Discovering simple DNA sequences by compression

An information-theoretic DNA compression scheme devised by Milosavljevic and Jurka (1993) has been used in many places in the literature for both the discovery of new genes and the compression of DNA. Their compression method applies an encoding of previously occurring runs. They use 5 different cod...

Full description

Saved in:
Bibliographic Details
Published in:Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing p. 597
Main Authors: Powell, D R, Dowe, D L, Allison, L, Dix, T I
Format: Journal Article
Language:English
Published: United States 1998
Subjects:
Online Access:Get more information
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:An information-theoretic DNA compression scheme devised by Milosavljevic and Jurka (1993) has been used in many places in the literature for both the discovery of new genes and the compression of DNA. Their compression method applies an encoding of previously occurring runs. They use 5 different code-words: four being the DNA bases, A, C, G and T, and the other being a pointer to a previously occurring run. They advocate a code-word of length log2 5 for each of these and then encoding a run by a code-word of length 2 x log2 n, where n is the length of the sequence. This scheme encodes the start of the sequence with a code-word of length log2 n and likewise encodes the end of the sequence with a code-word of length log2 n. In this paper, we show the above coding scheme to be inefficient in various ways and improve upon it so that it can compress DNA. We discuss our implementation of various schemes some of which run in linear time.
ISSN:2335-6936