Discovering simple DNA sequences by compression
An information-theoretic DNA compression scheme devised by Milosavljevic and Jurka (1993) has been used in many places in the literature for both the discovery of new genes and the compression of DNA. Their compression method applies an encoding of previously occurring runs. They use 5 different cod...
Saved in:
Published in: | Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing p. 597 |
---|---|
Main Authors: | , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
1998
|
Subjects: | |
Online Access: | Get more information |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | An information-theoretic DNA compression scheme devised by Milosavljevic and Jurka (1993) has been used in many places in the literature for both the discovery of new genes and the compression of DNA. Their compression method applies an encoding of previously occurring runs. They use 5 different code-words: four being the DNA bases, A, C, G and T, and the other being a pointer to a previously occurring run. They advocate a code-word of length log2 5 for each of these and then encoding a run by a code-word of length 2 x log2 n, where n is the length of the sequence. This scheme encodes the start of the sequence with a code-word of length log2 n and likewise encodes the end of the sequence with a code-word of length log2 n. In this paper, we show the above coding scheme to be inefficient in various ways and improve upon it so that it can compress DNA. We discuss our implementation of various schemes some of which run in linear time. |
---|---|
ISSN: | 2335-6936 |