A new efficient referential genome compression technique for FastQ files

Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compre...

Full description

Saved in:

Bibliographic Details
Published in:	Functional & integrative genomics Vol. 23; no. 4; p. 333
Main Authors:	Kumar, Sanjeev, Singh, Mukund Pratap, Nayak, Soumya Ranjan, Khan, Asif Uddin, Jain, Anuj Kumar, Singh, Prabhishek, Diwakar, Manoj, Soujanya, Thota
Format:	Journal Article
Language:	English
Published:	Berlin/Heidelberg Springer Berlin Heidelberg 01-12-2023 Springer Nature B.V
Subjects:	Algorithms Animal Genetics and Genomics Biochemistry Bioinformatics Biomedical and Life Sciences Cell Biology Compression Data compression Data Compression - methods Decompression Gene mapping Genome Genomes High-Throughput Nucleotide Sequencing - methods Life Sciences Medical laboratories Microbial Genetics and Genomics Nucleotide sequence Original Article Palindromes Plant Genetics and Genomics Sequence Analysis, DNA - methods Software Compression Decompression Identifiers FastQ Quality scores
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80–140% for fixed-length datasets and 80–125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10–25%.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1438-793X 1438-7948
DOI:	10.1007/s10142-023-01259-x