HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-...

Full description

Saved in:

Bibliographic Details
Published in:	PloS one Vol. 13; no. 7; p. e0201483
Main Authors:	Expósito, Roberto R, González-Domínguez, Jorge, Touriño, Juan
Format:	Journal Article
Language:	English
Published:	United States Public Library of Science 31-07-2018 Public Library of Science (PLoS)
Subjects:	Alignment Big Data Bioinformatics Biology and Life Sciences Computer and Information Sciences Computer memory Cost analysis Data management Data processing Datasets Deoxyribonucleic acid Design optimization Distributed memory DNA Downloading Engineering and Technology Gene expression Gene mapping Gene sequencing Genomes High-Throughput Nucleotide Sequencing International conferences Mapping Research and Analysis Methods Ribonucleic acid RNA RNA Folding RNA sequencing Sequence Alignment - methods Sequence Analysis, RNA - methods Software Galicia Spain Spain
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user's guide are publicly available for download at http://hsra.dec.udc.es.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0201483