SMusket: Spark-based DNA error correction on distributed-memory systems

Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequen...

Full description

Saved in:

Bibliographic Details
Published in:	Future generation computer systems Vol. 111; pp. 698 - 713
Main Authors:	Expósito, Roberto R., González-Domínguez, Jorge, Touriño, Juan
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01-10-2020
Subjects:	Apache Spark Big Data Error correction Next-Generation Sequencing (NGS) Sequence analysis Big Data Error correction Next-Generation Sequencing (NGS) Sequence analysis Apache Spark
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequencing errors can severely affect the quality of downstream analysis. Although current error correction approaches provide reasonably high accuracies, their computational cost can be still unacceptable when processing large datasets. In this paper we propose SparkMusket (SMusket), a Big Data tool built upon the open-source Apache Spark cluster computing framework to boost the performance of Musket, one of the most widely adopted and top-performing multithreaded correctors. Our tool efficiently exploits Spark features to implement a scalable error correction algorithm intended for distributed-memory systems built using commodity hardware. The experimental evaluation on a 16-node cluster using four publicly available datasets has shown that SMusket is up to 15.3 times faster than previous state-of-the-art MPI-based tools, also providing a maximum speedup of 29.8 over its multithreaded counterpart. SMusket is publicly available under an open-source license at https://github.com/rreye/smusket. •Big Data tool for efficient DNA read error correction on distributed-memory systems.•Scalable Spark implementation of a k-spectrum algorithm based on Musket.•SMusket shows speedups of up to 29.8x over Musket on a 16-node cluster.•SMusket is up to 15.3 times faster compared with state-of-the-art MPI-based tools.
ISSN:	0167-739X 1872-7115
DOI:	10.1016/j.future.2019.10.038