MaRe: a MapReduce-Oriented Framework for Processing Big Data with Application Containers

Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Further, thes...

Full description

Saved in:
Bibliographic Details
Main Authors: Capuccini, Marco, Dahlö, Martin, Toor, Salman, Spjuth, Ola
Format: Journal Article
Language:English
Published: 07-08-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Further, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results. Here we present MaRe, a programming model with an associated open-source implementation, which introduces support for application containers in MapReduce. MaRe is based on Apache Spark and Docker, the MapReduce framework and container engine that have collected the largest open source community, thus providing interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on two data-intensive applications in life science, showing ease of use and scalability. Conclusions. MaRe enables scalable data-intensive processing in life science with MapReduce and application containers. When compared with current best practices, that involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems and interactive processing. MaRe is generally-applicable and available as open source software.
AbstractList Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Further, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results. Here we present MaRe, a programming model with an associated open-source implementation, which introduces support for application containers in MapReduce. MaRe is based on Apache Spark and Docker, the MapReduce framework and container engine that have collected the largest open source community, thus providing interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on two data-intensive applications in life science, showing ease of use and scalability. Conclusions. MaRe enables scalable data-intensive processing in life science with MapReduce and application containers. When compared with current best practices, that involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems and interactive processing. MaRe is generally-applicable and available as open source software.
Author Spjuth, Ola
Toor, Salman
Capuccini, Marco
Dahlö, Martin
Author_xml – sequence: 1
  givenname: Marco
  surname: Capuccini
  fullname: Capuccini, Marco
– sequence: 2
  givenname: Martin
  surname: Dahlö
  fullname: Dahlö, Martin
– sequence: 3
  givenname: Salman
  surname: Toor
  fullname: Toor, Salman
– sequence: 4
  givenname: Ola
  surname: Spjuth
  fullname: Spjuth, Ola
BackLink https://doi.org/10.48550/arXiv.1808.02318$$DView paper in arXiv
BookMark eNotz7FOwzAUQFEPMEDhAzrhH0iw49hx2EqggNSqVdWBLXqxn4tFa0dOoPD3iMJ0tyudS3IWYkBCppzlpZaS3UL68p8510znrBBcX5DXJWzwjgJdQr9B-2EwWyWPYURL5wkOeIzpnbqY6DpFg8Pgw47e-x19gBHo0Y9vdNb3e29g9DHQJoYRfMA0XJFzB_sBr_87Idv547Z5zharp5dmtshAVTqrUTBjZcGw7JxRRcVAqVpzxnmtrKgcIGe1lUyUFQhb865EKV3pCotadUJMyM3f9kRr--QPkL7bX2J7IoofGg5NQA
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.1808.02318
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 1808_02318
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a678-9e30cd520e4bfc6270a6698101196d37fae109d50347a3d91b4e55f4f2de86b33
IEDL.DBID GOX
IngestDate Mon Jan 08 05:49:02 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a678-9e30cd520e4bfc6270a6698101196d37fae109d50347a3d91b4e55f4f2de86b33
OpenAccessLink https://arxiv.org/abs/1808.02318
ParticipantIDs arxiv_primary_1808_02318
PublicationCentury 2000
PublicationDate 2018-08-07
PublicationDateYYYYMMDD 2018-08-07
PublicationDate_xml – month: 08
  year: 2018
  text: 2018-08-07
  day: 07
PublicationDecade 2010
PublicationYear 2018
Score 1.7081051
SecondaryResourceType preprint
Snippet Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Distributed, Parallel, and Cluster Computing
Title MaRe: a MapReduce-Oriented Framework for Processing Big Data with Application Containers
URI https://arxiv.org/abs/1808.02318
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV27TgMxELRIKhoEAhSeckFr4efdmS6QhDQhUkiRLlqf1yhNiPISn499d1FoaO2txlrtyDs7S8gTapASQ8y0kBmmQQlmjStZLKUgkOclVHvIhp_5x6zo9ZNNDj3MwsD6Z7Gv_YHd5lkUSeoYKUjRIi0pk2TrfTyrm5OVFVcTf4yLHLM6-lMkBufkrGF3tFs_xwU5weUlmY1ggi8U6AhWk-SUimyc3IUj16ODgzaKRvJIG9V-rCb0dfFFe7AFmj5KaffYZqbJTwrSyN7mikwH_enbkDULDRjEmsAsKl56IzlqF8pM5hyyzCaHrZgGXuUBUHDrDVc6B-WtcBqNCTpIj0XmlLom7eX3EjuESi95qYxAYZyOMFvhAXhwECK_siXckE4Fw3xVe1bME0LzCqHb_6_uyGnkA0Wlb8vvSXu73uEDaW387rEC_hcx14FF
link.rule.ids 228,230,782,887
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MaRe%3A+a+MapReduce-Oriented+Framework+for+Processing+Big+Data+with+Application+Containers&rft.au=Capuccini%2C+Marco&rft.au=Dahl%C3%B6%2C+Martin&rft.au=Toor%2C+Salman&rft.au=Spjuth%2C+Ola&rft.date=2018-08-07&rft_id=info:doi/10.48550%2Farxiv.1808.02318&rft.externalDocID=1808_02318