Practical guide for managing large-scale human genome data in research

Studies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing do...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of human genetics Vol. 66; no. 1; pp. 39 - 52
Main Authors:	Tanjo, Tomoya, Kawai, Yosuke, Tokunaga, Katsushi, Ogasawara, Osamu, Nagasaki, Masao
Format:	Journal Article
Language:	English
Published:	England Nature Publishing Group 01-01-2021 Springer Singapore
Subjects:	Bioinformatics Computational Biology - methods Computer applications Data processing Genome, Human - genetics Genomes Genomic analysis Genomics - methods High-Throughput Nucleotide Sequencing - methods Human Genome Project Humans Information processing Information Storage and Retrieval - methods Reproducibility of Results Review Software Whole genome sequencing Whole Genome Sequencing - methods
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Studies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-3 content type line 23 ObjectType-Review-1
ISSN:	1434-5161 1435-232X
DOI:	10.1038/s10038-020-00862-1