Statistical Integration of Genomic Data Across Labs, Platforms and Experimental Conditions

High-throughput sequencing technology is the most important tool in modern biotechnological and biomedical research. Analyzing and interpreting such data is crucial for understanding genomic and epigenomic landscapes. The fast accumulation of publicly available high-throughput sequencing data provid...

Full description

Saved in:
Bibliographic Details
Main Author: Lyu, Yafei
Format: Dissertation
Language:English
Published: ProQuest Dissertations & Theses 01-01-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:High-throughput sequencing technology is the most important tool in modern biotechnological and biomedical research. Analyzing and interpreting such data is crucial for understanding genomic and epigenomic landscapes. The fast accumulation of publicly available high-throughput sequencing data provide new opportunities to compare and combine genomic data from various sources for deeper exploration, potentially improving the robustness of downstream analyses and leading to novel discoveries.To improve the utilization of high-throughput sequencing data, this dissertation contributes three novel statistical tools that can be used to integrate data from different platforms, labs and experimental conditions. The first tool, CFGL, can be used to construct gene co-expression networks across multiple conditions. By using a data-driven approach to capture condition-specific co-expression patterns, this method effectively identifies co-expression patterns that are specific to a condition and those that are common across conditions. The application of CFGL on TCGA breast cancer data reveals interesting insights about disease-type specificity. The second tool, IDR3c, is a rank-based semi-parametric model, to improve the identification of differential expressed genes using information across different sequencing platforms. IDR3c incorporates both the significance of differential expression and the consistency across sources. It effectively detects differentially expressed genes with moderate signals but consistent across data sources. In the simulations and real data studies, IDR3c shows a higher discriminate power and identifies more biologically relevant differential expression than the identifications based on individual sources. The third tool, nestedIDR, is for integrating findings from replicated genomic data from multiple labs. This method models the hierarchy of data sources, measures reproducibility (within each lab) and replicability (between labs) simultaneously, and takes account of heterogeneity across data sources. The applications on RNA-seq and ChIP-seq data show that it improves the reliability of identifications and rescues the signals that are not reproducible within a lab but are shown to be replicable in other labs.These methods provide a set of tools for integrating high-throughput sequencing data under varies of situations. The real data analysis discussed in the thesis also provide examples of current genomic data integrations, illustrating the benefit of such integration.
ISBN:9798582524007