Integrative learning of structured high‐dimensional data from multiple datasets

Integrative learning of multiple datasets has the potential to mitigate the challenge of small n$$ n $$ and large p$$ p $$ that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all...

Full description

Saved in:
Bibliographic Details
Published in:Statistical analysis and data mining Vol. 16; no. 2; pp. 120 - 134
Main Authors: Chang, Changgee, Dai, Zongyu, Oh, Jihwan, Long, Qi
Format: Journal Article
Language:English
Published: Hoboken Wiley Subscription Services, Inc., A Wiley Company 01-04-2023
Wiley Subscription Services, Inc
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Integrative learning of multiple datasets has the potential to mitigate the challenge of small n$$ n $$ and large p$$ p $$ that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
Bibliography:Funding information
National Institutes of Health, Grant/Award Number: RF1AG063481
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
All authors were involved in discussion and writing the manuscript. Changgee Chang developed the model, algorithm, and theory, implemented the method, and performed the data analysis. Zongyu Dai developed the theory. Jihwan Oh conducted the simulation study. Qi Long supervised this work.
Present Address 423 Guardian Drive, Philadelphia, Pennsylvania, USA
Author contributions
ISSN:1932-1864
1932-1872
DOI:10.1002/sam.11601