Long short-term memory model – A deep learning approach for medical data with irregularity in cancer predication with tumor markers
Machine learning (ML) has emerged as a superior method for the analysis of large datasets. Application of ML is often hindered by incompleteness of the data which is particularly evident when approaching disease screening data due to varied testing regimens across medical institutions. Here we explo...
Saved in:
Published in: | Computers in biology and medicine Vol. 144; p. 105362 |
---|---|
Main Authors: | , , , , , , , , , , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
Elsevier Ltd
01-05-2022
Elsevier Limited |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Machine learning (ML) has emerged as a superior method for the analysis of large datasets. Application of ML is often hindered by incompleteness of the data which is particularly evident when approaching disease screening data due to varied testing regimens across medical institutions. Here we explored the utility of multiple ML algorithms to predict cancer risk when trained using a large but incomplete real-world dataset of tumor marker (TM) values.
TM screening data were collected from a large asymptomatic cohort (n = 163,174) at two independent medical centers. The cohort included 785 individuals who were subsequently diagnosed with cancer. Data included levels of up to eight TMs, but for most subjects, only a subset of the biomarkers were tested. In some instances, TM values were available at multiple time points, but intervals between tests varied widely. The data were used to train and test various machine learning models to evaluate their robustness for predicting cancer risk. Multiple methods for data imputation were explored and models were developed for both single time-point as well as time-series data.
The ML algorithm, long short-term memory (LSTM), demonstrated superiority over other models for dealing with irregular medical data. A cancer risk prediction tool was trained and validated for a single time-point test of a TM panel including up to four biomarkers (AUROC = 0.831, 95% CI: 0.827–0.835) which outperformed a single threshold method using the same biomarkers. A second model relying on time series data of up to four time-points for 5 TMs had an AUROC of 0.931.
A cancer risk prediction tool was developed by training a LSTM model using a large but incomplete real-world dataset of TM values. The LSTM model was best able to handle irregular data compared to other ML models. The use of time-series TM data can further improve the predictive performance of LSTM models even when the intervals between tests vary widely. These risk prediction tools are useful to direct subjects to further screening sooner, resulting in earlier detection of occult tumors.
•Early cancer risk prediction tool powered by the largest to date real-world dataset of Tumor Biomarkers (TM).•The first study of model using time-series TM data, further improve the early cancer predictive performance of LSTM model.•The first study of semi-quantitative measurement of early cancer diagnosis by TM screening and clinical follow-ups. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0010-4825 1879-0534 |
DOI: | 10.1016/j.compbiomed.2022.105362 |