Missing value estimation of microarray data using Sim-GAN
Microarray data analysis needs utmost care as it plays a significant role in cancer study. Due to the excessive complexity of the data extraction process, it loses some relevant information (missing values) which leads to a significant irrecoverable disruption from the actual scenario. The imputatio...
Saved in:
Published in: | Knowledge and information systems Vol. 64; no. 10; pp. 2661 - 2687 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
London
Springer London
01-10-2022
Springer Nature B.V |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Microarray data analysis needs utmost care as it plays a significant role in cancer study. Due to the excessive complexity of the data extraction process, it loses some relevant information (missing values) which leads to a significant irrecoverable disruption from the actual scenario. The imputation of missing values is a crucial preprocessing step in analyzing microarray data. Currently, numerous methodologies have been designed to resolve the problem, but the unsatisfactory outcome is obtained with high missing rates of data. In order to estimate the missing expression to complete the dataset, a novel method has been proposed based on the similarity index and generative adversarial network (
Sim-GAN
). Firstly, the raw dataset has been divided into two subsets, i.e., the target set (which contains genes with missing expression values) and the candidate set (contains without missing values). In the next step, the similarity index between target genes and candidate genes has been obtained. As microarray data represents several biological factors, three similarity matrices (structural similarity, functional similarity, and semantic similarity) have been derived to find the small subset of candidate genes for each target gene. In structural similarity, a novel approach has been used to reduce the time complexity is O(1) as well as tackle the nonlinearity. Now, the obtained subsets are fed into a generative adversarial network to compute the missing values of the targeted genomes. The experimental outcomes consolidate the claim that the proposed methodology gives a satisfactory performance in terms of meaningful expression values. A detailed comparative study based on several statistical (i.e., NRMSE, AUROC, etc.) and biological (i.e., CPP, BLCI) metrics to confirm that the proposed
Sim
-GAN outperforms the existing missing value estimation techniques. |
---|---|
ISSN: | 0219-1377 0219-3116 |
DOI: | 10.1007/s10115-022-01718-0 |