Missing value estimation of microarray data using Sim-GAN

Microarray data analysis needs utmost care as it plays a significant role in cancer study. Due to the excessive complexity of the data extraction process, it loses some relevant information (missing values) which leads to a significant irrecoverable disruption from the actual scenario. The imputatio...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge and information systems Vol. 64; no. 10; pp. 2661 - 2687
Main Authors: Pati, Soumen Kumar, Gupta, Manan Kumar, Shai, Rinita, Banerjee, Ayan, Ghosh, Arijit
Format: Journal Article
Language:English
Published: London Springer London 01-10-2022
Springer Nature B.V
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Microarray data analysis needs utmost care as it plays a significant role in cancer study. Due to the excessive complexity of the data extraction process, it loses some relevant information (missing values) which leads to a significant irrecoverable disruption from the actual scenario. The imputation of missing values is a crucial preprocessing step in analyzing microarray data. Currently, numerous methodologies have been designed to resolve the problem, but the unsatisfactory outcome is obtained with high missing rates of data. In order to estimate the missing expression to complete the dataset, a novel method has been proposed based on the similarity index and generative adversarial network ( Sim-GAN ). Firstly, the raw dataset has been divided into two subsets, i.e., the target set (which contains genes with missing expression values) and the candidate set (contains without missing values). In the next step, the similarity index between target genes and candidate genes has been obtained. As microarray data represents several biological factors, three similarity matrices (structural similarity, functional similarity, and semantic similarity) have been derived to find the small subset of candidate genes for each target gene. In structural similarity, a novel approach has been used to reduce the time complexity is O(1) as well as tackle the nonlinearity. Now, the obtained subsets are fed into a generative adversarial network to compute the missing values of the targeted genomes. The experimental outcomes consolidate the claim that the proposed methodology gives a satisfactory performance in terms of meaningful expression values. A detailed comparative study based on several statistical (i.e., NRMSE, AUROC, etc.) and biological (i.e., CPP, BLCI) metrics to confirm that the proposed Sim -GAN outperforms the existing missing value estimation techniques.
ISSN:0219-1377
0219-3116
DOI:10.1007/s10115-022-01718-0