Centralized vs. distributed feature selection methods based on data complexity measures

•A methodology for distributing the process of feature selection based on several data complexity measures is proposed.•We tackled the two strategies to partition the datasets: horizontal (i.e. by samples) and vertical (i.e. by features).•We present an experimental study on 11 datasets (five of them...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems Vol. 117; pp. 27 - 45
Main Authors: Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.
Format: Journal Article
Language:English
Published: Elsevier B.V 01-02-2017
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•A methodology for distributing the process of feature selection based on several data complexity measures is proposed.•We tackled the two strategies to partition the datasets: horizontal (i.e. by samples) and vertical (i.e. by features).•We present an experimental study on 11 datasets (five of them microarrays) in terms of number of selected features, classification accuracy and running time.•The novel procedures are able to reduce significantly the running time while maintaining (or even improving) the classification performance. In the era of Big Data, many datasets have a common characteristic, the large number of features. As a result, selecting the relevant features and ignoring the irrelevant and redundant features has become indispensable. However, when dealing with large amounts of data, most existing feature selection algorithms do not scale well, and their efficiency may significantly deteriorate to the point of becoming inapplicable. Moreover, data is often distributed in multiple locations, and it is not economic or legal to gather it in a single site. For these reasons, we propose a distributed approach for partitioned data using two techniques: horizontal (i.e. by samples) and vertical (i.e. by features). Unlike than existing procedures to combine the partial outputs obtained from each partition of data, we propose a merging process using the theoretical complexity of these feature subsets. The novel procedure tested in 11 datasets has proved to be useful, showing competitive results both in terms of runtime and classification accuracy.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2016.09.022