Comparison of variable selection methods in partial least squares regression

Through the remarkable progress in technology, it is getting easier and easier to generate vast amounts of variables from a given sample. The selection of variables is imperative for data reduction and for understanding the modeled relationship. Partial least squares (PLS) regression is among the mo...

Full description

Saved in:
Bibliographic Details
Published in:Journal of chemometrics Vol. 34; no. 6
Main Authors: Mehmood, Tahir, Sæbø, Solve, Liland, Kristian Hovde
Format: Journal Article
Language:English
Published: Chichester Wiley Subscription Services, Inc 01-06-2020
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Through the remarkable progress in technology, it is getting easier and easier to generate vast amounts of variables from a given sample. The selection of variables is imperative for data reduction and for understanding the modeled relationship. Partial least squares (PLS) regression is among the modeling approaches that address high throughput data. A considerable list of variable selection methods has been introduced in PLS. Most of these methods have been reviewed in a recently conducted study. Motivated by this, we have therefore conducted a comparison of available methods for variable selection within PLS. The main focus of this study was to reveal patterns of dependencies between variable selection method and data properties, which can guide the choice of method in practical data analysis. To this aim, a simulation study was conducted with data sets having diverse properties like the number of variables, the number of samples, model complexity level, and information content. The results indicate that the above factors like the number of variables, number of samples, model complexity level, information content and variant of PLS methods, and their mutual higher‐order interactions all significantly define the prediction capabilities of the model and the choice of variable selection strategy. Variable selection methods can be divided in into three groups: filter, wrapper, and embedded. The comparison of variable selection methods in partial least squares (PLS) is conducted based on simulated data sets of diverse characteristics. For comparison, root mean square error is mainly used, and a meta‐analysis is carried out. Article provides the link between data properties and variable selection methods. Moreover, the characteristics of variable selection methods are explored.
ISSN:0886-9383
1099-128X
DOI:10.1002/cem.3226