Prediction of enzymatic function with high efficiency and a reduced number of features using genetic algorithm

The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformat...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine Vol. 158; p. 106799
Main Authors: Reis, Diogo R., Santos, Bruno C., Bleicher, Lucas, Zárate, Luis E., Nobre, Cristiane N.
Format: Journal Article
Language:English
Published: United States Elsevier Ltd 01-05-2023
Elsevier Limited
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformatics. In this work, we investigated the characteristics of proteins, representing the primary, secondary, tertiary, and quaternary structures of the protein, that improve the model’s quality by applying dimensionality reduction techniques and using the Support Vector Machine classifier for predicting the enzymes’ classes. During the investigation, two approaches were evaluated: feature extraction/transformation, which was performed using the statistical technique Factor Analysis, and feature selection methods. For feature selection, we proposed an approach based on a genetic algorithm to face the optimization conflict between the simplicity and reliability of an ideal representation of the characteristics of the enzymes and also compared and employed other methods for this purpose. The best result was accomplished using a feature subset generated by our implementation of a multi-objective genetic algorithm enriched with features that this work identified as relevant to represent the enzymes. This subset representation reduced the dataset by about 87% and reached 85.78% of F-measure performance, improving the overall quality of the model classification. In addition, we verified in this work a subset addressed with only 28 features out of a total of 424 that reached a performance above 80% of F-measure for four of the six evaluated classes, showing that satisfactory classification performance can be achieved with a reduced number of enzymes’s characteristics. The datasets and implementations are openly available. •We identified 424 attributes of 17,275 unique sequences of the enzymes considered.•A multi-objective genetic algorithm was proposed to select the best attributes.•The method combines simplicity and reliability of an ideal representation of the enzymes.•Subset reduced the dataset by about 87% and reached 85.78% of the F-measure.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2023.106799