Prediction of enzymatic function with high efficiency and a reduced number of features using genetic algorithm
The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformat...
Saved in:
Published in: | Computers in biology and medicine Vol. 158; p. 106799 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
Elsevier Ltd
01-05-2023
Elsevier Limited |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformatics. In this work, we investigated the characteristics of proteins, representing the primary, secondary, tertiary, and quaternary structures of the protein, that improve the model’s quality by applying dimensionality reduction techniques and using the Support Vector Machine classifier for predicting the enzymes’ classes. During the investigation, two approaches were evaluated: feature extraction/transformation, which was performed using the statistical technique Factor Analysis, and feature selection methods. For feature selection, we proposed an approach based on a genetic algorithm to face the optimization conflict between the simplicity and reliability of an ideal representation of the characteristics of the enzymes and also compared and employed other methods for this purpose. The best result was accomplished using a feature subset generated by our implementation of a multi-objective genetic algorithm enriched with features that this work identified as relevant to represent the enzymes. This subset representation reduced the dataset by about 87% and reached 85.78% of F-measure performance, improving the overall quality of the model classification. In addition, we verified in this work a subset addressed with only 28 features out of a total of 424 that reached a performance above 80% of F-measure for four of the six evaluated classes, showing that satisfactory classification performance can be achieved with a reduced number of enzymes’s characteristics. The datasets and implementations are openly available.
•We identified 424 attributes of 17,275 unique sequences of the enzymes considered.•A multi-objective genetic algorithm was proposed to select the best attributes.•The method combines simplicity and reliability of an ideal representation of the enzymes.•Subset reduced the dataset by about 87% and reached 85.78% of the F-measure. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0010-4825 1879-0534 |
DOI: | 10.1016/j.compbiomed.2023.106799 |