iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection

Promoter is a region of DNA that determines the transcription of a particular gene. There are several σ factors in the RNA polymerase, which has the function of identifying the promoter and facilitating the binding of the RNA polymerase to the promoter. Owing to the importance of promoter in genome...

Full description

Saved in:
Bibliographic Details
Published in:Analytical biochemistry Vol. 630; p. 114335
Main Authors: Liang, Yunyun, Zhang, Shengli, Qiao, Huijuan, Yao, Yingying
Format: Journal Article
Language:English
Published: United States Elsevier Inc 01-10-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Promoter is a region of DNA that determines the transcription of a particular gene. There are several σ factors in the RNA polymerase, which has the function of identifying the promoter and facilitating the binding of the RNA polymerase to the promoter. Owing to the importance of promoter in genome research, it is an urgent task to develop computational tool for effectively identifying promoters and their strength facing the avalanche of DNA sequences discovered in the post-genomic age. In this paper, we develop a model named iPromoter-ET using the k-mer nucleotide composition, binary encoding and dinucleotide property matrix-based distance transformation for features extraction, and extremely randomized trees (extra trees) for feature selection. Its 1st layer is used to identify whether a DNA sequence is of promoter or not, while its 2nd layer is to identify promoter samples as being strong or weak promoter. Support vector machine and the five cross-validation are used to perform identification and assess performance, respectively. The results indicate that our model remarkably outperforms the existing models in both the 1st and 2nd layers for accuracy and stability. We anticipate that our proposed model will become a very effective intelligent tool, or at the least, a complementary tool to the existing modes of identifying promoters and their strength. Moreover, the datasets and codes for iPromoter-ET are freely available at https://github.com/shengli0201/iPromoter-ET. [Display omitted] •A novel identifying model named iPromoter-ET is proposed based on extremely randomized trees for feature selection.•Our iPromoter-ET model achieves remarkably higher accuracy for identifying promoters and their strength.•The performance of iPromoter-ET model outperforms those of existing related models.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0003-2697
1096-0309
DOI:10.1016/j.ab.2021.114335