Identification of Novel Bacterial Microproteins Encoded by Small Open Reading Frames Using a Computational Proteogenomics Workflow

Genome annotation has historically ignored small open reading frames (smORFs), which encode a class of proteins shorter than 100 amino acids, collectively referred to as microproteins. This cutoff was established to avoid thousands of false positives due to limitations of pure genomics pipelines. Pr...

Full description

Saved in:

Bibliographic Details
Published in:	Methods in molecular biology (Clifton, N.J.) Vol. 2836; p. 19
Main Authors:	de Souza, Eduardo Vieira, Bizarro, Cristiano Valim
Format:	Journal Article
Language:	English
Published:	United States 2024
Subjects:	Bacteria - genetics Bacteria - metabolism Bacterial Proteins - genetics Bacterial Proteins - metabolism Computational Biology - methods Machine Learning Mass Spectrometry - methods Micropeptides Open Reading Frames - genetics Proteogenomics - methods Proteomics - methods Software Workflow smORFs Genome annotation μProteInS RNA-seq Mass spectrometry Proteomics
Online Access:	Get more information
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Genome annotation has historically ignored small open reading frames (smORFs), which encode a class of proteins shorter than 100 amino acids, collectively referred to as microproteins. This cutoff was established to avoid thousands of false positives due to limitations of pure genomics pipelines. Proteogenomics, a computational approach that combines genomics, transcriptomics, and proteomics, makes it possible to accurately identify these short sequences by overlaying different levels of omics evidence. In this chapter, we showcase the use of μProteInS, a bioinformatics pipeline developed for the identification of unannotated microproteins encoded by smORFs in bacteria. The workflow covers all the steps from quality control and transcriptome assembly to the scoring and post-processing of mass spectrometry data. Additionally, we provide an example on how to apply the pipeline's machine learning method to identify high-confidence spectra and pinpoint the most reliable identifications from large datasets.
ISSN:	1940-6029
DOI:	10.1007/978-1-0716-4007-4_2