Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemi...
Saved in:
Published in: | PLoS computational biology Vol. 18; no. 6; p. e1010238 |
---|---|
Main Authors: | , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
Public Library of Science
01-06-2022
Public Library of Science (PLoS) |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call "reverse homology", exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homolog from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences. |
---|---|
Bibliography: | new_version ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 I have read the journal’s policy and the authors of this manuscript have the following competing interests: AMM is a Consultant to Dewpoint Therapeutics Inc. Current address: Department of Electrical Engineering and Computer Sciences, Berkeley, California, United States of America Current address: Systems Biology Program, Center for Genomic Regulation, Barcelona, Spain Current address: Molecular Biology and Biochemistry Medical University of Graz, Graz, Austria Current address: Microsoft Research, Cambridge, Massachusetts, United States of America |
ISSN: | 1553-7358 1553-734X 1553-7358 |
DOI: | 10.1371/journal.pcbi.1010238 |