GENERALIST: A latent space based generative model for protein sequence families

Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or prote...

Full description

Saved in:
Bibliographic Details
Published in:PLoS computational biology Vol. 19; no. 11; p. e1011655
Main Authors: Akl, Hoda, Emison, Brooke, Zhao, Xiaochuan, Mondal, Arup, Perez, Alberto, Dixit, Purushottam D
Format: Journal Article
Language:English
Published: United States Public Library of Science 01-11-2023
Public Library of Science (PLoS)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.
Bibliography:new_version
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1553-7358
1553-734X
1553-7358
DOI:10.1371/journal.pcbi.1011655