Data Augmentation Algorithms for Detecting Conserved Domains in Protein Sequences: A Comparative Study

Protein conserved domains are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. These conserved subsequences are often unobserved and thus in need of detection. Motif discovery methods can be used to find these unobs...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of proteome research Vol. 7; no. 1; pp. 192 - 201
Main Author:	Bi, Chengpeng
Format:	Journal Article
Language:	English
Published:	United States American Chemical Society 01-01-2008
Subjects:	Algorithms Amino Acid Motifs Conserved Sequence Information Storage and Retrieval Likelihood Functions Monte Carlo Method Protein Structure, Tertiary Sequence Alignment Stochastic Processes Structural Homology, Protein data augmentation protein sequence analysis Markov chain Monte Carlo motif discovery multiple local alignment expectation maximization (EM) Gibbs sampling
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Protein conserved domains are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. These conserved subsequences are often unobserved and thus in need of detection. Motif discovery methods can be used to find these unobserved domains given a set of sequences. This paper presents the data augmentation (DA) framework that unifies a suite of motif-finding algorithms through maximizing the same likelihood function by imputing the unobserved data. The data augmentation refers to those methods that formulate iterative optimization by exploiting the unobserved data. Two categories of maximum likelihood based motif-finding algorithms are illustrated under the DA framework. The first is the deterministic algorithms that are to maximize the likelihood function by performing an iteratively optimal local search in the alignment space. The second is the stochastic algorithms that are to iteratively draw motif location samples via Monte Carlo simulation and simultaneously keep track of the superior solution with the best likelihood. As a result, four DA motif discovery algorithms are described, evaluated, and compared by aligning real and simulated protein sequences.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1535-3893 1535-3907
DOI:	10.1021/pr070475q