A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences

Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on computational biology and bioinformatics Vol. 6; no. 3; pp. 370 - 386
Main Author:	Bi, Chengpeng
Format:	Journal Article
Language:	English
Published:	United States IEEE 01-07-2009 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithms Amino Acid Motifs Amino Acid Sequence Animals Base Sequence Bioinformatics Clustering algorithms Computer Simulation Databases, Genetic DNA - chemistry Expectation maximization (EM) Genetics Genomics Iterative algorithms Markov Chains Models, Molecular Molecular Sequence Data Monte Carlo EM Monte Carlo Method Monte Carlo methods Monte Carlo simulation motif discovery multiple sequence alignment Nucleic Acid Conformation Protein Structure, Secondary Proteins Proteins - chemistry Pulse width modulation Sequence Analysis - methods Sequences Stochastic processes Studies Transcription Factors transcriptional regulation Statistical computing Stochastic processes Biology and genetics
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1545-5963 1557-9964
DOI:	10.1109/TCBB.2008.103