What is the expectation maximization algorithm?

  title={What is the expectation maximization algorithm?},
  author={Chuong B. Do and Serafim Batzoglou},
  journal={Nature Biotechnology},
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic models. What is it good for, and how does it work? 

MapReduce for Bayesian Network Parameter Learning using the EM Algorithm

Details of the MapReduce formulation of EM are presented, speed-ups versus the sequential case are reported, and various Hadoop cluster configurations in experiments with Bayesian networks of different sizes and structures are compared.

Molecular interaction motifs in a system-wide network context: Computationally charting transient kinase-substrate phosphorylation events

Molecular interaction motifs in a system-wide network context are evaluated by Computationally charting transient kinase-substrate phosphorylation events and showing relationships between these motifs and kinase activity.

A Genetic Algorithm for Learning Parameters in Bayesian Networks using Expectation Maximization

It is shown that GAEM provides significant speed-ups since it tends to select more fit individuals, which converge faster, as parents for the next generation, while producing better log-likelihood scores than the traditional EM algorithm.

Variants of compound models and their application to citation analysis

A thesis submitted in partial ful lment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.

Improving the Performance and Understanding of the Expectation Maximization Algorithm: Evolutionary and Visualization Methods

This work proposes a genetic algorithm for expectation maximization (GAEM), where it is found that small population sizes are sufficient to produce high solution quality and considerable speed-up compared to the traditional EM algorithm and develops an age-layered EM algorithm, ALEM, which enables comparisons between similarly aged EM runs and discards less promising EM runs well before their convergence.

EM*: An EM Algorithm for Big Data

The strategy is to embed EM-T into a non-linear hierarchical data structure (heap) that allows us to separate data that needs to be revisited from data that does not and narrow the iteration toward the data that is more difficult to cluster.

Belief Revision and the EM Algorithm

This paper provides a natural interpretation of the EM algorithm as a succession of revision steps that try to find a probability distribution in a parametric family of models in agreement with

Using data to build a better EM: EM* for big data

The strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to separate data that needs to be revisited from data that does not and narrow the iteration toward the data that is more difficult to cluster.

An expectation-maximization algorithm enables accurate ecological modeling using longitudinal microbiome sequencing data

BEEM addresses a key bottleneck in “systems analysis” of microbiomes by enabling accurate inference of ecological models from high throughput sequencing data without the need for experimental biomass measurements.



How does gene expression clustering work?

Clustering is often one of the first steps in gene expression analysis. How do clustering algorithms work, which ones should we use and what can we expect from them?

A modified expectation maximization algorithm for penalized likelihood estimation in emission tomography

  • A. Pierro
  • Mathematics
    IEEE Trans. Medical Imaging
  • 1995
The new method is a natural extension of the EM for maximizing likelihood with concave priors for emission tomography and convergence proofs are given.

Testing for linkage disequilibrium in genotypic data using the Expectation-Maximization algorithm

It is concluded that with highly polymorphic loci, the EM algorithm does lead to a useful test for linkage disequilibrium, but that it is necessary to find the empirical distribution of likelihood ratios in order to perform a test of significance correctly.

An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences

Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented and the final motif is utilized in a search for undiscovered CRP binding sites.

Hidden Markov models in computational biology. Applications to protein modeling.

The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.

A statistical model for identifying proteins by tandem mass spectrometry.

A statistical model is presented for computing probabilities that proteins are present in a sample on the basis of peptides assigned to tandem mass (MS/MS) spectra acquired from a proteolytic digest of the sample, and it is shown to produce probabilities that are accurate and have high power to discriminate correct from incorrect protein identifications.

Genome-wide discovery of transcriptional modules from DNA sequence and gene expression

The EM algorithm is used to identify transcriptional modules--sets of genes that are co-regulated in a set of experiments, through a common motif profile, and refines both the module assignment and the motif profile so as to best explain the expression data as a function of transcriptional motifs.

Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population.

An expectation-maximization (EM) algorithm leading to maximum-likelihood estimates of molecular haplotype frequencies under the assumption of Hardy-Weinberg proportions is implemented and appears to be useful for the analysis of nuclear DNA sequences or highly variable loci.

RNA sequence analysis using covariance models.

We describe a general approach to several RNA sequence analysis problems using probabilistic models that flexibly describe the secondary structure and primary sequence consensus of an RNA sequence


This method is applied to data on blood groups collected from villages near the mouth of the River Po, in northern Italy, in the course of an investigation on microcythaemia, and it is shown to be equivalent to maximum likelihood, and therefore fully efficient in the statistical sense.