Maximum likelihood and the graphical EM algorithm

Maximum likelihood and the graphical EM algorithm

Latombe, Guillaume and Granger, Eric and Dilkes, Fred A.

Canadian Conference on Electrical and Computer Engineering 2006

Abstract : A challenge in the design of SCFGs for some applications is the computation of the probability distributions associated with its production rules. The most popular methods for this task are by far the Inside-Outside (IO) algorithm, which maximizes the likelihood of a data set, and the Viterbi Score (VS) algorithm, which maximizes the likelihood of its best parse trees. Even though both algorithms are very demanding and have a similar time complexity, VS is known to require an overall lower computational cost in practice. Several methods have been developed to accelerate the computation of production rules probabilities with IO. Among them is the graphical EM (gEM) algorithm. During pre-processing, it computes the rules used in the production of the data set, and organizes them into an ordered data structure called “support graphs.” The iterative process uses these graphs to optimize the likelihood of the data set in a similar way to the IO algorithm. In this paper, a VS variant of gEM, that selects in the support graphs the elements belonging to the best parse trees of the data set, is proposed. The performance of the original algorithm, gEM(IO), and the new one, gEM(VS), are compared in terms of time and space complexity, convergence time, and perplexity. An experimental protocol is defined such that the impact on performance of factors like training set size, sequence length, and ambiguity of grammars can be assessed. Computer simulations are performed on the real-world European Molecular Biology Laboratory data set, consisting of DNA sequences. Results indicate that gEM(VS) provides significantly lower convergence time and time complexity, at the expense of a moderately higher perplexity and space complexity level.