Learning Hidden Markov Model Topology

Alexander Schliep




ZAIK/ZPR, Universität zu Köln
Weyertal 80
50931 Köln
Germany
Phone: +49­221­470­6011
Fax: +49­221­470­5160
E-mail: schliep@zpr.uni­koeln.de






ABSTRACT

Hidden­Markov­Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition with gene finding being one of the prime examples in computational biology. One fundamental problem in the application of Hidden Markov Models is finding the HMMs underlying architecture or topology especially when there is no strong evidence towards a specific choice from the application domain (e.g., when doing black box modeling). Or similarly, if the existence of rarely used or too frequently used states after training suggests that the chosen topology does not fit the data well. Topology is important with regard to good parameter estimates and with regard to performance: A model with "too many" states -- and hence too many parameters -- requires too much training data while an model with "not enough" states prohibits the HMM from capturing subtle statistical patterns. To determine the "optimal" topology either knowledge from the application domain is used or a trial and error procedure using ad­hoc methods (i.e., model surgery) are employed; systematic procedures have been rarely considered (e.g., Bayesian Model merging, Stolcke and Omohundro). We have developed a novel algorithm that will infer an HMM representation of the (ergodic) process generating a sequence, without prespecifying the topology of the model. That is, we infer the number of hidden states, the transitions allowed and the transition and emission probabilities. We use a Bayesian approach where a suitable prior on one crucial parameter forces generalization (and thus necessarily reduces data likelihood) from the maximum likelihood model. We will present the algorithm, some of our theoretical results and results from numerical experiments on biological --- DNA and Protein --- sequence data.


Keywords: Hidden Markov Models, Bayesian statistics, Sequence Analysis