Abstract
We present a unified probabilistic framework for statistical language modeling which can simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information. Our approach is based on a recent statistical inference principle we have proposed—the latent maximum entropy principle—which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We describe efficient algorithms for marginalization, inference and normalization in our extended models. We then use these techniques to combine two standard forms of language models: local lexical models (Markov N-gram models) and global document-level semantic models (probabilistic latent semantic analysis). Our experimental results on the Wall Street Journal corpus show that we obtain a 18.5% reduction in perplexity compared to the baseline tri-gram model with Good-Turing smoothing.
Article PDF
Similar content being viewed by others
References
Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23:4, 597–618.
Bellegarda, J. (2000). Exploiting latent semantic information in statistical language modeling. Proceedings of IEEE, 88:8, 1279–1296.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Berger, A., Della Pietra, S.,& Della Pietra, V. (1996).Amaximum entropy approach to natural language processing. Computational Linguistics, 22:1, 39–71.
Borwein, J., & Lewis, A. (2000). Convex analysis and nonlinear optimization: Theory and examples. Springer.
Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R., Nadas, A., & Roukos, S. (1992). A maximum entropy construction of conditional log-linear language and translation models using learned features and a generalized Csiszar algorithm. IBM Report.
Chelba, C., & Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14:4, 283–332.
Chen, S., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13:4, 319–358.
Chen, S., & Rosenfeld, R. (2000). A survey of smoothing techniques for ME models. IEEE Trans. on Speech and Audio Processing, 8:1, 37–244.
Clarkson, P., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. Proceedings of Eurospeech, 2707–2710.
Csiszar, I. (1996). Maxent, mathematics, and information theory. In K. Hanson and R. Silver (Eds.), Maximum Entropy and Bayesian Methods (pp. 35–50). Kluwer Academic Publishers
Darroch, J., & Ratchliff, D. (1972).Generalized iterative scaling for log-linear models. The Annals ofMathematical Statistics, 43:5, 1470–1480.
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:4, 380–393.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:1, 177–196.
Jaynes, E. (1983). Papers on probability, statistics, and statistical physics. In R. Rosenkrantz & D. Reidel, Publishing Company.
Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema, & L. Kanal, (Eds.), Pattern Recognition in Practice. (pp. 381–397) North Holland.
Jelinek, F. (1998). Statistical methods for speech recognition. MIT Press.
Khudanpur, S., & Wu, J. (2000).Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Computer Speech and Language, 14:4, 355–372.
Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
Lauritzen, S. (1995). The EM-algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 1, 191–201.
Peng, F., Schuurmans, D., & Wang, S. (2004). Augumenting naive Bayes text classifier using statistical language models. Information Retrieval, 7:3–4, 317–345.
Riezler, S. (1999). Probabilistic constraint logic programming. Ph.D. Dissertation, University of Stuttgart, Germany.
Roark, B. (2001). Probabilistic top-down parsing and language modeling. Computational Linguistics, 27:2, 249–285.
Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10, 187–228.
Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE, 88:8, 1270–1278.
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423.
Wainwright, M., & Jordan, M. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, University of California, Berkeley.
Wang, S., Schuurmans, D., & Zhao, Y. (2003). The latent maximum entropy principle. Manuscript submitted.
Wang, S., Schuurmans, D., Peng, F., & Zhao, Y. (2004). Learning mixture models with the regularized latent maximum entropy principle. IEEE Transactions on Neural Networks: Special Issue on Information Theoretic Learning, 154.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors:
Dan Roth and Pascale Fung
Rights and permissions
About this article
Cite this article
Wang, S., Schuurmans, D., Peng, F. et al. Combining Statistical Language Models via the Latent Maximum Entropy Principle. Mach Learn 60, 229–250 (2005). https://doi.org/10.1007/s10994-005-0928-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-005-0928-7