Combining Statistical Language Models via the Latent Maximum Entropy Principle

Wang, Shaojun; Schuurmans, Dale; Peng, Fuchun; Zhao, Yunxin

doi:10.1007/s10994-005-0928-7

Combining Statistical Language Models via the Latent Maximum Entropy Principle

Published: 02 June 2005

Volume 60, pages 229–250, (2005)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Combining Statistical Language Models via the Latent Maximum Entropy Principle

Download PDF

Shaojun Wang¹,
Dale Schuurmans¹,
Fuchun Peng² &
…
Yunxin Zhao³

565 Accesses
11 Citations
Explore all metrics

Abstract

We present a unified probabilistic framework for statistical language modeling which can simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information. Our approach is based on a recent statistical inference principle we have proposed—the latent maximum entropy principle—which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We describe efficient algorithms for marginalization, inference and normalization in our extended models. We then use these techniques to combine two standard forms of language models: local lexical models (Markov N-gram models) and global document-level semantic models (probabilistic latent semantic analysis). Our experimental results on the Wall Street Journal corpus show that we obtain a 18.5% reduction in perplexity compared to the baseline tri-gram model with Good-Turing smoothing.

Article PDF

Maximum Entropy Models for Natural Language Processing

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Language Modeling

References

Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23:4, 597–618.
Google Scholar
Bellegarda, J. (2000). Exploiting latent semantic information in statistical language modeling. Proceedings of IEEE, 88:8, 1279–1296.
Article Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Article Google Scholar
Berger, A., Della Pietra, S.,& Della Pietra, V. (1996).Amaximum entropy approach to natural language processing. Computational Linguistics, 22:1, 39–71.
Google Scholar
Borwein, J., & Lewis, A. (2000). Convex analysis and nonlinear optimization: Theory and examples. Springer.
Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R., Nadas, A., & Roukos, S. (1992). A maximum entropy construction of conditional log-linear language and translation models using learned features and a generalized Csiszar algorithm. IBM Report.
Chelba, C., & Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14:4, 283–332.
Article Google Scholar
Chen, S., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13:4, 319–358.
Article Google Scholar
Chen, S., & Rosenfeld, R. (2000). A survey of smoothing techniques for ME models. IEEE Trans. on Speech and Audio Processing, 8:1, 37–244.
Article Google Scholar
Clarkson, P., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. Proceedings of Eurospeech, 2707–2710.
Csiszar, I. (1996). Maxent, mathematics, and information theory. In K. Hanson and R. Silver (Eds.), Maximum Entropy and Bayesian Methods (pp. 35–50). Kluwer Academic Publishers
Darroch, J., & Ratchliff, D. (1972).Generalized iterative scaling for log-linear models. The Annals ofMathematical Statistics, 43:5, 1470–1480.
Google Scholar
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:4, 380–393.
Article Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.
Google Scholar
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:1, 177–196.
Article Google Scholar
Jaynes, E. (1983). Papers on probability, statistics, and statistical physics. In R. Rosenkrantz & D. Reidel, Publishing Company.
Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema, & L. Kanal, (Eds.), Pattern Recognition in Practice. (pp. 381–397) North Holland.
Jelinek, F. (1998). Statistical methods for speech recognition. MIT Press.
Khudanpur, S., & Wu, J. (2000).Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Computer Speech and Language, 14:4, 355–372.
Article Google Scholar
Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
Lauritzen, S. (1995). The EM-algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 1, 191–201.
Article Google Scholar
Peng, F., Schuurmans, D., & Wang, S. (2004). Augumenting naive Bayes text classifier using statistical language models. Information Retrieval, 7:3–4, 317–345.
Article Google Scholar
Riezler, S. (1999). Probabilistic constraint logic programming. Ph.D. Dissertation, University of Stuttgart, Germany.
Roark, B. (2001). Probabilistic top-down parsing and language modeling. Computational Linguistics, 27:2, 249–285.
Article Google Scholar
Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10, 187–228.
Article Google Scholar
Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE, 88:8, 1270–1278.
Article Google Scholar
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423.
Google Scholar
Wainwright, M., & Jordan, M. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, University of California, Berkeley.
Wang, S., Schuurmans, D., & Zhao, Y. (2003). The latent maximum entropy principle. Manuscript submitted.
Wang, S., Schuurmans, D., Peng, F., & Zhao, Y. (2004). Learning mixture models with the regularized latent maximum entropy principle. IEEE Transactions on Neural Networks: Special Issue on Information Theoretic Learning, 154.

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Canada
Shaojun Wang & Dale Schuurmans
Department of Computer Science, University of Massachusetts at Amherst, USA
Fuchun Peng
Department of Computer Engineering and Computer Science, University of Missouri at Columbia, USA
Yunxin Zhao

Authors

Shaojun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dale Schuurmans
View author publications
You can also search for this author in PubMed Google Scholar
Fuchun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yunxin Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaojun Wang.

Additional information

Editors:

Dan Roth and Pascale Fung

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Schuurmans, D., Peng, F. et al. Combining Statistical Language Models via the Latent Maximum Entropy Principle. Mach Learn 60, 229–250 (2005). https://doi.org/10.1007/s10994-005-0928-7

Download citation

Received: 08 October 2003
Revised: 19 May 2004
Accepted: 24 May 2004
Published: 02 June 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s10994-005-0928-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Combining Statistical Language Models via the Latent Maximum Entropy Principle

Abstract

Article PDF

Similar content being viewed by others

Maximum Entropy Models for Natural Language Processing

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Language Modeling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editors:

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining Statistical Language Models via the Latent Maximum Entropy Principle

Abstract

Article PDF

Similar content being viewed by others

Maximum Entropy Models for Natural Language Processing

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Language Modeling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editors:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation