Abstract
We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.
Similar content being viewed by others
Notes
For instance, Wikipedia articles about Madrid in English and Spanish address many common topics such as “demographics”, “geography and location” or “climate”, while at the same time, only the Spanish article contains text (i.e., a non-shared topic) about “the emblems of the city”, and only the English article elaborates on “business schools” or “bohemian culture” in Madrid.
Without loss of generality, due to simplicity, we will restrict the presentation in the article to bilingual topic models.
The energy between two vectors X and Y is defined as \(||X - Y||^2\).
By writing out the joint probability conditioned on all language assignments \(l_{ji}\), one can check that these formulations are indeed equivalent.
References
Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 1140–1150
Amini MR, Goutte C (2010) A co-classification approach to learning from multilingual corpora. Mach Learn 79(1–2):105–121
Amini MR, Usunier N, Goutte C (2009) Learning from multiple partially observed views—an application to multilingual text categorization. In: Proceedings of the 23rd annual conference on advances in neural information processing systems (NIPS), pp 28–36
Bel N, Koster CHA, Villegas M (2003) Cross-lingual text categorization. In: Proceedings of the 7th European conference on research and advanced technology for digital libraries (ECDL), pp 126–139
Bishop CM (2006) Pattern Recognition and machine learning (Information science and statistics). Springer, Inc, New York
Blei DM, McAuliffe JD (2007) Supervised topic models. In: Proceedings of the 21st Annual conference on advances in neural information processing systems (NIPS), pp 121–128
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Boyd-Graber J, Blei DM (2009) Multilingual topic models for unaligned text. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI), pp 75–82
Boyd-Graber J, Resnik P (2010) Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 45–55
Cavallanti G, Cesa-Bianchi N, Gentile C (2010) Linear algorithms for online multitask classification. J Mach Learn Res 11:2901–2934
Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar VC, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS)
Das D, Petrov S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 600–609
De Smet W, Moens MF (2009) Cross-language linking of news stories on the Web using interlingual topic modeling. In: Proceedings of the CIKM 2009 workshop on social web search and mining (SWSM@CIKM), pp 57–64
De Smet W, Tang J, Moens MF (2011) Knowledge transfer across multilingual corpora via latent topics. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 549–560
Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: Proceedings of the 49th Annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 429–433
Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. In: Proceedings of the ICML 2005 KCCA workshop (KCCA)
Ganchev K, Das D (2013) Cross-lingual discriminative learning of sequence models with posterior regularization. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1996–2006
Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 927–942
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741
Gliozzo AM, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 44th annual meeting of the association for computational linguistics and the 21st international conference on computational linguistics (ACL-COLING)
Gouws S, Bengio Y, Corrado G (2014) Bilbowa: fast bilingual distributed representations without word alignments. In: Deep learning workshop, conference on neural information processing systems (NIPS)
Guo Y, Xiao M (2012a) Cross language text classification via subspace co-regularized multi-view learning. In: Proceedings of the 29th international conference on machine learning (ICML)
Guo Y, Xiao M (2012b) Transductive representation learning for cross-lingual text classification. In: Proceedings of the 12th IEEE international conference on data mining (ICDM), pp 888–893
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hermann KM, Blunsom P (2014a) Multilingual distributed representations without word alignment. In: Proceedings of the international conference on learning representations (ICLR)
Hermann KM, Blunsom P (2014b) Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 58–68
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI), pp 289–296
Hu Y, Zhai K, Eidelman V, Boyd-Graber JL (2014) Polylingual tree-based topic models for translation domain adaptation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 1166–1176
Jagarlamudi J, Daumé III H (2010) Extracting multilingual topics from unaligned comparable corpora. In: Proceedings of the 32nd annual european conference on advances in information retrieval (ECIR), pp 444–456
Jiang Y, Liu J, Li Z, Lu H (2012) Collaborative PLSA for multi-view clustering. In: 2012 21st International conference on pattern recognition (ICPR), IEEE, pp 2997–3000
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning, vol 11. MIT Press, Cambridge, pp 169–184
Kim S, Toutanova K, Yu H (2012) Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL), pp 694–702
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 1459–1474
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit (MT SUMMIT), pp 79–86
Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 224–229
Krstovski K, Smith DA (2013) Online polylingual topic models for fast document translation detection. In: Proceedings of the workshop on statistical MT
Levow GA, Oard DW, Resnik P (2005) Dictionary-based techniques for cross-language information retrieval. Inf Process Manag 41(3):523–547
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Ling X, Xue GR, Dai W, Jiang Y, Yang Q, Yu Y (2008) Can Chinese web pages be classified with English data source? In: Proceedings of the 17th international conference on World Wide Web (WWW), pp 969–978
Littman M, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. Cross-language information retrieval. Kluwer Academic Publishers, Boston, pp 51–62
Lu B, Tan C, Cardie C, K Tsou B (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 320–330
McCallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: Proceedings of Neural Information Processing Systems (NIPS), pp 1973–1981
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the workshop of the international conference on learning representations (ICLR)
Mimno D, Wallach H, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 880–889
Ni X, Sun JT, Hu J, Chen Z (2009) Mining multilingual topics from Wikipedia. In: Proceedings of the 18th international World Wide Web conference (WWW), pp 1155–1156
Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of the 4th international conference on web search and web data mining (WSDM), pp 375–384
Olsson JS, Oard DW, Hajič J (2005) Cross-language text classification. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 645–646
Pan J, Xue GR, Yu Y, Wang Y (2011) Cross-lingual sentiment classification via bi-view non-negative matrix tri-factorization. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 289–300
Paul MJ, Girju R (2009) Cross-cultural analysis of blogs and forums with mixed-collection topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 1408–1417
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Platt JC, Toutanova K, Yih WT (2010) Translingual document representations from discriminative projections. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 251–261
Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1118–1127
Rigutini L, Maggini M, Liu B (2005) An EM based training algorithm for cross-language text categorization. In: Proceedings of the 2005 ACM international conference on web intelligence (WIC), pp 529–535
Soyer H, Stenetorp P, Aizawa A (2015) Leveraging monolingual data for crosslingual compositional word representations. In: Proceedings of the international conference on learning representations (ICLR)
Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semant Anal 427(7):424–440
Täckström O, McDonald R, Nivre J (2013) Target language adaptation of discriminative transfer parsers. In: Proceedings of the 14th meeting of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1061–1071
Talvensaari T, Pirkola A, Järvelin K, Juhola M, Laurikkala J (2008) Focused web crawling in the acquisition of comparable corpora. Inf Retr 11(5):427–445
Tao T, Zhai C (2005) Mining comparable bilingual text corpora for cross-language information integration. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 691–696
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st annual meeting of the association for computational linguistics (ACL), pp 72–79
Utsuro T, Horiuchi T, Chiba Y, Hamamoto T (2002) Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. Springer, Berlin
van der Plas L, Merlo P, Henderson J (2011) Scaling up automatic cross-lingual semantic role annotation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 299–304
Vinokourov A, Cristianini N, Shawe-Taylor JS (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in neural information processing systems, pp 1473–1480
Vu T, Aw AT, Zhang M (2009) Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL), pp 843–851
Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 479–484
Vulić I, De Smet W, Moens MF (2013) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf Retr 16(3):331–368
Vulić I, De Smet W, Tang J, Moens M (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147
Wan X (2009) Co-training for cross-lingual sentiment classification. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 235–243
Wan C, Pan R, Li J (2011) Bi-weighting domain adaptation for cross-language text classification. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI), pp 1535–1540
Wang H, Huang H, Nie F, Ding C (2011) Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 933–942
Wei B, Pal CJ (2010) Cross lingual adaptation: an experiment on sentiment classifications. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 258–262
Xiao M, Guo Y (2013a) A novel two-step method for cross language representation learning. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS), pp 1259–1267
Xiao M, Guo Y (2013b) Semi-supervised representation learning for cross-lingual text classification. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1465–1475
Xu Y, Chen L, Wei J, Ananiadou S, Fan Y, Qian Y, Chang EIC, Tsujii J (2015) Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary. BMC Bioinf 16(1):149
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European conference on advances in information retrieval (ECIR), pp 338–349
Zhang T, Liu K, Zhao J (2013) Cross lingual entity linking with bilingual topic model. In: Proceedings of the 23rd international joint conference on artificial intelligence (IJCAI), pp 2218–2224
Zhang D, Mei Q, Zhai C (2010) Cross-lingual latent topic extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1128–1137
Zhao H, Song Y, Kit C, Zhou G (2009) Cross language dependency parsing using a bilingual lexicon. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 55–63
Acknowledgments
The research presented in this article has been carried out in context of the SCATE (SBO-130047) research project financed by the (Flemish) agency for Innovation through Science and Technology (IWT).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.
Rights and permissions
About this article
Cite this article
Heyman, G., Vulić, I. & Moens, MF. C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Min Knowl Disc 30, 1299–1323 (2016). https://doi.org/10.1007/s10618-015-0442-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0442-x