Skip to main content
Log in

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. For instance, Wikipedia articles about Madrid in English and Spanish address many common topics such as “demographics”, “geography and location” or “climate”, while at the same time, only the Spanish article contains text (i.e., a non-shared topic) about “the emblems of the city”, and only the English article elaborates on “business schools” or “bohemian culture” in Madrid.

  2. Without loss of generality, due to simplicity, we will restrict the presentation in the article to bilingual topic models.

  3. The energy between two vectors X and Y is defined as \(||X - Y||^2\).

  4. By writing out the joint probability conditioned on all language assignments \(l_{ji}\), one can check that these formulations are indeed equivalent.

  5. http://svmlight.joachims.org/.

  6. http://scikit-learn.org/.

References

  • Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 1140–1150

  • Amini MR, Goutte C (2010) A co-classification approach to learning from multilingual corpora. Mach Learn 79(1–2):105–121

    Article  MathSciNet  Google Scholar 

  • Amini MR, Usunier N, Goutte C (2009) Learning from multiple partially observed views—an application to multilingual text categorization. In: Proceedings of the 23rd annual conference on advances in neural information processing systems (NIPS), pp 28–36

  • Bel N, Koster CHA, Villegas M (2003) Cross-lingual text categorization. In: Proceedings of the 7th European conference on research and advanced technology for digital libraries (ECDL), pp 126–139

  • Bishop CM (2006) Pattern Recognition and machine learning (Information science and statistics). Springer, Inc, New York

    MATH  Google Scholar 

  • Blei DM, McAuliffe JD (2007) Supervised topic models. In: Proceedings of the 21st Annual conference on advances in neural information processing systems (NIPS), pp 121–128

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Boyd-Graber J, Blei DM (2009) Multilingual topic models for unaligned text. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI), pp 75–82

  • Boyd-Graber J, Resnik P (2010) Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 45–55

  • Cavallanti G, Cesa-Bianchi N, Gentile C (2010) Linear algorithms for online multitask classification. J Mach Learn Res 11:2901–2934

    MathSciNet  MATH  Google Scholar 

  • Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar VC, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS)

  • Das D, Petrov S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 600–609

  • De Smet W, Moens MF (2009) Cross-language linking of news stories on the Web using interlingual topic modeling. In: Proceedings of the CIKM 2009 workshop on social web search and mining (SWSM@CIKM), pp 57–64

  • De Smet W, Tang J, Moens MF (2011) Knowledge transfer across multilingual corpora via latent topics. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 549–560

  • Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: Proceedings of the 49th Annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 429–433

  • Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. In: Proceedings of the ICML 2005 KCCA workshop (KCCA)

  • Ganchev K, Das D (2013) Cross-lingual discriminative learning of sequence models with posterior regularization. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1996–2006

  • Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 927–942

  • Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741

    Article  MATH  Google Scholar 

  • Gliozzo AM, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 44th annual meeting of the association for computational linguistics and the 21st international conference on computational linguistics (ACL-COLING)

  • Gouws S, Bengio Y, Corrado G (2014) Bilbowa: fast bilingual distributed representations without word alignments. In: Deep learning workshop, conference on neural information processing systems (NIPS)

  • Guo Y, Xiao M (2012a) Cross language text classification via subspace co-regularized multi-view learning. In: Proceedings of the 29th international conference on machine learning (ICML)

  • Guo Y, Xiao M (2012b) Transductive representation learning for cross-lingual text classification. In: Proceedings of the 12th IEEE international conference on data mining (ICDM), pp 888–893

  • Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  • Hermann KM, Blunsom P (2014a) Multilingual distributed representations without word alignment. In: Proceedings of the international conference on learning representations (ICLR)

  • Hermann KM, Blunsom P (2014b) Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 58–68

  • Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI), pp 289–296

  • Hu Y, Zhai K, Eidelman V, Boyd-Graber JL (2014) Polylingual tree-based topic models for translation domain adaptation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 1166–1176

  • Jagarlamudi J, Daumé III H (2010) Extracting multilingual topics from unaligned comparable corpora. In: Proceedings of the 32nd annual european conference on advances in information retrieval (ECIR), pp 444–456

  • Jiang Y, Liu J, Li Z, Lu H (2012) Collaborative PLSA for multi-view clustering. In: 2012 21st International conference on pattern recognition (ICPR), IEEE, pp 2997–3000

  • Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning, vol 11. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  • Kim S, Toutanova K, Yu H (2012) Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL), pp 694–702

  • Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 1459–1474

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit (MT SUMMIT), pp 79–86

  • Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 224–229

  • Krstovski K, Smith DA (2013) Online polylingual topic models for fast document translation detection. In: Proceedings of the workshop on statistical MT

  • Levow GA, Oard DW, Resnik P (2005) Dictionary-based techniques for cross-language information retrieval. Inf Process Manag 41(3):523–547

    Article  Google Scholar 

  • Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  • Ling X, Xue GR, Dai W, Jiang Y, Yang Q, Yu Y (2008) Can Chinese web pages be classified with English data source? In: Proceedings of the 17th international conference on World Wide Web (WWW), pp 969–978

  • Littman M, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. Cross-language information retrieval. Kluwer Academic Publishers, Boston, pp 51–62

  • Lu B, Tan C, Cardie C, K Tsou B (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 320–330

  • McCallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: Proceedings of Neural Information Processing Systems (NIPS), pp 1973–1981

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the workshop of the international conference on learning representations (ICLR)

  • Mimno D, Wallach H, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 880–889

  • Ni X, Sun JT, Hu J, Chen Z (2009) Mining multilingual topics from Wikipedia. In: Proceedings of the 18th international World Wide Web conference (WWW), pp 1155–1156

  • Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of the 4th international conference on web search and web data mining (WSDM), pp 375–384

  • Olsson JS, Oard DW, Hajič J (2005) Cross-language text classification. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 645–646

  • Pan J, Xue GR, Yu Y, Wang Y (2011) Cross-lingual sentiment classification via bi-view non-negative matrix tri-factorization. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 289–300

  • Paul MJ, Girju R (2009) Cross-cultural analysis of blogs and forums with mixed-collection topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 1408–1417

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Platt JC, Toutanova K, Yih WT (2010) Translingual document representations from discriminative projections. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 251–261

  • Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1118–1127

  • Rigutini L, Maggini M, Liu B (2005) An EM based training algorithm for cross-language text categorization. In: Proceedings of the 2005 ACM international conference on web intelligence (WIC), pp 529–535

  • Soyer H, Stenetorp P, Aizawa A (2015) Leveraging monolingual data for crosslingual compositional word representations. In: Proceedings of the international conference on learning representations (ICLR)

  • Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semant Anal 427(7):424–440

    Google Scholar 

  • Täckström O, McDonald R, Nivre J (2013) Target language adaptation of discriminative transfer parsers. In: Proceedings of the 14th meeting of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1061–1071

  • Talvensaari T, Pirkola A, Järvelin K, Juhola M, Laurikkala J (2008) Focused web crawling in the acquisition of comparable corpora. Inf Retr 11(5):427–445

    Article  Google Scholar 

  • Tao T, Zhai C (2005) Mining comparable bilingual text corpora for cross-language information integration. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 691–696

  • Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st annual meeting of the association for computational linguistics (ACL), pp 72–79

  • Utsuro T, Horiuchi T, Chiba Y, Hamamoto T (2002) Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. Springer, Berlin

    Book  MATH  Google Scholar 

  • van der Plas L, Merlo P, Henderson J (2011) Scaling up automatic cross-lingual semantic role annotation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 299–304

  • Vinokourov A, Cristianini N, Shawe-Taylor JS (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in neural information processing systems, pp 1473–1480

  • Vu T, Aw AT, Zhang M (2009) Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL), pp 843–851

  • Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 479–484

  • Vulić I, De Smet W, Moens MF (2013) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf Retr 16(3):331–368

    Article  Google Scholar 

  • Vulić I, De Smet W, Tang J, Moens M (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147

  • Wan X (2009) Co-training for cross-lingual sentiment classification. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 235–243

  • Wan C, Pan R, Li J (2011) Bi-weighting domain adaptation for cross-language text classification. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI), pp 1535–1540

  • Wang H, Huang H, Nie F, Ding C (2011) Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 933–942

  • Wei B, Pal CJ (2010) Cross lingual adaptation: an experiment on sentiment classifications. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 258–262

  • Xiao M, Guo Y (2013a) A novel two-step method for cross language representation learning. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS), pp 1259–1267

  • Xiao M, Guo Y (2013b) Semi-supervised representation learning for cross-lingual text classification. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1465–1475

  • Xu Y, Chen L, Wei J, Ananiadou S, Fan Y, Qian Y, Chang EIC, Tsujii J (2015) Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary. BMC Bioinf 16(1):149

    Article  Google Scholar 

  • Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European conference on advances in information retrieval (ECIR), pp 338–349

  • Zhang T, Liu K, Zhao J (2013) Cross lingual entity linking with bilingual topic model. In: Proceedings of the 23rd international joint conference on artificial intelligence (IJCAI), pp 2218–2224

  • Zhang D, Mei Q, Zhai C (2010) Cross-lingual latent topic extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1128–1137

  • Zhao H, Song Y, Kit C, Zhou G (2009) Cross language dependency parsing using a bilingual lexicon. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 55–63

Download references

Acknowledgments

The research presented in this article has been carried out in context of the SCATE (SBO-130047) research project financed by the (Flemish) agency for Innovation through Science and Technology (IWT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geert Heyman.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heyman, G., Vulić, I. & Moens, MF. C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Min Knowl Disc 30, 1299–1323 (2016). https://doi.org/10.1007/s10618-015-0442-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0442-x

Keywords

Navigation