Skip to main content
Log in

EVE: explainable vector based embedding technique using Wikipedia

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

We present an unsupervised explainable vector embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a concept using human-readable labels, thereby it is readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link structure. To test the effectiveness of the proposed model, we consider its usefulness in three fundamental tasks: 1) intruder detection—to evaluate its ability to identify a non-coherent vector from a list of coherent vectors, 2) ability to cluster—to evaluate its tendency to group related vectors together while keeping unrelated vectors in separate clusters, and 3) sorting relevant items first—to evaluate its ability to rank vectors (items) relevant to the query in the top order of the result. For each task, we also propose a strategy to generate a task-specific human-interpretable explanation from the model. These demonstrate the overall effectiveness of the explainable embeddings generated by EVE. Finally, we compare EVE with the Word2Vec, FastText, and GloVe embedding techniques across the three tasks, and report improvements over the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. http://lsa.colorado.edu/

  2. Besides the basic similarity with PageRank

  3. With an intuition to penalize untrusted pages (or spam)

  4. This can be an exact match or a partial best match using an information retrieval algorithm

  5. This dimension is the most relevant dimension defining the concept which is the article itself.

  6. In case of best match strategy, where more than one article is mapped to a concept i.e., Aconcept1,Aconcept2,... the score computed is further scaled by the relevance score of each article for the top-k articles, then reduced by the vector addition, and normalized again.

  7. In case of the partial best match it is the relevance score returned by BM25 algorithm.

  8. By simply, 1—normalized similarity score over each dimension

  9. The full set of experimental visualizations is available at http://mlg.ucd.ie/eve/

References

  • Adler, P., Falk, C., Friedler, S.A, Rybeck, G., Scheidegger, C., Smith, B., Venkatasubramanian, S. (2016). Auditing black-box models for indirect influence. In 2016 IEEE 16th international conference on data mining (ICDM) (pp. 1–10). IEEE.

  • Agirre, E., & Soroa, A. (2009). Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics, association for computational linguistics (pp. 33–41).

  • Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A. (2016). A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4, 385–399.

    Article  Google Scholar 

  • Baroni, M., & Lenci, A. (2010). Distributional memory: a general framework for corpus-based semantics. Computational Linguistics, 36 (4), 673–721.

    Article  Google Scholar 

  • Baroni, M., Dinu, G., Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (Vol. 1, pp. 238–247).

  • Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C. (2003). A neural probabilistic language model. JMLR, 3, 1137–1155.

    MATH  Google Scholar 

  • Bhargava, P., Phan, T., Zhou, J., Lee, J. (2015). Who, what, when, and where: multi-dimensional collaborative recommendations using tensor factorization on sparse user-generated data. In Proceedings of the 24th international conference on world wide web (pp. 130–140). ACM.

  • Bian, J., Gao, B., Liu, T.Y. (2014). Knowledge-powered deep learning for word embedding. In Joint European conference on machine learning and knowledge discovery in databases (pp. 132–148). Springer.

  • Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:160704606.

  • Bordes, A., Weston, J., Collobert, R., Bengio, Y. (2011). Learning structured embeddings of knowledge bases. In Conference on artificial intelligence, EPFL-CONF-192344.

  • Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32 (1), 13–47.

    Article  MATH  Google Scholar 

  • Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods, 3 (1), 1–27.

    Article  MathSciNet  MATH  Google Scholar 

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the ICML’2008 (pp. 160–167). ACM.

  • Datta, A., Sen, S., Zick, Y. (2016). Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE symposium on security and privacy (SP) (pp. 598–617). IEEE.

  • Deerwester, S. (1988). Improving information retrieval with latent semantic indexing. In Proceedings of the 51st annual meeting of the American Society for information science (Vol. 25, pp. 36–40).

  • Diao, Q., Qiu, M., Wu, C.Y., Smola, A.J., Jiang, J., Wang, C. (2014). Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 193–202). ACM.

  • Diaz, F., Mitra, B., Craswell, N. (2016). Query expansion with locally-trained word embeddings. In Association for computational linguistics (pp. 367–377).

  • Everitt, B., Landau, S., Leese, M. (2001). Cluster analysis. Wiley: Hodder Arnold Publication.

    MATH  Google Scholar 

  • Faruqui, M., Dodge, J., Jauhar, S.K, Dyer, C., Hovy, E., Smith, N.A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:14114166.

  • Firth, J. (1957). A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis (pp. 1–32).

  • Fu, X., Wang, T., Li, J., Yu, C., Liu, W. (2016). Improving distributed word representation and topic model by word-topic mixture model. In Proceedings of the 8th Asian conference on machine learning (pp. 190–205).

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the IJCAI’07 (Vol. 7, pp. 1606–1611).

  • Gallant, S.I., Caid, W.R., Carleton, J., Hecht-Nielsen, R., Qing, K.P., Sudbeck, D. (1992). Hnc’s matchplus system. In ACM SIGIR Forum (Vol. 26, pp. 34–38). ACM.

  • Ganguly, D., Roy, D., Mitra, M., Jones, G.J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 795–798). ACM.

  • Ganitkevitch, J., Van Durme, B., Callison-Burch, C. (2013). Ppdb: the paraphrase database. In HLT-NAACL (pp. 758–764).

  • Globerson, A., Chechik, G., Pereira, F., Tishby, N. (2007). Euclidean embedding of co-occurrence data. JMLR, 8, 2265–2295.

    MathSciNet  MATH  Google Scholar 

  • Goodman, B., & Flaxman, S. (2016). European union regulations on algorithmic decision-making and a “right to explanation”. arXiv preprint arXiv:160608813.

  • Gyöngyi, Z., Garcia-Molina, H., Pedersen, J. (2004). Combating web spam with trustrank. In Proceedings of the thirtieth international conference on very large data bases. VLDB Endowment (Vol. 30, pp. 576–587).

  • Harris, Z.S. (1954). Distributional structure. Word, 10 (2–3), 146–162.

    Article  Google Scholar 

  • Harris, Z.S. (1968). Mathematical structures of language. New York: Wiley.

    MATH  Google Scholar 

  • Henelius, A., Puolamäki, K., Boström, H., Asker, L., Papapetrou, P. (2014). A peek into the black box: exploring classifiers by randomization. Data Mining and Knowledge Discovery, 28 (5–6), 1503.

    Article  MathSciNet  Google Scholar 

  • Hoffart, J., Seufert, S., Nguyen, D.B., Theobald, M., Weikum, G. (2012). Kore: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 545–554).

  • Hunt, J., & Price, C. (1988). Explaining qualitative diagnosis. Engineering Applications of Artificial Intelligence, 1 (3), 161–169.

    Article  Google Scholar 

  • Jarmasz, M. (2012). Roget’s thesaurus as a lexical resource for natural language processing. arXiv preprint arXiv:12040140.

  • Jiang, Y., Zhang, X., Tang, Y., Nie, R. (2015). Feature-based approaches to semantic similarity assessment of concepts using wikipedia. Info Processing & Management, 51 (3), 215–234.

    Article  Google Scholar 

  • Kuzi, S., Shtok, A., Kurland, O. (2016). Query expansion using word embeddings. In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1929–1932). ACM.

  • Landauer, T.K., Foltz, P.W, Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25 (2–3), 259–284.

    Article  Google Scholar 

  • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Proceedings of the NIPS’2014 (pp. 2177–2185).

  • Levy, O., Goldberg, Y., Ramat-Gan, I. (2014). Linguistic regularities in sparse and explicit word representations. In CoNLL (pp. 171–180).

  • Levy, O., Goldberg, Y., Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.

    Article  Google Scholar 

  • Lipton, Z.C. (2016). The mythos of model interpretability. arXiv preprint arXiv:160603490.

  • Liu, Y., Liu, Z., Chua, T.S., Sun, M. (2015). Topical word embeddings. In AAAI (pp. 2418–2424).

  • Lopez-Suarez, A., & Kamel, M. (1994). Dykor: a method for generating the content of explanations in knowledge systems. Knowledge-Based Systems, 7 (3), 177–188.

    Article  Google Scholar 

  • Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Metzler, D., Dumais, S., Meek, C. (2007). Similarity measures for short segments of text. In European conference on information retrieval (pp. 16–27). Springer.

  • Mihalcea, R., & Tarau, P. (2004). Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.

  • Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S, Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS’2013 (pp. 3111–3119).

  • Nikfarjam, A., Sarker, A., O’Connor, K., Ginn, R., Gonzalez, G. (2015). Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association, 22, 671–681.

    Google Scholar 

  • Niu, L., Dai, X., Zhang, J., Chen, J. (2015). Topic2vec: learning distributed representations of topics. In 2015 International conference on asian language processing (IALP) (pp. 193–196). IEEE.

  • Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The pagerank citation ranking: bringing order to the web. Tech. rep., Stanford InfoLab.

  • Pennington, J., Socher, R., Manning, C.D. (2014). Glove: global vectors for word representation. In Empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

  • Qureshi, M.A. (2015). Utilising wikipedia for text mining applications. PhD thesis, National University of Ireland Galway.

  • Ren, Z., Liang, S., Li, P., Wang, S., de Rijke, M. (2017). Social collaborative viewpoint regression with explainable recommendations. In Proceedings of the tenth ACM international conference on web search and data mining (pp. 485–494). ACM.

  • Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144). ACM.

  • Salton, G., & McGill, M.J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill, Inc.

    MATH  Google Scholar 

  • Sari, Y., & Stevenson, M. (2016). Exploring word embeddings and character n-grams for author clustering. In Working notes. CEUR Workshop Proceedings, CLEF.

  • Schütze, H. (1992). Word space. In Proceedings of the NIPS’1992 (pp. 895–902).

  • Sherkat, E., & Milios, E.E. (2017). Vector embedding of wikipedia concepts and entities. In International conference on applications of natural language to information systems (pp. 418–428). Springer.

  • Socher, R., Chen, D., Manning, C.D, Ng, A. (2013). Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the NIPS’2013 (pp. 926–934).

  • Strube, M., & Ponzetto, S.P. (2006). Wikirelate! Computing semantic relatedness using wikipedia. In Proceedings of the 21st national conference on artificial intelligence (pp. 1419–1424).

  • Tintarev, N., & Masthoff, J. (2015). Explaining recommendations: design and evaluation. In Recommender systems handbook (pp. 353–382). Springer.

  • van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9, 2579–2605.

    MATH  Google Scholar 

  • Wang, Z., Zhang, J., Feng, J., Chen, Z. (2014). Knowledge graph and text jointly embedding. In EMNLP, Citeseer (Vol. 14, pp. 1591–1601).

  • Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L, Hao, H. (2016). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 174, 806–814.

    Article  Google Scholar 

  • Wick, M.R, & Thompson, W.B. (1992). Reconstructive expert system explanation. Artificial Intelligence, 54 (1–2), 33–70.

    Article  Google Scholar 

  • Witten, I., & Milne, D. (2008). An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In AAAI workshop on wikipedia and artificial intelligence: an evolving synergy (pp. 25–30).

  • Wu, F., Song, J., Yang, Y., Li, X., Zhang, Z.M, Zhuang, Y. (2015). Structured embedding via pairwise relations and long-range interactions in knowledge base. In AAAI (pp. 1663–1670).

  • Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., Liu, T.Y. (2014). Rc-net: a general framework for incorporating knowledge into word representations. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management (pp. 1219–1228).

  • Yeh, E., Ramage, D., Manning, C.D, Agirre, E., Soroa, A. (2009). Wikiwalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 workshop on graph-based methods for natural language processing (pp. 41–49).

  • Yu, M., & Dredze, M. (2014). Improving lexical embeddings with semantic knowledge. In ACL (Vol. 2, pp. 545–550).

  • Zesch, T., & Gurevych, I. (2007). Analysis of the wikipedia category graph for nlp applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007) (pp. 1–8).

  • Zhang, Y., Lai, G., Zhang, M., Zhang, Y., Liu, Y., Ma, S. (2014). Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval (pp. 83–92). ACM.

  • Zheng, G., & Callan, J. (2015). Learning to reweight terms with distributed representations. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 575–584). ACM.

  • Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian document computing symposium (p. 12). ACM.

Download references

Acknowledgements

This publication has emanated from research conducted with the support of Science Foundation Ireland (SFI), under Grant Number SFI/12/ RC/2289.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Atif Qureshi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qureshi, M.A., Greene, D. EVE: explainable vector based embedding technique using Wikipedia. J Intell Inf Syst 53, 137–165 (2019). https://doi.org/10.1007/s10844-018-0511-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-018-0511-x

Keywords

Navigation