Short Text Similarity Measurement Using Context from Bag of Word Pairs and Word Co-occurrence

Yang, Shuiqiao; Huang, Guangyan; Ofoghi, Bahadorreza

doi:10.1007/978-981-15-2810-1_22

Shuiqiao Yang¹⁵,
Guangyan Huang¹⁵ &
Bahadorreza Ofoghi¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1179))

Included in the following conference series:

International Conference on Data Service

1234 Accesses
3 Citations

Abstract

With the rapid development of social networks, short texts have become a prevalent form of social communications on the Internet. Measuring the similarity between short texts is a fundamental task to many applications, such as social network text querying, short text clustering and geographical event detection for smart city. However, short texts in social media always show limited contextual information and they are sparse, noisy and ambiguous. Hence, effectively measuring the distance between short texts is a challenging task.

In this paper, we propose a new heuristic word pair distance measurement (WPDM) technique for short texts, which exploits the corpus level word relations and enriches the context of each short text with bag of word pairs representation. We first adjust Jaccard similarity to measure the distance between words. Then, words are paired up to capture latent semantics in a short text document and thus transfer short text into a bag of word pairs representation. The similarity between short text documents is finally calculated through averaging the distances of the word pairs. Experimental results on a real-world dataset demonstrate that the proposed WPDM is effective and achieves much better performance than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ozcan, G.: Unsupervised learning from multi-dimensional data: a fast clustering algorithm utilizing canopies and statistical information. Int. J. Inf. Technol. Decis. Mak. 17(03), 841–856 (2018)
Article Google Scholar
Mehdizadeh, E., Teimouri, M., Zaretalab, A., Niaki, S.: A combined approach based on k-means and modified electromagnetism-like mechanism for data clustering. Int. J. Inf. Technol. Decis. Mak. 16(05), 1279–1307 (2017)
Article Google Scholar
Suma, S., Mehmood, R., Albeshri, A.: Automatic event detection in smart cities using big data analytics. In: Mehmood, R., Bhaduri, B., Katib, I., Chlamtac, I. (eds.) SCITA 2017. LNICST, vol. 224, pp. 111–122. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94180-6_13
Chapter Google Scholar
Wang, N., Ke, S., Chen, Y., Yan, T., Lim, A., et al.: Textual sentiment of chinese microblog toward the stock market. Int. J. Inf. Technol. Decis. Mak. (IJITDM) 18(02), 649–671 (2019)
Article Google Scholar
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Networks 88, 22–31 (2017)
Article Google Scholar
Liu, K., Bellet, A., Sha, F.: Similarity learning for high-dimensional sparse data. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2015) (2015)
Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 957–966 (2015). JMLR.org
Huang, G., et al.: Mining streams of short text for analysis of world-wide event evolutions. World Wide Web 18(5), 1201–1217 (2015)
Article Google Scholar
Aamer, H., Ofoghi, B., Verspoor, K.: Syndromic surveillance through measuring lexical shift in emergency department chief complaint texts. In: Proceedings of the Australasian Language Technology Association Workshop 2016, pp. 45–53 (2016)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Quan, X., Liu, G., Lu, Z., Ni, X., Wenyin, L.: Short text similarity based on probabilistic topics. Knowl. Inf. Syst. 25(3), 473–491 (2010)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Yin, J., Wang, J.: A Dirichlet Multinomial Mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
Google Scholar
Park, H.-S., Jun, C.-H.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36(2), 3336–3341 (2009)
Article Google Scholar
Gan, J., Tao, Y.: DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 519–530. ACM (2015)
Google Scholar
Bouguettaya, A., Yu, Q., Liu, X., Zhou, X., Song, A.: Efficient agglomerative hierarchical clustering. Expert Syst. Appl. 42(5), 2785–2797 (2015)
Article Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work was partially supported by Australian Research Council (ARC) Grant (No. DE140100387).

Author information

Authors and Affiliations

School of Information Technology, Deakin University, Melbourne, Australia
Shuiqiao Yang & Guangyan Huang
Deakin Business School, Deakin University, Melbourne, Australia
Bahadorreza Ofoghi

Authors

Shuiqiao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guangyan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bahadorreza Ofoghi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangyan Huang .

Editor information

Editors and Affiliations

Swinburne University of Technology, Melbourne, VIC, Australia
Jing He
University of Illinois at Chicago, Chicago, USA
Philip S. Yu
College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE, USA
Yong Shi
Research Institute of Extenics and Innovation Methods, Guangdong University of Technology, Guangzhou, China
Xingsen Li
Ningbo University, Ningbo, China
Zhijun Xie
Deakin University, Burwood, VIC, Australia
Guangyan Huang
Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China
Jie Cao
Nanjing University of Posts and Telecommunications, Nanjing, China
Fu Xiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, S., Huang, G., Ofoghi, B. (2020). Short Text Similarity Measurement Using Context from Bag of Word Pairs and Word Co-occurrence. In: He, J., et al. Data Science. ICDS 2019. Communications in Computer and Information Science, vol 1179. Springer, Singapore. https://doi.org/10.1007/978-981-15-2810-1_22

Download citation

DOI: https://doi.org/10.1007/978-981-15-2810-1_22
Published: 02 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2809-5
Online ISBN: 978-981-15-2810-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics