Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Duan, Tiehang; Lou, Qi; Srihari, Sargur N.; Xie, Xiaohui

doi:10.1007/978-3-030-16142-2_6

Tiehang Duan¹⁹,
Qi Lou²⁰,
Sargur N. Srihari¹⁹ &
…
Xiaohui Xie²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11441))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1990 Accesses
4 Citations

Abstract

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: (1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); (2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Chapter Google Scholar
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–143 (2006). https://doi.org/10.1214/06-BA104
Article MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Cai, D., He, X., Han, J.: SRDA: an efficient algorithm for large-scale discriminant analysis. IEEE Trans. Knowl. Data Eng. 20(1), 1–12 (2008)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014, pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar, October 2014. http://www.aclweb.org/anthology/D14-1179
Duan, T., Pinto, J.P., Xie, X.: Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics p. bty702 (2018). https://doi.org/10.1093/bioinformatics/bty702
Duan, T., Srihari, S.N.: Pseudo boosted deep belief network. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 105–112. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0_13
Chapter Google Scholar
Duan, T., Srihari, S.N.: Layerwise interweaving convolutional LSTM. In: Mouhoub, M., Langlais, P. (eds.) AI 2017. LNCS, vol. 10233, pp. 272–277. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-57351-9_31
Chapter Google Scholar
Gomez, J.C., Moens, M.F.: PCA document reconstruction for email classification. Comput. Stat. Data Anal. 56(3), 741–751 (2012)
Article MathSciNet Google Scholar
Gu, Y., Chen, S., Marsic, I.: Deep multimodal learning for emotion recognition in spoken language. CoRR abs/1802.08332 (2018)
Google Scholar
Gu, Y., Li, X., Chen, S., Zhang, J., Marsic, I.: Speech intention classification with multimodal deep learning. In: Mouhoub, M., Langlais, P. (eds.) AI 2017. LNCS (LNAI), vol. 10233, pp. 260–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57351-9_30
Chapter Google Scholar
Hori, C., Hori, T., Lee, T., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. CoRR abs/1701.03126 (2017)
Google Scholar
Hotho, A., Staab, S., Maedche, A.: Ontology-based text clustering. In: Proceedings of the IJCAI 2001 Workshop Text Learning: Beyond Supervision (2001)
Google Scholar
Huang, R., Yu, G., Wang, Z.: Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng. 25(8), 1748–1759 (2013)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Li, Y., et al.: Towards differentially private truth discovery for crowd sensing systems. CoRR abs/1810.04760 (2018)
Google Scholar
Liu, M., Chen, L., Liu, B., Wang, X.: VRCA: a clustering algorithm for massive amount of texts. In: IJCAI 2015, pp. 2355–2361. AAAI Press (2015). http://dl.acm.org/citation.cfm?id=2832415.2832576
Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025 (2015)
Google Scholar
Mikolov, T., et al.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) NIPS, pp. 3111–3119. Curran Associates, Inc. (2013)
Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)
Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
MathSciNet Google Scholar
Nie, Y., Han, Y., Huang, J., Jiao, B., Li, A.: Attention-based encoder-decoder model for answer selection in question answering. Front. Inf. Technol. Electron. Eng. 18(4), 535–544 (2017)
Article Google Scholar
Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 111–112. ACM, New York (2011)
Google Scholar
Shafiei, M.M., Milios, E.E.: Latent Dirichlet co-clustering. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 542–551, December 2006
Google Scholar
Wang, F., Zhang, C., Li, T.: Regularized clustering for documents. In: SIGIR 2007, pp. 95–102. ACM, New York (2007)
Google Scholar
Xun, G., Li, Y., Zhao, W.X., Gao, J., Zhang, A.: A correlated topic model using word embeddings. In: IJCAI 2017, pp. 4207–4213 (2017)
Google Scholar
Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636, May 2016
Google Scholar
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: KDD 2014, pp. 233–242. ACM, New York (2014)
Google Scholar
Yu, G., Huang, R., Wang, Z.: Document clustering via Dirichlet process mixture model with feature selection. In: KDD 2010, pp. 763–772. ACM, New York (2010)
Google Scholar
Zhang, H., Li, Y., Ma, F., Gao, J., Su, L.: Texttruth: an unsupervised approach to discover trustworthy information from multi-sourced text data. In: KDD 2018, pp. 2729–2737. ACM, New York (2018). https://doi.org/10.1145/3219819.3219977

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, 14260, USA
Tiehang Duan & Sargur N. Srihari
Department of Computer Science, University of California, Irvine, Irvine, CA, 92617, USA
Qi Lou & Xiaohui Xie

Authors

Tiehang Duan
View author publications
You can also search for this author in PubMed Google Scholar
Qi Lou
View author publications
You can also search for this author in PubMed Google Scholar
Sargur N. Srihari
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiehang Duan .

Editor information

Editors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, China
Qiang Yang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
University of Macau, Taipa, Macau, China
Zhiguo Gong
Southeast University, Nanjing, China
Min-Ling Zhang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, T., Lou, Q., Srihari, S.N., Xie, X. (2019). Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-16142-2_6
Published: 20 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics