ABSTRACT
Understanding large-scale document collections in an efficient manner is an important problem. Usually, document data are associated with other information (e.g., an author's gender, age, and location) and their links to other entities (e.g., co-authorship and citation networks). For the analysis of such data, we often have to reveal common as well as discriminative characteristics of documents with respect to their associated information, e.g., male- vs. female-authored documents, old vs. new documents, etc. To address such needs, this paper presents a novel topic modeling method based on joint nonnegative matrix factorization, which simultaneously discovers common as well as discriminative topics given multiple document sets. Our approach is based on a block-coordinate descent framework and is capable of utilizing only the most representative, thus meaningful, keywords in each topic through a novel pseudo-deflation approach. We perform both quantitative and qualitative evaluations using synthetic as well as real-world document data sets such as research paper collections and nonprofit micro-finance data. We show our method has a great potential for providing in-depth analyses by clearly identifying common and discriminative topics among multiple document sets.
Supplemental Material
- S. Al-Stouhi and C. K. Reddy. Multi-task clustering using constrained symmetric non-negative matrix factorization. In Proc. SIAM International Conference on Data Mining (SDM), pages 785--793, 2014.Google ScholarCross Ref
- S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. Journal of Machine Learning Research (JMLR), 28(2):280--288, 2013.Google Scholar
- S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization -- provably. In Proc. the 44th Symposium on Theory of Computing (STOC), pages 145--162, 2012. Google ScholarDigital Library
- L. Badea. Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization. In Proc. the Pacific Symposium on Biocomputing, pages 267--278, 2008.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research (JMLR), 3:993--1022, 2003. Google ScholarDigital Library
- J. Choo, C. Lee, D. Lee, H. Zha, and H. Park. Understanding and promoting micro-finance activities in kiva.org. In Proc. the 7th ACM International Conference on Web Search and Data Mining (WSDM), pages 583--592, 2014. Google ScholarDigital Library
- J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics (TVCG), 19(12):1992--2001, 2013. Google ScholarDigital Library
- J. Choo, D. Lee, B. Dilkina, H. Zha, and H. Park. A better world for all: Understanding and leveraging communities in micro-lending recommendation. In Proc. the International Conference on World Wide Web (WWW), pages 249--260, 2014. Google ScholarDigital Library
- A. Cichocki, R. Zdunek, and S.-i. Amari. Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization. In Independent Component Analysis and Signal Separation, pages 169--176. Springer, 2007. Google ScholarCross Ref
- P. Dao, K. Wang, C. Collins, M. Ester, A. Lapuk, and S. C. Sahinalp. Optimally discriminative subnetwork markers predict response to chemotherapy. Bioinformatics, 27(13):i205--i213, 2011. Google ScholarDigital Library
- J.-Y. Delort and E. Alfonseca. DualSum: a topic-model based approach for update summarization. In Proc. the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 214--223, 2012. Google ScholarDigital Library
- I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In Advances in Neural Information Processing Systems (NIPS), pages 283--290, 2005.Google Scholar
- G. Dong and J. Bailey. Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press, 2012. Google ScholarDigital Library
- G. H. Golub and C. F. van Loan. Matrix Computations, third edition. Johns Hopkins University Press, Baltimore, 1996. Google ScholarDigital Library
- S. K. Gupta, D. Phung, B. Adams, and S. Venkatesh. Regularized nonnegative shared subspace learning. Data mining and knowledge discovery (DMKD), 26(1):57--97, 2013. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proc. the 22nd Annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), pages 50--57, 1999. Google ScholarDigital Library
- H. Kim and H. Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713--730, 2008. Google ScholarDigital Library
- J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285--319, 2014. Google ScholarDigital Library
- J. Kim, R. D. Monteiro, and H. Park. Group sparsity in nonnegative matrix factorization. In Proc. the 2012 SIAM International Conference on Data Mining (SDM), pages 851--862, 2012.Google ScholarCross Ref
- J. Kim and H. Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261--3281, 2011. Google ScholarDigital Library
- H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1--2):83--97, 1955.Google Scholar
- S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Advances in Neural Information Processing Systems (NIPS), pages 897--904. 2008.Google Scholar
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS) 13, pages 556--562, 2000.Google Scholar
- L. Li, G. Lebanon, and H. Park. Fast bregman divergence nmf using taylor expansion and coordinate descent. In Proc. the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 307--315, 2012. Google ScholarDigital Library
- J. Liu, C. Wang, J. Gao, and J. Han. Multi-View clustering via joint nonnegative matrix factorizations. In Proc. the 2013 SIAM International Conference on Data Mining (SDM), pages 252--260, 2013.Google ScholarCross Ref
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, 2008. Google ScholarCross Ref
- O. Odibat and C. K. Reddy. Efficient mining of discriminative co-clusters from gene expression data. Knowledge and Information Systems, 41(3):667--696, 2014. Google ScholarDigital Library
- V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons. Text mining using non-negative matrix factorizations. In Proc. SIAM International Conference on Data Mining (SDM), pages 452--456, 2004.Google ScholarCross Ref
- A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 650--658, 2008. Google ScholarDigital Library
- W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proc. the 26th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), pages 267--273, 2003. Google ScholarDigital Library
- J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximum margin supervised topic models for regression and classification. In Proc. the 26th Annual International Conference on Machine Learning (ICML), pages 1257--1264, 2009. Google ScholarDigital Library
Index Terms
- Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization
Recommendations
Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
WWW '18: Proceedings of the 2018 World Wide Web ConferenceBeing a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual ...
Orthogonal nonnegative matrix t-factorizations for clustering
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningCurrently, most research on nonnegative matrix factorization (NMF)focus on 2-factor $X=FG^T$ factorization. We provide a systematicanalysis of 3-factor $X=FSG^T$ NMF. While it unconstrained 3-factor NMF is equivalent to it unconstrained 2-factor NMF, ...
Probabilistic Topic Modeling for Comparative Analysis of Document Collections
Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the ...
Comments