An effective approach for semantic-based clustering and topic-based ranking of web documents

Roul, Rajendra Kumar

doi:10.1007/s41060-018-0112-3

An effective approach for semantic-based clustering and topic-based ranking of web documents

Applications
Published: 15 March 2018

Volume 5, pages 269–284, (2018)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Rajendra Kumar Roul¹

422 Accesses
11 Citations
Explore all metrics

Abstract

In this large, dynamic and expandable web, extracting desired information of any user query is a significant problem for the search engine. Clustering and Ranking are two important resources which can shed light in this direction. To achieve this potential clustering-ranking mechanism, this study proposes a combined approach of semantic-based clustering and topic-based ranking of web documents. The proposed clustering approach combines the latent semantic indexing (LSI) with min-cut algorithm. To make the clustering technique more effective, a new feature selection method called clustering-based feature selection has been developed that focuses on finding the feature set which gathers the crux of documents in the corpus without deteriorating the outcome of the construction process. While LSI completely overcomes the constraint of synonymy, the min-cut algorithm helps to generate efficient clusters at each stage of the clustering process. For deciding the number of clusters to be formed, silhouette coefficient is used, which is a parameter incorporating both cohesion and separation of clusters. To rank the documents in each semantic cluster, the proposed approach transforms the text into topics using latent Dirichlet allocation and then runs the inverted indexing technique on those topics. 20-Newsgroups and DMOZ datasets are used for experimental work, and the results obtained from the experiment show that the performance of the clustering approach is better than the traditional clustering approaches and the ranking approach is promising.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine

Web Search Results Clustering Using Frequent Termset Mining

Integrating LDA with Clustering Technique for Relevance Feature Selection

Notes

http://lsa.colorado.edu/papers/dp1.LSAintro.pdf.
http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm.
https://radimrehurek.com/gensim/tutorial.html.
http://ag.arizona.edu/classes/rnr555/lecnotes/10.html.
http://www.azimuthproject.org/azimuth/show/Eckart-Young+low+rank+approximation+theorem#idea_2.
http://jmlr.org/papers/volume3/blei03a/blei03a.pdf.
http://tartarus.org/martin/PorterStemmer/.
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.
Decided by threshold.
http://qwone.com/~jason/20Newsgroups/.
http://www.dmoz.org.
Determined by running the code many times as the LDA is a stochastic topic model and considered that value of t for which the best results is obtained.

References

Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)
Article Google Scholar
Croft, W.B.: A model of cluster searching based on classification. Inf. Syst. 5(3), 189–195 (1980)
Article MathSciNet Google Scholar
Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 314–323. ACM (1997)
Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley, New York (2009)
MATH Google Scholar
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: EDBT, pp. 123–146. Springer (2004)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. VLDB 8, 222–236 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: Proceedings., 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)
Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Article Google Scholar
Borhani-Fard, Z., Minaei, B., Alinejad-Rokny, H.: Applying clustering approach in blog recommendation. J. Emerg. Technol. Web Intell. 5(3), 296–301 (2013)
Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)
Article MathSciNet Google Scholar
Parvin, H., MirnabiBaboli, M., Alinejad-Rokny, H.: Proposing a classifier ensemble framework based on classifier selection and decision tree. Eng. Appl. Artif. Intell. 37, 34–42 (2015)
Article Google Scholar
Roul, R.K., Aggrawal, A.: Feature space of deep learning and its importance: comparison of clustering techniques on the extended space of ML-ELM. In: Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 25–28. ACM (2017). https://doi.org/10.1145/3158354.3158359. ISBN: 978-1-4503-6382-2
Parvin, H., Minaei-Bidgoli, B., Alinejad-Rokny, H.: A new imbalanced learning and dictions tree method for breast cancer diagnosis. J. Bionanosci. 7(6), 673–678 (2013)
Article Google Scholar
Alinejad-Rokny, H., Farzaneh, M.K., Orimi, A.G., Pedram, M., Kiasari, H.A.: Proposing a new structure for web mining and personalizing web pages. J. Emerg. Technol. Web Intell. 5(3), 287–295 (2013)
Google Scholar
Esmaeili, L., Minaei-Bidgoli, B., Alinejad-Rokny, H., Nasiri, M.: Hybrid recommender system for joining virtual communities. Res. J. Appl. Sci. Eng. Technol. 4(5), 500–509 (2012)
Google Scholar
Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Jain, L.C., Behera, H.S., Mandal, J.K., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining - Volume 3, pp. 159–171. Springer (2015)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)
Article Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)
Google Scholar
Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)
Article MathSciNet MATH Google Scholar
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11), 1901–1907 (2009)
Article MATH Google Scholar
Li, P., Wang, B., Jin, W.: Improving web document clustering through employing user-related tag expansion techniques. J. Comput. Sci. Technol. 27(3), 554–566 (2012)
Article Google Scholar
Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)
Article Google Scholar
Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)
Article Google Scholar
Huang, F., Zhang, S., He, M., Wu, X.: Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17(1), 105–126 (2014)
Article Google Scholar
Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)
Article Google Scholar
Chen, H., Carin, L., Dunson, D.B.: Topic modeling with nonparametric markov tree. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 377–384 (2011)
Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS, vol. 7, pp. 121–128 (2007)
Blei, D.M., Lafferty, J.D., et al.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)
Article MathSciNet MATH Google Scholar
Cai, J., Lee, W.S., Teh, Y.W.: Improving word sense disambiguation using topic features. In: EMNLP-CoNLL, pp. 1015–1023 (2007)
Mimno, D., Blei, D.: Bayesian checking for topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 227–237 (2011)
Henderson, K., Eliassi-Rad, T.: Applying latent dirichlet allocation to group discovery in large graphs. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1456–1461. ACM (2009)
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: The Semantic Web-ISWC, 2008, pp. 229–244. Springer (2008)
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)
Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–465. ACM (2011)
Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)
Article Google Scholar
Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. (India) Ser. B 95(1), 15–21 (2014)
Article Google Scholar
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM (2005)
Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2009)
Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302. ACM (2007)
Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298. ACM (2009)
Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)
Article Google Scholar
Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159–168. ACM (1998)
Karger, D.R.: Global min-cuts in rnc, and other ramifications of a simple min-cut algorithm. ACM 93, 21–30 (1993)
MathSciNet MATH Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Knuth, D.E.: The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Delhi, Pearson Education India (2011)
MATH Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Systems, BITS, Pilani-K.K.Birla Goa Campus, Zuarinagar, Goa, 403726, India
Rajendra Kumar Roul

Authors

Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Kumar Roul.

Ethics declarations

Conflict of interest

The corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roul, R.K. An effective approach for semantic-based clustering and topic-based ranking of web documents. Int J Data Sci Anal 5, 269–284 (2018). https://doi.org/10.1007/s41060-018-0112-3

Download citation

Received: 14 January 2017
Accepted: 24 February 2018
Published: 15 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s41060-018-0112-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective approach for semantic-based clustering and topic-based ranking of web documents

Abstract

Access this article

Similar content being viewed by others

An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine

Web Search Results Clustering Using Frequent Termset Mining

Integrating LDA with Clustering Technique for Relevance Feature Selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An effective approach for semantic-based clustering and topic-based ranking of web documents

Abstract

Access this article

Similar content being viewed by others

An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine

Web Search Results Clustering Using Frequent Termset Mining

Integrating LDA with Clustering Technique for Relevance Feature Selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation