Skip to main content
Log in

An effective approach for semantic-based clustering and topic-based ranking of web documents

  • Applications
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

In this large, dynamic and expandable web, extracting desired information of any user query is a significant problem for the search engine. Clustering and Ranking are two important resources which can shed light in this direction. To achieve this potential clustering-ranking mechanism, this study proposes a combined approach of semantic-based clustering and topic-based ranking of web documents. The proposed clustering approach combines the latent semantic indexing (LSI) with min-cut algorithm. To make the clustering technique more effective, a new feature selection method called clustering-based feature selection has been developed that focuses on finding the feature set which gathers the crux of documents in the corpus without deteriorating the outcome of the construction process. While LSI completely overcomes the constraint of synonymy, the min-cut algorithm helps to generate efficient clusters at each stage of the clustering process. For deciding the number of clusters to be formed, silhouette coefficient is used, which is a parameter incorporating both cohesion and separation of clusters. To rank the documents in each semantic cluster, the proposed approach transforms the text into topics using latent Dirichlet allocation and then runs the inverted indexing technique on those topics. 20-Newsgroups and DMOZ datasets are used for experimental work, and the results obtained from the experiment show that the performance of the clustering approach is better than the traditional clustering approaches and the ranking approach is promising.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://lsa.colorado.edu/papers/dp1.LSAintro.pdf.

  2. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm.

  3. https://radimrehurek.com/gensim/tutorial.html.

  4. http://ag.arizona.edu/classes/rnr555/lecnotes/10.html.

  5. http://www.azimuthproject.org/azimuth/show/Eckart-Young+low+rank+approximation+theorem#idea_2.

  6. http://jmlr.org/papers/volume3/blei03a/blei03a.pdf.

  7. http://tartarus.org/martin/PorterStemmer/.

  8. https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

  9. Decided by threshold.

  10. http://qwone.com/~jason/20Newsgroups/.

  11. http://www.dmoz.org.

  12. Determined by running the code many times as the LDA is a stochastic topic model and considered that value of t for which the best results is obtained.

References

  1. Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)

    Article  Google Scholar 

  2. Croft, W.B.: A model of cluster searching based on classification. Inf. Syst. 5(3), 189–195 (1980)

    Article  MathSciNet  Google Scholar 

  3. Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 314–323. ACM (1997)

  4. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley, New York (2009)

    MATH  Google Scholar 

  5. Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

  6. Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)

  7. Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: EDBT, pp. 123–146. Springer (2004)

  8. Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. VLDB 8, 222–236 (2000)

    Article  Google Scholar 

  9. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: Proceedings., 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)

  10. Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)

    Article  Google Scholar 

  11. Borhani-Fard, Z., Minaei, B., Alinejad-Rokny, H.: Applying clustering approach in blog recommendation. J. Emerg. Technol. Web Intell. 5(3), 296–301 (2013)

    Google Scholar 

  12. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)

    Article  MathSciNet  Google Scholar 

  13. Parvin, H., MirnabiBaboli, M., Alinejad-Rokny, H.: Proposing a classifier ensemble framework based on classifier selection and decision tree. Eng. Appl. Artif. Intell. 37, 34–42 (2015)

    Article  Google Scholar 

  14. Roul, R.K., Aggrawal, A.: Feature space of deep learning and its importance: comparison of clustering techniques on the extended space of ML-ELM. In: Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 25–28. ACM (2017). https://doi.org/10.1145/3158354.3158359. ISBN: 978-1-4503-6382-2

  15. Parvin, H., Minaei-Bidgoli, B., Alinejad-Rokny, H.: A new imbalanced learning and dictions tree method for breast cancer diagnosis. J. Bionanosci. 7(6), 673–678 (2013)

    Article  Google Scholar 

  16. Alinejad-Rokny, H., Farzaneh, M.K., Orimi, A.G., Pedram, M., Kiasari, H.A.: Proposing a new structure for web mining and personalizing web pages. J. Emerg. Technol. Web Intell. 5(3), 287–295 (2013)

    Google Scholar 

  17. Esmaeili, L., Minaei-Bidgoli, B., Alinejad-Rokny, H., Nasiri, M.: Hybrid recommender system for joining virtual communities. Res. J. Appl. Sci. Eng. Technol. 4(5), 500–509 (2012)

    Google Scholar 

  18. Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Jain, L.C., Behera, H.S., Mandal, J.K., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining - Volume 3, pp. 159–171. Springer (2015)

  19. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  20. Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)

    Article  Google Scholar 

  21. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  22. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)

    Google Scholar 

  23. Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  24. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  25. Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11), 1901–1907 (2009)

    Article  MATH  Google Scholar 

  26. Li, P., Wang, B., Jin, W.: Improving web document clustering through employing user-related tag expansion techniques. J. Comput. Sci. Technol. 27(3), 554–566 (2012)

    Article  Google Scholar 

  27. Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)

    Article  Google Scholar 

  28. Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)

    Article  Google Scholar 

  29. Huang, F., Zhang, S., He, M., Wu, X.: Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17(1), 105–126 (2014)

    Article  Google Scholar 

  30. Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)

    Article  Google Scholar 

  31. Chen, H., Carin, L., Dunson, D.B.: Topic modeling with nonparametric markov tree. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 377–384 (2011)

  32. Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS, vol. 7, pp. 121–128 (2007)

  33. Blei, D.M., Lafferty, J.D., et al.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  34. Cai, J., Lee, W.S., Teh, Y.W.: Improving word sense disambiguation using topic features. In: EMNLP-CoNLL, pp. 1015–1023 (2007)

  35. Mimno, D., Blei, D.: Bayesian checking for topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 227–237 (2011)

  36. Henderson, K., Eliassi-Rad, T.: Applying latent dirichlet allocation to group discovery in large graphs. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1456–1461. ACM (2009)

  37. Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: The Semantic Web-ISWC, 2008, pp. 229–244. Springer (2008)

  38. Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)

  39. Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–465. ACM (2011)

  40. Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)

    Article  Google Scholar 

  41. Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. (India) Ser. B 95(1), 15–21 (2014)

    Article  Google Scholar 

  42. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM (2005)

  43. Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2009)

  44. Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302. ACM (2007)

  45. Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298. ACM (2009)

  46. Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)

    Article  Google Scholar 

  47. Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159–168. ACM (1998)

  48. Karger, D.R.: Global min-cuts in rnc, and other ramifications of a simple min-cut algorithm. ACM 93, 21–30 (1993)

    MathSciNet  MATH  Google Scholar 

  49. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  50. Knuth, D.E.: The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Delhi, Pearson Education India (2011)

    MATH  Google Scholar 

  51. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajendra Kumar Roul.

Ethics declarations

Conflict of interest

The corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roul, R.K. An effective approach for semantic-based clustering and topic-based ranking of web documents. Int J Data Sci Anal 5, 269–284 (2018). https://doi.org/10.1007/s41060-018-0112-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0112-3

Keywords

Navigation