Skip to main content
Log in

A Concept-Based Approach for Generating Better Topics for Web Search Results

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

As the web is accessible to a vast population around the globe, web users today pose a large number of queries, with dynamic, vague and unclear intentions, using the web search tools, as a consequence of which organizing search results have become an all the more challenging task. Further, because of such web queries, it is difficult for web search tools to comprehend the exact user context, and thus they retrieve an extensive volume of results, a significant portion of which are unnecessary for the user. One of the answers to this problem is a strategy called search result clustering (SRC), which bunches the search results and presents them to users with many options for the query. In this work, we have proposed an approach that initially classifies the related topics and lays them out in the form of concepts, and then building search results clusters by designating each to the relevant topic and finally, providing relevant labels for these topics. We examine the effectiveness of our approach by measuring it against two most popular non-commercial methods in this field, specifically Lingo and STC, with two standard datasets, ODP and Ambient, and a newly developed dataset, Ex-Ambient, which is a rigorously extended version of the Ambient Dataset. We performed analysis on both qualitative and quantitative dimensions. We define a qualitative dimension as the expressiveness of the cluster label generated, while quantitative dimension regards the correctness of the document assigned to the cluster. The experimental results presented by the proposed method were encouraging in contrast with Lingo and STC for all the datasets and both the dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.dmoz.org/

References

  1. Taivalsaari A, Mikkonen T (2017) The web as a software platform: Ten years later. In: Proceedings of the 13th international conference of web information systems and technologies (Porto, Portugal, April 25 - 27, 2017). WEBIST’17. INSTICC, SciTePress 2017. https://doi.org/10.5220/000623480041005

  2. Jansen BJ, Spink A. How are we searching the world wide web? A comparison of nine search engine transaction logs. Information Process Manag. 2006;42(1):248–63. https://doi.org/10.1016/j.ipm.2004.10.007.

    Article  Google Scholar 

  3. M. Sanderson. 2008. Ambiguous queries: test collections need more sense. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (Singapore, Singapore, July 20–24, 2008). SIGIR’08. ACM, New York, NY, USA, pp. 499–506

  4. Xu J, Ye F. Query recommendation using hybrid query relevance. Future Internet. 2018;10(11):1–13. https://doi.org/10.3390/fi10110112.

    Article  Google Scholar 

  5. Santos RLT, Macdonald C, Ounis I. Search result diversification. Found Trends Inf Retr. 2015;9(1):1–90. https://doi.org/10.1561/1500000040.

    Article  Google Scholar 

  6. M. Bendersky, W. B. Croft, and Y. Diao. 2011. Quality-biased ranking of web documents. In: Proceedings of the fourth ACM international conference on Web search and data mining (Kowloon, Hong Kong, Feburary 09–12, 2011). WSDM’11. ACM, New York, NY, USA, pp 95–104.

  7. Carpineto C, Romano G. A survey of automatic query expansion in information retrieval. ACM Comput Surv. 2012;44(1):1–56.

    Article  Google Scholar 

  8. C. Carpineto, S. Osiński, G. Romano, and D. Weiss, 2009. A survey of web clustering engines. ACM Computing Surveys. 41(3):1–38. https://doi.org/10.1145/1541880.1541884

  9. Hearst MA, Pedersen JO (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp. 76–84.

  10. Ngo CL, Nguyen HS (2004) A tolerance rough set approach to clustering Web search results. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (Pisa, Italy, September 20 – 24, 2004). PKDD’04, Lecture Notes in Computer Science, 3202, Springer, pp 515–517.

  11. Everitt BS, Landau S, Leese M. Cluster analysis. 4th ed. New York: Oxford University Press; 2001.

    MATH  Google Scholar 

  12. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st international ACM SIGIR conference on research and development in information retrieval (Melbourne, Australia, August 24–28, 1998). SIGIR’98, ACM, New York, NY, USA, pp. 46–54.

  13. I. Masowska (2003) Phrase-based hierarchical clustering of Web search results. In: Proceedings of the 25th European conference on IR research: (Pisa, Italy, April 14–16, 2003), ECIR’03, Lecture Notes in Computer Science, 2633, Springer, pp 555–562. https://doi.org/10.1007/3-540-36618-0_42

  14. Ferragina P, Gulli A (2005) A personalized search engine-based on Web-snippet hierarchical clustering. In:Special interest tracks and posters of the 14th international conference on World Wide Web (Chiba, Japan, May 10–14, 2005). WWW '05, ACM, New York, NY, USA, pp. 801-810

  15. Osinski S, Stefanowski J, Weiss D (2004) Lingo: Search results clustering algorithm based on singular value decomposition. In: Proceedings of the international intelligent information processing and web mining conference (Zakopane, Poland, May 17–20, 2004). IIPWM’04,Advances in Soft Computing. Springer, pp. 359–368.

  16. Zeng HJ, He QC, Chen Z, Ma WY, Ma J (2004) Learning to cluster Web search results. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (Sheffield, United Kingdom, July 25–29, 2004). SIGIR’04, ACM, New York, NY, USA, pp. 210–217.

  17. Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on world wide web (New York, NY, USA, May 17–22, 2004). WWW’04, ACM, New York, NY, USA, pp. 658–665.

  18. Di Marco A, Navigli R (2011) Clustering web search results with maximum spanning trees. In: Proceedings of 12th international conference of the italian association for artificial intelligence around man and beyond (Palermo, Italy, September 15–17, 2011). AI* IA 2011, Springer-Verlag, Berlin Heidelberg, pp. 201–212.

  19. Kozlowski M, Rybinski H (2018) Clustering of semantically enriched short texts. Journal of Intelligent Information Systems, pp 1–24.

  20. Osinski S, Gotoh Y (2004) Dimensionality reduction techniques for search results clustering. Master's thesis, The University of Sheffield.

  21. Carpineto C, Romano G (2008) Ambient dataset, https://search.fub.it/ambient/. Last Access: Nov 23 2018

  22. Carrot2: https://project.carrot2.org/download.html/ Last access: Nov 23 2018

  23. Singh RK, Bisht D. Significance and algorithms of web clustering engine: a review. Int J Res Eng Manag. 2017;2(1):25–30.

    Google Scholar 

  24. Berkhin P, Kogan J, Nicholas C, Teboulle M (2006). A survey of clustering data mining techniques, grouping multidimensional data. Springer-Verlag, Berlin, Heidelberg, pp 25–71. https://doi.org/10.1007/3-540-28349-8_2

  25. Han J, Kamber M, Tung AKH. Spatial clustering methods in data mining: a survey. Bristol: Geographic Data Mining and Knowledge Discovery. Taylor & Francis, Inc.; 2001. p. 1–29.

    Google Scholar 

  26. Jain AK, Dubes RC (1988) Algorithms for clustering data, Prentice-Hall, Inc.

  27. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Survev. 1999;31(3):264–32323. https://doi.org/10.1145/331499.331504.

    Article  Google Scholar 

  28. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques, In: Proceedings of KDD workshop on text mining (Boston, MA, USA, August 20–23, 2000), KDD-2000, ACM, New York, NY, USA, pp 1–20.

  29. D. Cheng, S. Vempala, R. Kannan, and G. Wang. 2005. A divide-and-merge methodology for clustering. In: Proceedings of the twenty-fourth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (Baltimore, Maryland, USA, June 13–15, 2005). PODS’05, ACM, New York, NY, USA, pp 196–205.

  30. Carpineto C, Romano G. Exploiting the potential of concept lattices for information retrieval with CREDO. J Univ Comput Sci. 2004;10(8):985–1013.

    MATH  Google Scholar 

  31. Zhang D, Dong Y (2004) Semantic, hierarchical, online clustering of web search results, In: Proceedings of the 6th Asia-Pacific web conference (Hangzhou, China, April 14–17, 2004), APWeb 2004, Advanced Web Technologies and Applications, Lecture Notes in Computer Science, Springer, pp 69–78.

  32. Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. Int J Adv Comput Sci Appl (IJACSA). 2015;6(1):147–53.

    Google Scholar 

  33. Goyal P, Mehala N, Bansal A. A robust approach for finding conceptually related queries using feature selection and tripartite graph structure. J Inform Sci. 2013;39(5):575–92. https://doi.org/10.1177/0165551513477819.

    Article  Google Scholar 

  34. Yu FU, You YU Research on text representation method based on improved TF-IDF. 2020. In: Journal of physics: conference series. 1486. pp. 072032. https://doi.org/10.1088/1742-6596/1486/7/072032

  35. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. In Journal of machine Learning research, 3(Jan):993–1022.

  36. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L. Latent dirichlet allocation (LDA) and Topic modeling: models, applications, a survey. Multimedia Tools Appl. 2019;78(11):15169–211.

    Article  Google Scholar 

  37. Bellegarda JR, Butzberger JW, Chow YL, Coccaro NB, Naik D (1996) A novel word clustering algorithm based on latent semantic analysis. In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, (Atlanta, GA, USA, May 9, 1996), pp 172–175.

  38. Tong Z, Zhang H (2016) A text mining research based on LDA topic modelling. In: Proceedings of the sixth international conference on computer science, engineering and information technology (CCSEIT), (Vienna, Austria, May 21–22, 2016), pp 21–22.

  39. Bergamaschi S, Po L (2014) Comparing LDA and LSA topic models for content-based movie recommendation systems. In: International conference on web information systems and technologies. Springer, Cham

  40. Potha N, Stamatatos E (2018) Intrinsic author verification using topic modeling. In: Proceedings of the 10th hellenic conference on artificial intelligence (SETN’18), (Patras, Greece) (pp. 1–7). https://doi.org/10.1145/3200947.3201013

  41. Mecca G, Raunich S, Pappalardo A. A new algorithm for clustering search results. Data Knowl Eng. 2007;62(3):504–22.

    Article  Google Scholar 

  42. Bruce Croft W (2019) The importance of interaction for information retrieval. In: 42nd international ACM SIGIR conference on research and development in information retrieval (SIGIR ’19), (Paris, France, July 21–25, 2019), pp 1–2.

  43. Chuang J, Hsu DJ (2014) Human-centered interactive clustering for data analysis. 2014. In: Conference on neural information processing systems (NIPS 2014), (Montreal, Canada, December 8–13, 2014). Workshop on Human-Propelled Machine Learning.

  44. Bae J, Helldin T, Riveiro M, Nowaczyk S, Mohamed-Rafik Bouguelia, Falkman G (2020) Interactive clustering: a comprehensive review. ACM Comput. Surv. 53, 1, Article 1 (February 2020), 39 pages. https://doi.org/10.1145/3340960

  45. Zamani H, Dumais S, Craswell N, Bennett P, Lueck G (2020) Generating clarifying questions for information retrieval. In: Proceedings of the 29th international conference on world wide web. (WWW’20), (Taipei, Taiwan, April 20–24, 2020). Association for Computing Machinery, New York, NY, USA pp. 418–428. https://doi.org/10.1145/3366423.3380126

  46. Zhao P, Leung K, Lee D (2013) Continuous topically related queries grouping and its application on interest identification. In: International conference on database systems for advanced applications. (DASFAA 2013), (Wuhan, China, April 22–25, 2013),pp 224–238. https://doi.org/10.1007/978-3-642-37487-6_19

  47. Gao R, Shah C. Toward creating a fairer ranking in search engine results. Inf Process Manage. 2020;57(1):102138.

    Article  Google Scholar 

  48. MontiLingua: https://web.media.mit.edu/~hugo/montylingua/ (Last access: July 06 2018)

  49. Goyal P, Mehala N (2011) Concept based query recommendation. In: Proceedings of the ninth australasian data mining conference (Ballarat, Australia, December 01–02, 2011). AusDM’11, ACM, New York, NY, USA, 121, pp 69–78.

  50. Dumais ST. Latent semantic indexing. Ann Rev Inform Sci Technol. 2004;38(1):188–230. https://doi.org/10.1002/aris.1440380105.

    Article  Google Scholar 

  51. Nasution MKM, Noah SAM, Saad S (2011) Social network extraction: Superficial method and information retrieval. In: Proceeding of International Conference on Informatics for Development (ICID’11), (Yogyakarta, Indonesia November 26, 2011)

  52. Liu Y, Wang C, Zhou K, Nie J, Zhang M, Ma S (2014). From skimming to reading: a two-stage examination model for web search. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management (Shanghai, China, November 3–7, 2014). CIKM’14, ACM, New York, NY, USA, pp. 233–240.

  53. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.

    Book  Google Scholar 

  54. Raghavan VV, Bollman P, Jung GS. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inform Syst. 1989;7:205–29.

    Article  Google Scholar 

  55. Manning C, Schutze H. Foundations of statistical natural language processing. Cambridge: MIT Press; 1999.

    MATH  Google Scholar 

  56. Davis J, Goadrich M (2006) The relationship between precison-recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning (Pittsburgh, PA, USA, June 25–29, 2006). ICML’06, ACM, New York, NY, USA, pp. 233–240.

  57. Pedrsen T, Patwardhan S, Michelizzi J (2004) Wordnet:: Similarity: measuring the relatedness of concepts. In: Proceedings of the Demonstration Papers HLT-NAACL 2004 (HLT-NAACL-Demonstrations’ 04) (Boston, MA, USA, May 2–7, 2004). ACL, Stroudsburg, PA, USA, pp. 38–41.

  58. C. Leacock and M. Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. In: WordNet: an electronic lexical database. The MIT Press, Cambridge, 11, pp. 265–283

  59. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd Annual meeting of the association for computational linguistics (Las Cruces, New Mexico, USA, June 27–30, 1994). ACL’94, pp. 133–138.

  60. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence (Montreal, Quebec, Canada, August 20–25, 1995). IJCAI '95, pp. 448–453.

  61. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning (Madison, Wisconsin, USA, July 24–27, 1998). ICML '98, pp. 296–304.

  62. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of international conference on research in computational linguistics (Taipei, Taiwan, August, 1997). ROCLING ’97, pp 19–33.

  63. Hirst G, St-Onge D. 1998. Lexical chains as representations of context for the detection and correction of malapropisms. In WordNet: An electronic lexical database. The MIT Press, Cambridge, MA, 13, pp. 305–332

  64. Banerjee S, Pedrsen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the eighteenth international joint conference on artificial intelligence (Acapulco, Mexico, August 9–15, 2003). IJCAI ’03, pp 805–810.

  65. Patwardhan S (2003) Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University of Minnesota, Duluth.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Divyansh Bhatia.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Internet Research and Engineering” guest edited by Mohit Sethi, Debabrata Das, P. V. Ananda Mohan and Balaji Rajendran.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mehala, N., Bhatia, D. A Concept-Based Approach for Generating Better Topics for Web Search Results. SN COMPUT. SCI. 1, 294 (2020). https://doi.org/10.1007/s42979-020-00311-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00311-y

Keywords

Navigation