Abstract
As the web is accessible to a vast population around the globe, web users today pose a large number of queries, with dynamic, vague and unclear intentions, using the web search tools, as a consequence of which organizing search results have become an all the more challenging task. Further, because of such web queries, it is difficult for web search tools to comprehend the exact user context, and thus they retrieve an extensive volume of results, a significant portion of which are unnecessary for the user. One of the answers to this problem is a strategy called search result clustering (SRC), which bunches the search results and presents them to users with many options for the query. In this work, we have proposed an approach that initially classifies the related topics and lays them out in the form of concepts, and then building search results clusters by designating each to the relevant topic and finally, providing relevant labels for these topics. We examine the effectiveness of our approach by measuring it against two most popular non-commercial methods in this field, specifically Lingo and STC, with two standard datasets, ODP and Ambient, and a newly developed dataset, Ex-Ambient, which is a rigorously extended version of the Ambient Dataset. We performed analysis on both qualitative and quantitative dimensions. We define a qualitative dimension as the expressiveness of the cluster label generated, while quantitative dimension regards the correctness of the document assigned to the cluster. The experimental results presented by the proposed method were encouraging in contrast with Lingo and STC for all the datasets and both the dimensions.
Similar content being viewed by others
Notes
References
Taivalsaari A, Mikkonen T (2017) The web as a software platform: Ten years later. In: Proceedings of the 13th international conference of web information systems and technologies (Porto, Portugal, April 25 - 27, 2017). WEBIST’17. INSTICC, SciTePress 2017. https://doi.org/10.5220/000623480041005
Jansen BJ, Spink A. How are we searching the world wide web? A comparison of nine search engine transaction logs. Information Process Manag. 2006;42(1):248–63. https://doi.org/10.1016/j.ipm.2004.10.007.
M. Sanderson. 2008. Ambiguous queries: test collections need more sense. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (Singapore, Singapore, July 20–24, 2008). SIGIR’08. ACM, New York, NY, USA, pp. 499–506
Xu J, Ye F. Query recommendation using hybrid query relevance. Future Internet. 2018;10(11):1–13. https://doi.org/10.3390/fi10110112.
Santos RLT, Macdonald C, Ounis I. Search result diversification. Found Trends Inf Retr. 2015;9(1):1–90. https://doi.org/10.1561/1500000040.
M. Bendersky, W. B. Croft, and Y. Diao. 2011. Quality-biased ranking of web documents. In: Proceedings of the fourth ACM international conference on Web search and data mining (Kowloon, Hong Kong, Feburary 09–12, 2011). WSDM’11. ACM, New York, NY, USA, pp 95–104.
Carpineto C, Romano G. A survey of automatic query expansion in information retrieval. ACM Comput Surv. 2012;44(1):1–56.
C. Carpineto, S. Osiński, G. Romano, and D. Weiss, 2009. A survey of web clustering engines. ACM Computing Surveys. 41(3):1–38. https://doi.org/10.1145/1541880.1541884
Hearst MA, Pedersen JO (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp. 76–84.
Ngo CL, Nguyen HS (2004) A tolerance rough set approach to clustering Web search results. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (Pisa, Italy, September 20 – 24, 2004). PKDD’04, Lecture Notes in Computer Science, 3202, Springer, pp 515–517.
Everitt BS, Landau S, Leese M. Cluster analysis. 4th ed. New York: Oxford University Press; 2001.
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st international ACM SIGIR conference on research and development in information retrieval (Melbourne, Australia, August 24–28, 1998). SIGIR’98, ACM, New York, NY, USA, pp. 46–54.
I. Masowska (2003) Phrase-based hierarchical clustering of Web search results. In: Proceedings of the 25th European conference on IR research: (Pisa, Italy, April 14–16, 2003), ECIR’03, Lecture Notes in Computer Science, 2633, Springer, pp 555–562. https://doi.org/10.1007/3-540-36618-0_42
Ferragina P, Gulli A (2005) A personalized search engine-based on Web-snippet hierarchical clustering. In:Special interest tracks and posters of the 14th international conference on World Wide Web (Chiba, Japan, May 10–14, 2005). WWW '05, ACM, New York, NY, USA, pp. 801-810
Osinski S, Stefanowski J, Weiss D (2004) Lingo: Search results clustering algorithm based on singular value decomposition. In: Proceedings of the international intelligent information processing and web mining conference (Zakopane, Poland, May 17–20, 2004). IIPWM’04,Advances in Soft Computing. Springer, pp. 359–368.
Zeng HJ, He QC, Chen Z, Ma WY, Ma J (2004) Learning to cluster Web search results. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (Sheffield, United Kingdom, July 25–29, 2004). SIGIR’04, ACM, New York, NY, USA, pp. 210–217.
Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on world wide web (New York, NY, USA, May 17–22, 2004). WWW’04, ACM, New York, NY, USA, pp. 658–665.
Di Marco A, Navigli R (2011) Clustering web search results with maximum spanning trees. In: Proceedings of 12th international conference of the italian association for artificial intelligence around man and beyond (Palermo, Italy, September 15–17, 2011). AI* IA 2011, Springer-Verlag, Berlin Heidelberg, pp. 201–212.
Kozlowski M, Rybinski H (2018) Clustering of semantically enriched short texts. Journal of Intelligent Information Systems, pp 1–24.
Osinski S, Gotoh Y (2004) Dimensionality reduction techniques for search results clustering. Master's thesis, The University of Sheffield.
Carpineto C, Romano G (2008) Ambient dataset, https://search.fub.it/ambient/. Last Access: Nov 23 2018
Carrot2: https://project.carrot2.org/download.html/ Last access: Nov 23 2018
Singh RK, Bisht D. Significance and algorithms of web clustering engine: a review. Int J Res Eng Manag. 2017;2(1):25–30.
Berkhin P, Kogan J, Nicholas C, Teboulle M (2006). A survey of clustering data mining techniques, grouping multidimensional data. Springer-Verlag, Berlin, Heidelberg, pp 25–71. https://doi.org/10.1007/3-540-28349-8_2
Han J, Kamber M, Tung AKH. Spatial clustering methods in data mining: a survey. Bristol: Geographic Data Mining and Knowledge Discovery. Taylor & Francis, Inc.; 2001. p. 1–29.
Jain AK, Dubes RC (1988) Algorithms for clustering data, Prentice-Hall, Inc.
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Survev. 1999;31(3):264–32323. https://doi.org/10.1145/331499.331504.
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques, In: Proceedings of KDD workshop on text mining (Boston, MA, USA, August 20–23, 2000), KDD-2000, ACM, New York, NY, USA, pp 1–20.
D. Cheng, S. Vempala, R. Kannan, and G. Wang. 2005. A divide-and-merge methodology for clustering. In: Proceedings of the twenty-fourth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (Baltimore, Maryland, USA, June 13–15, 2005). PODS’05, ACM, New York, NY, USA, pp 196–205.
Carpineto C, Romano G. Exploiting the potential of concept lattices for information retrieval with CREDO. J Univ Comput Sci. 2004;10(8):985–1013.
Zhang D, Dong Y (2004) Semantic, hierarchical, online clustering of web search results, In: Proceedings of the 6th Asia-Pacific web conference (Hangzhou, China, April 14–17, 2004), APWeb 2004, Advanced Web Technologies and Applications, Lecture Notes in Computer Science, Springer, pp 69–78.
Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. Int J Adv Comput Sci Appl (IJACSA). 2015;6(1):147–53.
Goyal P, Mehala N, Bansal A. A robust approach for finding conceptually related queries using feature selection and tripartite graph structure. J Inform Sci. 2013;39(5):575–92. https://doi.org/10.1177/0165551513477819.
Yu FU, You YU Research on text representation method based on improved TF-IDF. 2020. In: Journal of physics: conference series. 1486. pp. 072032. https://doi.org/10.1088/1742-6596/1486/7/072032
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. In Journal of machine Learning research, 3(Jan):993–1022.
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L. Latent dirichlet allocation (LDA) and Topic modeling: models, applications, a survey. Multimedia Tools Appl. 2019;78(11):15169–211.
Bellegarda JR, Butzberger JW, Chow YL, Coccaro NB, Naik D (1996) A novel word clustering algorithm based on latent semantic analysis. In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, (Atlanta, GA, USA, May 9, 1996), pp 172–175.
Tong Z, Zhang H (2016) A text mining research based on LDA topic modelling. In: Proceedings of the sixth international conference on computer science, engineering and information technology (CCSEIT), (Vienna, Austria, May 21–22, 2016), pp 21–22.
Bergamaschi S, Po L (2014) Comparing LDA and LSA topic models for content-based movie recommendation systems. In: International conference on web information systems and technologies. Springer, Cham
Potha N, Stamatatos E (2018) Intrinsic author verification using topic modeling. In: Proceedings of the 10th hellenic conference on artificial intelligence (SETN’18), (Patras, Greece) (pp. 1–7). https://doi.org/10.1145/3200947.3201013
Mecca G, Raunich S, Pappalardo A. A new algorithm for clustering search results. Data Knowl Eng. 2007;62(3):504–22.
Bruce Croft W (2019) The importance of interaction for information retrieval. In: 42nd international ACM SIGIR conference on research and development in information retrieval (SIGIR ’19), (Paris, France, July 21–25, 2019), pp 1–2.
Chuang J, Hsu DJ (2014) Human-centered interactive clustering for data analysis. 2014. In: Conference on neural information processing systems (NIPS 2014), (Montreal, Canada, December 8–13, 2014). Workshop on Human-Propelled Machine Learning.
Bae J, Helldin T, Riveiro M, Nowaczyk S, Mohamed-Rafik Bouguelia, Falkman G (2020) Interactive clustering: a comprehensive review. ACM Comput. Surv. 53, 1, Article 1 (February 2020), 39 pages. https://doi.org/10.1145/3340960
Zamani H, Dumais S, Craswell N, Bennett P, Lueck G (2020) Generating clarifying questions for information retrieval. In: Proceedings of the 29th international conference on world wide web. (WWW’20), (Taipei, Taiwan, April 20–24, 2020). Association for Computing Machinery, New York, NY, USA pp. 418–428. https://doi.org/10.1145/3366423.3380126
Zhao P, Leung K, Lee D (2013) Continuous topically related queries grouping and its application on interest identification. In: International conference on database systems for advanced applications. (DASFAA 2013), (Wuhan, China, April 22–25, 2013),pp 224–238. https://doi.org/10.1007/978-3-642-37487-6_19
Gao R, Shah C. Toward creating a fairer ranking in search engine results. Inf Process Manage. 2020;57(1):102138.
MontiLingua: https://web.media.mit.edu/~hugo/montylingua/ (Last access: July 06 2018)
Goyal P, Mehala N (2011) Concept based query recommendation. In: Proceedings of the ninth australasian data mining conference (Ballarat, Australia, December 01–02, 2011). AusDM’11, ACM, New York, NY, USA, 121, pp 69–78.
Dumais ST. Latent semantic indexing. Ann Rev Inform Sci Technol. 2004;38(1):188–230. https://doi.org/10.1002/aris.1440380105.
Nasution MKM, Noah SAM, Saad S (2011) Social network extraction: Superficial method and information retrieval. In: Proceeding of International Conference on Informatics for Development (ICID’11), (Yogyakarta, Indonesia November 26, 2011)
Liu Y, Wang C, Zhou K, Nie J, Zhang M, Ma S (2014). From skimming to reading: a two-stage examination model for web search. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management (Shanghai, China, November 3–7, 2014). CIKM’14, ACM, New York, NY, USA, pp. 233–240.
Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.
Raghavan VV, Bollman P, Jung GS. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inform Syst. 1989;7:205–29.
Manning C, Schutze H. Foundations of statistical natural language processing. Cambridge: MIT Press; 1999.
Davis J, Goadrich M (2006) The relationship between precison-recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning (Pittsburgh, PA, USA, June 25–29, 2006). ICML’06, ACM, New York, NY, USA, pp. 233–240.
Pedrsen T, Patwardhan S, Michelizzi J (2004) Wordnet:: Similarity: measuring the relatedness of concepts. In: Proceedings of the Demonstration Papers HLT-NAACL 2004 (HLT-NAACL-Demonstrations’ 04) (Boston, MA, USA, May 2–7, 2004). ACL, Stroudsburg, PA, USA, pp. 38–41.
C. Leacock and M. Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. In: WordNet: an electronic lexical database. The MIT Press, Cambridge, 11, pp. 265–283
Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd Annual meeting of the association for computational linguistics (Las Cruces, New Mexico, USA, June 27–30, 1994). ACL’94, pp. 133–138.
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence (Montreal, Quebec, Canada, August 20–25, 1995). IJCAI '95, pp. 448–453.
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning (Madison, Wisconsin, USA, July 24–27, 1998). ICML '98, pp. 296–304.
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of international conference on research in computational linguistics (Taipei, Taiwan, August, 1997). ROCLING ’97, pp 19–33.
Hirst G, St-Onge D. 1998. Lexical chains as representations of context for the detection and correction of malapropisms. In WordNet: An electronic lexical database. The MIT Press, Cambridge, MA, 13, pp. 305–332
Banerjee S, Pedrsen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the eighteenth international joint conference on artificial intelligence (Acapulco, Mexico, August 9–15, 2003). IJCAI ’03, pp 805–810.
Patwardhan S (2003) Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University of Minnesota, Duluth.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances in Internet Research and Engineering” guest edited by Mohit Sethi, Debabrata Das, P. V. Ananda Mohan and Balaji Rajendran.
Rights and permissions
About this article
Cite this article
Mehala, N., Bhatia, D. A Concept-Based Approach for Generating Better Topics for Web Search Results. SN COMPUT. SCI. 1, 294 (2020). https://doi.org/10.1007/s42979-020-00311-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00311-y