Skip to main content
Log in

Augmenting and structuring user queries to support efficient free-form code search

Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/about/press (verified 14.08.2015).

  2. This is a real question asked by a user in this post: http://stackoverflow.com/questions/4951997/generating-random-words-in-javahttp://stackoverflow.com/questions/4951997/generating-random-words-in-java.

  3. In this illustrative example, we excluded the actual post (http://stackoverflow.com/questions/4951997/generating-random-words-in-javahttp://stackoverflow.com/questions/4951997/generating-random-words-in-java) where this question is asked. To eliminate bias, in all experiments described in Section 5, in which we selected a question of a Q&A site as a subject, we removed the corresponding posts from the list of relevant posts to be used for augmenting the query.

  4. https://goo.gl/MqETzP (last accessed 12.07.2015).

  5. https://goo.gl/VPvxnX (last accessed 12.07.2015).

  6. https://doi.org/www.google.com

  7. We use the dump that contains the oldest data available since the launch of Stack Overflow in 2008.

  8. https://api.stackexchange.com/

  9. Lucene’s (version 4) English default stop word set.

  10. http://lucene.apache.org

  11. Ohloh is now OpenHub.

  12. Despite different queries, our query sets are similar to those of Lv et al. (2015) and representatives of common developer search queries.

References

  • Bajracharya SK, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Proceedings of the companion to the 21st ACM SIGPLAN symposium on object-oriented programming systems, languages, and applications (OPSLA). Portland, Oregon, USA, pp 681–682

  • Bajracharya SK (2010) Facilitating internet-scale code retrieval. Ph.D. thesis, Long Beach. AAI3422111

    Google Scholar 

  • Bajracharya SK, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering (FSE). Santa Fe, New Mexico, USA, pp 157–166

  • Barzilay O, Treude C, Zagalsky A (2013) Facilitating crowd sourced software engineering via stack overflow. In: Finding source code on the web for remix and reuse. Springer, Berlin, pp 289–308

  • Bissyande T, Thung F, Lo D, Jiang L, Reveillere L (2013) Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In: Computer software and applications conference (COMPSAC), 2013 IEEE 37th annual. https://doi.org/10.1109/COMPSAC.2013.55, pp 303–312

  • Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013) Orion: a software project search engine with integrated diverse software artifacts. In: ICECSS

  • Carpineto C, de Mori R, Romano G, Bigi B (2001) An information-theoretic approach to automatic query expansion. ACM Trans Inf Syst 19(1):1–27. https://doi.org/10.1145/366836.366860

    Article  Google Scholar 

  • Chatterjee S, Juvekar S, Sen K (2009) Sniff: a search engine for java using free-form queries. In: Fundamental approaches to software engineering. Springer, Berlin, pp 385–400

  • Chen TH, Thomas SW, Nagappan M, Hassan AE (2012) Explaining software defects using topic models. In: Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12. http://dl.acm.org/citation.cfm?id=2664446.2664476. IEEE Press, Piscataway, pp 189–198

  • Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: ACM/IEEE 32Nd international conference on software engineering. https://doi.org/10.1145/1806799.1806825, vol 1, pp 155–164

  • Codota (2016) http://www.codota.com. Last accessed 12.03.2016

  • Dagenais B, Robillard MP (2012) Recovering traceability links between an API and its learning resources. In: Proceedings of the 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 47–57

  • Eckert K, Stuckenschmidt H, Pfeffer M (2007) Interactive thesaurus assessment for automatic document annotation. In: Proceedings of the 4th international conference on knowledge capture, k-CAP ’07. https://doi.org/10.1145/1298406.1298426. ACM, New York, pp 103–110

  • Furnas GW, Landauer TK, Gomez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971. https://doi.org/10.1145/32206.32212

    Article  Google Scholar 

  • Gallardo-Valencia RE, Elliott Sim S (2009) Internet-scale code search. In: Proceedings of the 2009 workshop on search-driven development-users, infrastructure, tools and evaluation, SUITE

  • Gollapudi S, Ieong S, Ntoulas A, Paparizos S (2011) Efficient query rewrite for structured web queries. In: Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11. https://doi.org/10.1145/2063576.2063981. ACM, New York, pp 2417–2420

  • Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: 2010 ACM/IEEE 32nd international conference on software engineering. https://doi.org/10.1145/1806799.1806868, vol 1, pp 475–484

  • Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. In: International symposium on foundations of software engineering (FSE)

  • Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, pp 842–851

  • Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. http://dl.acm.org/citation.cfm?id=2486788.2486991. IEEE Press, Piscataway, pp 1307–1310

  • Hill E, Roldan-vega M, Fails JA, Mallet G (2014) NL-based query refinement and contextualized code search results: a user study. In: 2014 Software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014. https://doi.org/10.1109/CSMR-WCRE.2014.6747190, pp 34–43

  • Hoffmann R, Fogarty J, Weld DS (2007) Assieme: finding and leveraging implicit references in a web search interface for programmers. In: Proceedings of the 20th annual ACM symposium on user interface software and technology (UIST). Newport, Rhode Island, USA, pp 13–22

  • Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: Proceedings of the 27th international conference on software engineering (ICSE). St. Louis, MO, USA, pp 117–125

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories (MSR). Hyderabad, India, pp 92–101

  • Keivanloo I, Rilling J, Zou Y (2014) Spotting working code examples. In: Proceedings of ICSE

  • Kim S, Kim D (2016) Automatic identifier inconsistency detection using code dictionary. Empir Softw Eng (EMSE) 21(2):565–604

    Article  Google Scholar 

  • Lemos OAL, de Paula AC, Zanichelli FC, Lopes CV (2014) Thesaurus-based automatic query expansion for interface-driven code search. In: Proceedings of the 11th working conference on mining software repositories (MSR). Hyderabad, India, pp 212–221

  • Liu LM, Halper M, Geller J, Perl Y (1999) Controlled vocabularies in oodbs: Modeling issues and implementation. Distrib. Parallel Databases 7(1):37–65. https://doi.org/10.1023/A:1008682210559

    Article  Google Scholar 

  • Lozano A, Kellens A, Mens K (2011) Mendel: Source code recommendation based on a genetic metaphor. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, ASE ’11. https://doi.org/10.1109/ASE.2011.6100078. IEEE Computer Society, Washington, pp 384–387

  • Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via WordNet for effective code search. In: Proceedings of 22nd IEEE international conference on software analysis, evolution, and reengineering (SANER). Montreal, QC, Canada, pp 545–549

  • Lv F, Zhang H, guang Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In: 30th IEEE/ACM international conference on automated software engineering (ASE), pp 260–270

  • Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B (2011) Design lessons from the fastest Q&A site in the west. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI). Vancouver, BC, Canada, pp 2857–2866

  • Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. ACM SIGPLAN Not 40(6):48–61

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University press, New York

    Book  MATH  Google Scholar 

  • Martie L, LaToza TD, van der Hoek A (2015) CodeExchange: supporting reformulation of internet-scale code queries in context (T). In: 2015 30th IEEE/ACM international conference on Automated software engineering (ASE). Lincoln, USA, pp 24–35

  • McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087. https://doi.org/10.1109/TSE.2011.84 https://doi.org/10.1109/TSE.2011.84

    Article  Google Scholar 

  • McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: Proceedings of ICSE

  • Moreno L, Bavota G, Di Penta M, Oliveto R, Marcus A (2015) How can i use this method?. In: ICSE

  • Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in stackoverflow. In: Proceedings of 28th IEEE international conference on software maintenance (ICSM). Trento, Italy, pp 25–34

  • Nguyen AT, Nguyen TT, Al-Kofahi J, Nguyen HV, Nguyen T (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: 26Th IEEE/ACM international conference on automated software engineering (ASE). https://doi.org/10.1109/ASE.2011.6100062, pp 263–272

  • Nie L, Jiang H, Ren Z, Sun Z, Li X (2016) Query expansion based on crowd knowledge for code search. IEEE Trans Serv Comput 9(5):771–783. https://doi.org/10.1109/TSC.2016.2560165

    Article  Google Scholar 

  • Openhub (2016) http://code.openhub.net. Last accessed 12.03.2016

  • Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the IDE into a self-confident programming prompter. In: Proceedings of the 11th working conference on mining software (MSR). Hyderabad, India, pp 102–111

  • Roldan-vega M, Mallet G, Hill E, Fails JA (2013) Conquer: a tool for nl-based query refinement and contextualizing source code search results. In: Proceedings 29th IEEE international conference on software maintenance. Citeseer

  • Ruthven I (2003) Re-examining the potential effectiveness of interactive query expansion. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03. https://doi.org/10.1145/860435.860475. ACM, New York, pp 213–220

  • Sadowski C, Stolee KT, Elbaum S (2015) How developers search for code: a case study. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015. https://doi.org/10.1145/2786805.2786855. ACM, New York, pp 191–201

  • Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on aspect-oriented software development (AOSD). Vancouver, British Columbia, Canada, pp 212–224

  • Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories (MSR). San Francisco, CA, USA, pp 309–318

  • Stylos J, Myers BA (2006) Mica: a web-search tool for finding API components and examples. In: IEEE symposium on Visual languages and human-centric computing, 2006. VL /HCC 2006. https://doi.org/10.1109/VLHCC.2006.32, pp 195–202

  • Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: Proceedings of the 36th international conference on software engineering (ICSE). Hyderabad, India, pp 643–652

  • Thummalapenta S, Xie T (2007) Parseweb: a programmer assistant for reusing open source code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering (ASE). Atlanta, Georgia, USA, pp 204–213

  • Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social coding in Github. In: Proceedings of the 17th European conference on Software maintenance and reengineering (CSMR). Genova, Italy, pp 323–326

  • Treude C, Robillard M (2016) Augmenting api documentation with insights from stack overflow. In: Proceedings of the 38th international conference on software engineering, ICSE ’16, pp 392–403

  • Wang S, Lo D, Jiang L (2014) Active code search: incorporating user feedback to improve code search relevance. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering (ASE). Vasteras, Sweden, pp 677–682

  • Xie T, Pei J (2006) Mapo: mining api usages from open source repositories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. https://doi.org/10.1145/1137983.1137997. ACM, New York, pp 54–57

  • Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR). Zurich, Switzerland, pp 4–11

  • Yang J, Tan L (2014) Swordnet: inferring semantically related words from software context. Empir Softw Eng 19(6):1856–1886

    Article  MathSciNet  Google Scholar 

  • Zhao L, Callan J (2010) Term necessity prediction. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM

  • Zhao L, Callan J (2012) Automatic term mismatch diagnosis for selective query expansion. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR

Download references

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the Fonds National de la Recherche (FNR), Luxembourg, under projects RECOMMEND C15/IS/10449467, FIXPATTERN C15/IS/9964569, FNR-AFR PhD/11623818, and by the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant, under project 16-C220-SMU-004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongsun Kim.

Additional information

Communicated by: Denys Poshyvanyk

We make all our data available: source code of GitSearch, search indices, user study results. See https://github.com/serval-snt-uni-lu/cocabu. A prototype implementation of cocabu-based search engine, GitSearch, is live at http://www.cocabu.com.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sirres, R., Bissyandé, T.F., Kim, D. et al. Augmenting and structuring user queries to support efficient free-form code search. Empir Software Eng 23, 2622–2654 (2018). https://doi.org/10.1007/s10664-017-9544-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-017-9544-y

Keywords

Navigation