Analyzing and mining a code search engine usage log

Bajracharya, Sushil Krishna; Lopes, Cristina Videira

doi:10.1007/s10664-010-9144-6

Analyzing and mining a code search engine usage log

Open access
Published: 18 September 2010

Volume 17, pages 424–466, (2012)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

Analyzing and mining a code search engine usage log

Download PDF

Sushil Krishna Bajracharya^1,2 &
Cristina Videira Lopes¹

6257 Accesses
66 Citations
1 Altmetric
Explore all metrics

Abstract

This paper presents an analysis of a year long usage log of Koders, the first commercially available Internet-Scale code search engine (http://www.koders.com). The usage log comprises about ten million activities from more than three million users. Analysis of the usage data shows that despite of attracting a large number of visitors, Koders has a very sparse usage and that it lacks regular usage from many of its users. When compared to Web search, search behavior in Koders showed many similar patterns. A topic modeling analysis of the usage data shows what topics users of Koders are looking for. Observations on the prevalence of these topics among the users, and observations on how search and download activities vary across topics, lead to the conclusion that users who find code search engines usable are those who already know to a high level of specificity what to look for. This paper also presents a general categorization of these topics that provides insights on the different ways code search engine users express their queries. It identifies various forms of queries in Koders’s log and the kinds of results addressed by the queries. It also provides several suggestions for improvements in code search engines based on the analysis of usage, topics, and query forms. The work presented in this paper is the first of its kind that reveals several insights on the usage of an Internet-Scale code search engine.

Infrastructure for Building Code Search Applications for Developers

What do developers search for on the web?

Article 09 April 2017

Xin Xia, Lingfeng Bao, … Zhenchang Xing

Novel and Applied Algorithms in a Search Engine for Java Code Snippets

1 Introduction

Searching for source code constitutes a significant part of a software development activity (Singer et al. 1997; Murphy et al. 2006). Software developers use a variety of tools to search source code that range from conventional tools such as ‘grep’ to advanced search facilities in integrated development environments. With the explosion of source code available on the Web, it has become a routine practice among developers to search and reuse source code from the Web.

Recent research in software engineering has focused on understanding and supporting the search-driven nature of software development, and many such work have produced tools that aid various search needs of developers (Hoffmann et al. 2007; Bajracharya et al. 2006, 2009; Lemos et al. 2009; Holmes and Murphy 2005; Mandelin et al. 2005; Thummalapenta and Xie 2007; Hummel et al. 2008; Reiss 2009). On the other hand, there has been considerable commercial effort in the development of Internet-Scale code search engines that have powerful crawlers and large databases providing an index to large quantities of software available on the Web (Web site for Koders 2010; Web site for Krugle 2010; Web site for Google Code Search 2010).

Despite the progress in developing novel code search tools, there has been little work in understanding the actual usage of these tools. Recently, a questionnaire-based study obtained some insights on developers’ practice of searching for source code on the Web (Umarji et al. 2008). Two other studies used search logs to investigate information needs and query styles of developers on Web search (Hoffmann et al. 2007; Brandt et al. 2009). These studies show that searching for source code on the Web has various purposes: learning existing APIs, finding code samples, implementations etc (Umarji et al. 2008; Hoffmann et al. 2007); and that developers express their information need using various forms of queries, ranging from natural language terms to names of code entities they are aware of (Brandt et al. 2009).

Code search engines are relatively new category of tools, and there are no existing studies that explain what users of these code search engines are looking for, and how they express what they are looking for. Despite of the apparent popularity of code search engines,^{Footnote 1} it is not clear whether they are effective in providing the information software developers need. We conducted an exploratory analysis of the usage log of Koders, the first commercial Internet-Scale code search engine, with a goal to answer three major research questions:

Usage: What kind of usage behavior can we see in Koders?
Search Topics: What are the users searching for?
Query Forms: How are users expressing their information need in their queries?

There are four motivations behind these questions. First, we wanted to compare the usage patterns in Koders with known usage patterns in general purpose search and developers’ search behavior on the Web. Second, we wanted to get a high-level summary of what developers are looking for in Koders. Third, by analyzing selected user interactions in detail we wanted to know if users have specific ways of expressing their queries, and whether some forms of expression were more effective than other. Finally, we wanted to gain some insights on what improvements could be made to Koders, and code search engines similar to it, based on our findings.

This paper is an extension of our work presented in Bajracharya and Lopes (2009) with new results on usage statistics, query forms, and more details on topic extraction. It describes the process and results that we extracted from the usage log to answer the three major questions listed above. To the best of our knowledge the work presented in this paper is the first of its kind with detailed analysis of results extracted from millions of entries in the usage log of an Internet-Scale code search engine. In the context of the prior work mentioned earlier, the contributions of this paper are as follows:

1.
It provides a statistical characterization of search behavior in large scale code search by presenting an analysis of a year long usage data from Koders. It compares code search usage with Web search usage, and finds many similarities. It also reveals usage behavior that is unique to Koders.
2.
Using topic modeling analysis on the Koders’s usage data, it gives a solid empirical evidence of the range of topics that users of code search engines look for. It provides empirical data on the prevalence of these topics (popularity) among users and shows how search and download activities vary across the topics, supporting the conclusion that the most successful searches on Koders are those where the users already know what they are looking for. It also provides valuable insights at the different ways code search engine users express their queries.
3.
With an analysis of 150 randomly selected search sessions across various topics (mined using the topic modeling), it identifies various lexical forms of the queries, and the kinds of results addressed by these queries that users gave to Koders.
4.
It provides several suggestions and possible directions to improve code search engines based on the analysis of usage data, topic modeling results, and the analysis of query forms; something the next generation of code search engines should take into account.
5.
It makes the Koders usage log, and associated software used to produce analysis results available to others; facilitating replication, extension, and improvement of the presented work.

The paper is organized as follows. Section 2 discusses the usage data we analyzed. Section 3 provides an analysis of the usage data at large, providing some general statistics on usage and its comparison with usage in Web search. Section 4 presents the LDA (Latent Dirichlet Allocation) topic modeling technique, describes how we applied it to the Koders’s usage log, and presents our findings from topic modeling. Section 5 presents the results of analyzing 150 search sessions sampled from various topics where we identify five lexical forms and four result types users generally expressed in their queries. Section 5 also presents the form of queries that were effective in producing relevant results. Section 6 provides discussion on our interpretation and implication of the results we obtained from analyzing the general usage statistics, mining topics, and encoding various forms of queries. Section 7 discusses validity and limitations of our work. Section 8 presents related work, and we conclude in Section 9.

2 Usage Log Data

The data used for topic modeling consists of a year long user activity log obtained from Koders (Web site for Koders 2010). The usage log contained records of 5,207,758 search activities and 5,072,045 download activities from 3,187,969 unique users covering the period of 2007-01-01 to 2007-12-31.

The log data was recorded in a relational database. The portion of the log we used is represented as a set of tuples with the following fields; <uid, act-type, term-or-file, ts, l>, where:

1.
uid = a unique user id assigned by Koders to each of its user based on the combination of the user’s IP address and browser cookies.
2.
act-type = activity type, that can be either search or download. A search activity constitutes a query consisting of several terms, whereas a download activity is an activity where the user interacts with one of the results shown in the hits either by selecting the code or downloading it. A download activity means the user showed interest in the code that was found in the search results and used it in some way. Therefore a download activity in Koders is equivalent to the result-click events in Web search.
3.
term-or-file = denotes the collection of terms in the query when the activity is search, otherwise denotes a unique file identifier denoting the source code that was accessed during the download activity.
4.
ts = the timestamp attached to each activity that denotes when that activity took place.
5.
l = the programming language specified by the user for each query. If no language is specified the value is ‘*’ denoting search in all languages. Other possible values are languages listed in the Koders’s user interface such as Java, C, Python etc. This value exists only for search activities and not applicable to download activities.

The usage log studied in this paper is available from the UCI Source Code Data Sets Web site (Lopes et al. 2010). The software used to process the data is available as an open source project at Web page for Koders log analysis github repository (2010).

3 Analysis of Usage Data

In this section we look at several statistics to gain insights on usage behavior in Koders. We look at variables that are commonly used in analysis of query logs such as statistics on activities, search sessions, query types and query reformulations (Silverstein et al. 1999; Brandt et al. 2009). Being a code search engine, Koders offers some unique features not found in general purpose search engines. For example, query operators specific to source code, and facility to download (or browse) code after search. We focus more on variables pertaining to these features. These statistics not only reveal usage patterns that are unique to Koders, but also allow us to compare search behavior of users in Koders to those on the Web. Given below is a summary of all the variables we look at.

Routine usage: First, we look at three variables to understand whether users are searching in Koders routinely and actively. We look at number of days that users are active in Koders, number of search activities, and number of download activities among the users.
Analysis of sessions: Second, we do an analysis on sessions of activities in the usage log. A session is considered to be a series of queries by a single user in a short interval that represents a single information need (Silverstein et al. 1999). We look at three variables in sessions: duration, activities, and page views. Duration is the length of session in minutes, activities are either search or download activities, and page views are count of consecutive repeating queries that are recorded in the log when a user navigates through multiple pages of search results for the same query.
Analysis of queries: Third, we do an analysis of the queries in the log to understand how users are expressing their queries. We look at query length, common usage of terms in queries among users, types of queries user give, and the kind of query operators and reformulations in the queries.
Comparing with Web search: Finally, we compare some of the results we obtained with existing results from analysis of logs in Web search.

3.1 Routine Usage

Koders mentions in its Web site that more than 30,000 developers use the search engine every day. However this number does not indicate how routinely users rely on the system. A code search engine might have many visitors, but if the visitors are not coming back, or using it routinely, its utility is questionable. Therefore, we seek to answer the following questions: How many activities users typically have in the system? Do users who use Koders once come back to it again later for their information needs?

Table 1 lists statistics on number of days the users used Koders, and the count of search and download activities for users. These statistics are computed for users who had at least one search activity. We can make the following observations based on the data in Table 1.

Users engage in very few activities: A large percentage of the users had only few search activities. More than 85% of users had just three or less search activities; about 67% of users had only one search activity. More than half of the users do not download anything after search, and those who download have only few downloads. About 64% of the users who searched had no download activity, and 91% of users had three or less downloads.
Most of the users did not used Koders again after using it for a day: About 90% of the users were active only for one day. Among others, 98% of users were active for less than or equal to three days. Only 0.14% of users were active for more than 10 days.

Table 1 Usage statistics

Analyzing and mining a code search engine usage log

Abstract

Similar content being viewed by others

Infrastructure for Building Code Search Applications for Developers

What do developers search for on the web?

Novel and Applied Algorithms in a Search Engine for Java Code Snippets

1 Introduction

2 Usage Log Data

3 Analysis of Usage Data

3.1 Routine Usage

3.2 Analysis of Sessions

Activities in Sessions

Screen Views (Repeating Queries)

3.3 Analysis of Queries

Query Terms

Use of Operators

Query Types

Query Reformulation

3.4 Comparison with Web search

4 Topic Modeling

Data Processing for LDA

4.1 Results—Latent Topics

4.2 Topic Categories

4.3 Users and Topics

4.4 Search, Downloads and Topics

Search Activities Under a Topic

Downloads that Follow a Topic

Results

4.5 Prevalence in Topic Categories

5 Query Forms

5.1 Lexical Structure

5.2 Result Types

5.3 Form and Relevance

6 Discussion

6.1 Usage

Supporting Users who Give up Quickly

Support for Programming Language Specific Queries

Support for Browsing Code After Search

Leveraging Query Logs for Effective Retrieval

Support for Natural Queries

6.2 Topic Modeling

Facilitating Search on Specific Topics

Supporting Users who do not Know what to Look for

Support for Natural Language Expressions in the Queries

6.3 Query Forms

Understanding the Underlying Structure in Code

Supporting Code Abbreviations

Generating the Right Code Snippet to Show in the Result Page

Support for Retrieving Various Result Types

7 Validity

7.1 Usage Analysis

Accuracy of Identifying Users

Noise Introduced by Activities from Bots

Limitations of Log Analysis

Generalizing to Other Search Engines

7.2 Topic Modeling

Choice of LDA Topic Modeling to Mine Topics

Criteria for Selecting Users Searching in Java

Choice of a Model for a Document

Data Processing for LDA

Assumptions with LDA

Topic Categories

Popularity of Topics

Choice of LDA Tools

7.3 Form and Relevance

Relevance Judgements

Analysis of Topics in Search-Download Sessions

8 Related Work

9 Conclusion

Notes

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: List of Topics Mined from the Usage Log

Appendix: List of Topics Mined from the Usage Log

Rights and permissions