Community Engagement and Quality Knowledge with Stackoverflow’s

Bullipedia, the online gastronomic encyclopedia, is an idea yet to be developed. In this work, we analyze Stack Overflow (SO) and extract some good practices from this popular question-and-answer (Q&A) site to incorporate them into the future Bullipedia. SO is an online forum in which users ask and answer questions related to programming, web development, operating systems, and other technical topics. Expertise is rewarded through a detailed reputation system: questions and answers can receive up and downvotes from other members of the community so that their authors (askers and answerers) gain reputation for posing good questions and providing helpful solutions. Besides this, the asker may mark (accept) one of the answers as the best one at any point. In this paper, we present a study on how this reputation system can be used to predict the likely accepted answer (from a set of candidate answers) for a yet unresolved question. In our approach, we selected a subset of questions with their respective answers, and for each answer we created a question-answer pair (quan). Then we extracted a set of key features from every quan, and applied supervised machine learning techniques to train a classifier that learnt, based on those features, whether or not a quan contained the accepted answer for that question. Finally, we made use of the trained classifier to predict if, given a quan (related to a question with no marked answer), its answer might potentially be the accepted one for the question. Our findings show that the model previously obtained predicted the possible answer correctly for every question with high accuracy (88 percent of the time). A question and its accepted answer constitutes a source of quality knowledge, as it provides the solution to a specific problem. We propose to adopt a similar Q&A forum and reputation system for Bullipedia, and then apply a similar classification model to identify the best answer for unsolved questions


Introduction
Bullipedia has its origin in elBulli , one of the most lauded restaurants of all time. Michelin awarded it with three stars (1976, 1990, and 1997), and more recently it was voted best restaurant in the world in 2002 and from 2006 to 2009 by industry authority Restaurant magazine (Williams 2012). The successful restaurant incorporated disciplines such as technology, science, philosophy, and the arts in its research, and Ferran Adrià, its owner and the most influential chef in the world, published his results in international conferences, books or journal articles, in a similar way to the academic process of peer review.
elBulli has now become elBulliFoundation, a center that seeks to be a hub for creativity and innovation in high cuisine, and continue the creation activity of the former restaurant. The key project of the foundation is an attempt to externalize all its wisdom onto the Bullipedia -Adrià's vision for "an online database that will contain every piece of gastronomic knowledge ever gathered" (Williams 2012). He justified the need for such a culinary encyclopedia by claiming that "there is no a clear codification on cuisine" (Pantaleoni 2013). However, the Bullipedia is an idea yet to be developed.
Thus, the question to answer at this point is: What should the Bullipedia be like? By analyzing different sources (specific literature, elBulli's publications, interviews with Adrià, news, and emails with elBulliFoundation's staff), we have identified several requirements that the Bullipedia must meet to achieve its mandate. We focus on encouraging user contribution and holding quality knowledge in this work.
For a project such as this (creating an online encyclopedia on cuisine) we believe that the collaboration of the community is indispensable to building quality contents.
Bullipedia is an idea inherently 2.0 that can take advantage of crowdsourcing (Quinn and Bederson 2011;Doan, Dieter and Harmelen 2011) and harness the collective intelligence (O'Reilly 2005) to generate value. There exist many successful projects that could teach us valuable lessons for Bullipedia. The best example is Wikipedia, the online encyclopedia par excellence, whose success is due largely to its reliance on the crowd to create, edit, and curate its content. Other relevant cases are Allrecipes, Epicurious, and Cookpad, three of the most popular websites for recipe exchange. On these platforms, the members upload their own recipes and review and rate other members' recipes.
A major concern in these kinds of projects that rely on their own users to succeed is precisely how to engage them. In our previous work (Jiménez-Mavillard and Suárez 2015), we recapitulated the main motivations for the crowd to create content.
From them, in the context of this work, we highlight recognition and reputation as intrinsically rewarding factors that motivate the community to collaborate (Herzberg 2008). For this reason, we look into SO and we argue that its reputation system: 1) encourages participation, 2) is an excellent quality control mechanism that guarantees true value because it allows users to collectively decide whether or not contents are reliable, and 3) can be used to measure and predict the quality of these contents. This paper is organized as follows. In section 2, we describe Q&A sites and in particular SO. In section 3, we outline the problem and significant related works.
The experiment and the methodology is detailed in section 4, and the results are shown in section 5. We discuss the relevance of SO and other projects to Bullipedia in section 6, and finally, end with some conclusions and future work in section 7.

Stack Overflow and Q&A sites
Since the origin of the Internet, the volume of information has been increasing on the web, digital libraries and other media. Traditional search engines are helpful tools to tackle the abundance of information, but they just give ranked lists of documents that users must browse manually. In many cases, they simply want the exact answer to their questions, asked in natural language.
Q&A sites have emerged in the past few years as an enormous market to fulfill these information needs. They are online platforms where users share their knowledge and expertise on a variety of subjects. In essence, users of the Q&A community ask questions and other users answer them. These sites go beyond traditional keywordbased querying and retrieve information in more precise form than given by a list of documents. This fact has changed the way people search for information on the web. For instance, the abundance of information to which developers are exposed via social media is changing the way they collaborate, communicate, and Jimenez-Mavillard and Suárez: Community Engagement and Quality Knowledge with Stackoverflow's Reputation System 5 learn (Vasilescu and Serebrenik 2013). Other authors have also investigated the interaction of developers with SO, and reported how this exchange of questions and answers is providing a valuable knowledge base that can be leveraged during software development (Treude, Barzilay and Storey 2011). These changes have produced a marked shift initially from mere websites born to provide useful answers to the question, towards large collaborative production and social computing web platforms aimed at crowdsourcing knowledge by allowing users to ask and answer questions. The end product of such community-driven knowledge creation process is of enduring value to a broad audience; it is a large repository of valuable knowledge that helps users to solve their problems effectively (Anderson et al. 2012).
The ever-increasing amount of Q&A sites has caused the number of questions answered on them to far exceed the number of questions answered by library reference services (Janes 2003), which until recently were one of the few institutional sources for such information. Library reference services have a long tradition of evaluation to establish the degree to which a service is meeting user needs. Such evaluation is no less critical for Q&A sites, and perhaps even more so, as these sites do not have a set of professional guidelines and ethics behind them, as library services do (Shah and Pomerantz 2010). Instead, most Q&A sites use collaborative voting mechanisms for users inside the community to evaluate and maintain high quality questions and answers (Tian et al. 2013). By a quality answer, we mean the one that satisfies the asker (Liu et al. 2008) and also other web users who will face similar problems in the future (Liu et al. 2011 to traditional sources of information for programmers, like books, blogs, or other existing Q&A sites. The fact that both Atwood and Spolsky were popular bloggers contributed to its success in the early stages of the project as they brought their two communities of readers to their new site and generated the critical mass that made it work (Atwood 2008;Spolsky 2008). This success was also promoted by its novel The site employs gamification to encourage its participants to contribute (Deterding 2012). Participation is rewarded by means of an elaborate reputation system that is set in motion through a rich set of actions. The main actions are asking and answering questions, but users also can vote up or down on the quality of other members' contributions. The basic mode of viewing content is from the question page, which lists a single question along with all its answers and their respective votes. The vote score on an answer -the difference between the upvotes and downvotes it receives-determines the relative ordering in which it is displayed on the question page. When users vote, askers and answerers gain reputation for asking good questions and providing helpful answers, and they also obtain badges that give them more privileges on the website. In addition, at any point, an asker can select one of the posted answers as the accepted answer, suggesting that this is the most useful response. This also makes the asker and the answerer earn reputation.
The reputation score can be seen as a measure of expertise and trust, which signifies how much the community trusts a user (Tian et al. 2013).
SO's success is largely due to the engaged and active user community that collaboratively manages the site. This community is increasing both in size and in the amount of content it generates. According to the January 2014 SO data dump provided by the Stack Exchange network (and analyzed in this work), SO stores around 9.5 million questions, almost 16 million answers, and has a community with Jimenez-Mavillard and Suárez: Community Engagement and Quality Knowledge with Stackoverflow's Reputation System 7 more than 4 million users. The number of questions added each month has been steadily growing since the inception of SO (Ponzanelli et al. 2014) and has reached peaks of more than 200,000 new questions per month (see Figure 1).
The content is heavily curated by the community; for example, duplicate questions are quickly flagged as such and merged with existing questions, and posts considered to be unhelpful (unrelated answers, commentary on other answers, etc.) are removed. As a result of this self-regulation, and despite its size, content on SO tends to be of very high quality (Anderson et al. 2012).

Problem and related work
The primary idea of this study is to understand the relation between a question and its accepted answer in order to predict potential accepted answers (from a set of candidate answers) for new questions. We tackled this problem by analyzing SO to verify if the metrics associated to the reputation system activities -question's score, answer's score, user's reputation, etc. -can decisively predict accepted answers for yet unresolved questions.
This problem has two main components: crowdsourcing and machine learning.
The term crowdsourcing is the combination of the words crowd and outsourcing, and

Jimenez-Mavillard and Suárez: Community Engagement and
Quality Knowledge with Stackoverflow's Reputation System 8 is the process of getting work from a large group of people, especially from online communities, rather than from employees or suppliers. This model of contribution has been applied to a wide range of disciplines, from bioinformatics (Khare et al. 2015) to the digital humanities (Carletti et al. 2013). "The Wisdom of Crowds" (Surowiecki 2005) is a popular science work for a general introduction to the concept. SO relies on the crowd, and some authors underlined the soundness of this user-generated content model to provide quality solutions. Vasilescu and Serebrenik (2013) investigated the correlation between the activity of SO's users and their activity on GitHub, the largest social coding site on the Web. They demonstrated that the most productive programmers, in terms of amount and uniform distribution of their work, are the ones who answer more questions on the Q&A site. Therefore, a large number of answers on SO are presumably effective solutions, as they came from qualified programmers that incorporate good work practices. Parnin, Treude and Grammel (2012) showed that companies like Google, Android, or Oracle also acknowledge the quality of contents produced on SO. These brands entrusted the documentation of their respective APIs -Google Web Toolkit, Android, and Java -to the SO community. The authors collected usage data using Google Code Search (currently shut down), and analyzed the coverage, quality, and dynamics of the SO documentation for these APIs. They found that the crowd is capable of generating a rich source of content with code examples and discussion that is more actively viewed and used than traditional API documentations.
The second pillar of this work is machine learning, a subfield of artificial intelligence that studies how to create algorithms that can learn and improve with experience. These algorithms learn from input observations, build a model that fits the observed data, and apply the model to new observations in order to make predictions on them. Machine learning is a transverse area used in a large number Jimenez-Mavillard and Suárez: Community Engagement and Quality Knowledge with Stackoverflow's Reputation System 9 of disciplines, and with multiple applications, from computer vision to speech and handwriting recognition. "Machine Learning" (Mitchell 1997) is a classic introductory textbook on primary approaches to the field.
A recurrent problem solved with machine learning is classification. This is the problem of identifying the category of a new observation, and belongs to the type of supervised methods, that is, the task of building a model from categorized data. As the goal of our study is to determine when a question has been correctly answered, we posed this question as a classification problem: "Is this answer (likely to be) the accepted answer for this question?" The possible options are "yes" or "no." Therefore, for every question each of its answers belongs to one of these two categories: "Yes" (it is the accepted answer) or "No" (it is not). The question seems trivial but even in many branches of pure mathematics, where specialists deal with objective and universal knowledge, it can be surprisingly hard to recognize when a question has, in fact, been answered (Wilf 1982).
While it is true that finding the "right" answer is ambitious, efforts to detect 'good enough' ones are underway. As already mentioned above, our approach is to extract features from questions, answers, and users, and apply classification to learn the relation between a question and its accepted answer. Many authors have applied machine learning techniques to provide the correct answer to a question by pursuing different objectives. For instance, some of them focused on directly identifying the best answer. Wang et al. understood questions and their answers on Yahoo! Answers as relational data (Wang et al. 2009). They assumed that answers are connected to their questions with various types of links that can be positive (indicating highquality answers), negative (indicating incorrect answers) or user-generated spam.
They proposed an analogical reasoning-based approach which measures the analogy between the new question-answer linkages and those of some previous relevant knowledge that contains only positive links. The answer that had the most analogous link to the supporting set was assumed to be the best answer. Shah and Pomerantz instead, evaluated and predicted the quality of answers, also on Yahoo! Answers (Shah and Pomerantz 2010). They extracted features from questions, answers, and the users that posted them, and trained different classifiers that were able to measure Jimenez-Mavillard and Suárez: Community Engagement and Quality Knowledge with Stackoverflow's Reputation System 10 the quality of the answers. The answer with the highest quality was considered the best one. Interestingly, the authors reported that contextual information such as a user's profile (they included information like the number of questions asked and the number of those questions resolved, for the asker; and the number of questions answered and the number of those answers chosen as the best answers, for the answerer) can be critical in evaluating and predicting content quality. This is actually a key finding in our experiment, as we will see in the results (section 5).
Another approach is to redirect the question to the best source of information.
For example, Singh and Shadbolt matched questions on SO to the corresponding Wikipedia articles, so that the users could find out the answer by themselves (Singh and Shadbolt 2013). For this, they applied natural language processing tools to

Experiment and methodology
As stated before, we formulated the idea of identifying valuable knowledge on SO as a classification problem in machine learning. In particular, we made use of SO's reputation system to predict if an answer will be the accepted answer for a question.
Let us have a question, q i , that has k answers, a i1 , a i2 , ..., a ik , and none of them has been marked as accepted yet. Let us pair the question with each of its answers; we obtain k pairs: <q i , a i1 >, <q i , a i2 >, ..., <q i , a ik >. The problem we want to solve is: "what questionanswer pair contains the answer that will be (likely to be) accepted?" In the context of this paper, we defined the next concepts: • Information unit or quan: a quan is a question-answer pair. Each answer on SO has a minimum level of accuracy with respect to its question; in any other case, it would have been removed by the community. Therefore, every quan provides always valid information.
• Knowledge unit or kuan: a kuan is a particular quan formed by a question and its accepted answer. The accepted answer solved the asker's question.
Hence, a kuan is a source of valuable knowledge for other users that could face the same problem in the future. In the previous example, if the first answer (for instance), a i1 , was the accepted answer, then from the list of quans, <q i , a i1 >, <q i , a i2 >, ..., <q i , a ik >, only <q i , a i1 > would be also a kuan.
The rephrased goal now is to answer the question: "what quans are kuans?" In order to address this experiment, we combined different tools: the Ubuntu command line, Python, and in particular its Element Tree XML API, Pandas, and Scikit Learn. Python is a programming language that allowed us to work quickly and integrate systems more effectively. The Element Tree module implements a simple and efficient API for parsing and creating XML data. Pandas is an open source, BSDlicensed library that provides high-performance, easy-to-use data structures and data analysis tools for Python. Scikit Learn is a collection of simple and efficient tools for data mining, data analysis, and machine learning. Our experiment was performed in several steps:

a) SO's data dump
The first step was to dump SO's data on posts and users. With the Ubuntu command line, we split the posts into questions and answers, and then used the Element Tree XML API to parse and extract the data from questions, answers, and users. Table 1 summarizes the information of the total data dump. Quality Knowledge with Stackoverflow's Reputation System 14 b) Experiment dataset Secondly, we selected a subset of questions with their respective answers and their authors (users that posted these questions and answers). The subset of questions ranged from January 1 to January 10, 2015, while we collected their answers posted during the whole month of January, 2015. One month from posting a question is enough time to get most of their answers. In fact, 63.5% of questions on SO are answered in less than an hour (Bhat et al. 2014). We used Pandas to process this large data set and store it in an easy to use format: CSV.

c) Feature selection
Third, we created the set of quans from the selected dataset. We needed to transform our data into a suitable representation that the classifier could process, so to accomplish this task, every quan was represented as a vector of features (these came from questions' and answers' attributes as represented in Table 3). Table 2 summarizes the information of our new dataset.
We selected the features based on similar previous works.  Anderson et al. 2012). Features 7 and 9 were also suggested by others (Shah and Pomerantz 2010;Ponzanelli et al. 2014;Bhat et al. 2014); then, we extended them

d) Feature extraction
Next, we needed to classify the set of quans into two categories: "Yes" (as in "Yes, it contains the accepted answer") for kuans and "No" (as in "No, it does not contain the accepted answer") for the rest of the quans that are not kuans. Feature extraction consisted in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. In our case, a quan is represented by a Pythondictionary-like object composed of 13 feature-value pairs. As previously defined, if QA is the set of quans and m its cardinality, then: Where qa i is each of the total m quans, decomposed into its thirteen features with their respective values:

e) K-fold cross-validation
The aforementioned set of feature vectors is required to train the classifier. In the basic approach, the total set is split into two sets: training set (usually 90% of the original set) and testing set (the remaining 10%). Then, the classifier is trained with the training set and tested with the testing set. In the k-fold cross-validation approach, the training set is split into k smaller sets, and the following procedure is followed for each of the k folds: first, a model is trained using k-1 of the folds as training data; and second, the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute performance measures such as accuracy). The performance measure reported by k-fold cross-validation is then the average of the k values computed in the loop.

f) Classifier training and testing
Finally, we did 10-fold cross-validation on our set of 72,847 quans and got 10 subsets of 7,285 quans. Approximately, one third of the quans are kuans (containing an accepted answer). We trained and tested several classifiers with different parameters.
These classifiers were: Ridge, Perceptron, Passive-Aggressive, Stochastic Gradient Descent (SGD), Nearest Centroid, Bernoulli Naive Bayes (NB), and Linear Support Vector Classification (SVC). Of these, the classifier with best performance was Linear SVC.
Linear SVC belongs to the family of SVM, a set of supervised learning methods used for classification, among other applications. SVM has shown good performance for many natural language related applications, such as text classification (Joachims 2002), and has been used in multiple studies relating to question classification (Blooma et al. 2008;Tamura, Takamura and Okumura 2005;Solorio et al. 2004;Zhang and Lee 2003). Table 4 summarizes the performance of all the classifiers evaluated in our experiment.

Results
The results show that Linear SVC was the classifier with highest performance (for the four metrics) in comparison to the rest. Table 5 displays the Linear SVC performance in detail. The evaluation was done on the testing set (7,285 quans). We can see that the classifier performed slightly better for "No" than for "Yes" (first and second rows respectively). The third row shows the average values for the total testing set of quans. Our model gave an accuracy of 88%, that is, it predicted correctly the accepted answer for 88% of the questions. This result is superior to others reported in similar works. Shah and Pomerantz (2010) measured the quality of answers on Yahoo! Answers with an accuracy of 84%, while Wang et al. (2009) identified the best answer for new questions, also on Yahoo! Answers, with a precision of 78%. Table 6 is the confusion matrix of the model. It represents the real number of "Yes" and "No", and how Linear SVC classified them. For visualization purposes, we also calculated the two most informative features (the two most helpful features for the classifier to classify the quans) and they were percent_answered_questions_q and percent_accepted_answers_a, i.e., the percentage of questions that were answered to this question's asker (with respect to the total number of questions posted by this asker) and the percentage of answers that were accepted for this answer's answerer (with respect to the total number of answers posted by this answerer), respectively. We trained a new classifier with only these two features and obtained an average accuracy of 82% (86% for "No" and 76% for "Yes").
This result is unsurprisingly inferior to the previous one (88%) as we left out other important features, but it demonstrated that the two most informative features alone provide enough information to the classifier for it to classify the quans with fairly high accuracy. These results are shown in Figure 2.      Furthermore, building up trust is one of the major motivations for information exchange (Barachini 2009;Krogh, Roos and Kleine 1998). Thanks to this involvement, good questions and answers are easily identified by the community itself, and unhelpful posts are removed. This keeps the quality of content very high (Anderson et al. 2012). At the same time, the level of trust in a community and the value of the knowledge that it generates are also important factors in users' willingness to collaborate (Krogh, Roos and Kleine 1998;Tsai 2000). All of this activity combines into a self-perpetuating cycle (Figure 3).

22
The viability of Bullipedia depends on how we face some big challenges, namely, how to engage the community that will build the content, how to make this content quality knowledge, or how to compete with established recipe websites.
As mentioned above, the reputation system fits the quality knowledge needs and the rewarding factors that motivate people to contribute; but social factors play a role equally decisive. Some of these are simply the common good or less altruistic motivators such as recognition and career advancement. Reputation is translated into expertise, which is very valuable for the development industry in the case of SO. Only time will tell if a similar reputation system on Bullipedia will be helpful in identifying the next generation of best chefs in the world. Bullipedia will likely also lure users away from other recipe websites, since Bullipedia has an important unfair advantage in its connection to elBulli and Ferran Adrià's trademark. This is a project that has raised high expectations since its conception among culinary professionals and media. Moreover, Bullipedia is different from its main competitors because it will be more than a recipe exchange website -it will be a platform for recipe and

Conclusions and future work
In this work, we have studied SO, a popular Q&A site for programming that owes its success to its committed community. This commitment is achieved by means of an elaborate reputation system and its triple role: 1) it is an implementation of the gamification employed by the site to encourage participation, 2) it is a collective mechanism to control the quality of the crowdsourced contents, and 3) it stores a reputation score for every user, that is, a measurement of their expertise on the site.
These scores provide useful metrics to evaluate the quality of the contents. Thus, we contemplated using this reputation system to prognosticate the quality of the answers. In particular, we wondered whether it was possible to predict the likely accepted answer (from a set of candidate answers) for a yet unresolved question.
We formulated this issue as a machine learning classification problem: "for every quan (question-answer pair), is that answer (likely to be) the accepted answer for that question?" We gave two possible categories: "Yes" or "No". In order to solve this problem, we first represented every quan as a feature vector; and then did a 10-fold cross-validation on our set of quans and trained and tested several classifiers with different parameters. The classifier that performed best was Linear SVC. Our key finding shows that this model correctly predicted the accepted answer for every question with an accuracy of 88%. This result is substantially higher than others reported in similar studies. Shah and Pomerantz (2010) measured the quality of answers on Yahoo! Answers with an accuracy of 84%, while Wang et al. (2009) identified the best answer for new questions, also on Yahoo! Answers, with a precision of 78%.
Our prediction is based on the thirteen extracted features for each quan, but the two most informative features were percent_answered_questions_q and percent_ accepted_answers_a (the percentage of resolved questions posted by the asker and the percentage of accepted answers posted by the answerer, respectively). These two metrics suggest the importance of asking clear question in obtaining an answer and giving good answers in having them accepted. When a question is asked by a good asker and the answer is provided by a good answerer, the probability for the question and the answer to form a kuan (question-accepted_answer pair) increase rapidly (to 82% according to our second experiment and up to 88% according to our first one).
These results reaffirm Shah and Pomerantz's (2010) findings -they reported that users' information is critical in evaluating and predicting content quality.
A question and its accepted answer constitute reliable knowledge, as it provides the solution for a specific problem that a user had in the past and that other users may face in the future. From this reliable knowledge (by identifying questions and their accepted answers) we can build a repository that contains exclusively quality knowledge on SO. This idea can be extrapolated to every possible domain. We have demonstrated that Linear SVC is suitable for Q&A classification, so, if we implemented a similar forum and reputation system on the future Bullipedia, it would be possible to apply this same idea to predict the best answer for unsolved questions on the gastronomic encyclopedia. Questions like "Is it safe to cook chicken in a slow-cooker?" or "What's a good dressing for a salmon salad?" would have a best answer because our classifier would select it from all the supplied answers.
We believe that a Q&A forum with gamification implemented by a reputation system, plus elBulli and Adrià's trademark, will form a delightful combination that would raise high expectations and engage a large enthusiastic community.
This engagement would create quality knowledge and increase the potential of the Bullipedia project, from which the next generation of best chefs in the world might arise. In the future, we are planning to research if it is also feasible to use the reputation system to identify influential users in the community, like well-known chefs, other users with high reputations, good askers or good answerers, and improve the process of knowledge creation by analyzing the habits of these key actors.