1 Introduction

Online epistemic communities or community of creation, aim at creating new knowledge [1, 2]. The two well known examples of these communities are open source software projects and Wikipedia, the largest free and open access online encyclopedia. These communities, as an online self organizing group, perform a large number of activities such developing softwares and editing Wikipedia articles, but also prioritize the work of the participants (via bug lists or articles in need of improvement), or manage the global organization of the project. This organizational model of online epistemic community, is the core of new and innovative knowledge generation. In the same time, the way of community success is uncertain and risky. Like business building, most attempts fail, do not matter the manner in which dollars were spent.

The question of how to integrate newcomers, but also what a good (efficient and effective) team means are intensively discussed in the recent literature. Many crucial factors, at various stages of the group composition, are stressed: motivation of participants [3], governance structure [8], culture and ideology [4], social structure and network ties [6, 7] and social identity [8, 9]. In general, these studies highlight the reasons individuals participate in online communities and the manner in which these members are organized to reach group target. In particular they interested on the processes and the manner members coordinate their work. They are less prolix on members’ characteristics, and how their diversity affects the quality of the knowledge produced by their virtual work.

To understand the production of article as common knowledge, we first base our analysis on the Institutional Analysis and Development framework described in Fig. 1. This framework distinguishes the characteristics of the community, or the “inputs” (“biophysical characteristics”, “attribute of the community”, “rules-in-use”) from “the action arena” which constrain the way people interact, leading to “outcomes” [10].

Fig. 1.
figure 1

Institutional analysis and development framework [10]

Based on this framework, we cannot overlook neither the effect of the process nor the effect of the community attributes on the quality of the articles. Before talking about the process and the action arena, we need to identify the appropriate group based on member attributes, because they affect greatly the manner members work together and how effective their collaborative efforts will be. In traditional organizations performance has been linked to members’ characteristics and their diversity [11]. However, the negative effects of these factors could be restricted among teams using new collaborative technologies because of limited capabilities like visual anonymity and participation equality. Likewise, the positive effects can be improved due to coordination support and electronic use as additive capabilities. So in one side, two key factors are to be considered: group characteristics in general and member diversity. More specifically, the focus should be placed on characteristics reflecting the similarity or dissimilarity of members’ attributes.

On the other hand, another specific characteristic of online groups is their nature of self-organizing. Contrary to work groups in virtual organizations where memberships are identified by management control and organizational model, online group in self-organization is coached voluntary involvement. So it is important to look at membership constitution based on the career of the contributors before the creation of the article. This factor plays also an important role in online collaboration success. However, this ideas is not developed in the case of Wikipedia as an online epistemic community. The majority of the existing studies on the subject focus on what happen after article creation. They link knowledge quality to group characteristics within the same project but neglect what may have happened before the studied project.

Our study will advance conditions under which this crowding group create high quality knowledge among online epistemic communities. In this paper, we inspect the following research questions:

  1. 1.

    What are the characteristics of a “good” team to produce new knowledge?

  2. 2.

    How does member diversity or similarity affect the outcomes that a group produced in an online epistemic community?

As a primary contribution of our work we developed a theoretical model about initial structure in terms of group composition and member diversity in virtual teams based on what happened before article creation. We further integrated various social theories to develop these propositions by considering how structure could be instantiated in shared mental models and the specific behaviors that contribute to building such models.

The paper is organized as follows: the next section reviews related work on group composition and core members. The following section develops hypotheses about the effects of group composition on the quality of produced knowledge. Then, we describe our research methods following with our results. Finally, we discuss our findings and highlight implications for both theory and practice, before our concluding remarks with possible future research directions.

2 Literature Review

There is a large body of literature that describes the structure in online communities. To develop our hypotheses, we follow two sets of literature. The first is the group composition literature where Tan et al. [12] shed light that group composition is an important factor of online community success. The second is the extensive core members and their relationships with knowledge quality [13].

2.1 Group Composition

Study research in team composition is based on members’ attributes and the effects of the combination of these attributes on action arena and outcomes [14]. Two important group composition dimensions were investigated in literature: the mean level of group’s characteristic, and the diversity within the group [13].

In one side, the mean level is considered as simple attributes combination of group members [15, 16]. Member attributes average is designed as a summary based on some measures of members’ characteristic [14].

On the other hand, diversity in group work is defined as heterogeneity or dis-similarity of individual attributes that arises from the difference of any attribute from others [9]. These attributes can be social demographic characteristics (age, gender, race, and nationality), informational ones (tenure, experience, education,…) or deeper individual differences (beliefs, values and personality). The relationship between team diversity and outcomes has been extensively [9].

This line of research has focused on the dimensions of the group’s activity, based on what is done during article creation. It does not detail how member’s previous activity, would determine group performance and quality results. To forecast knowledge quality, we use these group composition dimensions, focusing on the core members on the the basis of their previous behavior.

2.2 Core Member Constitution

Some studies on epistemic communities, especially in the case of open source software, stressed the importance of the core members [13, 17, 18]. In this context, they define the core-periphery structure by members’ activity, where a set of high active members are accountable for the majority of project contribution and a large and loosely coupled group of periphery members support the others. Mostly, the core members are more active compared to others. Their presence is important for online communities success.

These studies conduct to making the hypothesis that in Wikipedia projects too, there is a core-periphery structure, similar to the one found in open software production, and that the initiators of the projects are its ‘natural’ core members, and are key to its success. In this study, we examine the question of how to identify core members of an article and if its characteristics impact the quality of the article produced.

3 Research Hypotheses

As mentioned above, most studies look at the contributions made after the first contribution. In our study, we are interested on what happened before. We define the “first contribution” as the first time an editor makes a revision and proposes a new article as a new and original knowledge.

Based on these definitions we define article creation life cycle as described in Fig. 2. The first contribution is a separator, allows us to define three main periods. “The learning period” is the period before the first contribution where we learn about the members attributes based on their history. “The teaming period” is the period of four months after the first contribution where we constitute the core members. “The active period” is the period just after the teaming period when teams start activities1. These definitions help to refine our main hypothesis:

Fig. 2.
figure 2

Article life cycle among Wikipedia

H: “The teaming period carries important information about the initial group composition using what we learned about the members in the first period which impacts in turn the quality of knowledge produced by the group.”

Our main hypothesis argued that group composition is responsible for the success of online epistimic community. The way in which groups self-formed, recruit and attract participants automatically may affect group composition, which affects indirectly knowledge quality. In our study a key factor to consider is member characteristics, within the virtual work environment, at both individual and group levels. In addition, we provide valuable insights into their diversity which reflects wherever group members share different or similar attributes.

3.1 Effects of Initial Group Size on the Quality of Knowledge Produced by a Virtual Group

Many studies have been trying to determine whether small or large groups are more likely to cooperate on a project and produce knowledge. Group size affects quality differently. In one hand, wide groups ensure a large amount of knowledge and a faster time to corrected errors and discovered incomplete information [19]. On the other hand, small groups often lack the resources that large groups can extent. These limited resources, make difficult to give additional resources to producing article within Wikipedia as a collective action [20]. Ostrom suggested further research on collective action to focus on the hypothesized curvilinear effects of group size [10]. Based on Ostrom proposition we hypothesize that:

H1: “Group size in the teaming period is an important determinant of project success in online epistemic community which has a curvilinear effects on article quality.”

3.2 Effects of Diversity of Member Characteristics Within the Virtual Group on the Quality of Knowledge Produced

Some studies on team dynamic, show that heterogeneous teams are more productive than heterogeneous isolated workers in the case of low-skilled worker. Our study of diversity in Wikipedia extends the literature of diversity in virtual teams. We test and confirm theoretical propositions of the effects of diversity to know if diversity should be encouraged or discouraged.

3.2.1 Effects of Experience Disparity on the Quality of Knowledge Produced by a Virtual Group

The members comprising a team may be classified according to their experience. Some agents are newcomers having a little experience and skills. The other are old-timers or incumbents, persons with identifiable talents named.

Evidence from virtual organizations has shown that, although old-timers are more experienced and skilled than newcomers, their effort is generally lower [21]. Having a blend of incumbents and newcomers ensures a sufficient group experience to establish and maintain task structure, and in the same time acquires new ideas and information to complete the task [9]. In the meantime, when experience disparity increases, old-timers and new-comers may have different collaboration work views. So experience diversity may reduce communication and social integration and thus it has been linked to increased conflicts. Therefore, our next hypothesis:

H2: “There is a non linear relationship between experience diversity and article quality. Article quality increases as experience diversity increases. However, beyond certain levels of experience diversity, article quality will decrease.”

3.2.2 Effects of Reputation Diversity on the Quality of Knowledge Produced by a Virtual Group

User reputations are computed according to the number of their past contributions, the quality of produced articles and the quantity of succeeding edits (see [22]). Reputation systems are considered one of the primary factors for success of online communities [18]. By exploring German Wikipedia, [22] showed that high quality articles are not necessarily edited by a large number of people, but the most important is to be written by contributors with reputation for high quality contributions. On the other hand, [23] found that the highest quality contributions come from the large numbers of anonymous hardly contribute. So we must find an adjustment between user with high level reputation and anonymous users.

H3: “High-quality content in Wikipedia comes from means level of reputation distributed among members during the teaming period. In addition there is a non linear relationship between reputation diversity and article quality. Article quality increases as reputation diversity increases. However, beyond certain levels of reputation diversity, article quality will decrease.”

4 Research Method

4.1 Wikipedia Case Study

To address our research questions and to verify the validity of the proposed hypotheses, we chose Wikipedia as data source. Wikipedia, is one of the most success stories of peer collaboration and the well-known example of the online epistemic community.

The first important characteristic of Wikipedia data availability. Many useful data records publicly available online, includes useful information about Wikipedia [24]. Furthermore, the case study will be Wikipedia, because of its connection with firms’ knowledge management and production challenges. Another important aspect of Wikipedia is that MediaWiki site is maintained for every different language in Wikipedia.

4.2 Data Collection

The best way to retrieve large portions or the whole set of activity data from any Wikipedia language is using the database dump files. These dump files contain precise information about all actions performed in any Wikipedia language. Dump files can be retrieved from the Wikimedia Downloads center.

For our purposes, we are interested in French wikipedia. We first retrieve French Wikipedia XML database dump le “pages-meta-history.xml.7z” from the set of available dump files. We use data extracted on December 12, 2015. This data contains the complete meta data of every version of all French articles from the beginning of the online encyclopedia (January 2001) to December 2015.

Once dump file is loaded, we use WikiDAT and Media wiki API for data extraction. WikiDAT is a tool for Wikipedia data analytics, based on Python and R and using MySQL database. It is aim to create an extensible toolkit for Wikipedia using Python and R to automate the extraction of Wikipedia data into 5 different tables of MySQL database (page, people, revision, revision hash, logging).

The MediaWiki action API is a web service that allows the collection of data and meta-data from the latest Wikipedia’s dump and it’s available for several languages, in particular, French, the one we’re concerned with. It is a project maintained by the Mediawiki and contains a well-structured documentation to be able to query data and can be returned as JSON. In our project, we used it to retrieve the articles that are part in the categories “Featured Articles” and “Good Articles”.

4.3 Data Preparation

4.3.1 Data Selection

We preprocessed the XML records in the raw data using Wiki-DAT into a tabular data set representing 7833289 articles and 114907858 edits. We used MediaWiki API to retrieve article quality; a qualified article needs to belongs to one of two classes: “Featured Articles” and “Good Articles”. There are 2920 qualified article (1 % of total number) and we randomly sampled 2500 non qualified articles. We finally analyzed these articles based on their revision and the historic of their members’ revision on other articles. As a first contribution we start by analyzing 100 articles from 114907878 editors’ revision made.

4.3.2 Variable Measure

In this part, Table 1 describes the different variables we consider for modeling our hypothesis.

Table 1. Variables measure

4.4 Data Analysis

In this paper, we examined WikiProjects from 2001 to 2015 in order to understand how group composition and diversity characteristics affect the quality of created knowledge. For data analysis, we used Random Forest in order to select the most relevant group attributes leading to successful articles. We created a predictive model of random forest algorithm.

This algorithm is a classification method that operates by constructing a many decision trees during training time, and take out ordered attributes by importance [29].

For our predictive model, we separated the dataset in train and test sets. The train set consisted in a random 70 % of all the articles and the test set contained the remaining 30 %. Then, the predictive model was trained on the training set and applied on the test set to predict article quality. We compared those predictions to the real value of article quality. To deduce the accuracy value of each predictive model we compute the rate of the good predictions over the total number of predictions. This process was done a 5000 times to smooth over the extreme cases.

5 Results

Descriptive statistics are presented in Fig. 3 for French Wikipedia. The random forest method gives values for quantifying the importance of an attribute for the quality of the prediction. The variable importance is a critical output of the Random Forest algorithm. For each variable it notes its importance to classify the data and predict the dependent variable. The plot presented in Fig. 3 shows each variable ordered top-to-bottom according to its importance from the most important to the least important. The estimation of their importance is given in the x-axis by computing the average decrease of accuracy of each tree in the forest in the absence of the given attribute. The most important variables are at the top. Higher this value is, more important is this attribute for the prediction. To choose the important variables, we look for a large break between variables. According to this metric Fig. 3 shows that the large break is between max_Nb_participation and diversity_participation_FA.

Fig. 3.
figure 3

Attributes’ importance of French Wikipedia

In our model there are many attributes and some of them may be useless, being very correlated to others. Hence, we calculated the Pearson correlation on all the attribute’s pairs as mentioned in Fig. 4 and we removed one element of every pair for which the absolute value of the correlation is over 0.8. So the most important attributes in general and that standing out to predict a successful French Wikipedia project are average reputation, group size, diversity contribution and participation, average experience.

Fig. 4.
figure 4

Pearson correlation on all the attribute’s pairs of French Wikipedia

6 Discussion

The results detailed in Sect. 5 mention that the most important variable is average reputation. This means that the average effective participation of the authors who edit qualified articles is higher than the participation of authors contributing to other articles. Similarly, the editors who wrote the qualified articles participate during teaming periods of qualified ones. In the same time, average reputation is more important than maximum reputation. More specifically, average reputation is more important than reputation diversity which doesn’t exert any significant importance to predict qualified articles. At the beginning of the core member recruitment process, recruiting editors with high or heterogeneous reputation is not necessary. But the most important is to recruit well-known editors based either on their edits or on the pages they edited.

The results of the experience effects are different to the findings of [9] which posit that high experience or tenure disparity leads to high productivity. Experience diversity does not have any significant impact article quality during teaming period. This means that core members do not need to have a diverse experience to produce a qualified article. In contrast, average experience distributed among core members have a slight importance article quality.

Experience in Wikipedia can generally provide a social status. Old editors and newcomers have the possibility to refer to each other, which may cause conflict in a WikiProject that reduces performance. In particular conflicts at the beginning of article creation, and during the teaming period, can be particularly damaging to online volunteer groups. As prior research on offline groups, high tenure diversity of core member increase conflict at the beginning. But, status inequalities during the teaming period can be less prominent in online groups. Therefore, in online volunteer groups like Wikipedia most editors should participate at the beginning with equal experience.

Furthermore, people in online volunteer groups are categorized based on experience and are treated them differently. But in a first step of teaming period, preferably having equal experience distributed among members lead to a high quality article.

Interestingly, we found that while diversity in experience and reputation has not an important impact on article quality, the amount of contributions, and their length, matter, in a single sense: the more diverse, the longer, the better. Our results show that group size is also an important attribute to qualify produced article. This can be understood as the fact that the team which rapidly attracts contributors (within this 4-months period) have a better chance of success, something to be related to the fact that a well-know indicator of article quality is the number of contributors who have participated in their redaction. This is coherent with the finding related to the control variable. Our results mentioned that article age exerted a positive effect on all constructs by generating the increase of teaming period on all article. The construction of an article of quality is, mainly a question of stock accumulation (here edits), and the longer the period is, the better the chance that new edits have been made.

7 Conclusion and Future Works

The first contribution of our work is to develop a set of theoretical models about structure involvement in term of group composition in a virtual team. We further integrate various social theories to develop these models taking into account the manner in which the structure can be instantiated in collaborative models.

We start by showing the importance of studying group composition in online open collaboration. In this context, existing research on online collaboration has focused mostly on social structure, governance and motivation. Unlike, our results suggest other attributes of group members that influences the group success. Our findings, on one hand, confirm the importance of identifying group composition in online collaboration basing on contribution history and, on the other hand, highlight the importance of diversity of some attributes.

As a second step in our model, we will work on the manner members organize their activities. We will study in particular the different kinds of leadership and their effect on article quality. Based on theoretical suggestions of leadership in virtual teams, we will present three order leadership that seems likely to be more effective.