Dataset of open-source software developers labeled by their experience level in the project and their associated software metrics

Developers are extracted from 17 open-source projects from GitHub. Projects are chosen that use the java programming language, the Spring framework and Maven/Gradle build tools. Along with these developers, 24 software engineering metrics are extracted for each of them. These metrics are either calculated by analyzing the source code or relative to project management metadata. Each of these developers then are manually searched for in professional social media such as LinkedIn or Twitter to be labeled with their experience level in their project. Outliers are statistically detected and manually re-assigned when needed. The resulting dataset contains 703 anonymized developers qualified by their 24 project-related software engineering metrics and labeled for their experience. It is suitable for empirical software engineering studies that need to connect developers’ level of experience to tangible software engineering metrics.


Specifications table
Software Engineering Specific subject area Labeled dataset of developers extracted from GitHub open-source projects associated to 24 software metrics. Type of data

Value of the Data
• This dataset contains more than 700 developers extracted from 17 open-source projects hosted on GitHub associated with 24 software metrics that are computed for each developer. The value of this dataset comes both from the size of the dataset (24 metrics for 703 developers) but also from the different information attached to each developer (metrics and experience levels). • Developers in the dataset are manually labelled with one of the following labels: Experienced Software Engineer, Software Engineer, Bot, Other, Unknown. Quality of the labelling is improved by a statistical analysis followed by a manual inspection of outliers and a re-labelling when needed. • Gathering information about software developers and more particularly their experience level in open-source projects is a cumbersome task. Hence, this dataset might be of interest to researchers in software engineering. • The dataset can be used to perform empirical studies in software engineering, more precisely about characteristics of software developers or relations between project code quality and developers. Moreover, it can be used in machine learning approaches (either unsupervised or supervised) thanks to both the labelling and the number of software metrics associated to each developer.

Objective
This dataset was created in a context related to empirical software engineering and machine learning. The data has been extracted from open-source GitHub projects. It is related to a research article [3] . This dataset is provided openly to researchers working in empirical software engineering and machine learning, to ease their data collection, developer-related software metrics calculus and data labelling. This kind of dataset is rare in this context as it requires both heavy calculus and tedious manual indexing. It is important for us to share it widely with the scientific community. Furthermore, this article is important for reproducibility purposes, as it clearly documents the retrieval process of the data used in the companion research article [3] . Also, the dataset could be used as a benchmark for comparing the performance of future research in this field. By officially publishing this dataset through Data In Brief, authors wish to advertise the solid conception of this dataset.

Data Description
The dataset of experienced developers is composed of 703 developers extracted from 17 open-source project hosted on GitHub [4] . Selected GitHub projects are mainly written in Java and all use the Java Spring Framework [5] . This framework provides languages (such as a deployment descriptor XML dialect and Java annotations) that support the definition of the architecture that will be automatically instantiated by the system to execute an application. Projects also use the Gradle [6] and Maven [7] automatic software management and automation tools. The use of these technologies is a deliberate choice in order to constitute a dataset of developers working with a Java ecosystem (Gradle/Maven, Java, etc. ), Spring and GitHub. Table 1 provides metadata on the 17 selected projects: their total number of developers, their number of stars in GitHub, their GitHub creation date and their URL. The numbers of both developers and stars vary with time. Values in Table 1 are those retrieved on 2021/09/22. Criteria for selection are described below (in Section Experimental design, materials and methods ).
Developers from those 17 projects are extracted using the GitHub API [8] . For each developer of each project, 24 metrics, described in Table 2 , are computed.
Four metrics ( Number of Commits ( NoC ), Followers , Days in Project ( DiP ) and Inter-commit Time ( ICT )) are process metrics ( i.e. metrics monitoring the development process). The remaining 20 other metrics described in Table 2 are code metrics and are inher-   These metrics measure the software architecture at different scales (or granularities). Those scales are shown by Fig. 1 . Moreover, to choose these metrics, we rely on the work of Di Bella et al. [9] and Perez et al. [10] . Di Bella et al. use an unsupervised method to classify developers in 4 groups from rare to core developers. They show that several metrics are discriminant for this classification: Number of Commits , Lines of Codes , Days in Project and Inter-commit Time . Hence, we choose to reuse these metrics to constitute our dataset. Perez et al. use Spring markers (specific Java annotations) to statistically distinguish categories of developers having an experience in working on the runtime architecture of the software. Therefore, we also choose to reuse their identified three variables specific to Spring runtime software architecture. Table 3 statically describes the 24 metrics with figures computed on the whole dataset. For each metric, we compute its: • Minimum (Min) and Maximum (Max) values, • 1st (25%) and 3rd (75%) percentiles, • and, Median.
We check that computed metrics are consistent, for instance that AB + NAB = C E + NC E. As seen in Table 3 , metrics obey a large statistical dispersion due to some developers having a high level of seniority and therefore a high level of contribution in projects.
Developers in the dataset are manually labelled according to their experience level in their project, using one of the following labels: Labels are described below (in Section Experimental design, materials and methods). Fig. 2 presents the total number of developers per experience level. The major part (505 out 703 developers) of the dataset is composed of developers whose role was not found. This comes from the nature of the open-source projects where a large proportion of developers are very occasional or even contributed only once. In the other categories, except for the BOT category, there is a total of 188 developers whose experience level has been clearly identified. There is a good balance between software engineers (73) and experienced software engineers (69). 29 developers are software architects whereas 17 clearly identify as having a specific IT role (such as UX/UI designer or project manager) while not being developers. Finally, 10 developers are identified as BOTs, i.e. continuous integration systems such as Jenkins or Travis which automatically commit on GitHub repositories. Fig. 3 shows the number of developers per experience level for each project (represented using a logarithmic scale). As described in Fig. 2 , in all projects, a majority of developers have an unknown role (UNK). Four projects (Activiti, Broadleaf, dhis2-core and flowable-engine) have a plurality of developers (SE, ESE, SA, NSE, BOT). Others projects have only a few SE, ESE or SA.

Experimental Design, Materials and Methods
The data acquisition process, described using the Business Process Modeling and Notation (BPMN) [11] , is shown in Fig. 5 . The different steps of this data acquisition process are the following: 1. GitHub project selection : we manually select 17 projects from GitHub using the quality criteria given by Kalliamvakou et al. [12] for open source repository mining. We also add extra selection criteria to target projects that use the Spring Framework and have at least two developers. 2. Data acquisition process for developers (parallel tasks): (a) Developers extraction from projects : we extract the set of 951 developers from the 17 selected projects using the GitHub API. Each extracted developer is linked to its project. Thus, a developer appearing in two projects is considered different in each. (b) Developers metadata retrieval : extracted data about developers contain username, name and email as described in developers' GitHub accounts. 3. Data acquisition process for metrics (parallel tasks): (a) Source code retrieval : for each project, we collect the source code. (b) Commits retrieval : we acquire project histories composed of the set of all commits from the first (date of the project creation on GitHub) to the latest (date of the dataset retrieval as given by the commit ID in Table 4 ). 4. Metrics computation for each developers : using a modified version of the PyDriller tool [1] , we compute 24 metrics described in Table 2 . For each project and developer, metrics are computed using the whole project history extracted at Step 2. Table 5 presents 4 global metrics characterizing the extracted software projects. We use the cloc software [13] to compute the number of files and lines of code listed in Table 5 . Table 4 also gives the number of developers present in the dataset for each project. 5. Data cleaning : we perform a manual cleaning step to exclude developers that did not change at least one line, as synthesized in the following variables: AddLGM , DelLGM , AddLoC , DelLoC , AddSAM , DelSAM . When the sum of these six variables is equal to zero the developer is removed from the dataset. By this means, the dataset is reduced from 951 to 703 developers. 6. Developer labelling : each developer extracted from GitHub is mapped to its experience level in the project in a three stepped process (see Fig. 4 ). The labelling process mainly relies on a manual search on internet for each developer, using his / her GitHub username and name. We trust this labelling method because many developers use social networks [14] . We collect developers' experience levels from LinkedIn [15] , Twitter [16] and GitHub   profiles or project documentation websites. When a developer's GitHub name is found in one of those search engines, we check that the developer mentions that he / she is working on the given project (so as to prevent confusion with potential homonyms). The developer's profile is manually read through to determine the developer's label. The list of labels used to qualify developers' experience is inspired from the 2021 Stack Overflow Developer Survey [17] . After this first step, a statistical analysis (isolation forest) is performed to detect labelling outliers with respect to their metrics values. Outliers are then reviewed manually again in a third step in order to check their labelling and correct it if needed.
Following is a detailed description of this three step process: Step 1: Manual labelling. Each developer is searched for in LinkedIn, Twitter and GitHub profiles or project documentation websites using his / her GitHub username and name. When a developer is found in one of those search engines, we check that the developer mentions that he / she is working on the given project (so as to prevent confusion with potential homonyms). If the profile of a given developer mentions [17] : • "Architect" or "Senior Software Engineer" then the developer is labelled as "Experienced Software Engineer" (ESE) [18] , • "Junior Software Engineer" or "Software Engineer" then the developer is labelled as "Software Engineer" (SE), • "Developer" then we search if the developer has a Master of Sciences in Software Engineering. If so, the developer is labelled as "SE"; else the developer is labelled as "OTHER". • Other descriptions than "SE" or "ESE" the developer is labelled as "OTHER". Table 6 summarizes the keywords searched for in developers' profiles to label them.
Step 2: Outliers detection. To avoid misclassifications, we have sought outliers using an Isolation-Forest method. Indeed, we assume that equally labeled developers should have comparable metrics values, and conversely that developers from two different metrics profiles should be labelled differently. Isolation-Forest calculates a score for each observation in the dataset. This score provides a measure of normality for each observation and thus provide a set of possibly mislabeled developers.
Step 3: Manual relabelling. After an inspection of potential outliers, we have manually relabeled 21 of them. This manual relabelling process increases the quality of the labelling. Table 6 Keywords and information used to label developers.

Keywords
Developer label • "Architect" • "Senior Architect" Software Architect (SA) • "Senior Software Engineer" Experienced Software Engineer (ESE) • "Junior Software Engineer" • "Software Engineer" Software Engineer (SE) • "Developer" AND "MSc in Software Engineering" Software Engineer (SE) • "Developer" Non Software Engineer (NSE) • "Bot" (in GitHub username) BOT Other experience level Non Software Engineer (NSE) No information found Unkwnon (UNK) It is important to note that the dataset is enriched by manual labelling which makes it ready for supervised machine learning algorithms. However, users of the dataset might want to dismiss this labelling for unsupervised learning or might want to do a labelling of their own. In the latter cases, the dataset can still be considered a relevant contribution as it is rich of 24 calculated metrics.

Ethics Statements
By its nature, the extracted data contains GitHub usernames associated to metrics and experience level in 17 projects. Code extraction and information relative to developers for each project on GitHub comply with the GitHub policies. Information gathered using social networks (Twitter, GitHub and LinkedIn) about developers are compliant with the platforms' data distribution policies. Developers' experience level provides information about developers' skills. Hence, we fully anonymized the GitHub usernames. By doing so, it is very difficult to trace back to the non-anonymized developer by simple metric calculation. This computational difficulty combined with the fully anonymization of GitHub usernames guarantee developers' anonymity. 1

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Dataset of Open-Source Software Developers Labeled by their Experience Level in the Project and their Associated Software Metrics (Original Data) (Zenodo).