Data in Brief

s

a b s t r a c t Developers are extracted from 17 open-source projects from GitHub.Projects are chosen that use the java programming language, the Spring framework and Maven/Gradle build tools.Along with these developers, 24 software engineering metrics are extracted for each of them.These metrics are either calculated by analyzing the source code or relative to project management metadata.Each of these developers then are manually searched for in professional social media such as LinkedIn or Twitter to be labeled with their experience level in their project.Outliers are statistically detected and manually re-assigned when needed.The resulting dataset contains 703 anonymized developers qualified by their 24 project-related software engineering metrics and labeled for their experience.It is suitable for empirical software engineering studies that need to connect developers' level of experience to tangible software engineering metrics. ©

Value of the Data
• This dataset contains more than 700 developers extracted from 17 open-source projects hosted on GitHub associated with 24 software metrics that are computed for each developer.The value of this dataset comes both from the size of the dataset (24 metrics for 703 developers) but also from the different information attached to each developer (metrics and experience levels).• Developers in the dataset are manually labelled with one of the following labels: Experienced Software Engineer, Software Engineer, Bot, Other, Unknown.Quality of the labelling is improved by a statistical analysis followed by a manual inspection of outliers and a re-labelling when needed.• Gathering information about software developers and more particularly their experience level in open-source projects is a cumbersome task.Hence, this dataset might be of interest to researchers in software engineering.• The dataset can be used to perform empirical studies in software engineering, more precisely about characteristics of software developers or relations between project code quality and developers.Moreover, it can be used in machine learning approaches (either unsupervised or supervised) thanks to both the labelling and the number of software metrics associated to each developer.

Objective
This dataset was created in a context related to empirical software engineering and machine learning.The data has been extracted from open-source GitHub projects.It is related to a research article [3] .This dataset is provided openly to researchers working in empirical software engineering and machine learning, to ease their data collection, developer-related software metrics calculus and data labelling.This kind of dataset is rare in this context as it requires both heavy calculus and tedious manual indexing.It is important for us to share it widely with the scientific community.Furthermore, this article is important for reproducibility purposes, as it clearly documents the retrieval process of the data used in the companion research article [3] .Also, the dataset could be used as a benchmark for comparing the performance of future research in this field.By officially publishing this dataset through Data In Brief, authors wish to advertise the solid conception of this dataset.

Data Description
The dataset of experienced developers is composed of 703 developers extracted from 17 open-source project hosted on GitHub [4] .Selected GitHub projects are mainly written in Java and all use the Java Spring Framework [5] .This framework provides languages (such as a deployment descriptor XML dialect and Java annotations) that support the definition of the architecture that will be automatically instantiated by the system to execute an application.Projects also use the Gradle [6] and Maven [7] automatic software management and automation tools.The use of these technologies is a deliberate choice in order to constitute a dataset of developers working with a Java ecosystem (Gradle/Maven, Java, etc. ), Spring and GitHub.Table 1 provides metadata on the 17 selected projects: their total number of developers, their number of stars in GitHub, their GitHub creation date and their URL.The numbers of both developers and stars vary with time.Values in Table 1 are those retrieved on 2021/09/22.Criteria for selection are described below (in Section Experimental design, materials and methods ).
Developers from those 17 projects are extracted using the GitHub API [8] .For each developer of each project, 24 metrics, described in Table 2 , are computed.
Four metrics ( Number of Commits ( NoC ), Followers , Days in Project ( DiP ) and Inter-commit Time ( ICT )) are process metrics ( i.e. metrics monitoring the development process).The remaining 20 other metrics described in Table 2 are code metrics and are inher-   Perez et al. use Spring markers (specific Java annotations) to statistically distinguish categories of developers having an experience in working on the runtime architecture of the software.Therefore, we also choose to reuse their identified three variables specific to Spring runtime software architecture.
We check that computed metrics are consistent, for instance that AB + NAB = C E + NC E. As seen in Table 3 , metrics obey a large statistical dispersion due to some developers having a high level of seniority and therefore a high level of contribution in projects.
Developers in the dataset are manually labelled according to their experience level in their project, using one of the following labels: Labels are described below (in Section Experimental design, materials and methods).Fig. 2 presents the total number of developers per experience level.The major part (505 out 703 developers) of the dataset is composed of developers whose role was not found.This comes from the nature of the open-source projects where a large proportion of developers are very occasional or even contributed only once.In the other categories, except for the BOT category, there is a total of 188 developers whose experience level has been clearly identified.There is a good balance between software engineers (73) and experienced software engineers (69).29 developers are software architects whereas 17 clearly identify as having a specific IT role (such as UX/UI designer or project manager) while not being developers.Finally, 10 developers are identified as BOTs, i.e. continuous integration systems such as Jenkins or Travis which automatically commit on GitHub repositories.Fig. 3 shows the number of developers per experience level for each project (represented using a logarithmic scale).As described in Fig. 2 , in all projects, a majority of developers have an unknown role (UNK).Four projects (Activiti, Broadleaf, dhis2-core and flowable-engine) have a plurality of developers (SE, ESE, SA, NSE, BOT).Others projects have only a few SE, ESE or SA.

Experimental Design, Materials and Methods
The data acquisition process, described using the Business Process Modeling and Notation (BPMN) [11] , is shown in Fig. 5 .The different steps of this data acquisition process are the following: 1. GitHub project selection : we manually select 17 projects from GitHub using the quality criteria given by Kalliamvakou et al. [12] for open source repository mining.We also add extra selection criteria to target projects that use the Spring Framework and have at least two developers.2. Data acquisition process for developers (parallel tasks): (a) Developers extraction from projects : we extract the set of 951 developers from the 17 selected projects using the GitHub API.Each extracted developer is linked to its project.Thus, a developer appearing in two projects is considered different in each.(b) Developers metadata retrieval : extracted data about developers contain username, name and email as described in developers' GitHub accounts.3. Data acquisition process for metrics (parallel tasks): (a) Source code retrieval : for each project, we collect the source code.(b) Commits retrieval : we acquire project histories composed of the set of all commits from the first (date of the project creation on GitHub) to the latest (date of the dataset retrieval as given by the commit ID in Table 4 ).4. Metrics computation for each developers : using a modified version of the PyDriller tool [1] , we compute 24 metrics described in Table 2 .For each project and developer, metrics are computed using the whole project history extracted at Step 2. Table 5 presents 4 global metrics characterizing the extracted software projects.We use the cloc software [13] to compute the number of files and lines of code listed in Table 5 .Table 4 also gives the number of developers present in the dataset for each project.5. Data cleaning : we perform a manual cleaning step to exclude developers that did not change at least one line, as synthesized in the following variables: AddLGM , DelLGM , AddLoC , DelLoC , AddSAM , DelSAM .When the sum of these six variables is equal to zero the developer is removed from the dataset.By this means, the dataset is reduced from 951 to 703 developers.6. Developer labelling : each developer extracted from GitHub is mapped to its experience level in the project in a three stepped process (see Fig. 4 ).The labelling process mainly relies on a manual search on internet for each developer, using his / her GitHub username and name.We trust this labelling method because many developers use social networks [14] .We collect developers' experience levels from LinkedIn [15] , Twitter [16] and GitHub   profiles or project documentation websites.When a developer's GitHub name is found in one of those search engines, we check that the developer mentions that he / she is working on the given project (so as to prevent confusion with potential homonyms).The developer's profile is manually read through to determine the developer's label.The list of labels used to qualify developers' experience is inspired from the 2021 Stack Overflow Developer Survey [17] .After this first step, a statistical analysis (isolation forest) is performed to detect labelling outliers with respect to their metrics values.Outliers are then reviewed manually again in a third step in order to check their labelling and correct it if needed.
Following is a detailed description of this three step process: Step 1: Manual labelling.Each developer is searched for in LinkedIn, Twitter and GitHub profiles or project documentation websites using his / her GitHub username and name.When a developer is found in one of those search engines, we check that the developer mentions that he / she is working on the given project (so as to prevent confusion with potential homonyms).If the profile of a given developer mentions [17] : • "Architect" or "Senior Software Engineer" then the developer is labelled as "Experienced Software Engineer" (ESE) [18] , • "Junior Software Engineer" or "Software Engineer" then the developer is labelled as "Software Engineer" (SE), • "Developer" then we search if the developer has a Master of Sciences in Software Engineering.If so, the developer is labelled as "SE"; else the developer is labelled as "OTHER".• Other descriptions than "SE" or "ESE" the developer is labelled as "OTHER".Table 6 summarizes the keywords searched for in developers' profiles to label them.
Step 2: Outliers detection.To avoid misclassifications, we have sought outliers using an Isolation-Forest method.Indeed, we assume that equally labeled developers should have comparable metrics values, and conversely that developers from two different metrics profiles should be labelled differently.Isolation-Forest calculates a score for each observation in the dataset.This score provides a measure of normality for each observation and thus provide a set of possibly mislabeled developers.
Step 3: Manual relabelling.After an inspection of potential outliers, we have manually relabeled 21 of them.This manual relabelling process increases the quality of the labelling.

DelSAM
Spring Architectural Modifications (lines specific to Spring) by a given developer ChurnSAM Difference between added and deleted specific Spring lines for a given developer Lines of Code AddLOC Number of Lines Of Code added by a given developer in project files DelLOC Number of Lines Of Code deleted by a given developer in project files ChurnLOC Difference between added and deleted lines of code in project files for a given developer Number of files AddF Number of Files added for a given developer DelF Number of Files deleted for a given developer Process Metrics Followers Numbers of GitHub followers of a given developer DiP Days in Project.Number of days the developer has been in the project (time between first and last commit) IT Inter-commit Time: Average time (in days) between two successive commits for a given developer NoC Number of commit made by a developer ently related to source code.Code metrics measure different kinds of elements.Eight metrics are focused on the Java structure ( e.g.Number of Abstract Classes ( AB ) or Number of Classes Implementing an Interface ( CII )).Four metrics relate to the Gradle / Maven structure and three metrics measure the use of the Spring framework.Then, three metrics qualify the number of lines of code and two the number of files added or deleted.These metrics measure the software architecture at different scales (or granularities).Those scales are shown by Fig.1.Moreover, to choose these metrics, we rely on the work of Di Bella et al.[9] and Perez et al.[10] .Di Bella et al. use an unsupervised method to classify developers in 4 groups from rare to core developers.They show that several metrics are discriminant for this classification: Number of Commits , Lines of Codes , Days in Project and Inter-commit Time .Hence, we choose to reuse these metrics to constitute our dataset.

Fig. 2 .
Fig. 2. Number of developers per experience level in the dataset.

Fig. 3 .
Fig. 3. Number of developers per category for each project (represented using a logarithmic scale).
[1] the data were acquired Software developers are extracted from 17 open-source software projects hosted on GitHub.In order to do so, we reuse and adapt the PyDriller[1]tool.Using PyDriller, we compute 24 software metrics attached to each developer for a given project.Then, we search for the experience level of each developer in professional social networks and project documentation.

Table 1
Metadata on projects in the dataset.

Table 2
Description of the 24 computed metrics.

Table 4
Latest selected commit for each project.