Published June 24, 2020 | Version v2.0
Dataset Open

Defect Prediction Tool Validation Dataset 2

  • 1. Tilburg University

Description

This dataset is used to address the Research Questions in the study at Transactions on Software Engineering: Within-Project Defect Prediction of Infrastructure-as-Code using Product and Process Metrics.

See also: https://github.com/stefanodallapalma/TSE-2020-05-0217.

It provides

* repositories.json - a list of repositories selected from open-source GitHub repositories based on the Ansible language.

* fixing-commits.json - a list of defect-fixing commits extracted from those repositories.

* fixed-files.json - a list of Ansible files fixed in those defect-fixing commits and respective bug-inducing commits.

* failure-prone-files.json - a list of failure-prone files through the repository's commit history.

* metrics.zip - csv files consisting of releases (set of files) and their IaC-oriented, delta and process metrics extracted from each analyzed repository

* projects.zip - for each analyzed project, it contains the data (models, performance, and results of Recursive Feature Elimination) used to answer the Research Questions.

Context

Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.

On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. 
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.

This dataset targets the YAML-based Ansible language to devise within-project defects prediction approaches for IaC based on Machine-learning.

Content

The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:

* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 10% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a  .travis.yaml file;
* The repository has a comments ratio of at least 0.1%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.01 events per month on average;
* The repository has evidence of a license, such as the presence of a LICENSE.md file
* The repository has at least 100 source lines of code.

Metrics are grouped into three categories:

* IaC-Oriented: metrics of structural properties derived from the source code of infrastructure scripts. Click [here](https://www.sciencedirect.com/science/article/pii/S0164121220301618) for more info.

* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric.

* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html).

In addition to the metrics, the dataset contains the pre-trained models (*.joblib) in the folders rq1 and rq2 of projects.zip.

You can load the model in Python as follows:

```
from joblib import load
model = load('projects/owner/repository/rq1/random_forest.joblib'), mmap_mode='r')

best_estimator = model['estimator']  # The estimator that maximized the AUC-PR

cv_results = model['cv_results']  # The results of each step of the validation procedure

best_index = mode['best_index']  # The index to access the best cv_results
```

 

Acknowledgements

 

This work is supported by the European Commission grants no. 825040 (RADON H2020).


Inspiration

What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?

Files

failure-prone-files.json

Files (923.0 MB)

Name Size Download all
md5:30027f0425bb6982592582b0eba47e8e
148.6 MB Preview Download
md5:f4c90de4414d73c05d9a6c43e73d9fdb
1.0 MB Preview Download
md5:aba91f032d892f160f68999c21a3c6cf
1.6 MB Preview Download
md5:536464b257bb2e9312e5051e85dab9cd
8.5 MB Preview Download
md5:d8af582339c90bfdad98a541a6704d06
763.2 MB Preview Download
md5:678c4ac53e7b7dfbe288c90cb53e3502
119.7 kB Preview Download

Additional details

Funding

RADON – Rational decomposition and orchestration for serverless computing 825040
European Commission