Published February 26, 2021 | Version 1
Thesis Open

Large-scale Java GitHub search of "test" in content, filename and file path

  • 1. Technical university of Košice

Contributors

Supervisor:

  • 1. Technical university of Košice

Description

Dataset of large-scale GitHub analysis based on GHTorrent list of repositories from May 2019. Dataset includes only repositories with majority Java language, that are not forks. Each of 4.3M repositories was searched for the word "test" via Github Search API in:

  • all files content
  • java files content
  • all filenames
  • java filenames
  • all file paths
  • java file paths

Simultaneously, number of current repository commits and watchers where obtained. The dataset was obtained between 2019-08-20 and 2019-10-01.

Dataset is a mysql dump of 1 table, containing the following columns:

  • id - internal table ID
  • project_id - ID of `projects` table of GHTorrent's mirror mysql-2019-05-01
  • full_name - full name of the project
  • found_test_in_path_java - number of occurrences of "test" in java paths
  • found_test_in_path - number of occurrences of "test" in all paths
  • found_test_in_body_java - number of occurrences of "test" in java files content
  • found_test_in_body - number of occurrences of "test" in all files content
  • found_test_in_filename_java - number of occurrences of "test" in java filenames
  • found_test_in_filename - number of occurrences of "test" in all filenames
  • watchers - number of project's watchers
  • created_at - datetime of data fetching
  • last_commit - datetime of last commit
  • all_commits - all commits, along with the inherited (from other ones)
  • project_commits - only commits of the project, without the inherited

Notes

This work was supported by project VEGA No. 1/0762/19: Interactive pattern- driven language development.

Files

Files (165.1 MB)

Name Size Download all
md5:4e70f5afd7997a7ae2bfd43ce85c2707
165.1 MB Download