Match score dataset for team ball sports

In this data article, we present a dataset containing match scores from major international competitions for 12 popular team ball sports: basketball, cricket, field hockey, futsal, handball, ice hockey, lacrosse, roller hockey, rugby, soccer, volleyball, and water polo. The dataset was obtained by web scraping data available on Wikipedia pages and includes the following information related to individual matches: the year of the competition edition when a match occurred, the names of the two opposing teams, their respective scores, and the name of the winning team. Our match score dataset provides researchers in the field of sports analytics with valuable data that can be used to compute team statistics, develop team ranking and rating systems, infer patterns and trends in a team's performance across the edition years, build predictive models to forecast the outcome of future matches, and evaluate the performance of machine learning algorithms.


Subject area
Data Science Specific subject area Big Data Analytics, Sports Analytics Data format Raw, Filtered Type of data Value of the data • The dataset can be used to compute team statistics and develop team ranking and rating systems to evaluate the performance of teams based on their past match scores.
• Statistical and data-driven methodologies can be used to infer patterns and trends in a team's performance across the years.
Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015-1582, USA Emails: tna324@lehigh.edu,tog220@lehigh.edu,lnv@lehigh.edu,rgm424@lehigh.edu,orr224@lehigh.edu • Researchers in the field of sports analytics can use the dataset to build a predictive model to forecast the outcome of future matches and evaluate the performance of different machine learning algorithms.

Background
In the last decade, there has been increasing research interest in the analysis of team ball sports [6].
Various avenues have been explored.Some articles use match score data to identify sports with the most random outcomes [1,4].Others aim to determine scoring patterns across different sports [5].Finally, there are articles focused on predicting the outcome of sports matches using machine learning algorithms [2,3].The information contained in the match score dataset presented in this article is highly valuable for researchers in the field of sports analytics.By using and exploring the dataset, researchers can compute team statistics (such as the number of matches won and lost by a team) and develop team ranking and rating systems to evaluate the performance of teams based on their past match scores.By analyzing historical match score data, researchers can gain insights into the strengths and weaknesses of each team across different editions of a competition.Statistical and data-driven methodologies offer the opportunity to infer patterns and trends in a team's performance across the years.The dataset can also be used to build predictive models aimed at forecasting the outcome of future matches.By training machine learning algorithms on historical match score data, such models can predict which team is likely to win upcoming matches.Additionally, the dataset enables machine learning practitioners to evaluate the performance of different machine learning algorithms in predicting match outcomes, allowing for the selection of the most effective algorithms for this task.

Data description
The dataset we present in this article contains match score data from major international competitions across 12 team ball sports: basketball, cricket, field hockey, futsal, handball, ice hockey * , lacrosse, roller hockey, rugby, soccer, volleyball, and water polo.For each sport, the dataset includes the following information related to individual matches: names of the two opposing teams, their respective scores, and the name of the winning team.Table 1 provides the official names of the competitions selected for each sport, all of which are men's events, and the total number of matches across all the years of the editions of each competition.The complete table with the edition years for each selected competition is provided in Table 3 of Appendix A. Tables 4-7 present the number of matches and teams for each edition.Some popular team ball sports like American football, baseball, and tennis were omitted from our dataset due to either the absence of international competitions or the limited size of their teams compared to the sports included in our paper.
Table 2 introduces some general notation that will allow us to formally describe the match score dataset.Denoting as S the set of sports, let E s be the set of editions for the competition selected for sport s ∈ S in Table 1 (one can think of such a set as a set of edition years) and let P s e = {p 1 , p 2 , . . ., p N s e } be the set of all teams playing in edition e ∈ E s , where N s e is the total number of teams in that edition.We will denote as M s e ⊆ P s e × P s e the set of matches in edition e, represented as a set of tuples (i, j), where i and j are opposing teams belonging to P s e .For any sport s ∈ S, edition e ∈ E s , and match (i, j) ∈ M s e , let score s,e ij (i) denote the score that team i obtained when playing against team j in edition e (for example, if "5-6" is the outcome of the match (i, j), then score s,e ij (i) = 5 and score s,e ij (j) = 6) and let winner s e (i, j) ∈ {i, j, Draw} represent the winner of the match, where Draw denotes that the match resulted in a tie.We have  1: Major international competitions selected for the team ball sports included in our paper, along with the total number of matches across all the edition years of each competition.
Given a sport s ∈ S, the match score dataset for each edition e ∈ E s can be represented by the following set D s e = {(i, j, score s,e ij (i), score s,e ij (j), winner s e (i, j)) Note that given a sport s and an edition e, the size of D s e is equal to the size of M s e , which can be obtained from the values in the 'Matches' column in Tables 4-7.The size of P s e , which is equal to N s e , can be obtained from the values in the 'Teams' column in Tables 4-7.Combining the match score datasets in (2.1) across all the edition years of the competition selected for each sport s, one obtains the following dataset D s = ∪ e∈Es D s e . (2. 2) The dataset available in our repository consists of 12 CSV files, each corresponding to a sport s ∈ S. Such files are named as follows "match score dataset [SPORT NAME].csv",where [SPORT NAME] denotes the name of a sport (see the 'Sport' column in Table 1).For each sport s, the corresponding CSV file contains the dataset D s , as defined in (2.2).Each dataset D s consists of five features, i.e., 'Team 1', 'Team 2', 'Score 1', 'Score 2', and 'Winner', which are associated with i, j, score s,e ij (i), score s,e ij (j), and winner s e (i, j) in (2.1), where (i, j) ∈ M s e .

Experimental design, materials, and methods
For each of the 12 team ball sports included in our paper, to populate the datasets (2.1) and (2.2), we collected real data on match scores from available editions of the major international competitions listed in Table 1.In particular, we obtained match score data through Python web-scraping of 147 Wikipedia pages that include match outcomes from the selected editions of the competitions.Dataset resulting from the union of the datasets D s e for all e ∈ E s , as defined in (2.2).
Table 2: Notation.finals.The FIVB Men's World Cup for volleyball is an exception to this format.In such a competition, teams are divided into two groups.In the first phase, each team plays one match against all other teams in its group.In the second phase, each team plays one match against all the teams in the other group, and the competition champion is determined based on the total number of points and other criteria.

Limitations
We note that due to the complexity of the cricket scoring system, scores for cricket matches are not available in the dataset.However, the dataset does include the name of the winning team for each match, as this information is obtainable from Wikipedia pages related to the ICC Men's Cricket World Cup.In 7 matches, the winning team is not available and is listed as "No result", while in 5 matches, it is labeled "Match abandoned".Table 7: Tables presenting the number of matches and teams per edition for each competition selected for soccer, volleyball, and water polo.

Table
Table8of Appendix B includes links to the web-scraped pages.Most of the major international sports competitions listed in Table1include a group stage, where teams are divided into groups and compete against each other within their group to accumulate points and progress in the competition, and a knockout stage (or bracket stage), where teams are eliminated from the competition if they lose a match.The knockout stage typically consists of the following additional phases: rounds of of 16, quarterfinals, semifinals, and Set of editions for the competition selected for sport s ∈ S.P s e = {p 1 , p 2 , . .., p N s e } Set of all teams playing in edition e ∈ E s .Match score dataset for edition e ∈ E s , as defined in (2.1).D s

Table 4 :
Tables presenting the number of matches and teams per edition for each competition selected for basketball, cricket, and field hockey.

Table 5 :
Tables presenting the number of matches and teams per edition for each competition selected for futsal, handball, and ice hockey.

Table 6 :
Tables presenting the number of matches and teams per edition for each competition selected for lacrosse, roller hockey, and rugby.