The Ballpark Effect: Spatial-Data-Driven Insights into Baseball’s Local Economic Impact

: The impact of sporting events on local economies and their spatial distribution is a topic of active policy debate. This study adds to the discussion by examining granular cellphone location data to assess the spillover effects of Major League Baseball (MLB) games in a major US city. Focusing on the 2019 season, we explore granular geospatial patterns in mobility and consumer spending on game days versus non-game days in the Saint Louis region. Through density-based clustering and hotspot analysis, we uncover distinct spatiotemporal signatures and variations in visitor affluence across different teams. This study uses features like game day characteristics, location data (latitude and longitude), business types, and spending data. A significant finding is that specific spatial clusters of economic activity are formed around the stadium, particularly on game days, with multiple clusters identified. These clusters reveal a marked increase in spending at businesses such as restaurants, bars, and liquor stores, with revenue surges of up to 38% in certain areas. We identified a significant change in spending patterns in the local economy during games, with results varying greatly across teams. Notably, the XGBoost model performs best, achieving a test R 2 of 0.80. The framework presented enhances the literature at the intersection of urban economics, sports analytics, and spatial modeling while providing data-driven actionable insights for businesses and policymakers.


Introduction
Understanding precise human mobility is essential for the study of economic sciences, as it provides real-time data and shows small-and large-scale movement trends.This makes it helpful in examining the movements of people in relation to sporting events to gain knowledge about travel patterns and business effects.The combination of cell-based GPS data and business-level economic data allows for an understanding of meta-scale business trends and human movement based on whether it is a gameday or not.These insights are important in fields such as urban planning, transportation, and emergency preparedness [1][2][3].Sports events uplift the local economy of host cities through enhanced city branding, economic benefits, non-infrastructure benefits, and revenue from event ticket sales and tourism [4].In the past, stadiums were financed by private entities, but nowadays they are funded publicly, which means that the costs of constructing a new stadium fall on citizens through taxes [5].Over the past two decades, virtually every one of the 100 professional stadiums that have been inaugurated has been provided with financial support, either directly or indirectly, by government entities at the local, state, or federal levels [6].
The Saint Louis Cardinals are a professional baseball team that was founded in 1882 and have won 11 World Series championships, making them one of the most successful franchises in Major League Baseball history.Since 2006, the Saint Louis region has benefited from an estimated economic impact of over USD 4.5 billion in output as a result of the Cardinals' 16 regular seasons, with around 40% of annual fan attendance being made up of out-of-town visitors [7].As part of their agreement with the city, the owners of the Cardinals and their private equity partners invested over USD 270 million, which included at least USD 65 million in public subsidies, into the Ballpark Village, a sports-themed entertainment district [8,9].Given the significant investment involved, it is crucial to understand the potential impact that Cardinals games may have on the mobility and spending patterns of people in the area.
The use of geospatial technologies and location-based data has been instrumental in studying various aspects of human behavior [10,11].However, the application of modern geospatial technology in investigating the economic effects of sports events on regional economies has been limited.One possible reason could be the insufficiency of relevant data or the lack of available infrastructure capable of handling big data.Nevertheless, recent advances in cloud computing infrastructure and the availability of Safe Graph data have made it possible to explore the tangible economic impact of sports events.Therefore, this study investigates the hypothesis that baseball games contribute to the local economy.
This study seeks to bridge the gap between mobility patterns and economic impact by posing the following clear research questions: How do Major League Baseball (MLB) games affect local economic activity, and how can mobility data enhance our understanding of this impact?Additionally, how can we exploit mobility data to predict local spending, aiding in future planning, such as optimizing revenue generation, determining preferred game times or days, and selecting teams that maximize economic benefits?While it is well established that large events disrupt mobility patterns, our analysis goes beyond descriptive insights to explore how these changes translate into economic benefits.By integrating mobility data into machine learning models, we provide a more holistic view of the spillover effects of baseball games.This integration allows us to capture not only the spatial dynamics of visitor movements, but also their financial impact on local businesses.This analysis focuses on spatial patterns and the predictive power of features beyond business location, offering insights into how mobility data can inform revenue predictions.By connecting spatial analytics with economic outcomes, this study contributes to a more comprehensive understanding of the economic influence of sports events on local communities and how such data can be strategically used for future economic planning.

Related Works
The increasing availability of location-based data from mobile devices enables novel analyses across domains like transportation, urban planning, and economics [12][13][14].Researchers are taking advantage of this rich source of information to study a wide range of topics, from studying population behavior to recovery efforts from COVID-19 [15,16].However, findings on the local economic effects of stadium investments are mixed-some studies even suggest focusing funds elsewhere, given modest impacts and opportunity costs [17][18][19].Though multiple analyses find negligible links between sports infrastructure spending and growth [20,21], others show potential for increased consumer spending, employment, and housing market uplift [22].
Machine learning is increasingly applied to spatiotemporal data across sports and other domains, like using cell data to predict COVID-19 spread [23].Techniques also estimate attendance, optimize player health, analyze team performance, and more [24,25].In economics, machine learning evaluates housing price spillovers from development programs [26] and regional innovation impacts [27].
Additionally, ref. [28] used machine learning models to predict the attendance demand in European football games, with features such as the performance of the home team, the performance of the visiting team, the day of the game, distance, and uncertainty of outcome as independent variables.However, no study uses mobile data and machine learning to quantify the localized spillover effects of sporting events.
Tree ensemble methods effectively handle mixed data types and capture complex relationships, making them suited to sports economics tasks [29].In this study, we extract features such as game day and time, which exhibit intricate interactions with the targetvariable-amount-spent locally.Tree-based algorithms are suitable nonlinear modeling techniques that can hierarchically estimate the importance of predictor variables in classification and regression tasks [30].These models have proven to be effective in predicting outcomes, such as visitor attendance and revenue generation, and have been applied to various sports events [31].
The flexibility of machine learning provides new means to evaluate economic trends, though applications estimating sports-related impacts remain limited [32].Despite its potential, there has been limited research that utilizes machine learning techniques to estimate the economic impact of specific events, such as baseball games.This research responds to these gaps through a data science approach, uniting spatiotemporal data and sports economics.

Study Area
The Saint Louis Metropolitan Area (SLMA) is a bi-state region located at the intersection of the Missouri and Mississippi Rivers, on the border between the states of Missouri and Illinois (Figure 1).The area consists of fourteen counties, covering about 20,367 km 2 , with a population of 2,805,617 as of 2019 [33].Saint Louis is known as a hub for major sports markets, home to Busch Stadium, the Enterprise Center, and the new Saint Louis City Stadium.Playing out of Busch Stadium, the Saint Louis Cardinals are a professional baseball team based in Saint Louis, Missouri.They are part of the National League (NL) of the Major League Baseball (MLB) and have won 11 World Series championships, the most of any NL team and second-most in MLB history [34].The team was founded in 1882 as the Saint Louis Brown Stockings and is one of the oldest teams in American professional sports [35].Busch Stadium has been home for the team since 2006.
of outcome as independent variables.However, no study uses mobile data and machine learning to quantify the localized spillover effects of sporting events.
Tree ensemble methods effectively handle mixed data types and capture complex relationships, making them suited to sports economics tasks [29].In this study, we extract features such as game day and time, which exhibit intricate interactions with the targetvariable-amount-spent locally.Tree-based algorithms are suitable nonlinear modeling techniques that can hierarchically estimate the importance of predictor variables in classification and regression tasks [30].These models have proven to be effective in predicting outcomes, such as visitor attendance and revenue generation, and have been applied to various sports events [31].
The flexibility of machine learning provides new means to evaluate economic trends, though applications estimating sports-related impacts remain limited [32].Despite its potential, there has been limited research that utilizes machine learning techniques to estimate the economic impact of specific events, such as baseball games.This research responds to these gaps through a data science approach, uniting spatiotemporal data and sports economics.

Study Area
The Saint Louis Metropolitan Area (SLMA) is a bi-state region located at the intersection of the Missouri and Mississippi Rivers, on the border between the states of Missouri and Illinois (Figure 1).The area consists of fourteen counties, covering about 20,367 km 2 , with a population of 2,805,617 as of 2019 [33].Saint Louis is known as a hub for major sports markets, home to Busch Stadium, the Enterprise Center, and the new Saint Louis City Stadium.Playing out of Busch Stadium, the Saint Louis Cardinals are a professional baseball team based in Saint Louis, Missouri.They are part of the National League (NL) of the Major League Baseball (MLB) and have won 11 World Series championships, the most of any NL team and second-most in MLB history [34].The team was founded in 1882 as the Saint Louis Brown Stockings and is one of the oldest teams in American professional sports [35].Busch Stadium has been home for the team since 2006.Because baseball fans travel from anywhere around the St. Louis region, we took a larger study area of the entire SLMA to visualize the movement of people before, during, and after baseball games.This is important, as it allows us to have a comprehensive understanding of the movement patterns.However, it is also important to study the economic impact of baseball games on the local economy, specifically around the stadium.This is because most of the economic activity related to sports tourism happens in that area [36].By focusing on these areas, we can gain a more accurate understanding of how the local economy is affected by the influx of visitors and fans.This study has identified four key POI for investigation, which are as follows: restaurants and bars, grocery stores, hotels, and liquor stores.These locations have been selected as they are frequently visited by people and are estimated to impact the local economy [37].
In 2019, the St. Louis Metro Area had a diverse and robust economy, with a GDP of approximately USD 152.4 billion, ranking it as the 22nd largest in the United States [38].The region's economy was supported by key sectors including manufacturing, healthcare, and professional services, with healthcare being the largest employer [39].Despite its economic diversity, St. Louis faced challenges such as stagnant population growth and a relatively small tech sector compared to national averages.However, the area's low cost of living and strong educational institutions provided a competitive advantage, fostering opportunities in emerging industries like biotech and fintech [38,39].

Data and Resources
This study merged cell-based GPS location data providing human-movement insights, parcel data providing location reference information, and business-level daily economic data.Mobile location data, which are gathered via GPS and other technologies built into smartphones and other mobile devices, have grown in importance as a tool for studying economic activity.These data can provide valuable information on business and other economic entity performances, as well as consumer behavior, such as travel and purchasing trends.

Cellphone Location Data
This research uses real-world fine-grain mobility data provided by a data-as-a-service company that specializes in collecting and providing anonymized population movement data collected through GPS signals of cellphones across the U.S. We are obliged not to name the data provider due to contractual restrictions.Global Positioning System (GPS) is a navigation system that uses a network of satellites to provide location and time information to GPS receivers on Earth.Each satellite broadcasts a signal that includes a timestamp and information about its location, which is used by GPS receivers to calculate their own position and velocity [40].Data provider claims to provide precise polygon-based building footprints for over 6 million places in the U.S. from over 20 million devices daily, with an accuracy rate of up to 10 m [41].
Our study utilizes data which implements robust privacy measures to ensure the protection of individual privacy through advanced anonymization techniques, data aggregation, and strict adherence to consent protocols.Data vendor provides geospatial location data by aggregating raw GPS signals from mobile devices, merged with precise polygon places, to identify visits to specific points of interest.These data are sourced from a network of third-party applications and software development kits (SDKs), with compliance to privacy laws through pseudonymization and aggregation to protect individual identities.Despite these robust measures and the high credibility of the data, there remains the potential for inaccuracies due to factors such as signal interference, atmospheric conditions, and the urban built environment, which can all affect the precision of GPS-based location data.
The dataset used in this study consists of location signals from across the U.S. for the time period of 1 March 2019 to 31 December 2019 (Figure 2).This period was specifically chosen because it represents the most recent year of available data for us that is unaffected by the COVID-19 pandemic and the associated restrictions that significantly altered mobility patterns and economic activities in subsequent years.This allows for an accurate assessment of the MLB game days' economic impact under normal conditions, making the findings more relevant and generalizable.Each record in the dataset contains a unique 'caid', latitude, longitude, and timestamp, where 'caid' represents a unique identifier for each mobile device; latitude and longitude represent the geographical coordinates of the device at the time of the recording; and timestamp indicates the date and time of the record (Appendix A).The dataset provides detailed information on the movement patterns of individuals within the SLMA during the study period (Figure 2).By analyzing the data, we can identify changes in population movement and spending patterns that may be attributed to the games.These fine-grain mobility data can be used to gain insights into the behavior of individuals and businesses in the area and can assist in making data-driven decisions.
chosen because it represents the most recent year of available data for us that is unaffected by the COVID-19 pandemic and the associated restrictions that significantly altered mobility patterns and economic activities in subsequent years.This allows for an accurate assessment of the MLB game days' economic impact under normal conditions, making the findings more relevant and generalizable.Each record in the dataset contains a unique 'caid', latitude, longitude, and timestamp, where 'caid' represents a unique identifier for each mobile device; latitude and longitude represent the geographical coordinates of the device at the time of the recording; and timestamp indicates the date and time of the record (Appendix A).The dataset provides detailed information on the movement patterns of individuals within the SLMA during the study period (Figure 2).By analyzing the data, we can identify changes in population movement and spending patterns that may be attributed to the games.These fine-grain mobility data can be used to gain insights into the behavior of individuals and businesses in the area and can assist in making data-driven decisions.

Parcel Data
Parcel data refer to information about parcels of land, such as their size, location, and ownership.These data are collected and maintained by local governments and can be used for various purposes, such as property assessment and taxation, land use planning, and environmental management.Parcel data from Lightbox include information about the boundaries and dimensions of the parcel, the type of land use, the owner's name and contact information, and any buildings or structures on the property.Parcel data can be visualized and analyzed using software such as ArcGIS Pro 2.7.0.These data are imported as a shapefile and can then be symbolized, joined with other data sources, and analyzed using built-in tools.The Lightbox parcel data are organized according to 'Use Code', which assigns a unique number to properties based on their usage, such as a 'Residential Building' or a 'Grocery Store' [43].
As a pre-processing step, we employed the use of the GeoPandas library in Python to remove any duplicate parcels with similar 'Use Code' and geometry.Also, for a subset of parcels that had duplicated data with a different 'Use Code', we manually inspected the properties and assigned the appropriate 'Use Code' for further analysis.This resulted in 1,274,721 parcels with unique geometries available for our study in the study area.

Safe Graph Data
Safe Graph specializes in providing location data from a variety of public and private sources, ensuring that personal information is not disclosed or misused [41].This allows for the analysis of consumer behavior and patterns without violating individuals' privacy.The data provided by Safe Graph are often utilized for market research, urban planning, and other purposes [44,45].Additionally, the company offers spending data, which include a comprehensive set of information that reflects the monetary expenditure of individuals or businesses on specific goods or services.These data can be broken down into categories, such as retail, dining, or entertainment, and can provide valuable insights into consumer spending patterns and trends.
Our initial step was to acquire spending data for the year 2019 from Safe Graph hosted in AWS.We then proceeded to filter the data by dates, retaining only the records corresponding to the home and away games of the Saint Louis Cardinals played between Thursday, 28 March 2019, in Milwaukee against the Milwaukee Brewers, and Sunday, 29 September 2019, at the local Busch Stadium against the Chicago Cubs.Subsequently, we extracted the 'daily spend' data from the file, which contain information about the amount spent in a particular business each day across the entire SLMA.
While this study leverages comprehensive datasets to analyze mobility patterns, it is crucial to acknowledge a significant limitation in these data sources.The differential cellphone data coverage across various demographic groups is a limitation of these data, potentially impacting the representativeness of the results.

Methodology
In this study, we studied the impact of baseball games on revenue in our POI by using a combination of statistical analysis and spatial visualization techniques (Figure 3).We performed Spearman's statistical test on the number of people pinged in the given point of interests on game days and non-game days, analyzed spend data to visualize the average USD amount spent on game days versus non-game days, and used ArcGIS Pro to analyze movements and clusters on game days compared to non-game days.We used machine learning to predict revenue for future games.

Location Analysis
To test the hypothesis that game days generate more revenue in our point of interest compared to non-game days, we conducted Spearman's statistical test to compare the number of people pinged in the given point of interest on game days and the closest nongame days.This provided us with a quantitative measure of the difference in foot traffic on these two types of days.Spearman's statistical test was conducted using the Python library SciPy.

Location Analysis
To test the hypothesis that game days generate more revenue in our point of interest compared to non-game days, we conducted Spearman's statistical test to compare the number of people pinged in the given point of interest on game days and the closest nongame days.This provided us with a quantitative measure of the difference in foot traffic on these two types of days.Spearman's statistical test was conducted using the Python library SciPy.
We collected location data of people in our selected POI for all home game days and for each game day, and we chose the nearest non-game day within a week for comparison.We removed twelve days that were affected by rain.This was carried out to avoid any potential lack of samples and to ensure that our results were not skewed by any external factors.

Spatial Statistics
In this research, we used spatial statistics techniques such as density-based clustering and hotspot analysis, supported by ArcGIS Pro, to analyze mobility data from SLMA and gain insights into the spatial patterns, processes, and relationships.These techniques allowed us to visualize spatial patterns and understand underlying processes, as spatial statistics is an interdisciplinary field that uses concepts and methods from statistics, geography, computer science, and other fields [46,47].By utilizing these tools and techniques, we took full advantage of the spatial component of our data and gained a deeper understanding of the patterns and relationships present in our study area (Figure 4).We collected location data of people in our selected POI for all home game days and for each game day, and we chose the nearest non-game day within a week for comparison.We removed twelve days that were affected by rain.This was carried out to avoid any potential lack of samples and to ensure that our results were not skewed by any external factors.

Spatial Statistics
In this research, we used spatial statistics techniques such as density-based clustering and hotspot analysis, supported by ArcGIS Pro, to analyze mobility data from SLMA and gain insights into the spatial patterns, processes, and relationships.These techniques allowed us to visualize spatial patterns and understand underlying processes, as spatial statistics is an interdisciplinary field that uses concepts and methods from statistics, geography, computer science, and other fields [46,47].By utilizing these tools and techniques, we took full advantage of the spatial component of our data and gained a deeper understanding of the patterns and relationships present in our study area (Figure 4).
Using GeoPandas and the h3 package in Python [48], we tracked individuals in key locations like residential areas, restaurants and bars, hotels, liquor stores, and Cardinals Stadium (Figure 4).To avoid clutter and noise in the visualization, we removed other data points such as people in transit whose GPS pings were located on roads.We then used polylines to connect the first and last location of unique individuals and employed the shapely package in GeoPandas to create arcs to visualize any return movements.To further aid in visualization, we aggregated the movements within 6-mile hexagons and calculated the flow intensity of individuals moving between these hexagons.The resulting map generated in ArcGIS Pro revealed that the intensity of people moving towards the city was higher before the game and that this intensity decreased as people moved away from the city area after the game.While the two graphs may appear similar at first glance, a closer examination reveals that the flow intensity increases around specific areas, particularly near the stadium, after the game.This post-game clustering indicates a significant movement of people towards local businesses, which corresponds to the increased economic activity observed in those areas.The increased flow intensity around the stadium and key points of interest post-game suggests a boost in local spending.Using GeoPandas and the h3 package in Python [48], we tracked individuals in key locations like residential areas, restaurants and bars, hotels, liquor stores, and Cardinals Stadium (Figure 4).To avoid clutter and noise in the visualization, we removed other data points such as people in transit whose GPS pings were located on roads.We then used polylines to connect the first and last location of unique individuals and employed the shapely package in GeoPandas to create arcs to visualize any return movements.To further aid in visualization, we aggregated the movements within 6-mile hexagons and calculated the flow intensity of individuals moving between these hexagons.The resulting

HDBSCAN
We used Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), a density-based clustering algorithm for identifying clusters, in our mobility data.Density-based clustering algorithms have the advantage of identifying clusters of varied sizes and shapes in a dataset, as opposed to other clustering methods that assume clusters have a regular shape and require the user to specify the number of clusters [49,50].HDBSCAN uses a hierarchical approach to identify clusters based on the density of points in the data, making it robust and efficient in handling data with noise or varying density.HDBSCAN follows a four-step process to create a cluster.The first step involves estimating the density of points by calculating their distance to the kth nearest neighbor, also known as the core distance.HDBSCAN then uses a new distance metric, known as mutual reachability distance, to identify low-density points or noise.This is achieved through Equation (1) in the algorithm, which is used to calculate the mutual reachability distance between points.
Equation ( 1) used in HDBSCAN considers the original matrix distance (d(a, b)) between two points (a and b) and the core distance or density estimate (core k ) for a given parameter k.The mutual reachability distance, which is calculated using this equation, is also utilized in the subsequent step to generate a minimum spanning tree that identifies connected points in dense regions.The third step of the HDBSCAN algorithm is crucial and involves pruning the tree by comparing the total number of points in a branch to the minimum cluster size.Finally, the stability of each resulting cluster in the pruned spanning tree is determined in the last step by applying Equation (2).
In the last step of HDBSCAN, the stability of each cluster is calculated using two thresholds, namely λ birth and λ p .λ birth is the threshold value at which a cluster splits and becomes its own cluster, while λ p is the threshold value at which a point p in a given cluster falls out of the cluster.The stability of each cluster is then used to determine which cluster will be included in the final set of clusters.
Additionally, HDBSCAN has several parameters that can be adjusted to control the algorithm's sensitivity and improve its performance on different datasets [51].We chose to use HDBSCAN compared to multi-scale OPTICS and DBSCAN, provided as a spatial statistics tool in ArcGIS Pro, for our study.This was decided because of its ability to handle the complexity of our large-scale data and its flexibility in identifying clusters of varying sizes and shapes without requiring a distance parameter [52].
We utilized HDBSCAN to identify statistically significant clusters of individuals in key locations during the game day of the Saint Louis Cardinals vs. the Pittsburgh Pirates on 12 May 2019.For visualization in our research, we selected a minimum number of 25 features to be considered as a crowd in our point of interest, which means that the algorithm only considers groups of points with at least 25 members as valid clusters.Points that do not belong to a cluster of at least 25 points are labeled as noise points.This decision was made based on the rationale that this number represents a reasonable threshold for defining a group of people that can be visually identified and distinguished from other individuals in the area.

Getis-Ord GI* Clustering
Similarly, we utilized Getis-Ord GI* clustering, also known as hotspot analysis or spatial autocorrelation analysis, as a statistical technique to identify the clusters of high or low values in our mobility data on the same game day.The G statistic, which ranges between −1 and 1, is used to measure the spatial autocorrelation [53].Positive values of the G statistic indicate that high values of the variable tend to cluster together, while negative values indicate that high values tend to be dispersed.
The Getis-Ord local statistic is given as Equation (3).
Appl.Sci.2024, 14, 8134 where the weight between feature i and j is denoted by w i,j .The attribute value for feature j is represented by x j , and n denotes the total number of features in the dataset and the following: This method is particularly useful in identifying regions where a certain occurrence is exceptionally concentrated or scattered [54].The Getis-Ord GI* statistic is a valuable tool for understanding the spatial distribution of a certain phenomenon and how it relates to other variables in the dataset.It allowed us to locate clusters of high to low values in our mobility data and helped us to understand the spatial distribution of mobility patterns in our POI before and after a baseball game.
We conducted a hotspot analysis to identify significant clusters of people in key locations during the Saint Louis Cardinals vs. Pittsburgh Pirates game on 12 May 2019, with a focus on the Saint Louis City area.

Machine Learning
This research investigated the use of tree-based machine learning techniques in economics and their potential to improve our understanding of the economy.Here, we tested three tree-based machine learning models, decision trees, random forests, and gradient boosting, which are known for their ability to handle complex interactions between features and for their robustness to outliers and missing values [52].Considering these properties, and based on the characteristics of our dataset, we decided to focus specifically on tree-based models in this study, disregarding other models such as linear regression and support vector regression.
Our analysis involves the use of tree-based models to analyze substantial amounts of economic data and identify trends or patterns that can be used to make more informed decisions.The results of this research will be used to evaluate the effectiveness of the tree-based models in identifying patterns in economic data.
In addition, we analyzed the impact of spatial data on the performance of machine learning models.To accomplish this, we conducted two experiments using the same datasets.In the first experiment, we included the latitude and longitude data in our dataset and trained a machine learning model using these data.In the second experiment, we removed the latitude and longitude data from our dataset and trained the same model.The results of the two experiments were then compared to assess the significance of spatial data in the model's performance.This approach allowed us to draw conclusions on the importance of spatial data in such research and to understand the potential impact of including or excluding spatial information on the performance of machine learning models in economic studies.
We predicted the amount spent by individuals at local businesses in SLMA, using features related to the baseball games of the Cardinals in the year 2019.The features utilized in our model included the business type, the longitude and latitude of the business, brands associated with the business, whether the game was a home or away game for the Cardinals, the opponent team, the time of the game, whether the game was played during the day or at night, and the day and month of the game.The outcome variable of interest was the amount spent at each business derived from Safe Graph (Table 1).To examine the presence of non-linear relationships among the input features, we conducted a Pearson's correlation t-test.Based on our findings, we transformed all input features into categorical formats, while retaining the numerical format of the outcome variable "Amount Spent".To ensure that the machine learning algorithms were properly trained, we utilized a dataset consisting of 329,249 observations from all businesses, in addition to our POI.Upon examination of the data, 103 observations were found to have a value of less than USD 1 in the 'Amount Spent' variable (Table 1).This value was deemed an error in data collection or measurement and does not reflect a realistic spending amount.As such, these observations were removed from the dataset before proceeding with the analysis.We enhanced the efficacy of our modeling by incorporating transaction data from all businesses within the SLMA, along with our POI data.This enabled us to increase the sample size, facilitating better machine learning outcomes.This filtering process resulted in 329,146 transaction data points, which were later used in our machine learning algorithms.

Decision Tree
Decision Tree is a machine learning algorithm that uses a tree-like model to make decisions by breaking down a dataset into smaller subsets based on the values of the features.It is commonly used for both classification and regression tasks due to its simplicity, interpretability, and ability to handle different types of data [55].In our study, we used a Decision Tree algorithm to predict the amount spent in local businesses using various categorical features related to baseball games, such as game day, opponent, and other relevant variables.The algorithm was trained on a dataset of historical spending patterns and baseball game information to generate predictions of local business expenditure based on the input game-related variables.

Random Forest
A Random Forest Regression model was also used to predict the amount spent by individuals in our POI during a baseball game.Random Forest Regression is a type of ensemble machine learning model composed of multiple decision trees [56].The use of multiple decision trees in a random forest model allows the model to make more accurate predictions by averaging out the errors made by the individual trees.
One of the advantages of using a Random Forest Regression model is its ability to handle complex interactions between features [57].In our study, we used features such as the opponents, time of the game, day of the game, date of the game, and type of businesses where consumer go before and after the game, all of which can have complex relationships with the amount spent in POI.By creating many decision trees on different subsets of the data and combining their results, a random forest can reduce overfitting and improve generalization [58].

XGBoost
Additionally, the Extreme Gradient Boost (XGBoost) algorithm was tested as another method for prediction.XGBoost is an optimized version of the Gradient Boosting algorithm.It is an ensemble learning method, meaning it combines multiple decision trees to improve the model's overall performance.XGBoost is known for its ability to handle large datasets, high dimensionality, and its ability to prevent overfitting [59].By utilizing the XGBoost algorithm, we were able to improve the generalization performance of our model and identify key features affecting the amount spent on our POI during a baseball game.The results of our study demonstrate the importance of using XGBoost in addition to Decision Tree and Random Forest to prevent overfitting and improve overall prediction accuracy.

Model Evaluation
In this study, we evaluated the performance of our algorithms using several metrics, including the R 2 correlation coefficient and the RMSE.To account for the right-skewed distribution of the spend data, we used RMSE instead of relative root mean squared error (RRMSE) to evaluate the model, as RRMSE is more sensitive to large errors, which are more likely to occur in right-skewed data [60].The RMSE is calculated as the square root of the average of the squared differences between the predicted and actual values, and it is a measure of the overall accuracy of the model.A smaller RMSE value indicates a higher accuracy of the model.The R 2 coefficient, on the other hand, represents the proportion of variation in the responses that is explained by the model using predictor values from the test data.A higher R 2 value indicates a better correlation between the predicted and actual values.
We used the scikit-learn library in Python to evaluate our algorithms.The library provides several functions for model evaluation, including mean squared error for calculating the RMSE and the r2_score for calculating the R 2 coefficient.Finally, we used randomized data partitioning of 70% training and 30% testing data to evaluate our models.

Location Analysis
The results of our analysis are represented in a graph (Figure 5), with the p-value on the y-axis and the time of day on the x-axis.These data were also separated by significance level, with p-values of less than 0.05 being considered statistically significant.This allowed us to identify specific time periods during which the population distribution was significantly different between game and non-game days.The t-test and the separation of data by significance level allowed us to identify specific time periods during which the population distribution was significantly different between game and non-game days.The results of the Location Analysis showed that, during game days, there were statistically significant differences in the population distribution at certain time periods in restaurants and bars (8 AM-12 AM), grocery stores (7 AM-10 PM), and hotels (6 AM-12 PM).This suggests that there is a discernible pattern in their visitation habits.In contrast, The results of the Location Analysis showed that, during game days, there were statistically significant differences in the population distribution at certain time periods in restaurants and bars (8 AM-12 AM), grocery stores (7 AM-10 PM), and hotels (6 AM-12 PM).This suggests that there is a discernible pattern in their visitation habits.In contrast, no significant differences were observed in liquor stores in this study.

HDBSCAN Clustering
In this study, we utilized HDBSCAN, a density-based clustering algorithm, to analyze the movement patterns of individuals before and after the baseball game.The data were collected for a period of three hours before and three hours after the game to observe any changes in the movement patterns of individuals (Figure 6).The objective was to identify significant clusters of people in our POI and potential areas of high foot traffic for businesses in the surrounding areas.Our analysis revealed statistically significant clusters of people moving before and after the game, indicating a discernible pattern in their behavior (Figure 6).The figure also illustrates an increase in people clusters in our POI and throughout Saint Louis City after the game.Our analysis revealed statistically significant clusters of people moving before and after the game, indicating a discernible pattern in their behavior (Figure 6).The figure also illustrates an increase in people clusters in our POI and throughout Saint Louis City after the game.

Getis-Ord GI* Clustering
The figure presented illustrates the change in the density of individuals before and after a baseball game on 12 May 2019 (Figure 7).The same data collected for a period of three hours prior to and three hours after the game began were used to observe changes in movement patterns.The Getis-Ord G algorithm was utilized to identify statistically significant clusters of dense kernels, specifically, individuals congregating around the POI.The results of the Getis-Ord G clustering for the SLMA and city showed that the baseball games are attracting people to local businesses such as restaurants, bars, hotels, and grocery stores (Figure 7).This was demonstrated by the significant increase in the heatmaps of clusters around the stadium before and after the game.
Appl.Sci.2024, 14, x FOR PEER REVIEW 15 of 23 significant clusters of dense kernels, specifically, individuals congregating around the POI.The results of the Getis-Ord G clustering for the SLMA and city showed that the baseball games are attracting people to local businesses such as restaurants, bars, hotels, and grocery stores (Figure 7).This was demonstrated by the significant increase in the heatmaps of clusters around the stadium before and after the game.Before the game, the clustering analysis indicated that people were present at various POI across the city, with some areas having a higher concentration of individuals.However, after the game, the heatmap revealed a marked increase in the number of people near the stadium.These findings suggest a significant increase in the movement of individuals around these POI after the start of the baseball game.Before the game, the clustering analysis indicated that people were present at various POI across the city, with some areas having a higher concentration of individuals.However, after the game, the heatmap revealed a marked increase in the number of people near the stadium.These findings suggest a significant increase in the movement of individuals around these POI after the start of the baseball game.

Machine Learning
The results of the analysis showed that the XGBoost model had the highest R 2 value of 0.80, followed by Random Forest (0.79), and Decision Tree (0.55).The R 2 value indicates the proportion of variation in the amount spent at local businesses in Saint Louis that can be explained by the features used in the model (Table 2).Additionally, the root mean squared error (RMSE) values for the three models were 334.0 for Random Forest, 403.71 for Decision Tree, and 297.58 for XGBoost (Table 2).The RMSE value measures the difference between the predicted values and the actual values, with lower values indicating a better fit.The results also showed that removing the longitude and latitude features from the model resulted in a decrease in the R 2 values for all three models, with the lowest R 2 value of 0.39 for Decision Tree.The RMSE values also increased, with the highest value of 595.07 for Decision Tree.
The feature importance analysis within our Random Forest model highlights the significant role of geographic coordinates, with longitude (0.35) and latitude (0.32) as key predictors in determining the economic impact of MLB games on local businesses.This analysis suggests that longitude, in particular, serves as a critical indicator due to its representation of the east-west positioning of businesses within the broad geographic landscape of Saint Louis (Figure 8).Such positioning may influence consumer behavior and spending patterns, given the city's layout and the distribution of commercial areas.
Appl.Sci.2024, 14, x FOR PEER REVIEW 16 of 23 be explained by the features used in the model (Table 2).Additionally, the root mean squared error (RMSE) values for the three models were 334.0 for Random Forest, 403.71 for Decision Tree, and 297.58 for XGBoost (Table 2).The RMSE value measures the difference between the predicted values and the actual values, with lower values indicating a better fit.The results also showed that removing the longitude and latitude features from the model resulted in a decrease in the R 2 values for all three models, with the lowest R 2 value of 0.39 for Decision Tree.The RMSE values also increased, with the highest value of 595.07 for Decision Tree.
The feature importance analysis within our Random Forest model highlights the significant role of geographic coordinates, with longitude (0.35) and latitude (0.32) as key predictors in determining the economic impact of MLB games on local businesses.This analysis suggests that longitude, in particular, serves as a critical indicator due to its representation of the east-west positioning of businesses within the broad geographic landscape of Saint Louis (Figure 8).Such positioning may influence consumer behavior and spending patterns, given the city's layout and the distribution of commercial areas.

Discussion
To address the first research question of whether game days generate more foot traffic and revenue in our POI compared to non-game days, we conducted an analysis of the location data.Specifically, we performed Spearman's statistical test to compare the number of people who were detected within our POI on game days versus the closest non-

Discussion
To address the first research question of whether game days generate more foot traffic and revenue in our POI compared to non-game days, we conducted an analysis of the location data.Specifically, we performed Spearman's statistical test to compare the number of people who were detected within our POI on game days versus the closest non-game days.The results of our analysis were consistent with prior research, indicating that individuals are more likely to visit our POI while games are taking place [61].This finding suggests that game days may play an important role in driving both foot traffic and revenue for our POI.We found that people tend to disperse randomly before the game, but after the game, they were more likely to be clustered around the downtown area, which is closer to the stadium.Additionally, the spillover effects of the game were visually evident as they expanded outwards from the stadium and into various POI throughout the SLMA, particularly in the central region.The correlations between game days and increased economic activity are evident from our analysis, but pinpointing causality requires more advanced methods.Future research should consider employing synthetic controls or similar approaches to conclusively determine the causal impact of sporting events.
Second, we utilized the Safe Graph spend data to investigate changes in revenue during game days compared to the closest non-game day, which showed a surge in revenue in our study area.To gain a deeper understanding of this phenomenon, we conducted a comparison of the revenue generated in our POI.Specifically, we compared the total revenue generated during the game days with the nearest non-game days within a week for comparison, choosing the nearest day to avoid conflicts of game days.Our analysis showed that restaurants and bars had a 38% increase in revenue, hotels had an 8% increase, grocery stores had a 37% increase, and liquor stores had a 34% increase in revenue (Table 3).To further investigate the trend in the downtown area, we focused on the 4 kilometers radius of the stadium.Our analysis revealed a 4% increase in revenue for restaurants and bars, a 3% increase for hotels, a 6% increase for grocery stores, and a significant 27% increase for liquor stores (Table 3).It is important to note that there may be other confounding factors, such as consumer behavior, marketing strategies, or seasonal trends, that may have influenced the observed increase in revenue at the various POI during game days.Additionally, it should be emphasized that we conducted a direct comparison of the total amount spent in our POI during a game and closest non-game days, rather than a day-to-day comparison.This distinction is important because it allows us to assess the overall spending patterns of participants rather than just their spending on individual days, which could be subject to fluctuations and outliers.It should also be noted that the 4 km radius surrounding the stadium had a significantly lower number of POI compared to the entire SLMA (Table 3).This may have contributed to the smaller increase in revenue observed in our POI compared to the Metro Area, as there are likely more opportunities for spending outside of the immediate vicinity of the stadium.
Our study also compared revenue with respect to home games and teams playing to identify any significant differences (Figure 9).The results showed that the Milwaukee Brewers and Pittsburgh Pirates consistently generated higher revenue in our POI than other teams.To further validate this finding, we analyzed the total revenue generated in the local POI while the Saint Louis Cardinals played away games in Milwaukee and Pittsburgh.There was a similar increase in revenue compared to other teams, suggesting that the Saint Louis Cardinals have a significant impact on local businesses during games against Milwaukee and Pittsburgh.Conversely, the Oakland Athletics had the lowest revenue generated in our POI (Figure 9).We also conducted the same analysis in the 4 kilometers radius of the stadium and found consistent results with the larger study area.It is important to acknowledge that the transaction data analyzed in this study is not comprehensive and does not represent all POI within the study area.Rather, the data were solely derived from Safe Graph, and, therefore, may not capture the complete range of economic activities in the study area.
The results from the spatial analysis revealed that clusters and hotspots of people tend to concentrate more around the stadium, indicating that baseball games may have a positive impact on the local city economy by driving more people to that area.This finding is consistent with previous research in the literature review, which suggests that sports events can bring economic and social benefits to local communities [22,62].The clustering of people in the areas near the stadium is significant because it implies that baseball games are attracting more visitors, which can result in increased revenue for local businesses and create job opportunities for the community.This finding is consistent with studies that have shown a positive relationship between sports events and local economic development.
In addition to our previous analysis, we tracked the location of 336 individuals who attended the Saint Louis Cardinals vs. Pittsburgh Pirates game on 12 May 2019.We captured the individuals whose first location was at the Cardinal's Busch Stadium during the game time and traced their location 3 h after the game.This filter made it clear that the majority of the visitors are from outside of the Saint Louis City area, coming from all around the SLMA to attend the game (Figure 10).To further validate this finding, we analyzed the total revenue generated in the local POI while the Saint Louis Cardinals played away games in Milwaukee and Pittsburgh.There was a similar increase in revenue compared to other teams, suggesting that the Saint Louis Cardinals have a significant impact on local businesses during games against Milwaukee and Pittsburgh.Conversely, the Oakland Athletics had the lowest revenue generated in our POI (Figure 9).We also conducted the same analysis in the 4 kilometers radius of the stadium and found consistent results with the larger study area.It is important to acknowledge that the transaction data analyzed in this study is not comprehensive and does not represent all POI within the study area.Rather, the data were solely derived from Safe Graph, and, therefore, may not capture the complete range of economic activities in the study area.
The results from the spatial analysis revealed that clusters and hotspots of people tend to concentrate more around the stadium, indicating that baseball games may have a positive impact on the local city economy by driving more people to that area.This finding is consistent with previous research in the literature review, which suggests that sports events can bring economic and social benefits to local communities [22,62].The clustering of people in the areas near the stadium is significant because it implies that baseball games are attracting more visitors, which can result in increased revenue for local businesses and create job opportunities for the community.This finding is consistent with studies that have shown a positive relationship between sports events and local economic development.
In addition to our previous analysis, we tracked the location of 336 individuals who attended the Saint Louis Cardinals vs. Pittsburgh Pirates game on 12 May 2019.We captured the individuals whose first location was at the Cardinal's Busch Stadium during the game time and traced their location 3 h after the game.This filter made it clear that the majority of the visitors are from outside of the Saint Louis City area, coming from all around the SLMA to attend the game (Figure 10).Tracking individuals who attended a specific Cardinals versus Pirates game revealed that nearly 80% resided outside of the host city, with 58% returning home post-match while a minority visited local establishments-nonetheless highlighting spillover effects drawing metro-wide patronage (Figure 10).Comparing machine learning models for predicting event-induced spending that included spatial attributes versus those that excluded such location data showed significant differences; moreover, incorporating longitude and latitude information improved predictive accuracy.Specifically, XGBoost slightly outperformed alternatives like Random Forest.Attempts at tuning model hyperparameters did marginally lift performance over baseline models but introduced overfitting risks.Overall, the geospatial analysis underscores baseball games' economic influence in widening foot traffic across the broader region, while machine learning validates the potency of locationbased features.The predictive models used in this study have the potential for out-ofsample analysis, such as forecasting business visits before or after games, and further research could validate and refine these predictions in different contexts.Though this study primarily examines in-sample data, the techniques used, especially the machine learning models, offer a promising avenue for out-of-sample analysis, such as forecasting visits to local businesses before or after games.Further research could validate and refine these models for broader predictive use, ensuring that they perform well across different datasets and contexts.
While we have identified significant spending shifts on game days, it is essential to consider the broader economic context, such as the displacement of regular economic activities and the sustainability of relying on such events for economic boosts.Future studies could explore the long-term economic benefits versus potential risks, such as increased public spending on infrastructure and the volatility of sports-related revenue.Tracking individuals who attended a specific Cardinals versus Pirates game revealed that nearly 80% resided outside of the host city, with 58% returning home post-match while a minority visited local establishments-nonetheless highlighting spillover effects drawing metro-wide patronage (Figure 10).Comparing machine learning models for predicting event-induced spending that included spatial attributes versus those that excluded such location data showed significant differences; moreover, incorporating longitude and latitude information improved predictive accuracy.Specifically, XGBoost slightly outperformed alternatives like Random Forest.Attempts at tuning model hyperparameters did marginally lift performance over baseline models but introduced overfitting risks.Overall, the geospatial analysis underscores baseball games' economic influence in widening foot traffic across the broader region, while machine learning validates the potency of location-based features.The predictive models used in this study have the potential for out-of-sample analysis, such as forecasting business visits before or after games, and further research could validate and refine these predictions in different contexts.Though this study primarily examines in-sample data, the techniques used, especially the machine learning models, offer a promising avenue for out-of-sample analysis, such as forecasting visits to local businesses before or after games.Further research could validate and refine these models for broader predictive use, ensuring that they perform well across different datasets and contexts.
While we have identified significant spending shifts on game days, it is essential to consider the broader economic context, such as the displacement of regular economic activities and the sustainability of relying on such events for economic boosts.Future studies could explore the long-term economic benefits versus potential risks, such as increased public spending on infrastructure and the volatility of sports-related revenue.

Conclusions
In conclusion, this research presents an innovative framework that combines granular human mobility data, spatial analytics, and machine learning methodologies to quantify the economic impact of baseball games on local economies.By posing clear research questions, we explored how MLB games affect local economic activity and how mobility data can be leveraged to enhance our understanding of this impact.Our study revealed significant spending patterns during game days near Busch Stadium, with restaurants, bars, grocery stores, and liquor stores experiencing notable revenue increases.Teams like the Milwaukee Brewers and the Pittsburgh Pirates generated consistently higher economic activity, highlighting the importance of team selection in maximizing local benefits.The integration of mobility data into machine learning models provided a more comprehensive view of the spillover effects, capturing both spatial dynamics and financial impacts on businesses.This approach offers valuable insights for future planning, such as optimizing game schedules and team matchups to boost economic benefits.This research combines granular mobility data, advanced spatial analytics, and machine learning models to quantify MLB games' economic impact, offering robust insights into local economic activity.The findings provide actionable recommendations for urban planners and policymakers, demonstrating how these methodologies can optimize game schedules and enhance revenue generation.Our analysis underscores the critical role of geospatial technology in economic research, demonstrating the potential of mobility data to predict local spending and guide strategic decision making.Our findings are specific to the studied location and may not be generalizable across different regions; therefore, future studies should explore the variability in economic impacts across states or cities with varying demographics and team characteristics.

Figure 1 .
Figure 1.Map of the study area.The map also details the bi-state counties in the Saint Louis Metropolitan Area separated by the Mississippi river, with a focus on Saint Louis City.

Figure 1 .
Figure 1.Map of the study area.The map also details the bi-state counties in the Saint Louis Metropolitan Area separated by the Mississippi river, with a focus on Saint Louis City.

Figure 2 .
Figure 2. Comparison of visitors' count on (a) game days and (b) non-game days in SLMA.These numbers reflect the amount of time tracked by data provider, which does not account for 100% of everyone's location.The figure shows that the number of visitors remained stable during the game time and gradually decreased after the game ended.We counted the number of visitors in our POI and a nearby ballpark village to track visitor foot traffic, as shown in Figure 2. Data from the Saint Louis Cardinals vs. Pittsburgh Pirates game on 12 May 2019, which had the highest recorded attendance of 48,555 at the Cardinals Stadium, were used for analysis [42].The figure indicates that there were more visitors in our POI on 12 May compared to a non-game day on 19 May.The number of

Figure 2 .
Figure 2. Comparison of visitors' count on (a) game days and (b) non-game days in SLMA.These numbers reflect the amount of time tracked by data provider, which does not account for 100% of everyone's location.The figure shows that the number of visitors remained stable during the game time and gradually decreased after the game ended.We counted the number of visitors in our POI and a nearby ballpark village to track visitor foot traffic, as shown in Figure 2. Data from the Saint Louis Cardinals vs. Pittsburgh Pirates game on 12 May 2019, which had the highest recorded attendance of 48,555 at the Cardinals Stadium, were used for analysis [42].The figure indicates that there were more visitors in our POI on 12 May compared to a non-game day on 19 May.The number of visitors remained stable during the game time and gradually decreased after the game ended.

Figure 4 .
Figure 4. Movement of individuals three hours before (a) and after (b) a baseball game, where the map shows the intensity of people in the city area.The dark arrow indicates a higher number of people, while the light arrow represents fewer people moving around the area.It is evident from the figure that the intensity of people increases in the city area after the game.

Figure 4 .
Figure 4. Movement of individuals three hours before (a) and after (b) a baseball game, where the map shows the intensity of people in the city area.The dark arrow indicates a higher number of people, while the light arrow represents fewer people moving around the area.It is evident from the figure that the intensity of people increases in the city area after the game.

23 Figure 5 .
Figure 5.Comparison of foot traffic on game days and non-game days at points of interest.The figure shows the results of the t-test conducted on the number of individuals located in (a) restaurants and bars, (b) grocery stores, (c) hotels, and (d) liquor stores on game days and non-game days.The shaded region represents statistically significant time frames.We noticed statistically significant differences in restaurants and bars (8 AM-12 AM), grocery stores (7 AM-10 PM), and hotels (6 AM-12 PM).

Figure 5 .
Figure 5.Comparison of foot traffic on game days and non-game days at points of interest.The figure shows the results of the t-test conducted on the number of individuals located in (a) restaurants and bars, (b) grocery stores, (c) hotels, and (d) liquor stores on game days and non-game days.The shaded region represents statistically significant time frames.We noticed statistically significant differences in restaurants and bars (8 AM-12 AM), grocery stores (7 AM-10 PM), and hotels (6 AM-12 PM).

23 Figure 6 .
Figure 6.HDSCAN clustering analysis results of Saint Louis Metro Area with the focus in St. Louis City three hours before and after the game, with (a,b) showing the spatial distribution of clusters before the game and (c,d) showing the clustering results after the game.The map reveals more clusters being formed in central SLMA and around the stadium after the game.

Figure
Figure HDSCAN clustering analysis results of Saint Louis Metro Area with the focus in St. Louis City three hours before and after the game, with (a,b) showing the spatial distribution of clusters before the game and (c,d) showing the clustering results after the game.The map reveals more clusters being formed in central SLMA and around the stadium after the game.

Figure 7 .
Figure 7. Hotspot analysis of foot traffic in POI before and after the game in POI, with (a,b) representing three hours before the game and (c,d) representing three hours after the game.The colorcoded map indicates areas with high (red) and low (purple) foot traffic density in our POI.

Figure 7 .
Figure 7. Hotspot analysis of foot traffic in POI before and after the game in POI, with (a,b) representing three hours before the game and (c,d) representing three hours after the game.The color-coded map indicates areas with high (red) and low (purple) foot traffic density in our POI.

Figure 8 .
Figure 8. Random Forest feature importance scores.D/N refers to a day or night game.

Figure 8 .
Figure 8. Random Forest feature importance scores.D/N refers to a day or night game.

Figure 9 .
Figure 9.Comparison of total revenue generated by teams in our POI by home and away games.This chart only captures a subset of the spending in the points of interest, as it is based on data from Safe Graph and does not represent 100% of the spending activity.

Figure 9 .
Figure 9.Comparison of total revenue generated by teams in our POI by home and away games.This chart only captures a subset of the spending in the points of interest, as it is based on data from Safe Graph and does not represent 100% of the spending activity.

Figure 10 .
Figure 10.Outflow of people from the stadium after a game.The majority of the visitors are from outside the Saint Louis City area, coming from all around the Saint Louis Metro Area to attend the game.

Figure 10 .
Figure 10.Outflow of people from the stadium after a game.The majority of the visitors are from outside the Saint Louis City area, coming from all around the Saint Louis Metro Area to attend the game.

Table 1 .
Descriptive statistics of Safe Graph spend data.

Table 2 .
Results of the machine learning models.* Indicates the best model results for each variable.

Table 2 .
Results of the machine learning models.* Indicates the best model results for each variable.

Table 3 .
Percentage change in revenue during game and non-game days.