Using Spatial Scan Statistics and Geographic Information Systems to Detect Monthly Human Mobility Clusters and Analyze Cluster Area Characteristics

Introduction: This study evaluated the detection of monthly human mobility clusters and characteristics of cluster areas before the coronavirus disease 2019 (COVID-19) outbreak using spatial epidemiological methods, namely, spatial scan statistics and geographic information systems (GIS). Methods: The research area covers approximately 10.3 km2, with a population of about 350,000 people. Analysis was conducted using open data, with the exception of one dataset. Human mobility and population data were used on a 1-km mesh scale, and business location data were used to examine the area characteristics. Data from January to December 2019 were utilized to detect human mobility clusters before the COVID-19 pandemic. Spatial scan statistics were performed using SaTScan to calculate relative risk (RR). The detected clusters and other data were visualized in QGIS to explore the features of the cluster areas. Results: Spatial scan statistics identified 33 clusters. The detailed analysis focused on clusters with an RR exceeding 1.5. Meshes with an RR over 1.5 included one with clusters for 1 year which is identified in all months of the year, one with clusters for 9 months, three with clusters for 6 months, three with clusters for 3 months, and four with clusters for 1 month. September had the highest number of clusters (eight), followed by April and November (seven each). The remaining months had five or six clusters. Characteristically, the cluster areas included the vicinity of railway stations, densely populated business areas, ball game fields, and large-scale construction sites. Conclusions: Statistical analysis of human mobility clusters using open data and open-source tools is crucial for the advancement of evidence-based policymaking based on scientific facts, not only for novel infectious diseases but also for existing ones, such as influenza.


Introduction
Coronavirus disease 2019 (COVID- 19) is an infectious disease that has affected over 760 million people worldwide, resulting in more than 6.8 million deaths (1) .Various strategies have been employed to combat the COVID-19 pandemic, including nonpharmaceutical interventions (NPIs), such as effective communication strategies and governmental support; strict measures, such as lockdowns; and pharmaceutical interventions, such as vaccines and antiviral drugs (2) .Since the early stages of the pandemic outbreak, NPIs delayed the spread of the infection as effectively as strict solutions.NPIs typically in-volve measures such as social distancing and cancellation of small-scale gatherings (3) , primarily implemented by local health departments.However, restrictions imposed by local governments can infringe human rights (4) .Therefore, nonrestrictive and effective measures are necessary.Moreover, infectious disease policies should be grounded in scientific facts to form the basis for evidence-based policymaking (EBPM).
Avoiding infectious disease clusters through NPIs is crucial to containing infection and death rates.Spatial epidemiology and spatial scan statistics (SSS), including the examination of disease clusters, have been employed in the analysis of cancer incidence rates, healthcare sectors, and COVID-19 clus-ters (5), ( 6), (7), (8), (9) .To date, the data utilized to develop COV-ID-19 countermeasures have been derived from the examination of disease clusters after the outbreak.Analyzing the characteristics of human mobility clusters using data from periods without outbreaks is essential to rapidly manage emerging infectious diseases and prepare business continuity plans for health administrations.In this study, "human mobility cluster" refers to the aggregation of groups in an area resulting from human movement and is not meant to denote the aggregation of infectious disease patients, which is commonly used in the context of infectious diseases (10), (11), (12) .In Japan, human flow data have been used to support NPIs, such as social distancing (13) , and statistical identification of human mobility clusters is necessary because dense human gatherings pose a risk of COVID-19 cluster formation.
This study aimed to detect human mobility clusters before the COVID-19 pandemic using SSS and elucidate the characteristics of cluster areas, thereby facilitating the implementation of NPIs.

Data
This study used open data available online, with one exception (business location information).Boundary data for the target region, using administrative district data, were obtained from the National Land Numerical Information site of the Ministry of Land, Infrastructure, Transport and Tourism (14) .Residential population data, using future population data with a 1-km mesh (H30 National Bureau Estimates), were also obtained from the same site (15) .The future estimated population was calculated by the National Institute of Population and Social Security Research using the 1-km mesh format based on the 2018 census results every 5 years from 2020 to 2050 (16) .Mesh data, a digitalized map format for various statistical information (17) , were used, with each mesh being a 1-km square dividing the area.However, as no data were available for 2019, the 2020 total estimated population was used.Human flow data were obtained from the nationwide human flow open data of the G Spatial Information Center (1-km mesh) (18) .The resident population was based on GPS data collected from smartphones using Agoop SDK (19) .The average number of people per day during 1 month was calculated based on the converted population value, and the data were available from January 2019 on a monthly basis.Monthly data between January and December 2019 were selected to exclude the impact of COVID-19.Comprehensive data were obtained by selecting the full-day data for each month.As railways affect human flow, railway station and line data were obtained from the National Land Numerical Information site (20) .
Business location information was purchased from Zenrin Co., Ltd., which sells corporate search data.These data combine information, including location data, on approximately 6 million corporations in Japan, encompassing a wide range of companies and organizations (21) .In this study, the data were classified and mapped into five categories: eateries, customer attraction stores, offices, retail stores, and medical and care facilities.
Figure 1 shows the geographical location of Takatsuki City in Japan, where the analysis was conducted.Takatsuki City is a municipality located in Osaka Prefecture, Western Japan, between Kyoto and Osaka.The northern part of the city is mountainous, featuring scenic tourist spots along highways, whereas the southern part is urban, with a mix of redeveloped high-rise apartments and traditional houses.Two railway companies operate in the city, with a major commercial area centered around the railway station in the city center.The population is approximately 350,000, with children below 14 years old, people of productive age (15-64 years old), and older adults (65 years old or older) accounting for 21%, 57%, and 28% of the total population (22) .

Mapping
To examine the geographical distribution of the future estimated population and human flow data, QGIS (version 3.28.3)was used.QGIS, an official project of the Open Source Geospatial Foundation, is a user-friendly open-source geographic information system (GIS) (23) .It has an intuitive UI and robust spatial analysis capabilities and is continually upgraded with numerous additional features through plugins developed by contributors worldwide.The downloaded data were inputted into QGIS for mapping, enabling data visualization and examination of the geographical distribution of clusters identified through scan statistics.Furthermore, the characteristics of businesses located in the detected cluster areas were obtained using corporate search data.To understand the geographical trends, aerial photographs from the Geospatial Information Authority of Japan's Geospatial Information Tiles were used as background maps (24) .

Spatial scan statistics (SSS) using the poisson distribution
SaTScan™ (version 10.1, 64 bits) was used to detect clusters (25) .SaTScan was developed in 1997 by Professor Kulldorff of Harvard Medical School.It is a free software capable of performing statistical analyses to detect clusters (disease agglomerations) in space or space-time (26) .SSS involves the use of a window, referred to as a connected area, that could potentially be a cluster within a larger connected region.This circular window was continuously expanded and moved, and the window with the maximum likelihood ratio, as determined by a Monte Carlo probability simulation, was considered to be the most likely cluster (26), (27) .If the observed values within a window are significantly higher than the expected values based on the Poisson distribution, this indicates the presence of a cluster.Conversely, it is also possible to detect clusters that are not statistically significant.Expected values can be calculated using the area population, and the degree of agglomeration is ex-pressed as relative risk (RR).SSS has been applied in various fields beyond epidemiological research and is particularly effective in detecting hotspot clusters in suburban areas with low population densities.Thus, it is suitable for detecting clusters in areas such as the study area (28) , with a densely populated southern area and mountainous northern area with a low population density.
The following SaTScan settings were used to conduct SSS (Table 1).The location ID in the coordinate file corresponded to the mesh ID, and the centroids of each mesh were calculated using the geometry tool in QGIS.The latitude and longitude were obtained using the function feature of the field calculator.
Furthermore, the space-time scan statistic (STSS) can be performed using SaTScan if the concept of time is includ-ed (29) .However, STSS is suitable for examining when the largest clusters occur within a specific period, whereas SSS is more appropriate for detecting monthly clusters and identifying their characteristics (30) , which was the focus of this study.Therefore, this study employed SSS for cluster detection.

Results
Figure 2 shows the geographical distribution of the residential population, human flow, and business locations in Takatsuki City.The residential population of Takatsuki City in 2020 was concentrated in the southern part of the city, particularly along the railway lines.Contrarily, the northern part, which is mountainous, had a smaller residential population with some meshes containing no residents.The human flow Figure 1.Takatsuki City (indicated in red).Ⓒ OpenStreetMap contributors (38)  data for January 2019 showed a similar trend to that of the residential population, with a higher concentration of people around the four railway stations in the city.In addition, the heatmap showed a concentration of business locations around railway stations in the southern part of the city.Expanding the area where businesses are concentrated and mapping their locations enable a detailed understanding of the geographical distribution trends.This instance examined only the industry type; however, the subsequent analysis utilized data contained in the business location information, including business names, addresses, detailed industry types, and latitudes and longitudes (only location information is shown Figure 2d) due to data usage agreement terms).
Cluster detection using spatial scan statistics (SSS) SSS revealed that 33 meshes had significant clusters detected in at least 1 month of 2019.Some meshes showed significant clusters throughout the year, whereas others were identified as clusters only during specific months.Clusters with an RR below 1.5 exhibited low monthly variation and consistently remained below 1.5.Therefore, 14 meshes with an RR exceeding 1.5 were selected for a detailed analysis because they represented areas with a high risk of human gathering and significant monthly variability, potentially indicating the occurrence of seasonal events (Table 2 and Figure 3).Meshes with an RR above 1.5 included two that showed clusters throughout the year (1 and 2), two for 9 months (3 and 4), three for 6 months (5, 6, and 7), three for 3 months (8, 9, and 10), and four for 1 month (11, 12, 13, and 14).September had the most clusters (eight), followed by April and November (seven each), whereas the remaining months had five or six clusters.

Characteristics of cluster areas
Factors potentially contributing to high risk in the detected cluster areas were identified using business location information.Cluster 1 was detected throughout the year and consistently exhibited a high monthly RR > 2.4.Cluster 1 was locat-ed in a hub for public transportation with stations of various railway companies and a high concentration of eateries and offices, which matched the heatmap results.Cluster 3 was located in the central-eastern part and formed in April 2019.A major highway expansion project was identified at this location.Clusters 5, 6, 10, and 13 shared common features, such as comprehensive sports parks, baseball fields, soccer fields, and tennis courts.Cluster 7, which was detected from January to April, September, and November, was identified as a golf course.Cluster 11 had a high RR (RR = 4) only in October and included a valley known for beautiful autumn leaves and

Discussion
This study identified monthly human mobility clusters and their characteristics before the COVID-19 pandemic using SSS, GIS, and integrated business location data.The results indicated that human mobility clusters were associated with areas central to public transportation, commercial areas with a high concentration of eateries and offices, construction sites, comprehensive sports parks, and ball game fields.Seasonal variations were also observed.The use of GIS to represent multiple data types on a single map is crucial.This enables a realistic representation of physical spaces based on geographical spatial information by combining data on population dynamics, human mobility, transportation, and business locations.This approach is crucial for public health professionals, including health officers, for community diagnosis and surveillance, not only during outbreaks but also in regular times (31) .
Mapping of SSS results in GIS promotes a common understanding among epidemiologists, public health experts, and the general public.SaTScan outputs cluster information in widely used GIS formats, such as Shapefiles and KML files for Google Earth (32) .
This study conducted SSS on a computer; however, cloud implementation could allow for on-demand execution, reduce processing time, and benefit health authorities when using large datasets (33) .
The detected cluster areas were confirmed using business location information.However, some clusters, such as 4 and 7, were detected in areas without corresponding business locations, indicating that some clusters may be transit points.For instance, a large golf course was located beyond Cluster 4. Road information and data from neighboring municipalities could provide further insight into cluster characteristics (34) .
In addition, local interviews could be valuable, as local customs and festivals are potential cluster sources (35) .However, these data are often not openly available.Combining GIS with interviews in spatial epidemiology can validate the spatial analysis results (36) .
Evidence-based policy decisions are preferred, particularly for public institutions that implement policies restricting human rights or require substantial budgets.This study identified areas prone to human mobility clusters and suggested areas where interventions should be seasonally intensified or relaxed.During the COVID-19 response, European countries initially justified lockdowns based on science but later priori- tized economic values and voter opinions over scientific advice (37) .To implement NPIs that are reasonable for human rights, movement and similar measures can be restricted to the smallest possible population within the minimum necessary areas to suppress the spread of infectious diseases.In addition, by identifying outbreak-prone areas based on the flow of people and characteristics of regions during normal times, it is feasible to issue preemptive warnings as NPIs based on data before imposing movement restrictions.As a result, if the outbreak is suppressed, it will be possible to minimize the need for movement restrictions and other measures, thereby protecting human rights while controlling the spread of infectious diseases.
Promoting EBPM based on open data and open-source science facts is imperative, not only for new infectious diseases but also for existing ones, such as influenza and RSV (3) .To utilize the results of this study for the implementation of NPIs, it is important for municipalities or prefectural governments to estimate baseline data during normal times.Normal times refer to periods before the arrival of new infectious diseases or when diseases such as influenza are not in an outbreak phase.By continuously understanding the baseline data of people's movements, it is believed that anomalies can be de-tected and compare with the baseline when an infectious disease outbreak occurs.In addition, attempting to calculate the baseline after an outbreak of an infectious disease could influence the time and human resources required to implement outbreak suppression policies.Therefore, it is necessary to automate based on the results of this study using a system, thereby reducing the burden on public health officials.
The reason for this is that the results of this study can flexibly adapt to the differences in transmission patterns, groups susceptible to infection, or age groups prone to severe illness for each infectious disease.Depending on whether the infection is spread through contact or airborne transmission, the size of the area for detecting human mobility clusters can vary.Moreover, the data on human flow includes information such as age, gender, and starting points.Although this study analyzed the data for the entire population, it is possible to limit the analysis only to data for the elderly or for children based on the characteristics of each infectious disease.This indicates the possibility of developing customized outbreak suppression policies tailored to individual infectious diseases.Furthermore, by accumulating these insights, when an unknown infectious disease emerges, the best NPIs can be selected by processing the vast amount of accumulated data with genera- The universality of this methodology allows similar analyses worldwide.All data utilized, except one source, were open.SaTScan and QGIS are open-source and freely available software packages for the implementation of SSS and GIS (38) .Business location data, although not open, can be substituted with open-source alternatives, such as OpenStreetMap or free aerial and satellite images provided by national geospatial authorities (39) .
SSS in SaTS can detect circular clusters; thus, noncircular clusters may be overlooked, and low-risk areas may be included in the results (40) .Noncircular clusters can be detected using software such as FleXScan; however, this software requires more computational resources (41) .
This study successfully employed spatial epidemiological methods, SSS, and GIS to detect human mobility clusters and analyze cluster area characteristics before the COVID-19 pandemic.

Figure 2 .
Figure 2. (a) Residential population per mesh (blue = few residents; red = many residents).(b) Human flow per mesh for January 2019 (blue = few residents; red = many residents).(c) Heatmap based on corporate search data demonstrating the distribution of business locations.(d) Business locations in the southern part of the city, where businesses are most densely situated (red: eateries; pink: customer attraction stores; blue: retail stores; green: offices; yellow: medical and care facilities).Each side of all black squares is 1 km.GSI Tile (National Latest Photo [Seamless]) provided by GSI is used as a background map.

Figure 3 .
Figure 3. Geographical location and relative risk of clusters detected using spatial scan statistics (SSS) in each month between January and December 2019.GSI Tile (National Latest Photo [Seamless]) provided by GSI is used as a background map.

Table 1 .
Data and Settings for Conducting Spatial Scan Statistics.

Table 2 .
Detected Clusters with a Relative Risk Above 1.5.

Table 2 .
Continued.Mesh ID, unique number of the 1-km square grid dividing the area; Observation Case, number of people in the mesh; Expected Case, expected number of people as calculated using spatial scan statistics; Population, residential population in the mesh; RR, relative risk calculated using the spatial scan statistics.All clusters were considered statistically significant at P < 0.01.