Identification of spatio-temporal clusters of lung cancer cases in Pennsylvania, USA: 2010–2017

It is known that geographic location plays a role in developing lung cancer. The objectives of this study were to examine spatio-temporal patterns of lung cancer incidence in Pennsylvania, to identify geographic clusters of high incidence, and to compare demographic characteristics and general physical and mental health characteristics in those areas. We geocoded the residential addresses at the time of diagnosis for lung cancer cases in the Pennsylvania Cancer Registry diagnosed between 2010 and 2017. Relative risks over the expected case counts at the census tract level were estimated using a log-linear Poisson model that allowed for spatial and temporal effects. Spatio-temporal clusters with high incidence were identified using scan statistics. Demographics obtained from the 2011–2015 American Community Survey and health variables obtained from 2020 CDC PLACES database were compared between census tracts that were part of clusters versus those that were not. Overall, the age-adjusted incidence rates and the relative risk of lung cancer decreased from 2010 to 2017 with no statistically significant space and time interaction. The analyses detected 5 statistically significant clusters over the 8-year study period. Cluster 1, the most likely cluster, was in southeastern PA including Delaware, Montgomery, and Philadelphia Counties from 2010 to 2013 (log likelihood ratio = 136.6); Cluster 2, the cluster with the largest area was in southwestern PA in the same period including Allegheny, Fayette, Greene, Washington, and Westmoreland Counties (log likelihood ratio = 78.6). Cluster 3 was in Mifflin County from 2014 to 2016 (log likelihood ratio = 25.3), Cluster 4 was in Luzerne County from 2013 to 2016 (log likelihood ratio = 18.1), and Cluster 5 was in Dauphin, Cumberland, and York Counties limited to 2010 to 2012 (log likelihood ratio = 17.9). Census tracts that were part of the high incidence clusters tended to be densely populated, had higher percentages of African American and residents that live below poverty line, and had poorer mental health and physical health when compared to the non-clusters (all p < 0.001). These high incidence areas for lung cancer warrant further monitoring for other individual and environmental risk factors and screening efforts so lung cancer cases can be identified early and more efficiently.


Introduction
Lung cancer is the most frequently diagnosed cancer worldwide, accounting for 1.74 million deaths annually and lung cancer cases are expected to increase by 38% to 2.89 million by 2030 [1]. Lung cancer is also the leading cause of cancer mortality in both men and women in the U.S. In 2021, Pennsylvania (PA) was ranked 32 out of 49 states with an age-adjusted lung cancer incidence rate of 63 per 100,000 population and a five-year survival rate of 25 percent [2]. An estimated 5,990 Pennsylvanians are expected to die from lung cancer in 2022 with approximately 11,170 new cases being reported [3]. People diagnosed at early stages of lung cancer are five times more likely to survive; however, in Pennsylvania only 16 percent of lung cancer cases are diagnosed at early stages [2].
Documenting the extent of cancer incidence remains central to improving public health research and to developing population-based strategies for cancer prevention. The need to understand the incidence of lung cancer is influenced by potentially modifiable risk factors (e.g., tobacco use, alcohol drinking, unhealthy diet, radon exposure) and others that are not (e.g., inherited genetic mutations) [3]. Cancer outcomes are influenced also by socioeconomic status, access to care, supportive services, and rural-urban environmental factors, all of which contribute to both the physical and mental health of cancer patients [4].
Mapping spatial patterns of lung cancer risk is an increasingly popular approach given the greater availability of geographically enabled cancer data and sophisticated visualization methods [5]. Maps are useful for examining disease patterns in relation to local environmental factors with the ability to examine disease causation through the identification of demographic patterns and trends [6,7].
Spatial statistical methods like space-time models can also be used to quantify patterns and trends over space and time (i.e., spatio-temporal) and cancer clusters are frequently used by researchers to respond to public concerns. The aims of this study were to examine spatio-temporal patterns of lung cancer incidence in Pennsylvania over an 8-year period (2010-2017), identify high incidence clusters, and compare the demographic and health characteristics of residents inside and outside of clusters.

Data sources
Lung and bronchus cancer cases in PA between 2010 and 2017 were obtained from the Pennsylvania Cancer Registry (PCR) [8] using International Statistical Classification of Diseases, 10th revision (ICD 10) diagnosis codes-C340 (main bronchus), C341 (upper lobe, bronchus or lung), C342 (middle lobe, bronchus or lung), C343 (lower lobe, bronchus or lung), C348 (overlapping sites of bronchus and lung), and C349 (unspecified part of bronchus or lung). PCR is an incidence-based registry and has earned Gold Certification from the North American Association of Central Cancer Registries (NAACCR), the highest level of data quality achieving at least 95% completeness, for all years under study [9]. The following three exclusion criteria were applied to exclude cases that were: (i) in situ and non-carcinoma histology, (ii) not uniquely matched with a census tract ID, and (iii) the age of diagnosis belonged to an age group with zero population size as estimated by US Census Bureau indicating a possible error. This resulted in a total of 73,937 cases from 3,197 census tracts. We used the census tract, the small and relatively permanent statistical subdivision defined by the US Census Bureau, as the unit of analysis for the consistency in the data collected over the years and the validity when used in research studies [10]. We conducted the present analysis under a data use agreement with the Pennsylvania Department of Health and with the approval of the University of Pennsylvania Institutional Review Board (IRB number 831671).
Demographic data at the census tract level were extracted from the 2011-2015 ACS including median age (years), percentage of males, distribution by race and ethnicity, per capita income, median household income (thousands of $), percent poverty, distribution by educational attainment, total population size, and population density (per square mile).
Poor mental health and poor physical health, defined as the percent of individuals ≥ 18 years who self-reported having 14 or more days during the past 30 days in which their mental or physical health was not good, were extracted from the Centers for Disease Control and Prevention (CDC) PLACES 2020 database derived using the 2018 Behavioural Risk Factor Surveillance System (BRFSS). Both mental and physical health measures were based on self-assessment only without an objective health component [13].

Age-adjusted Incidence rates and trends over time
The age-adjusted incidence rates (number of cases per 100,000) for each census tract were calculated by adjusting the crude incidence rate with respect to the 2000 U.S. Standard Million Population, a commonly used standard population for adjustment that assumes a total population of 1,000,000 [14]. The adjustment used the 18 age groups and population size estimates described above. A choropleth map for the age-adjusted incidence rate using the cumulative cases over 8 years was created to visualize the spatial pattern. Temporal trends in the adjusted incidence rates were examined and modeled using linear quantile mixed models [15]. Such mixed models were utilized to allow census tract level random effects of intercept and slope for the calendar year to be estimated, while the use of the quantile regression provided a robust summary of the trends that were less sensitive to outlying values in the incidence rates, which are often observed in smaller census tracts. The estimated 50 th (median), 75 th , 80 th , and 90 th quantiles were plotted, and the mean profile was included as a reference.

Spatio-temporal disease risk and mapping
To understand the spatio-temporal disease risk, we modeled the observed case counts through a log-linear Poisson regression with both spatial and temporal terms, as well as a space-time interaction term. Specifically, the mean case count for location i (in this case a census tract) and year j was modeled as the expected case counts for the same location and year combination (E ij ) times the relative risk parameter, RR ij , which is also indexed by location i and year j (i.e., relative risk specific to a location and a time). The expected case counts E ij were determined based on the age distribution of the corresponding location i and year j such that E ij equals the crude incidence rate in a particular age group in the study population in year j times the population size in the same age group of the location i from the same year (i.e., internal standardization). Extending the model proposed by Lawson et al. [16] for the spatial model, the log of the spacetime relative risk parameter RR ij was modeled with four components: an intercept as the overall relative risk for the study region, location-specific random effects, a linear trend term in time j, and the interaction random effects between the location and time. The spatial random effects were assumed to follow a normal distribution under the conditional autoregressive (CAR) setting based on Queen contiguity spatial weight matrix (i.e., two areas are considered neighbors if they share a common boundary). The model was fit using the R-integrated nested Laplace approximation, R-INLA [17] under the Bayesian framework with a normal prior distribution. Temporal trends in the RR estimates from 20 randomly selected census tracts were plotted to examine the changes over time. The spatial pattern for the estimated RR for a given year was illustrated using a choropleth map. Furthermore, we calculated the standardized incidence ratio (SIR) for each census tract as the total number of cases observed divided by the total number of cases expected E ij across the 8 years combined. We then created a choropleth map of the SIR to examine this spatial pattern empirically with SIR > 1 indicating an elevated risk such that the number of cases observed is higher than the expected number of cases.

Detection of high-risk clusters
We used the SaTScan cluster detection method which employs Kulldorff scan statistics to detect high risk clusters. This approach has been widely used in spatial statistics to evaluate the risk of disease geographically to detect high risk clusters. This method generated circular spatial windows of various sizes and evaluated the observed over the expected number of cases by comparing inside versus outside the circles to identify statistically significant clusters [18]. To detect spatio-temporal clusters [19,20], scan statistics covered the study area with many overlapping "windows" now defined as cylinders with the base as the area and the height as the time period in the spacetime setting. As the window expanded to contain more areas and more cases, we used a log-linear ratio (LLR) to compare the number of cases inside the windows to the number of cases outside the window. The null hypothesis was calculated under the probability that being a case is the same inside and outside the window relative to the age-adjusted expected number of cases. A LLR > > 1 indicated evidence that the current window forms a high incidence or high-risk cluster. In our analysis, the ageadjusted expected case counts used were the same E ij that was used for the log-linear Poisson model in the previous section. The most likely cluster (i.e., the window with the maximum LLR) and secondary clusters (i.e., other statistically significant windows at 0.05 significance level) were identified in the current analysis. The RR of each cluster was determined by the total number of cases observed over the total number of cases expected in the years when the cluster is present. The statistical significance of a cluster was determined through a Monte Carlo hypothesis testing procedure [21]. The proposed analysis was performed using the R shiny application SpatialEpiApp, which allows estimation of spatio-temporal disease risk and detection of clusters [22].

Comparison of census tracts in high-risk cluster versus not
The nonparametric Wilcoxon rank sum test with a continuity correction was used to compare demographic variables between census tracts in any high incidence cluster at any time during 2010 to 2017 versus those not in any clusters. Data on smoking, which is a known risk factor for lung cancer development, were not available at the census tract level and for the same time frame as the demographic variables used, thus comparison of the smoking prevalence between census tracts was not possible. A two-sided p < 0.01 was considered statistically significant. We used a lower p-value threshold for statistical significance to account for testing multiple variables.

Age-adjusted incidence rates and spatio-temporal disease risk
The population density by census tract in Pennsylvania using the 2011-2015 ACS is shown in Fig. 1A. The population was mainly concentrated in a small number of metropolitan areas including the southeast and western regions of Pennsylvania, specifically in Philadelphia and Pittsburgh areas, respectively. A map of age-adjusted incidence rate using the cumulative cases over 8 years is provided in Fig. 1B showing that higher age-adjusted incidence rates were mainly observed in the major cities located in southeastern (e.g., Philadelphia), northeastern (e.g., Allentown, Scranton), and western (e.g., Pittsburgh, Erie) Pennsylvania. The age-adjusted incidence rates decreased slightly over the study period with median incidence rates (25 th -75 th quantiles) of 51.7 per 100,000  Figure 2B shows the estimated RR over time for 20 randomly selected census tracts and the median of RR estimates for a decile group created using the 2010 estimates. The parallel lines observed in Fig. 2A and B reflected that the fitted models suggested no space and time interaction such that the decreasing trends in the age-adjusted incidence rates and RR values were consistent across the study region. Maps showing the estimated RR for 2013 and SIR are provided in Fig. 3A and B, respectively, indicating a similar pattern to the age-adjusted incidence rates as shown in Fig. 1B, such that higher values of RR and SIR were concentrated in the major cities located in southeastern (e.g., Philadelphia), northeastern (e.g., Allentown, Scranton) and western (e.g., Pittsburgh, Erie) Pennsylvania while most of the central PA showed lower than expected case counts (RR < 1 and SIR < 1). Maps from other years also show a similar pattern (maps not shown).

Detection of high-risk clusters
Five spatio-temporal clusters were identified based on lung cancer cases in Pennsylvania during the study period 2010 to 2017, as shown in Fig. 4. Information for each of the clusters is provided in Table 1 Table 2.
Significant differences were observed in median age, percent male, percent African American, per capita income, percent poverty, percent high school graduate or higher, population density, poor mental health, and poor physical health (all p < 0.001) between the clustered and nonclustered census tracts. In our analysis, census tracts that were part of the high incidence clusters tended to have residents of lower median age, had a higher percentage of  African Americans, residents below the poverty line, be densely populated, and had poorer mental and physical health compared to residents, not in the clusters. The demographic and health characteristics for each cluster are presented in Table 3. The cluster located in the southeast area of Pennsylvania, Cluster 1, had the highest percent African American (median = 45.4%, IQR = 73.2) and population density (median = 17,785.9, IQR = 15,192.9). Cluster 3 in Mifflin County showed the lowest per capita income (median = 16.5%), the highest percent poverty (median = 29.3%), and poor mental and physical health (median = 19.3% and 17%, respectively).

Discussion
In this study, we aimed to identify spatio-temporal clusters of high lung cancer incidence in Pennsylvania over an 8-year period (2010-2017). Overall, the age-adjusted   (2010-2013) to 5.21 for Cluster 3 (2014 to 2016). One of the secondary clusters in the Mifflin County revealed the highest risk ratio of lung cancer compared to the other four clusters, associated with the lowest per capita income (median = 16.51%) and the highest percent poverty (29.3% below poverty line), and the poorest physical and mental health (17% and 19.3%, respectively). The primary economic activities in this area in the past, included the manufacturing of steel [23], machinery [24], and textiles [25], which are often accompanied by adverse environmental impacts and may have contributed to the increased risk of lung cancer. Radon, a known risk factor for lung cancer [26], is also shown to have higher levels in the Mifflin County For example, zip codes in the Mifflin County area were discovered to have high radon levels in homes in the past (e.g., the average radon concentrations were 5.2 pCi/L (first floor) and 11.2 pCi/L (basement) in 17,044, 16.7 pCi/L (first floor), and 20.4 pCi/L (basement) in 17,084). These measurements are much higher than the PA average indoor concentrations of 3.6 and 7.1 pCi/L (first floor and basement, respectively) and above the US EPA standard of 4 pCi/L [27]. Our analyses of socioeconomic and health characteristics suggested that the census tracts with the following characteristics were more likely found in a lung cancer cluster: higher percent African American, lower per capita income, higher percent poverty, and higher population density. Differences in these demographic characteristics reflect that disadvantaged communities are more likely to be exposed to environmental pollutants known to be lung cancer risk factors due to, for example, proximity to manufacturing facilities or hazardous waste sites, traffic, or living in houses with high radon values. Similar spatio-temporal analyses of lung cancer cases performed by others for the U.S. states of Kentucky [28] and Maine [29] also found results like our findings. For example, the reports by Christian and colleagues that analyzed the spatial and temporal distributions of lung cancer histological types in Kentucky using Scan statistics also found that high risk clusters were near the major metropolitan area and/or overlapped with areas of high poverty. Although our analysis did not distinguish histology types, the locations and characteristics of the clusters found in the current and other studies have further demonstrated the importance of environmental and socioeconomics factors for lung cancer.

Limitations and future work
One factor that may affect the results is the size and shape of the study area, known as zoning effect [30]. This is a relevant effect on the results of spatial analysis that may change depending on the unit of the analysis. The zoning effect arises when the number of the spatial units of measure remains the same but changes in their relative structure (unit boundaries and shape) generate different analytical results. The scaling parameters in SaTScan may also produce different results depending on scan window sizes, so multiple scans can be performed at various circle sizes [31]. Known as the aggregation effect, the choice of unit for analysis (census tract) may result in the loss of statistical power to detect clusters, and different results may be obtained if other units such as census block groups were selected [32]. Another methodological limitation is the adjustment for age for the clustering analysis which was done using an internal standardization method such that age distribution for the study area was used as the reference population rather than an external standard population (e.g., standard million population). Additionally, lung cancer cases were combined from different stages, histological subtypes and were not stratified by sex. Data on other risk factors such as smoking, occupational and residential history were also not available on the census tract level to allow comparisons between clusters and non-clusters in the current analysis. Smoking is the single highest risk factor for lung cancer. If the smoking data were available, we would have hypothesized that the census tracts in the clusters will have a higher smoking prevalence than the census tracts outside the clusters. This hypothesis was supported indirectly by the results in Table 2 showing the census tracts that were part of the clusters tended to have more residents living below the poverty line, lower-income, and less educated, and positive associations are known to exist between these variables and smoking [33][34][35][36]. Furthermore, mental and physical health variables were self-reported and thus subject to error. Individuals living in high lung cancer incidence clusters may be more vulnerable to multiple risk factors, e.g., smoking history, which is indicative of the health disparity in the state of Pennsylvania. Educational interventions in counties with a higher incidence of lung cancer are important for promoting public health and other risk mitigation practices. These geographic areas warrant further investigation to potentially identify additional risk factors or unique patterns of cancer stage and histology at the diagnosis, to further address environmental exposures and lung cancer risk in those specific counties. Geographic surveillance may help to identify disparities in disease burden among different regions or communities with high-risk populations that could be targeted for public health interventions. Incorporating spatiotemporal statistical methods, such as cluster detection, into existing disease surveillance activities can provide information about potential cancer clusters. For instance, Cluster 5, which included the counties of Cumberland, Dauphin, and York, had the largest U.S. Amish population. Health among the Amish is characterized by higher incidences of many genetic disorders [37,38], so these communities could be more susceptible to lung cancer. With more vigilant surveillance, early detection of lung cancer may improve overall survival.

Conclusions
Spatio-temporal analysis of lung cancer incidence in Pennsylvania has led us to the identification of areas with higher risk of developing the disease. Although the age-adjusted incidence rates and RR of lung cancer decreased over time, five statistically significant clusters were identified over the study period from 2010 to 2017. Our analysis also demonstrated significant differences in demographic characteristics including percent African American, percent poverty, and population density between those census tracts that were part of the identified clusters versus those that were not. Poorer mental and physical health were also associated with clustered areas. These geographic areas with increased cancer risk factors require further environmental monitoring and screening efforts to reduce exposures and detect lung cancer at an earlier stage.