Methods of spatial cluster detection in rare childhood cancers: Benchmarking data and results from a simulation study on nephroblastoma

The potential existence of spatial clusters in childhood cancer incidence is a debated topic. Identification of rare disease clusters in general may help to better understand disease etiology and develop preventive strategies against such entities. The incidence of newly diagnosed childhood malignancies under 15 years of age is 140/1,000,000. In this context, the subgroup of nephroblastoma represents an extremely rare entity with an annual incidence of 7/1,000,000. We evaluated widely used statistical approaches for spatial cluster detection in childhood cancer (Ref. Schündeln et al., 2021, Cancer Epidemiology). For the simulation study, random high risk clusters of 1 to 50 adjacent districts (NUTS-level 3, nomenclature des unités territoriales statistiques) were generated on the basis of the 402 German administrative districts. Each cluster was simulated with different relative risk levels (1 to 100). For each combination of cluster size and risk level 2000 iterations were performed. Simulated data was then analyzed by three local clustering tests: Besag-Newell method, spatial scan statistic and the Bayesian Besag-York-Mollié approach (fit by Integrated Nested Laplace Approximation). The performance characteristics of all three methods were systematically documented (sensitivity, specificity, positive/negative predictive values, exact- and minimum power, correct classification, positive/negative diagnostic likelihood and false positive/negative rate). This data article links to a Mendeley online repository which includes the raw data of simulated high-risk clusters and simulated cases on the district level for an all-childhood-malignancy scenario as well as for cases of nephroblastoma. These data was used for the evaluation of the three cluster detection methods. The R code for simulation and analysis are available from GitHub. The article also includes analyzed data summarizing the performance of the cluster detection tests in very rare disease entities, using the example of simulated nephroblastoma cases. The raw data from the study can be used for benchmarking analyses applying different spatial statistical methods systematically and evaluating their performance characteristics comparatively. The analyzed data from the nephroblastoma example can be useful to interpret the performance of the three applied local cluster detection tests in the setting of extremely rare disease entities. As a practical application, data and R code can be used for performance analyses when planning to establish surveillance systems for rare disease entities.


Keywords:
Simulation study Random distribution Spatial cluster Childhood cancer Nephroblastoma Spatial scan statistic Besag-Newell Bayesian Besag York Mollié widely used statistical approaches for spatial cluster detection in childhood cancer (Ref. Schündeln et al., 2021, Cancer Epidemiology ). For the simulation study, random high risk clusters of 1 to 50 adjacent districts (NUTS-level 3, nomenclature des unités territoriales statistiques) were generated on the basis of the 402 German administrative districts. Each cluster was simulated with different relative risk levels (1 to 100). For each combination of cluster size and risk level 20 0 0 iterations were performed. Simulated data was then analyzed by three local clustering tests: Besag-Newell method, spatial scan statistic and the Bayesian Besag-York-Mollié approach (fit by Integrated Nested Laplace Approximation). The performance characteristics of all three methods were systematically documented (sensitivity, specificity, positive/negative predictive values, exact-and minimum power, correct classification, positive/negative diagnostic likelihood and false positive/negative rate). This data article links to a Mendeley online repository which includes the raw data of simulated high-risk clusters and simulated cases on the district level for an all-childhoodmalignancy scenario as well as for cases of nephroblastoma. These data was used for the evaluation of the three cluster detection methods. The R code for simulation and analysis are available from GitHub. The article also includes analyzed data summarizing the performance of the cluster detection tests in very rare disease entities, using the example of simulated nephroblastoma cases. The raw data from the study can be used for benchmarking analyses applying different spatial statistical methods systematically and evaluating their performance characteristics comparatively. The analyzed data from the nephroblastoma example can be useful to interpret the performance of the three applied local cluster detection tests in the setting of extremely rare disease entities. As a practical application, data and R code can be used for performance analyses when planning to establish surveillance systems for rare disease entities.

Value of the Data
• The raw data from this study can be used for benchmarking analyses when applying other statistical methods and evaluating their performance characteristics systematically. • The data is of benefit for researchers investigating the spatial epidemiology of extremely rare disease entities.
• The analyzed data from the nephroblastoma example can be useful to interpret the comparative performance of local cluster detection tests in the setting of extremely rare disease entities. • Data and R code can be used for performance analyses when planning to establish surveillance systems for various disease entities.

Raw data
The aim of the study was to evaluate three local clustering tests: Besag Newell (BN), spatial scan statistics ( SSS ) and the Bayesian Besag-York-Mollié approach (fit by Integrated Nested Laplace Approximation). To measure their performance, the tests were conducted with simulated data: Randomly assembled high-risk clusters of adjacent districts, increasing in size ( Cluster ) and in various risk levels ( RR ) were generated. The simulation process is described in detail in paragraph 2.3.
The raw data, generated by the simulation is presented online in the Mendeley repository ( https://data.mendeley.com/datasets/3hrg9tpsx9/4 ). In the online repository, the raw data for the nephroblastoma incidence simulation is documented in the file "NephroblastomaSimulation.RData". The file for the simulation of the all-childhood-malignancies scenario is in the file "AllMalignancies.RData". Both files can be loaded into the statistical software R. Each file contains six lists for the different cluster sizes ("Cluster Size X"). Within each of these lists 20 0 0 simulations for clusters in 10 different risk levels ("RR Y Cluster"). Corresponding to each run of simulation, the simulated cases for each of the respective scenario ("RR Y SimCases") are found. The files also contain the population of children under 15 years for each district ("District Population") as published by the German Federal Statistical Office [2]. In addition the expected cases for the entities, all malignancies or nephroblastoma, ("Expected Cases") per district, based on the expected incidence rates are given within the files.
The adjacency matrix for the 402 German districts is added as a separate RData file (Adjacency Matrix.RData).

Analyzed data from Nephroblastoma example
In the study, the performance of the three spatial cluster detection tests was systematically documented (details see 2.5). The data in Table 1 summarizes the analyzed results using the example of nephroblastoma cases. Selected performance measures are displayed as percentage sensitivity ( Sens ), specificity ( Spec ), positive predictive value (PPV), negative predictive value ( NPV ), exact power ( EP ), minimum power ( MP ) and correct classification ( CC ). Fig. 1 gives an overview of results on sensitivity, specificity, PPV and NPV for each of the three methods separately, depending on relative risk and cluster size.
The examples of a small and a large cluster high-risk scenario, 5-and 50-district clusters respectively, are shown in Fig. 2 .
The complete analyzed data can be found in the Mendeley repository ( ( https://data.mendeley. com/datasets/3hrg9tpsx9/4 , file "Analyzed Data.xlsx"). The complete data includes the results of the nephroblastoma scenario for all simulated RR levels and all simulated cluster sizes. In addition to the performance measures presented above, the correct proportion (CP), the positive diagnostic likelihood ( PDL ), negative diagnostic likelihood ( NDL ), false positive ( FPR ) and false negative rate ( FNR ) are displayed including the respective upper and lower confidence intervals.
Additionally, the file includes the complete analyzed data for the all-childhood-malignancies scenario (presented in detail in Ref. Cancer Epidemiology ).

General aspects of simulation study
For our study, we simulated random spatial clusters of the extremely rare childhood cancer subentity of nephroblastoma. The clusters varied in size and magnitude of risk increase. Overall incidence, population and spatial structure (districts) reflect conditions in Germany. The simulation was also performed for an all childhood malignancy scenario (see Ref. Schündeln et al., 2020, Cancer Epidemiology 2020). We systematically varied input parameters in the simulation and assessed performance of three cluster detection methods. The code for the computational implementation of the simulation to reproduce the analysis of the published study is provided online ( https://github.com/Pediatrics/ Childhood-Cancer-Study ). Here also the GADM shapefiles and the baseline population data can be found. Table 2 summarizes the notation for the explanatory remarks in the next paragraphs. total number of cases in district i and its j closest neighbors U j (i) population size in district i and its j closest neighbors  [ 20 , 21 ] standard deviation of the Monte Carlo estimator (RR), taken across repetitions (n)

Simulation and raw data generation
For the simulation and analysis of spatial data at the district-level (nomenclature des unités territoriales statistiques, NUTS-level 3 we used a shapefile obtained from the Database of Global Administrative Areas [1] . It represents 402 German districts according to the German administrative divisions of mid-2016. Corresponding population sizes, as of 31 December 2017, were obtained from the German Federal Statistical Office (Statistisches Bundesamt) [2] . The German pediatric population was estimated to be 11,048,523 children under the age of 15 years (13.3%). The number of children at risk below the age of 15 years for each spatial unit ranged between 3,594 and 4 92,44 8. The population density of children under 15 years of age ranged from 4 to 620 per km 2 .
High-risk clusters of different sizes were generated by randomly compiling a number of 1, 2, 3, 5, 10, 20 or 50 adjacent districts. A random district was (repeatedly) selected as a starting point using a fixed seed. Neighboring districts were identified using the adjacency matrix (evaluation of rows and columns) with recursion. The operation was terminated when the desired cluster size was reached. In case "donut-shaped" polygons were selected by the random process, the enclosed district was included into the generated risk cluster.
Crude incidence rates of expected pediatric cancer cases were assumed to follow a Poisson distribution with λ nephroblastoma = 7 / 1 0 0 0 0 0 0 (or λ all = 140 / 1 0 0 0 0 0 0 for the "all-pediatriccancer-scenario") [ 7 , 8 ]. Generally, RR i was assumed to be 1, while RR i associated with the generated clusters was varied in 10 steps from 1 to 100 (1, 1.1, 1.2, 1.3, 1.4, 1.5, 2.5, 10, 100), thus λ i = λ nephroblastomal × R R i . The data was aggregated over a 10-year period, as is regularly done for spatial epidemiological analysis of childhood cancer using population-based cancer registry data [7] . Therefore, the case numbers per district during the time period were calculated as follows: c i, 10 = 10 n = 1 λ × 1 0 0 0 0 0 0 × u i, 10 . The crude incidence rates for each district were then calculated as follows: ci r i = c i 10 u i 10 × 1 0 0 0 0 0 0 . Cancer incidence was simulated for all 402 districts 20 0 0 times for each scenario. The simulation estimand was the district cumulative RR

Cluster detection methods
Besag and Newell method The first approach was introduced by and named after Besag and Newell [9] . Here a test for each single region i based on the number of neighbors that must be combined to contain a minimum number of user defined cases k . The cases surrounding district i are ranked according to their distance to i to identify the k nearest cases. The area containing those k nearest cases is then identified ( M i ), in which M i constitutes a possible disease cluster. The following explanatory remarks are based on Song and Kulldorff [10] . To test for clustering around i , the approach considers whether the total number of cases in M i is large relative to the total risk population. The test statistic is defined as follows: where M i is a random variable denoting the minimum number of districts needed to have at least k cases in district i and its M i closest neighboring districts, m i is the observed value of Mi that is Under the null hypothesis, every individual person in a given region is equally likely to be a case, independent of other individuals and the location of residence. The null hypothesis of no clustering is rejected when the test statistic R is large. The method was applied as implemented in the R package SpatialEpi, version 1.2.3 [11] . For "nephroblastoma" scenario , k was set to 5 (for "all malignancies" to 50). These thresholds cover around the 75 th percentile of expected cases per district for the respective scenarios.

Spatial scan statistics
SSS in this study are represented by a modified approach adapted from Kulldorff [12] . SSS imposes a circular window on the map and lets the circle centroid move across the study region. For any given position of the centroid, the radius of the window is changed continuously between zero and an upper limit of radius or a maximum fraction of total population. Let L j(i) be the likelihood under the alternate hypothesis that there is a cluster in district i and its j closest neighbors, and let L 0 be the likelihood under the null hypothesis. It can then be shown that As this likelihood ratio is maximized over all circles, it identifies the one that constitutes the most likely cluster. The test statistic is where I is the indicator function with value 1 when D j(i ) > U j(i ) N C and 0 otherwise. The null hypothesis of no clustering is rejected when T is large. The method was applied as implemented in the R package SpatialEpi, version 1.2.3 [11] . The maximum population within the circles was set to 10 % of the total population.

Besag-York-Mollié method
In the Bayesian approach, the disease risk is estimated using a hierarchical model, comprising random effects that allow borrowing strength from the respective neighboring observations, therefore smoothing the spatial variation of relative risk and minimizing the likelihood of risk variation by chance. This makes the approach attractive for application in rare diseases and underpopulated areas. The general form is as follows (see e.g. [ 13 , 14 ]): where RR i is the relative risk in area i , which is modelled by an intercept term μ, an exchangeable area-specific effect ν i and another spatially structured area-specific effect m i . The spatially structured random effects can be estimated by a number of different models. Commonly, conditional autoregressive (CAR) prior distribution models are used in disease mapping studies. Spatial correlation between the random effects is defined by a binary n x n neighborhood matrix W . In two neighboring districts denoted by j ∼i , the random effects are correlated. Non-neighboring districts are modelled as being conditionally independent, given the remaining elements of m . The intrinsic autoregressive model includes the simplest CAR prior [15] and is referred to as the Besag, York and Mollié (BYM) model. The full conditional distribution in this model is then given The conditional expectation of m i is equal to the mean of the random effects in neighboring areas, while the conditional variance is inversely proportional to the number of neighbors f i . Therefore, in the presence of strong spatial correlation, more neighbors yields increased information in the data about the value of the random effect.
The parameter τ 2 l controls the variation between random effects. While Inference in this such models is usually based on Markov chain Monte Carlo (MCMC) simulation, the presented approach applies the Integrated Nested Laplace Approximation (INLA) [16] . INLA has been shown to produce results comparable to MCMC sampling and is nowadays often used in spatial applications, see e.g. [ 17 , 18 ]. It was applied as implemented in the R package R-INLA, version 18.07.12 [19] . High-risk districts/ clusters were defined as regions where the estimated RR is larger than 1 as determined by its two-sided equal-tailed 95% credible interval. Minimally informative priors (1, 0.001) on the log-precisions of the unstructured and spatially structured effects (based on the log-gamma distribution; as is the default setting in R-INLA) were used.

Performance of cluster detection methods
The performance of each of the various cluster detection methods and scenarios in this study is reported according to the quality criteria detailed below.
Variance estimates: Mean, standard deviation (SD) as well as lower and upper 95% confidence intervals (CI = mean ± 1.96 × SE) were calculated for all measured parameters (LCI and UCI).

Ethics Statement
No humans and animals were directly involved in the study. Publicly available population based data was used from DESTATIS, GADM and the German Childhood Cancer Registry.

Credit Author Statement
MSS and CStock: conceptualization, design of the study, coding, statistical analyses, drafting of manuscript; TL and MK: coding, CSpix: raw data supply; KB: conceptualization and drafting of manuscript; all authors contributed to revising the manuscript critically for important intellectual content and approved the final version.

Declaration of Competing Interest
CStock is now full-time employee of Boehringer Ingelheim Pharma GmbH & Co. KG. The company had no role in design, analysis or interpretation of the present study. The remaining authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.