Applying machine learning to model radon using topsoil geochemistry

.


Introduction
Radon is a radioactive noble gas that occurs naturally and is released from soil, water and rocks.There are three main radon isotopes, 222 Rn (radon), 220 Rn (thoron) and 219 Rn (actinon) which have half-lives of 3.82 days, 55.6 s and 3.96 s, respectively.Uranium ( 238 U) decays to radon ( 222 Rn) as an intermediate daughter product before completely decaying to the stable lead isotope ( 206 Pb). 220Rn and 219 Rn are products of the 232 Th and 235 U decay chains, respectively.Importantly, if 222 Rn or its daughter products ( 214 Po and 218 Po) are inhaled they can cause significant damage by emitting alpha radiation directly into lung cells; linking radon exposure to lung-cancer (Rodríguez-Martínez et al., 2021;Zagà et al., 2021).
The majority (80%) of radiation exposure received by the general public is from natural sources (United Nations Scientific Committee on the Effects of Atomic Radiation, 2011).Approximately 53% of natural radiation is caused by the inhalation of decay products from the uranium and thorium series, of which the isotope 222 Rn accounts for over 90% of these decay products (United Nations Scientific Committee on the Effects of Atomic Radiation, 2011;Zohuri, 2020).Radon ( 222 Rn) exposure is the leading cause of lung-cancer cases in non-smokers, with 3-15% of global lung-cancer cases attributed to radon (World Health Organization, 2009).It is estimated that up to 350 people are affected by Rn induced lung-cancer in Ireland annually (Elío et al., 2018;Environmental Protection Agency, 2022).The survival rate after 5 years since lung cancer diagnosis is between 12% and 25% (Lin et al., 2019; Abbreviations: IRC, Indoor radon concentration; PCA, Principal component analysis; ML, Machine learning; SGRn, Soil-gas radon; GPR, Gaussian process regression; RF, Random Forest; LR, Logistic regression; GRP, Geogenic radon potential.Schabath and Cote, 2019).
Reduction of radon exposure is paramount to reducing radon-related lung-cancer cases.Internationally, several countries have implemented initiatives to reduce the harmful effects of radon exposure.Some initiatives include setting reference limits for acceptable levels of indoor radon concentration (IRC), and having prevention measures (i.e.radon barrier, sump, ventilation) in place for new buildings (Radiological Protection, 2019(Ionising Radiation) Regulations, 2019).It is important to note that there is no safe limit of radon exposure (World Health Organization, 2009).However, countries often decide the 'safe' IRC level based on average background levels and the realistic ability to lower IRCs to a 'safe' level (Degu Belete and Alemu Anteneh, 2021).In Ireland, 200 Becquerels per meter cubed (Bq m − 3 ) is the reference level for IRC in houses (Environmental Protection Agency (Ireland), 2019).
Identifying radon-prone elements in topsoil may help distinguish radon-prone areas.Soil-gas radon (SGRn) is the prime contributor to IRC (World Health Organization, 2007), understanding the association of soil geochemistry with the natural variation of geogenic radon distribution could allow topsoil geochemistry to be incorporated into mapping radon.The latter could help overcome the lack of national soil-gas radon data and add value to legacy soil geochemical mapping programs.It is possible radon-prone elements covary spatially for several reasons including association with different lithologies, weathering and mobility of elements.As such, it may be possible to determine the radon potential of a region using reliable soil geochemical data.
The application of multivariate geostatistical data analysis and machine learning for geochemical and environmental modelling is becoming more established (Bossew et al., 2020;He et al., 2022;Huntingford et al., 2019;Zuo, 2017).Principal component analysis (PCA) and machine learning have been applied to investigating the distribution of potentially toxic metals in topsoils at regional (Xu et al., 2021) and local scales (Wu et al., 2022).PCA and compositional data analysis have also been used to research the distribution of rare earth elements in topsoils (Ambrosino et al., 2022), as well as map pollution source of sea sediments (Somma et al., 2021), and geological and geochemical mapping at local, regional and larger scales (Ballabio et al., 2019;Wang et al., 2021;Zheng et al., 2021).Processing topsoil geochemistry using machine learning have also been applied to monitoring soil organic carbon (Sakhaee et al., 2022) and investigating heavy metal distribution in soils (Wang et al., 2023).Although machine learning has been used for modelling geogenic/indoor radon distribution (Elío et al., 2023;2019;Petermann et al., 2021;Rezaie et al., 2021), the present research explicitly investigates the topsoil geochemical signature of geogenic radon in a 17,983 km 2 area in Ireland.
The main purpose of this study is to test multivariate statistical methods (PCA, ML) for analysing the relation between topsoil geochemistry and geogenic radon at a regional scale.Considering the lack of comparable scientific literature on this subject, several ML models are tested.The analysis is performed on a topsoil dataset obtained from Tellus Geological Survey Ireland (GSI) and compared with corresponding geogenic radon categories derived from Elío et al. 2020.
The results demonstrate a correlation between topsoil elements and geogenic radon.This indicates that the geochemical signature of a topsoil sample can provide insight into the geogenic radon available for an area.The geostatistical approach used in this research confirms the feasibility of using topsoil geochemistry for assessing geogenic radon risk at a regional scale (>10 km 2 ).

Methodology
In total, 4279 shallow topsoil samples were obtained from the 2017-2019 GSI Tellus survey G5 area of the North Midlands, Ireland; including samples from counties Galway, Mayo, Roscommon, Longford, Westmeath, Offaly, Meath, Kildare and Dublin (Geological Survey Ireland Tellus programme, 2020).The GSI Tellus programme collected the topsoil samples with a 4 km 2 sample density.The topsoil samples cover a range of Quaternary sediments including till derived from limestones, cherts, alluvium, Namurian sandstones and shales, blanket peat, Lower Palaeozoic and Devonian sandstones, Devonian and Carboniferous sandstones, gravels derived from limestones.
The bedrock geology includes Devonian Granites, Ordovician Sandstones, conglomerates and Silurian sandstones and siltstones in the west, with a range of Carboniferous Limestones that predominate throughout the study area.There are occurrences of Silurian siltstones, sandstones and shales in the north-east of the area, and the northern portion of the Leinster Granites are located in the south-east of the study area.A detailed description of all the bedrock lithologies in this study area can be viewed on the Geological Survey of Ireland (GSI) online map viewer (www.gsi.ie(https://dcenr.maps.arcgis.com/apps/webappviewer/index.html?id=ebaf90ff2d554522b438ff313b0c197a&scale=0)).
The geogenic radon potential (GRP) categories developed by Elío (2020) were derived from GSI Tellus airborne radiometric surveys; where equivalent soil-gas radon and soil permeability were used to model GRP with a 1 km 2 resolution (Elío et al. 2020).The soil permeability categories were estimated from the Groundwater Subsoil Permeability (GWSP) map of Ireland and the all-Ireland Quaternary map (Elío et al. 2020).The airborne geogenic radon potential (GRP) categories; High (H), Moderate-High (M-H), Moderate-Low (M-L) and Low (L), published by Elio et al. (2020) are used as dependent variables in the machine learning models (Fig. 1B).The four groups are assigned to two classes.The Low category is assigned to class 1 and the Moderate-Low, Moderate-High and High groups are assigned to class 2. The Low GRP category has significantly less probability (6.93%) of having indoor radon concentration above the 200 Bq m − 3 reference level compared to the remaining GRP classes (average 16.98%) (Elío et al., 2020).
The data for the independent variable(s), of shallow topsoil geochemistry, were downloaded from the Geological Survey Ireland/ data and maps/geochemistry website (Geological Survey Ireland Tellus programme, 2020).The dataset investigated is '6117xxA-6174xxA_-Shallow_Topsoil_Download_v1.0.xlsx' (Geological Survey Ireland Tellus programme, 2020).Each shallow topsoil (0.05-0.20 m depth) sample is composed of five subsamples; four of which are collected from each corner of a 20 m square and the fifth is collected from the centre of that square corresponding with the GPS location for that sample (Knights et al., 2020).
The GSI Tellus protocol for preparing samples prior to analysis includes fan oven drying at 30 • C, carefully breaking clumps, dry sieving, dry sieving to obtain < 2 mm soil fraction, pulverising to obtain 63 μm fraction using an agate ball mill (Young et al., 2016).Tellus topsoil geochemistry was analysed using ICPMS for multiple elements (Al, Ba, Ca, Cr, Cu, Fe, K, Li, Mg, Mn, Na, Ni, P, S, Sr, Ti, V, Zn, Zr, Ag, As, Be, Bi, Cd, Ce, Co, Cs, Ga, Ge, Hf, Hg, In, La, Mo, Nb, Pb, Rb, Sb, Sc, Se, Sn, Te, Th, Tl, U, W, Ta, Au, Pd, Pt, Re and Y) (Geological Survey Ireland Tellus programme, 2020).Several elements (Ta, Au, Pd Pt and Re) were omitted for further analysis due to more than 5% of values being below the detection limit.Only 4130 of 4279 topsoil samples from the Tellus north midlands dataset were used for analysis.The 149 samples not used were not within the boundaries of a radon potential classes reported by Elío et al. (2020), as such they could not be used for investigating the link with geogenic radon potential.

Data closure
Geochemical data is a subset of compositional data in which the concentrations of elements represent a proportion of the entire composition, leading to the constant sum (or data closure) issue (Aitchison, 1982).The centred log ratio transformation is used on data prior to multivariate analysis (i.e.PCA, Machine Learning models) in this study to accommodate the data closure restraint.
Clr-transformation projects the dataset into Euclidean space, allowing for the interpretation of geochemical results without the issues of spurious correlation of dependent variables (Aitchison, 1982;Grunsky and Caritat, 2019;Wang et al., 2021).The isometric log ratio (ilr) transformation is used on data prior to univariate data analysis (i.e.correlation heat maps, Pearson r and r 2 ).The ilr method transforms a composition from an Aitchison-simplex to D-1 (dimension minus 1) Euclidian vector space, retaining isometry (Juan José Egozcue et al., 2003).The ilr transformed dataset was back-transformed to obtain the original dimensionality of the starting dataset; this was done using 'R' software version 4.0.2 and the 'compositions' package.

Principal component analysis
Principal Component Analysis (PCA) is an affine geometric transformation technique, which allows for increased interpretability of a multivariate dataset (Tolosana-Delgado and McKinley, 2016).PCA is a linear dimensionality reduction technique based on Euclidean methods, that transforms multiple variables into a smaller number of principal components (PC's), where principal component 1 (PC1) represents the direction of maximal variance of the dataset (Mueller et al., 2020).After PC1 the second principal component (PC2) represents the second largest variance of the data and is orthogonal to PC1 (Vermeesch, 2013).Assuming each PCA variable contributed to the variance of the dataset equally, each PC would explain (100/n)%, where n is the number of variables.Considering there are 47 elements used in the analysis, each PC would explain 2.1% (100/47) of the variance if they contributed equally.
The first several (n) PC's approximate the composition of elements that cause the highest degree of variation in the dataset.The last PCs explain the least variance, which may correspond to quasi-constant elements that could be utilized for studying immobile and mobile element migration if a geospatial/geochemical dataset is used (Tolosana--Delgado and McKinley, 2016).PCA aids in the interpretation of elemental variance within a compositional dataset and can be used in the process discovery phase of research (Grunsky, 2010;McKinley et al., 2018).
Each sample has a distinct score for each principal component, with the scores representing the samples in the transformed Euclidean space.
To each principal component are associated loadings representing the size and direction of the contribution from each of the original variables (i.e.elements) (Abdi and Williams, 2010).
The positioning and size of each loading are significant; loadings with similar directions, depict variables with similar variation patterns.In comparison, loadings with large differences between them have lower correlation (Dempster et al., 2013).
For the purposes of this study, four ML models are compared to investigate the suitability of using topsoil geochemistry for predicting geogenic radon potential categories.Results from RF, GPR and LR are compared with a baseline/control model; which assigns topsoil samples to geogenic radon classes based on equal probability.The ML algorithms (RF, LR, GPR) have been chosen in this study due to their robustness and ability to analyse large (> 1000 samples) multivariate datasets.
In simple terms, Random Forest (RF) can be used as a supervised classification algorithm that aims to predict which group an observation belongs to.A random forest classifier is an ensemble model that builds multiple decision trees on different subsets of the dataset and employs averaging to enhance predictive accuracy while mitigating overfitting (Breiman, 2001;Farhadi et al., 2022;Shang et al., 2019).A more elaborate explanation of the mathematical theory underpinning RF can be found in the literature (Biau and Scornet, 2016;Breiman, 2001;Schonlau and Zou, 2020).The 'RandomForestClassifier' function in the sklearn library (python version 3.6) was used to implement the random forest model.
Gaussian Process Regression (GPR), is a nonparametric classification ML algorithm based on Bayesian probability.The GPR model initially forms a prior distribution, then uses the training dataset to reallocate probabilities and form a posterior probability distribution.Thorough explanations of the theory underlying GPR can be found in the literature (Bernardo et al., 1998;Bousquet et al., 2011;Kanagawa et al., 2018).The 'GaussianProcessClassifier' function in the sklearn library (python version 3.6), with a squared exponential kernel was used to train and test the GRP model.
Logistic regression (LR) is a statistical method for modelling the relationship between a binary dependent variable and one or more independent variables by estimating probabilities using a logistic function.More details regarding the LR mathematical model are published in the literature (Hastie et al., 2009;Kirasich et al., 2018;Sperandei, 2014).Specifically, the sklearn library is used in python (version 3.6), utilizing the 'LogisticRegression' function.
Cross-validating data is important for determining the stability of model parameters.Spatial cross-validation aims to eliminate overfitting in the model due to spatial autocorrelation (Pohjankukka et al., 2017;Talebi et al., 2022).The Tellus North Midlands G5 dataset was spatially cross-validated using 5-fold k-means cluster spatial blocking.This was achieved by splitting the dataset into 5 separate spatial blocks, 4 blocks were used for training the model and the remaining block was used to test the model.To aid in balancing the model, equal amounts of samples from classes 1 and 2 were randomly extracted during the training and testing process.The 5-fold cross-validation was repeated 10 times for RF, GPR, LR and the control model.

Results and discussion
The primary aim of this case study is to investigate the applicability of topsoil geochemistry for classifying geogenic radon potential at a regional scale.Initially, PCA is deployed to determine the extent of variation within the dataset and identify clusters of elements.Several ML models are analysed to investigate if any of the ML models tested can fit the topsoil geochemistry to predict the geogenic radon class (GRP).Summary statistics (i.e.mean, min, max, quartiles 1-3, standard deviation, skewness and Kurtosis) as well as Pearson r, and r 2 results are included in the supplementary materials (supp.matt.).

Principal component analysis results
The PCA biplot (Fig. 2) shows the data-cloud colour coded to GRP class.It is important to note that the PCA biplot doesn't use information regarding the assigned GRP class when calculating PCA loadings.However, the data-cloud depicts low GRP scores concentrating in the positive PC1-PC2 biplot region (i.e.top right), and samples collected from regions with higher GRP class are represented by lower PC1-PC2 scores on the biplot (i.e.bottom left) (Fig. 2).As such, the data-cloud shows directionality, indicating GRP class decreases as PC1 and PC2 scores increase.
The PCA results indicate that radon-prone elements have geospatial restraint, suggesting topsoil geochemistry can be used as a tool to understand the geogenic radon potential in an area.Explicitly, if a topsoil sample contains elevated concentrations of elements with negative PC1 and PC2 loadings, then the topsoil may correlate with higher geogenic radon potential.Conversely, if topsoil samples contain elevated concentrations of elements with positive PC1-PC2 loadings may indicate the soil has lower geogenic radon potential.
The PC1 score map was superimposed onto the 1:100 k GSI bedrock map using ArcMap 10.8.1.The full legend for the different bedrock geologies can be found on GSI's online map viewer (www.gsi.ieaccessed 2022).High PC1 scores occur above various limestones including the Waulsortian limestones, Croghan limestone Formation, Ballymore limestone Formation, Cong Canal Formation and the majority of the Burren Formation (Fig. 3a).However, some of the highest PC1 scores also occur above the Devonian granites in Co. Galway and Co. Dublin (Fig. 3b), where a higher concentration of uranium and radon-prone elements would be expected (i.e if soil formed from the underlying bedrock and insignificant chemical and physical weathering occurred).The lack of radon-prone elements in soils above the Devonian granites indicates that (a) the soil did not form from the underlying bedrock or, more plausibly, (b) sedimentary processes including weathering and erosion has occurred that has mobilised radon-prone minerals/elements to be deposited above different bedrock geologies (Moles and Moles, 2002).Mobility of elements in soils due to glacial activity is shown for Northern Ireland, north of the study area (Dempster et al., 2013).
Low PC1 loadings (relating to elements including Tl, Y, Mn, Cr, Ni, Co, Al, V, Sc, Be and Zr) are observed, in higher concentrations, above Namurian (undifferentiated) sandstones, siltstones and shales, as well as in soils above siltstones and sandstones in the east of the study area including the Balrickard Formation (Namurian sandstone and shale), Walshestown Formation (Namurian shale, sandstone limestone), Denhamstown Formation (Silurian greywacke sandstone and siltstone) and Clatterstown Formation (Silurian siltstone and sandstone).Low PC1 scores also occur above the limestones and shales located in the north east of the study area (i.e.Mornington Formation, Crufty Formation, the Lucan Formation, the Loughshinny Formation, Clontail Formation (calcareous red-mica greywacke).
The highest PC2 scores occur above the Shannapheasteen and Galway Granite and the Leinster Granite (Fig. 3b).Higher PC2 scores also occur scattered in the middle and east of the study area above various limestones.The most negative PC2 loadings (<− 0.06) are linked to Rb, Li, Cs, Th, Ce, Al, K, Ga, La and Cr (supp.matt.).Low PC2 scores are found above the undifferentiated Visean limestones, dark fine-grained limestone and shale, as well as in proximity to black mudstone, siltstone and greywacke (Fig. 3b) The inverse trend of GRP class decreasing as PC1-PC2 values increase can also be observed by comparing the GRP class map (Fig. 1b) with the PC1-2 score maps (Fig. 3).The lowest PC1 scores (Fig. 3a) correspond with areas that show moderate-high geogenic radon potential (GRP) in the North Midlands (Fig. 1b).Areas exhibiting a low GRP class (Fig. 1b) correspond to areas with high PC1-PC2 scores (Fig. 3).
A canonical heatmap is shown for the shallow topsoil elements measured in the Tellus North Midlands G5 study area, using ilrtransformed data (Fig. 4).The covariance of elements that cluster together on the PC1-PC2 biplot (Fig. 2) is reinforced by showing relative positive correlation on the canonical heat map (Fig. 3).Low PC1-PC2 loadings including La, Y, Zr, Hf, Tl, Sc, Be and Mn, as well as Cr, Al, Ga, V, Ni, Co, Li, and Rb show relatively positive correlation on the heatmap (Fig. 4).In comparison, high PC1-PC2 loadings including S, Na, Se, Sr, Hg and Ca are positively correlated with each other; and mostly exhibit negative correlation with the low PC1-PC2 loadings (Figs. 2 and  4).

Machine learning model performance
The accuracy, precision, recall, f1-score and ROC-AUC results for RF, LR, GPR and the control machine learning model are reported in Table 1.The GPR model is the most accurate (74% (f1-score 0.74)), followed by LR (74% (f1-score 0.73)) and RF (73% (f1-score 0.73)), although the accuracies of these models are within uncertainty of one another (Table 1).The GPR, LR and RF models are significantly more accurate compared to the control model accuracy (50.2%).
The performances of the ML models prove a link between topsoil geochemistry and geogenic radon class, as they are capable of identifying patterns in the training set to predict the radon class on unseen data with an average f-1 score of ~0.73.
The average true positive, true negative, false positive and false negative results from each model were used to compute a simplified confusion matrix (Table 2).Type I errors (false positives) represent samples that were mistakenly classified in a high GRP class.Type II errors (false negatives) occur when a model misclassifies a high-risk GRP sample as being in the low-risk GRP class.
The type I error is lowest in the LR model (22%), followed by GPR (23%) and RF (32%).The control model has the highest rate of type I and type II errors (49.9%).The RF model has the lowest rate (23%) of type II errors followed by GPR (29%) and LR (29%).Overall, the percentage of correctly classified samples is consistently higher in the GPR model (73.9%),LR (73.7%) and RF (72.8%) algorithms compared to the control model (50.1%) (Table 2).Models chosen for low type I errors would minimize the amount of resources spent on targeting low geogenic radon areas which are mistakenly classified as high.Whereas a model with the lowest type II error, would be the most effective for targeting areas associated with a higher geogenic radon potential probability.

Feature importance
The ML models feature importance indicates which elements are important for distinguishing low and higher GRP classes.The feature importance is calculated for each of the four models using sklearn libraries in python (Fig. 5).The LR, RF and GPR models show a pattern of elements that rank among the most important features in at least two  models (Fig. 5); these elements include Y, Tl, Mn, Sc, Ba, Sb, Cr, U, Li and Rb, some elements are ranked as important features in individual models.For instance, random forest also ranks Pb, U, Zn, Sr, P and Zr among the important features, gaussian process regression additionally ranks Co, Cu, Be highly, whereas logistic regression includes Al, Rb, Li among its most important features (Fig. 5).
Logistic Regression ranks elements Y, Al, Rb, Mn, Sc, Li, Cr and K, with high feature importance (Fig. 5) which group together on the PCA biplot with the negative PC1-PC2 loadings (Fig. 2).The majority of the next highest-ranking features also occur in the negative PC1-PC2 biplot cluster (i.e.Ga, Cs, Nb, Ce and Tl).Gaussian Process Regression also highly ranks elements with negative PC1-PC2 loadings i.e.Y, Tl, Mn, Sc, Co, Be and Cr.Other highly rankly features include elements Cu, Ba, and U which have positive PC2 and negative PC1 loadings.Random Forest highly ranks elements from various PC1-PC2 biplot clusters, namely negative PC1-PC2 loadings (Tl, Y, Mn and Sc) and positive PC2negative PC1 cluster (Ba, P, U, Zn, Sb and Pb).Elements with low PC1-PC2 loadings are consistently ranked highly on the feature importance for several ML models.However, there are occurrences of more positive PC1 and PC2 loadings having higher feature importance in the ML models, e.g.Pb, Sb, Ca, Sr, Hg and Se.The latter results indicate that the models use the geochemical signature from areas classified as low GRP when performing a prediction.
Low PC1 and PC2 loadings are more frequently ranked among high feature importance, reinforcing the results shown by the PCA directionality of low PC1 -2 scores being associated with higher GRP class (Fig. 2).

Geological meaning of affinities
The affinities between shallow topsoil geochemistry and geogenic radon potential across the study area are influenced by different geological processes.At any given location, the geochemistry of each topsoil sample is an integration of previous processes involved in rock formation, weathering processes, glacial processes and soil formation processes, and possibly anthropogenic processes (Dosseto et al., 2011;Heimsath et al., 1997;Johnson and Watson-Stegner, 1987;Shepherd, 1989).It is important to highlight that affinities between elements can change if chemical and physical parameters change.Below is an elaboration of some possible suggestions of topsoil affinities.
The affinity between Co with Ni (0.85), Cr (0.68) and V (0.62) in high geogenic regions (Low PC1 areas -Fig.4) could be related to sulphide, ferromagnesian and oxide minerals.For example, Ni and Co can covary together in siegenite ((Ni,Co) 3 S 4 ), pyrrhotite (Fe(1-x)S) or pentilandite ((Fe,Ni) 9 S 8 ) as stoichiometric lattice substitutions in hydrothermal sulphide deposits; or Ni, Co, along with Cr and V can be adsorbed onto finegrained inorganic particles (Jansson and Liu, 2020;Kovalev et al., 2014;Loring, 1976).As such, it is possible for covariances between Ni, Co, Cr and V formed as a result of different geological processes.However, considering the geology underlying Co, Ni, V and Cr covariances is dominated by sedimentary lithologies, it is likely the [Co, Ni, V, Cr] affinity relates to those sedimentary processes, namely adsorption onto clay-rich or shale layers within limestones.However, the Navan Zn-Pb deposit in Co. Meath, that formed as a result of hydrothermal sulphides interacting with carbonate host lithologies, could contribute to the Co, Ni and Fe affinity (Ashton et al., 1980;Johnston et al., 2013;Marks et al., 2017;Yesares et al., 2019).The latter highlights that heterogenous geological processes across the study area could be contributing to the affinity between elements.
It is possible that physical weathering and leaching of the granites could be the potential source of Li, Cs and Rb covariation observed in the high geogenic radon regions (i.e.above the bedrock Formations in the east of the study area and in tills derived from limestones and tills derived from Silurian and Namurian sandstones and shales).It's possible these [Li, Rb, Cs] element affinities within soils correlating to higher geogenic radon potential areas are due to clay minerals illite and chlorite, which are reported to incorporate traces of radon precursor elements (i.e.uranium and radium) by adsorption in phosphorous-rich environments (Benedicto et al., 2014;Kim et al., 2017;Liao et al., 2020;Mei et al., 2022).Confirming the geological provenance of element affinities within the topsoil would require specific mineralogical analysis, petrographic microscopy and/or provenance studies on bedrock and soil samples to validate geological processes and controls responsible for the geochemical affinities.

Limitations
There are some discrepancies between the feature importance from various models, likely connected with the different model architectures.
To evaluate the feature importance on the same footing for all models, we applied the same method, specifically permutation importance, as implemented in the sklearn library (python version 3.6).Considering topsoils are partially derived from bedrock, the topsoil geochemical signatures that relate to radon-prone areas are associated with the geologies within the study region.It is important to note the limitation of applying the results of our compositional and multivariate statistical analysis to areas of distinctly different geologies.

Conclusion
Exposure to indoor radon gas is a major contributor to lung cancer globally (Gaskin et al., 2018;World Health Organization, 2009), and it is in interest of human wellbeing to provide rigorous research aimed at understanding natural radon distribution.Soil-gas is the main contributor to indoor radon, and considering there is no national soil-gas radon dataset available, our study sets out to investigate if topsoil geochemistry can be used to predict geogenic radon class.Multiple methods of  analysing data are used to investigate the extent topsoil geochemistry can be used to understand geogenic radon distribution.The ML models tested can use topsoil geochemistry to predict radon class with ~74% accuracy (f-1 score ~ 0.73).Elements that are relatively more correlated with geogenic radon consistently group together when comparing multivariate and univariate results for regional scale data (Tellus 'North Midlands' Ireland).The feature importance from several machine learning algorithms designed to test the relation of topsoil geochemistry to geogenic radon, reinforce the hypotheses that topsoil geochemistry can be indicative of geogenic radon.In particular, Y, Tl, Mn, Cr, Co, Al and Sc are elements commonly associated with elevated geogenic radon.Additionally, Pb, Sb, Ca and Se are among elements frequently negatively correlated with geogenic radon.The results demonstrate the value that topsoil geochemical surveys can have in determining radon prone areas.The analytical approach used in this paper contributes a thorough and novel method for extracting useful information from topsoil geochemical data.Machine learning results demonstrate a rigorous method for testing and interpreting large datasets applied to geogenic radon studies, although the methodologies are also applicable to other areas of research such as resource exploration (e.g.topsoil geochemistry applied to predicting underlying lithologies opposed to geogenic radon).Overall, we contribute a better understanding of the correlation between topsoil geochemistry and geogenic radon, and provide a robust methodological approach for interpreting compositional data.As such the methods presented here could be applied as a diagnostic tool to assist radon mitigation measures, adding value to legacy soil geochemistry datasets.

Role of funding source
The partial SUSI grant provided to Méabh Banríon's Ph.D. has no involvement or role in the study design, the collection, analysis and interpretation of data; in writing of the report; and in the decision to submit the article for publication.The financial scholarship provided to Matteo Cobelli from IRC has no involvement or role in any aspect of this research project, including research design, sampling, analysis, report writing or decision to submit the article for publication.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1. 1.A Study area in relation to GSI bedrock geology map (1:100 k); The topsoil sample refers to shallow topsoil geochemical samples collected by the GSI Tellus topsoil geochemical survey (G5 north midlands study area), and 1.B Geogenic radon potential class for study area (derived from Fig. 5, Elío et al. 2020).Both maps were made using Arc map 10.8 and utilizing GSI basemaps and OSI Coast -National 250k Map of Ireland from Tailte Éireann.

Fig. 2 .
Fig. 2. PC1 vs PC2 biplot for Tellus North Midlands topsoil geochemistry data.PCA score results are grouped to Geogenic Radon Potential class (i.e.topsoil sample grouped to Low (L), Moderate-Low (ML), Moderate-High (MH), and High (H) GRP class derived from Elío et al., 2020b); 'R' software was used; specifically, the 'prcomp' library with the centre and scale argument set to T = true.

Fig. 4 .
Fig. 4. Shallow topsoil geochemistry ilr-transformed canonical heatmap.Blue and positive numbers up to +1 depict positive correlations and red and negative values down to − 1 depict negative correlations between variables.

Table 1
Accuracy, Precision, Recall, f1-score and ROC-AUC results for Random Forest, Logistic Regression, Gaussian Process Regression and the control machine learning model.The type I and type II errors are reported after ± in the relevant cells.

Table 2
Confusion matrix results for Random Forest, Gaussian Process Regression, Logistic Regression and a Control model applied to classifying topsoil samples into Geogenic Radon Potential classes.The Asterix '*Type I' refers to the percentage rate of false positives, while '**Type II' refers to the percentage rate of false negatives.