Forensic soil provenancing in an urban/suburban setting: A sequential multivariate approach

Abstract Compositional data from a soil survey over North Canberra, Australian Capital Territory, are used to develop and test an empirical soil provenancing method. Mineralogical data from Fourier transform infrared spectroscopy (FTIR) and magnetic susceptibility (MS), and geochemical data from X‐ray fluorescence (XRF; for total major oxides) and inductively coupled plasma‐mass spectrometry (ICP‐MS; for both total and aqua regia‐soluble trace elements) are performed on the survey's 268 topsoil samples (0–5 cm depth; 1 sample per km2). Principal components (PCs) are calculated after imputation of censored data and centered log‐ratio transformation. The sequential provenancing approach is underpinned by (i) the preparation of interpolated raster grids of the soil properties (including PCs); (ii) the explicit quantification and propagation of uncertainty; (iii) the intersection of the soil property rasters with the values of the evidentiary sample (± uncertainty); and (iv) the computation of cumulative provenance rasters (“heat maps”) for the various analytical techniques. The sequential provenancing method is tested on the North Canberra soil survey with three “blind” samples representing simulated evidentiary samples. Performance metrics of precision and accuracy indicate that the FTIR and MS (mineralogy), as well as XRF and total ICP‐MS (geochemistry) analytical methods, offer the most precise and accurate provenance predictions. Inclusion of PCs in provenancing adds marginally to the performance. Maximizing the number of analytes/analytical techniques is advantageous in soil provenancing. Despite acknowledged limitations and gaps, it is concluded that the empirical soil provenancing approach can play an important role in forensic and intelligence applications.


| INTRODUC TI ON
Soils are complex mixtures of minerals, amorphous material, organic matter, water, gasses, organisms, and, in places, man-made particles. The composition of soils is fundamentally controlled by their location through the environmental controls of climate (moisture, temperature), life (plants, organisms), topography (elevation, aspect, slope, relief), substrate (geology, parent material), and time (weathering), among others, as first articulated by Jenny in 1941 [1]. Thus, the natural soil composition varies in a largely predictable and structured, rather than random and chaotic, fashion. Therefore, coherent maps showing the spatial variability of natural soil parameters can be produced provided the density at which they are measured is appropriate relative to the scale of their heterogeneity. Human land use may either confound or complement understanding of the spatial patterns. Once a series of soil property maps are produced, they can serve two important forensic purposes: (i) the evidentiary relevance of observing nondistinguishable questioned and control samples, and (ii) the potential to constrain the spatial provenance of an unknown questioned soil sample.
The use of geological material such as soil in forensic investigations is increasing in police forces around the world, including the Federal Bureau of Investigation, the Royal Canadian Mounted Police, and the Australian Federal Police (e.g., [2][3][4][5][6][7]). In Australia, successful soil forensic investigations have contributed evidence that has been used in Australian Supreme courts (e.g., [8]). Forensic soil provenancing can be defined as the capability to spatially constrain the likely region of origin of an evidentiary sample of earthrelated material [9,10]. Rawlins et al. [9] characterized the prediction of the provenance of a sample of earth-related material as "one of the most difficult and challenging tasks for analytical earth scientists." Caritat et al. [11] introduced a predictive soil provenancing method that does not require a specific soil survey to be carried out over an area of interest. More typically, however, forensic soil provenancing is implemented empirically by comparing the spatial multivariate information contained in the evidentiary soil's geochemistry, mineralogy, bulk properties, etc., to either purposely acquired or pre-existing knowledge (see fig. 1 in [11]). Such knowledge generally is derived from soil geochemical surveys and stored in databases containing this same or similar multivariate information over the region of interest at an appropriate density [12]. Geochemical surveys come in many guises (e.g., [13,14]) and although many already exist at a range of spatial coverages (continental to local), sampling densities (1 sample per 1000's of km 2 to 100's of samples per 1 km 2 ), and sampling media (materials) selections (topsoil, C horizon, sediment, …), forensic applications have specific requirements that may not have been the primary focus of the original surveys [15].
Despite this, these pre-existing surveys and associated databases have their use in forensic applications, as long as their limitations (e.g., sampling density, sampling medium, sample collection method) are understood.
Finally, if the evidentiary sample is non-distinguishable with a particular region of origin, a detailed forensic investigation can proceed there. If unsuccessful or inconclusive, more data and better data must be collated (if pre-existing) or collected (if not), which may imply undertaking a more refined geochemical survey at a scale relevant for the case at hand.
In this and a companion paper, we describe and compare different approaches to soil provenancing based on a local (i.e., relatively small area and relatively high sampling density) soil geochemical survey in and around North Canberra, Australian Capital Territory, in inland southeastern Australia. The approaches under consideration are (i) a sequential multivariate approach (this paper), and (ii) a simultaneous multivariate (degree of geochemical similarity) approach (upcoming paper in preparation). A complementary probabilistic (likelihood ratio) approach will be published separately (upcoming paper in preparation). The aims of the present contribution accordingly are to: • briefly introduce the North Canberra soil geochemical survey • present the sequential multivariate provenancing approach • present results for this method • quantify the performance of this approach • draw conclusions as to the suitability of the sequential multivariate provenancing approach for forensic and intelligence applications • Those properties are interpolated to create 250 × 250 m raster grids over the survey area.
• Evidentiary (blind) sample properties are compared within uncertainty to grid cell values.
• For every grid cell a score of 1 is given if a property matches the blind sample value, 0 otherwise.
• Scores are added for all properties, mapping areas more closely matching the blind samples.

| The North Canberra soil geochemical survey
The North Canberra soil geochemical survey was initiated in 2017 and focused on the northern part of Canberra city and surrounding suburban areas, in the Australian Capital Territory (ACT) ( Figure 1). The total area covered by the survey was ~260 km 2 sampled at an average density of 1 site/km 2 [35]. In addition to the 268 samples in this survey, three "blind" samples (Blind 1, Blind 2, and Blind 3 hereafter) were collected from sites within the survey area (but away from the survey's grid samples), the geographical coordinates or even approximate locations of which were unknown to the lead researcher until the project had concluded all data analysis and map production.
General background, results, and interpretations of the geochemical mapping of the ACT, including the investigation of the effects of lithology and land use on soil geochemistry, will be presented elsewhere (upcoming paper in preparation). A simple description of the blind sample sites is, however, warranted here as it will have a bearing on the interpretation of the provenance analysis we focus on (see Fig. S1 in Appendix S1 We note here that Blind 1 was deliberately collected from a local environment not representative of the broader landscape to test the limit of soil provenancing. Blind 2 is Kurosol/Rudosol (Alluvial) soil collected over undifferentiated Quaternary alluvium and fluvial deposits of gravel, sand, silt, and clay along Ginninderra Creek. Blind 3 is a Kurosol soil collected over a thin, folded Acton Shale Member, an Early Ordovician black graptolitic siliceous shale within the broader turbiditic (sandstone, mudstone, shale) Adaminaby Group.
In this paper, the analytical focus is directed to both (i) soil mineralogy via infrared spectroscopy (informing on, e.g., hydrated minerals such as clay minerals, carbonates, and sulfates) and magnetic susceptibility (informing on, e.g., ferrimagnetic minerals such as maghemite or magnetite, and their grain sizes); and (ii) soil geochemistry via major oxides and organic matter concentrations as well as trace element concentrations after two chemical extractions of different strengths. Sample collection, preparation, and analysis methods are detailed in the Appendix S1, as are data analysis, spatial analysis, quality control, and detailed uncertainty analysis procedures.

| Uncertainty analysis
Uncertainty arises from any attempt to quantify natural phenomena, from sampling through to analysis. In this project, two main types of uncertainty were specifically quantified: measurement uncertainty (U m ) and interpolation uncertainty (U i ). They were quantified as three standard deviations of field triplicates (SD m ) and of residuals (SD i ), respectively. Residuals are the differences between the interpolated (modeled) values and the measured values at each sampled site. The combined uncertainty (U c ), which applies to the generated property raster surfaces, is calculated using the root sum of squares method (e.g., [37,38]) as follows: The standard deviations (SD m , SD i ) and uncertainties (U m , U i , U c ) of each analyte are given in Table S3 in the Appendix S1.

| Determination of search ranges
For each variable, the Search Range (SR) for a blind (evidentiary) sample was set to the measured value of that variable in that blind sample (Target Value or TV) with a buffer reflecting the sum of the uncertainty in the analytical data (U m ) and of the uncertainty in the raster surface (U c ), according to: This accounts for uncertainty in both the interpolated surface (which is derived from measured values in the database and a smoothing interpolation algorithm), via U c , and the measured value in the blind sample, via U m , as illustrated in Figure 2. The graphic illustrates that the interpolated grid value for a particular soil property needs to fall within the uncertainty envelope (U m + U c ) around that soil property for the evidentiary sample to count as a match and score a 1 in the provenance raster computation (see below and Appendix S1).

| Raster generation and clipping
Interpolation rasters for each available variable were prepared by inverse distance weighting (IDW), clipped, and analyzed in QGIS as explained in the Spatial Analysis section of the Appendix S1.

| Provenancing methodology
A sequential multivariate approach to soil provenancing based on an empirical database of soil properties is developed in this (2) SR = TV ± (U m + U c ) F I G U R E 2 Schematic illustration of the values of a measured variable at seven survey samples A to G (light blue rectangles) with uncertainty (light blue error bars), and inverse distance weighting (IDW) interpolated surface (solid dark blue line) with combined uncertainty U c (dashed dark blue lines above and below solid line). Blind sample being provenanced is shown as a dark blue rectangle, with its measurement uncertainty U m (dark blue error bar) [Color figure can be viewed at wileyonlinelibrary.com] contribution. The first step in this approach is to measure and map a number of mineralogical (e.g., FTIR, MS) and geochemical (e.g., XRF and ICP-MS) soil properties at the sampled sites. The next step is the interpolation of those properties between sampled sites, here performed using IDW (power 3; 12 neighbors; 250 m cells) as detailed elsewhere. The final step of this method is to select raster cells from those grids that match the Target Value ± Search Range of the evidentiary sample of interest. This is akin to drawing contours on a topographic map that follow a given elevation with allowance for some slack or uncertainty in that elevation value; this essentially yields a corridor (or corridors) of locations (cells) that satisfy the elevation ± uncertainty criterion. A raster calculation in QGIS assigns a value of 1 to cells that satisfy a given criterion (i.e., those whose soil property value fall within the Search Range), and a value of 0 to those that do not (i.e., those whose soil property value fall outside the Search Range). Once the cells that satisfy the Search Range for one composition variable are established, those for one or many more variables can be added to it. This generates a map over the area of interest with cells having values ranging from 0 to N (the number of soil properties under consideration). Such maps can be colored to produce "heat maps" that readily draw attention to those areas with most criteria being satisfied and thus more likely to include the potential origin for the evidentiary sample. It is noted that the provenancing methodology presented here is not intended to be used at the exclusion of other provenancing avenues such as soil microbiome or palynology, but rather complement those by providing a geochemical/mineralogical perspective. Once areas of enhanced provenance potential are identified, further resources can be allocated to these focussed regions with a lower failure risk.

| RE SULTS AND D ISCUSS I ON
A statistical summary of the data collected during this project can be found in Table 1. Lower limits of detection and proportions of the variance explained for the principal components obtained for the FTIR, XRF, Total, and aqua regia (AR) ICP-MS datasets (the latter three after centered log-ratio-clr-transformation) are given in the Appendix S1 (Tables S1 and S2).

| Validation
Standard deviations and uncertainties derived for each parameter as described above are given in the Appendix S1 (Table S3) Tables 2, 3, and 4, respectively. The results of soil provenancing investigations using the sequential multivariate approach are discussed below.
The maps of provenance prediction for samples Blind 1, Blind 2, and Blind 3 based on three FTIR principal components and two MS parameters (for a total of five parameters) are shown in Figure 3.
Results indicate that for these three blind samples, 3 of (a theoretical maximum of) 5, 2 of 5, and 3 of 5 parameters match the Search Ranges for Blind 1, Blind 2, and Blind 3, respectively. If the three PCs from FTIR are removed from the analysis and only MS data are considered (not shown), the match rates for these three blind samples change to 1 of 2 for all three blind samples.
The soil provenance rasters generated by the present sequential multivariate provenancing method can be interpreted like "heat maps" where raster grid cells with hotter colors are a better match to the evidentiary sample under investigation than cooler colored cells. In Figure 3A, grid cells colored light, medium, and dark red (scores of 3, 4, or 5) indicate a match equivalent or superior to the cell from which simulated evidentiary sample Blind 1 actually comes from (which has score of 3). Provenancing grids computed from the cumulative results from more variables yield smoother, more gradational spatial patterns than those generated from fewer variables, as demonstrated by subsequent figures. In a separate section (Performance Assessment), we will discuss metrics to quantify how good the provenance predictions are.
The maps of provenance prediction for samples Blind 1, Blind 2, and Blind 3 based on 11 compositional XRF parameters are shown in

| Performance assessment
The performance statistics of the sequential method of provenancing soil samples are summarized in Table 5. Each Blind sample behaves slightly differently in terms of provenancing performance ( Inclusion of principal components (PCs) in the provenancing workflow provides a marginal advantage in terms of provenancing performance (

| Sensitivity analysis
The sequential multivariate soil provenancing method developed here suggests a number steps to take for identifying regions within a search area (i.e., cells within a raster) that are more likely to contain the source of an evidentiary (blind) sample being provenanced.
In this section, we test a number of variations on the previously  Table 6 shows the impact of these scenarios relative to the base scenario for XRF and Total ICP-MS analyses.
The sensitivity analysis (

| Limitations and future research
The present study focussed specifically on data analysis workflows for the provenancing of soil trace evidence. It did not address the (acknowledged) issues of (i) sample size available for analysis in a geochemical survey situation vs a crime scene forensic casework; (ii) soil transfer and persistence from the crime scene to the point where soil is sampled for forensic assessment; (iii) the potential for a questioned soil sample from an urban/suburban environment being impacted by human activity (e.g., transported soil for landscaping or engineering purpose); and (iv) the choice of interpolation method to predict the values of a soil property between survey grid points.
The latter point has been the focus of investigations in the past (e.g., [39][40][41][42]), though perhaps not specifically with a forensic application in mind. Other limitations to this provenancing approach, such as contamination, are common to all forensic traces, for example, fingerprinting, biological tissues, fibers, and not specific to soil provenancing; they are of course an important concern and need to be managed by appropriate protocols.
Future research could thus include expanding the present investigation to include (i) micro-analysis techniques, and (ii) quantitative mineralogical and geochemical assessment of soil transfer and persistence (e.g., as footsteps are taken with dirty boots, a car is driven with muddied tires, or a shovel is subjected to drying and shaking to simulate transport in a vehicle).
Despite the acknowledged limitations to the empirical soil provenancing approach developed herein and the recognition that additional research is recommended, it is concluded that empirical soil provenancing based on soil mineralogical and geochemical surveys can play an important role in forensic and intelligence applications.

| SUMMARY AND CON CLUS IONS
A sequential multivariate method of soil provenancing was ap-  Methods are as follows: Fourier transform infrared (_FTIR), mass-specific (Xlf) and frequency-dependent in percent (Xfd_pc) magnetic susceptibility, X-ray fluorescence (_XRF), and aqua regia (_AR) and total (_Tot) inductively coupled plasma-mass spectrometry. Units are as follows: All PCs: dimensionless; Xlf: 10 −6 m 3 /kg; Xfd_pc: %; XRF: wt%; AR and Tot: mg/kg (ppm). See text for details. Methods are as follows: Fourier transform infrared (FTIR), magnetic susceptibility (MS), X-ray fluorescence (XRF), and aqua regia (AR) and total (Tot) inductively coupled plasma-mass spectrometry; ALL represents all the above methods combined. Precision (Prc) is defined as the ratio of cells in a grid that have scores equivalent to, or lower than, the score of the cell containing the Blind (evidentiary) sample over the total number of cells. Accuracy (Acc) is defined as the ratio of the score for the cell containing the Blind (evidentiary) sample over the (actual) maximum score obtained at any cell within the grid. Prc and Acc reported in %. See text for details.
TA B L E 6 Sensitivity analysis of provenancing performance statistics for the sequential multivariate method for unknown samples Blind 1, Blind 2, and Blind 3 for X-ray fluorescence (XRF) and total (Tot) inductively coupled plasma-mass spectrometry analytical methods, with and without principal components (PCs) included The reference scenario (Sc 0) is the base case developed herein (IDW power 3; grid origin 679750,6090750; cell size 250 m x 250 m; and uncertainty multiplier 3). Variations modifying one of these parameters at a time are Sc 1 (IDW power 2), Sc 2 (grid origin 679625,6090625), Sc 3 (cell size 500 m x 500 m), and Sc 4 (uncertainty multiplier 6). Precision (Prc) and accuracy (Acc) reported in %. See text for details.
as comprehensive an analytical suite as possible is advantageous as shown by the performance of the ALL methods category; and (iv) inclusion of PCs in the provenancing workflow provides a marginal advantage in terms of provenancing performance compared to not considering PCs. In a companion paper, we will investigate a simultaneous, rather than sequential, empirical soil provenancing method.

ACK N OWLED G EM ENTS
We would like to express our gratitude toward Australian Federal