Abstract

Recent years have seen an increase in the forensic interest associated with the poison ricin, which is extracted from the seeds of the Ricinus communis plant. Both light element (C, N, O, and H) and strontium (Sr) isotope ratios have previously been used to associate organic material with geographic regions of origin. We present a Bayesian integration methodology that can more accurately predict the region of origin for a castor bean than individual models developed independently for light element stable isotopes or Sr isotope ratios. Our results demonstrate a clear improvement in the ability to correctly classify regions based on the integrated model with a class accuracy of 60.9±2.1% versus 55.9±2.1% and 40.2±1.8% for the light element and strontium (Sr) isotope ratios, respectively. In addition, we show graphically the strengths and weaknesses of each dataset in respect to class prediction and how the integration of these datasets strengthens the overall model.

1. Introduction

Castor bean is the common term used for the seed of the plant Ricinus communis. Castor beans have a long history as a commercial crop through the world and thus are a valuable commercial commodity [1]. However, castor beans also contain the toxic protein ricin, which is classified as a Schedule 1 controlled substance under the Chemical Weapons Convention and as a select agent by several other agencies. In particular, very little ricin is necessary for a lethal dose, estimated to be only 5–10 micrograms per kilogram of body weight if injected or inhaled and 1–20 milligrams if ingested [2]. Although few deaths to date have been attributed to ricin poisoning [3], recent years have seen an increase in the seizure of ricin-containing samples related to biocriminal activity [4]. Thus, methods to track the source of castor beans have the potential to be of value to investigators. Recent methods associated with the attribution of ricin-containing samples have focused on characterization of the procedure that could have been used to extract ricin from the seeds [5, 6]. These methods give valuable information to the investigator. However, the information is associated with the extracted ricin and not the castor bean. Additionally, under the circumstance that the investigator collects castor beans in lieu of the processed product, it may be of interest to discover the region from which those castor beans originated.

By integrating data from a variety of analytical instruments, it may be possible to help identify the geographic source of castor seeds. These integration methods promise to yield a more complete and accurate view of a sample compared to any individual data source. However, heterogeneous data collected from different analytical methods cannot be simply concatenated together to build a statistical model. In addition, adequately orthogonal datasets are required to construct an integration-based classifier that is more accurate than the individual models.

Here, we investigate two sources of isotope ratio (IR) data that may provide insight into the region of origin for castor beans: (1) light element (C, N, O, and H) stable isotope ratios (LeIRs) and (2) Sr isotope ratios (SrIRs). Both data types have been used to associate plant and animal material with regions of origin [711]. Stable isotope ratios of C, O, and H in plants are influenced by climate [12], while 87/86Sr isotope ratios are influenced by bedrock and soil [13], suggesting that they should be treated as independent datasets. We find that each dataset can predict region of origin more accurately than expected by chance, but with moderate success. Further, we present a statistical integration approach based on a simple Bayesian network to utilize the two datasets in combination to predict region of origin (Figure 1). The integrated classification model significantly improves our capability to predict region of origin. In addition, this statistical integration approach simultaneously yields a classification prediction as well as a probabilistic confidence in the identification.

2. Experimental Material and Methods

The castor seeds used for this study were a subset of a larger collection assembled by various means including purchasing seeds in various locales, accepting donations of seeds from collaborators, and sending seeds to volunteer growers throughout the United States [14]. The collection includes ornamental and agricultural varieties of castor seeds; our goal was to gather seeds from as geographically diverse regions as possible, and we accepted all acquisitions regardless of strain. Consequently, the collection is genetically variable. The actual growth conditions of the seeds were also not controlled; seeds could have been from irrigated plots, watered only by precipitation or any other method. From this broader collection, we subsampled seeds from 8 geographic regions from which we had multiple acquisitions to determine whether, despite the variation in genetics and potential cultivation method, there were characteristic isotope ratio values that linked the seeds to their regions of origin.

The nature of the seed collection imposed significant limitations in our definition of growth region. Ideally, a growth region would consist of an area homogeneous in geology, climate, and isotope ratios of precipitation and surface water from which we would have multiple seed acquisitions. Given the opportunistic nature of the castor seed collection, however, it lacked the sampling density in tightly constrained geographic regions required for the ideal experimental design. We therefore defined regions from diverse global locations in which we had multiple acquisitions in relatively limited areas. The sample set used in our analyses consisted of 68 castor seed acquisitions from 8 such geographic regions (Table 1).

We defined two categories of data: (1) 𝛿13C, 𝛿15N, 𝛿18O, and 𝛿2H isotope ratios (LeIRs) of the seeds and (2) 87Sr/86Sr isotope ratios (SrIRs) of the seeds. The average value of the observations within each of these data categories is summarized in Table 1. Boxplots of the data are given in Figures 2 and 3 for the observed LeIRs and SrIR data, respectively.

2.1. Light Element (C, N, O, and H) Isotope Ratios

Light element stable isotope ratios of castor beans were measured by isotope ratio mass spectrometry as described in [14, 15]. In brief, five castor beans from a single geographic source were homogenized using a Retsch MM200 machine (Retsch GmbH & Co., Germany). The C and N stable isotope ratios of the paste were determined on a Finnegan MAT Delta X isotope ratio mass spectrometer (Bremen, Germany) coupled to a Carlo Erba Elemental Analyzer 1108. The O and H stable isotope ratios of the paste were determined using a Thermo-Finnegan Delta Plus XL isotope ratio mass spectrometer (Bremen, Germany) equipped with a thermal conversion elemental analyzer (TC/EA). Stable isotope content is measured as a ratio, R (e.g., 13C/12C), and reported as a delta (𝛿) value where 𝛿=[(𝑅sample/𝑅standard)1]1,000%. In this equation, 𝑅sample is the measured isotope ratio of the sample, and 𝑅standard is the isotope ratio of an internationally recognized standard. The standard for C isotope ratio measurement is Vienna PeeDee Belemnite (VPDB), for N is air (AIR), and for O and H is Vienna Standard Mean Ocean Water (VSMOW) [16].

2.2. Sr Isotope Ratios

Sr isotopes were measured by digesting 1-2 castor bean with 2-3 treatments of concentrated nitric acid and hydrogen peroxide coupled with heating and drying in order to break down all the organics. Strontium was separated from the digested sample using Sr-Spec (Eichrom) resin and nitric acid. The eluted Sr sample was dried and treated with 1-2 drops concentrated acid to further drive off organics and then reconstituted to a final volume of 4 mL of 2% nitric acid. The Sr isotope analyses were performed on a MC-ICPMS (Neptunen Plus) using a standard spray chamber and a self-aspirating nebulizer on 50 ppb solutions. For quality control, NBS-987 was run along with the unknowns (87Sr/86Sr  =  0.71026 ± 1; 𝑛=5). The analyses were corrected for mass bias using 86Sr/88Sr = 0.1194 and normalized to a NBS-987 standard value of 0.71024.

3. Statistical Material and Methods

The statistical model is formulated as a simple Bayesian network (Figure 1). Bayesian statistics is a common approach to make inferences from biological data because all data are treated as random variables. Bayesian models provide a full joint distribution over both the observable and unobservable variables (1). Furthermore, the posterior probability of interest can be computed by integration or summation, such as viewed in (2) [17, 18]. In particular, for the Bayesian formulation in Figure 1, the random variable region (𝑅) is conditionally dependent upon each data type; however, the sources of data are not conditionally related to each other. Thus, the joint probability can be described by these conditional relationships Joint=𝑃(LeIR,SrIR,𝑅)=𝑃(LeIR𝑅)𝑃(SrIR𝑅)𝑃(𝑅).(1) The specific probability of interest is the probability of observing region 𝑘 given our two data types, which can be obtained directly by applying Bayes formula to (1), 𝑃𝑅𝑘=𝑃LeIR,SrIRLeIR,SrIR,𝑅𝑘=𝑃𝑃(LeIR,SrIR)LeIR𝑅𝑘𝑃SrIR𝑅𝑘𝑃𝑅𝑘𝑘𝑃LeIR𝑅𝑘𝑃SrIR𝑅𝑘𝑃𝑅𝑘.(2) Thus, the task of computing the probability of interest in (2) simplifies to computing the posterior probability models of 𝑃(LeIR|𝑅) and 𝑃(SrIR|𝑅).

3.1. Individual Posterior Probability Models

MATLAB 2011b with Statistics Toolbox V7.6 was used to perform all statistical analyses on the LeIR and SrIR datasets, as well as the integration and validation of the models.

3.1.1. Light Element (C, N, O, and H) Isotope Ratios

The light element stable isotope ratios (IRs) consisted of four variables with little colinearity. These variables were relatively normally distributed with 𝑃 values ranging from 0.24 to 0.5 based on a Jarque-Bera test of normality [19]. Boxplots of the distribution of each variable are given in Figure 2. Given the normal structure of the data and a categorically distributed dependent variable (regions), linear discriminant analysis (LDA) was used to derive a statistical classification model. LDA is a multivariate discrimination method commonly used for classification in chemometrics [20]. LDA uses statistical learning to infer an optimal linear combination of the features to separate the regions. The “classify” function in MATLAB was used to obtain the probability of region 𝑘(𝑅𝑘) given a set of IR values. The statistical model based on training data can be described as 𝑃𝑅𝑘𝛿13C,𝛿15N,𝛿18O,𝛿2H=𝑓𝑘(LeIR),(3) where 𝑓𝑘(LeIR) is computed from a multivariate normal distribution. The posterior probability of interest (2) for a test sample 𝑗(LeIR𝑗) is computed directly from the “classify” function 𝑃𝛿13C,𝛿15N,𝛿18O,𝛿2H𝑅𝑘=𝑃LeIR𝑗𝑅𝑘.(4)

3.1.2. Sr Isotope Ratios

The SrIR data is described by a single observed value, and unlike the LeIR, the observed data for Sr is not normally distributed (𝑃 value of approximately 0.001). A boxplot of the distribution of Sr across regions is given in Figure 3. Given that the data is non-normal with a single independent variable and categorically distributed dependent variable, multinomial logistic regression (MLR) was used to derive a statistical classification model [21] using the “mrnfit” function in MatLab 𝑃𝑅𝑘=𝑒Sr(Sr𝛽𝑘)1+𝑘𝑒(Sr𝛽𝑘),(5) where 𝛽𝑘 is the vector of regression coefficients for region 𝑘. Again the posterior is computed in MATLAB using the “mrnval” function to obtain 𝑃(SrIR𝑅𝑘).

3.2. Classification Model Evaluation Metrics

Each model is evaluated independently using a leave-one-out bootstrapping cross-validation approach (LOOB-CV) with resampling [22] to obtain the full set of posterior probabilities for each sample in our datasets. The LOOB-CV method was selected to reduce the likelihood of overtraining the model, and resampling is performed to acquire uncertainty estimates on the metrics of model accuracy. In particular, for the LOOB-CV method, each sample is left out to create 𝑁 datasets [𝑋1,𝑋2,,𝑋𝑁], where 𝑁=68 for the data described in Section 2. A set of 100 bootstrap samples, each containing 50 samples, are randomly selected for each 𝑋𝑖, [𝐵(1)𝑖,𝐵(2)𝑖,,𝐵(100)𝑖]. The model is trained on each 𝐵(𝑘)𝑖, and the posterior of sample 𝑖 is obtained. The posteriors across the 100 bootstrap samples are averaged to obtain a more accurate estimate of the posterior probability.

The results are evaluated using two approaches: (1) average classification accuracy (CA) and (2) average area under a receiver operating characteristic curve (AUC). To compute these, each sample was defined by a binary vector where all values are initialized to zero. The probabilities for the sample were sorted, and all locations equal to or greater than the correct answer were set to 1. For example, suppose the correct region has the third largest probability of the 8, then it is set to 𝑆𝑖=[0,0,1,1,1,1,1,1]. If the correct region is identified as the most probable then this becomes 𝑆𝑖=[1,1,1,1,1,1,1,1]. The CA is defined as the fraction of samples that are correctly classified into the appropriate region CA=𝑖𝑆𝑖1𝑁.(6) The AUC is computed based on a modified receiver operating characteristic (ROC) curve. ROC curves traditionally plot the false positive rate (FPR) versus true positive rate (TPR) of a binary classifier. For this data, there are 8 possible classes with one correct answer and seven potential false identifications. Thus, there may be 0, 1, 2, 3, 4, 5, 6 or 7 regions incorrectly identified prior to correct answer, which means that each sample may have an FPR defined as one of these eight values [0, 1/7, 2/7, 3/7, 4/7, 5/7, 6/7, 1]. Based on these FPR states, the associated TPR for the full dataset is computed as the total number of samples that have an FPR less than or equal to the defined value, 𝑡𝑝𝑟𝑚=𝑖𝑆𝑖𝑚𝑁.(7) Note that CA=𝑡𝑝𝑟1. Similar to a binary ROC curve, a perfect classifier would identify all samples correctly with no regions correctly identified ahead of the true classification, which would result in an AUC of 1.0. Additionally, a random permutation would yield a linear relationship between the FPR and the TPR and an AUC of 0.5.

4. Results and Discussion

Prior to evaluation of the improvement in classification accuracy based on the integrated model, an exhaustive evaluation of the variables that could be used for the LeIR data model is performed. There are 15 possible combinations of the four light element IRs (Table 2). For each combination, the average CA and the average AUC are computed (6) and (7) and are ordered in Table 2 by average CA. The most accurate model is the one that includes all variables, although the models excluding either O, H, or both are very similar. The average ROC, however, is nearly identical for these top four models. This result is consistent with past observations that climatic factors and thus geography influence the C, O, and H isotope ratios of plants [2328]. These effects are likely to be similar for plants growing in similar geographic locales. N isotope ratios of a single species, in contrast, are a function of N sources such as fertilizer (if any) and are likely to be independent of geography, at least in most cases [29]. However, since the inclusion of N improves our ability to classify these regions and the IR of O may be highly valuable to regions not included in our sample dataset and it does not decrease accuracy, we utilize the LDA model that included all four variables for the development of statistical models for the purposes of integration.

The LOOB-CV analysis was performed 100 times, which yielded an average CA of 55.9%±2.1% and 40.2%±1.8% for the LeIR and SrIR datasets, respectively. The integrated posterior probability was computed as described in (2), and the overall classification accuracy improved to 60.9%±2.1%. This is a significant improvement over the individual IR model; the null hypothesis that the average difference in the means between the two models is equal to zero is rejected with a 𝑃 value less than 1𝑒40 (based on a two-sample t-test). The ROC curve (Figure 4), shows this is a clear trend with an improvement of the AUC to 0.94 versus 0.88 and 0.80 for LeIR and SrIR, respectively. The 100 sampled observations of the AUC of the integrated model are significant larger than the LeIR model at a 𝑃 value <1𝑒10 (based on a two-sample t-test), and the curves depicted in Figure 4 are significantly different at a 𝑃 value of 0.06 (based on a sign rank nonparametric paired test) [30]. Thus, although SrIR does not perform well alone, it does offer a significant contribution if integrated with the LeIR data.

The evaluation of the datasets via the CA and AUC gives an overall view of the predictability of the datasets with respect to region but does not give insight across the regions. The improved accuracy of the integrated model points to a capability of one dataset to correctly predict the appropriate region with higher probability than the misclassified probability of the other dataset. To evaluate the classification accuracy in respect to region, a visualization akin to that used in Visual Integration for Bayesian Evaluation (VIBE) was employed [31]. Figure 5 gives classification accuracy plot across the 8 regions as the true class on the y-axis and predicted class on the x-axis. A perfect classifier would have a diagonal of solid black since all of the predicted classes would be equal to the true classes and have a value of 1. The classification accuracies observed in Figure 5 are slightly different than above since this is a single sample from the 100 iterations performed. However, a similar trend is observed in which the IR data has a larger CA than the Sr data and the integrated model outperforms either alone. The class accuracy plot quickly demonstrates that the IR data has the most challenge distinguishing regions 5 through 8 (outside of the united states), and most often regions are misclassified into region 4 (US04 = TX)—see Table 1. The Sr data has a completely different profile in terms of how it correctly and incorrectly classifies samples. It most often misclassifies samples into either the third or fifth regions, US03 (UT) and BRAZ01. In addition, it never correctly classifies any samples from regions 1, 2, 4, and 8. The integrated model on the right corrects many of the imbalances observed from the individual datasets. The specific geographic regions that are being correctly classified are easily distinguished from the class accuracy plot, as well as those regions that cannot be easily distinguished by isotope ratios.

The relative lack of power of the O and H isotope ratio data to link seeds to their regions of origin initially appeared surprising, as O and H isotope ratios of plant material have been shown to be linked to geographic region of origin [12]. Plants derive O and H atoms from their water sources, while there is a strong and well-recognized link between the isotope ratios of precipitation and geography [32]. However, the limitations on experimental design imposed by nature of the seed collection might lead one to expect this effect. Some of the defined growth regions spanned a gradient of climate. For example, the “Texas” region included samples from the vicinity of Lubbock, which has an arid climate, and Houston, which is quite humid. The surface water isotope ratios of these two parts of the state are predicted to differ somewhat, based on US Geological Survey data [33]. Thus, the source water accessed the plants as well as the extent of evaporative enrichment of plant leaf water, a source used for biosynthesis of many plant organic components [12], would likely be expected to differ and result in differing O and H isotope ratios in the seeds. A higher sampling density in strictly limited geographic regions would permit a better analysis of the effect of O and H isotope ratios on region of origin association. Another possible reason for the lack of power of O and H isotope ratios to associate seeds with regions of origin is the potential variation in cultivation conditions of the seeds. Even if a region could be more tightly defined, cultivation practices could influence the actual isotope ratios of the water used by the castor plants. For example, irrigation via open ditches could result in significant evaporative enrichment in isotopic content of the water taken up by the plants, while drawing water from a deep well could provide them with water isotopically different from surface water. Controlling the environmental variation imposed by cultivation conditions might also improve the discrimination power of the O and H isotope ratios. Finally, the genetic variability of the castor seeds may affect water dynamics within the individual plants, which could impart some variation to the O and H isotope ratios of plants growing in identical environments.

The ability of isotope ratios to associate castor seeds with region of origin despite the limitations imposed by sampling density, the genetic variability of the seeds, and the probable variation in cultivation methods is noteworthy. Presumably, if we had been able to more strictly define geographic regions to those with homogenous climate and geology and control the variables of genetics and growth conditions, the accuracy of the association would be improved, perhaps significantly. Further experiments with multiple seed acquisitions from tightly defined source regions could address this question. For real-world application of isotope ratios for assigning region of origin, however, it is unlikely that either the genetic strain of the sample (whether of castor seeds or some other plant product such as a food) or its cultivation method would be controllable. Demonstrating that integrated isotopic data can associate a plant product with its region of origin in the absence of such control suggests that this approach could be broadly useful for geographic sourcing.

5. Conclusion

Both light element (C, N, O, and H) stable isotope ratios and 87/86Sr isotope ratios have been used to associate plant and animal materials with its geographic region of origin. Here, we show that each of these datasets independently can associate castor seeds with region of origin more accurately than would be expected by chance, as shown by both classification accuracy and a modified ROC curve model. Bayesian integration of these two data streams yielded results that were significantly better than those from either individual dataset. This approach illustrates the benefits afforded by a rigorous approach to data integration and its application to forensics community.

Acknowledgments

This work was supported in part by Laboratory Directed Research and Development at Pacific Northwest National Laboratory (PNNL) and the National Science Foundation (Grant 0743543). PNNL is a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC06-76RL01830.