Introduction

Sand is a foundational material to both natural and human systems. From concrete to silicon microchips, the modern world needs more construction aggregates (mainly sand and gravel) than any other solid material resource1. As demand for sand continues to increase, the impacts of the extraction and use of sand resources on biodiversity and society are increasingly reported and recognized2,3,4. Ensuring that sand resources for urban and infrastructure development are extracted and transported in a socially and environmentally sound manner represents an urgent need5,6,7.

Over the last decade, ‘responsible sourcing’ and traceability of supply-networks has become a topic of broad interest, as a way to address issues from human health risks in food sources (e.g., sea lettuce;8 bivalves9) to sustainability risks in commodity mineral supply-chains2,10,11 or illegal trade (e.g., to determine the origin of stolen gold12 or poached ivory13). In the mining sector, responsible sourcing has been traditionally applied to the so-called “conflict minerals” (tin, tantalum, tungsten, diamonds, cobalt, and gold)14. Despite the scale and importance of the construction sector, for which most sand is extracted15, the traceability of sand and other construction aggregates is still at an emerging stage14,16. While remote sensing techniques hold immense promise in identifying sand mining sources and even in tracing distribution pathways, particularly when extraction and transport is done by boat17,18,19, these methods still require direct observation of extraction and transport. Traceability tools to certify and verify the geographic origin of sand resources, along with strong regulations and monitoring systems, are increasingly encouraged by international organizations to guarantee sustainable outcomes5. The current paucity of metrics by which to assess the efficacy of any effort to set sustainable sourcing standards or instate traceability in the construction aggregate sector is a hurdle that must be overcome before any such efforts can be broadly successful.

Here, we present a proof-of-concept study that examines the potential uses of sand provenance analysis or compositional “fingerprinting” in tracing construction sand supply-networks. Compositional fingerprinting methods widely used in sedimentology20,21 could provide a way to fingerprint the source of sand resources2, which would allow interested stakeholders to monitor and re-construct unknown or poorly defined sand supply-networks, i.e., the connections among sourcing areas, processing and storage sites, and markets3. Naturally occurring sand inherits a compositional fingerprint from the unique surface geology in the catchment from which it was eroded. Dozens of well-established provenance techniques exist to “fingerprint” sand and tie it back to its source from bulk mineralogy22 and geochemistry23 to more sophisticated techniques that build signatures from isotopic compositions of domains within individual sand grains24. Decades of work exist on the geologic controls on different sand compositions and how to leverage this information to trace sand dispersal pathways in natural sedimentary systems.

Moreover, applications of sand fingerprinting have not been limited to natural systems, with documented success in forensic geology25,26,27 and archaeology28,29,30 in answering questions rooted in understanding the provenance of sand at a crime scene or in artefacts. For example, using neutron activation analysis to examine beach samples in the case of the Coral Springs beach theft in Jamaica in 200831. However, the full potential of provenance methods for tracing construction sand supply-networks from “source to sink” remains untested. Other than the fact that construction sand is transported via truck, barge or rail car instead of rivers, waves or wind, there is little practical difference in applying sand provenance analysis to commodity supply-networks vs. modern natural dispersal systems. To test the utility of fingerprinting methods in tracing construction sand supply-networks, we conducted a proof-of-concept study in central and north Texas, USA (Fig. 1).

Fig. 1: Conceptual schematic for natural sand compositional “fingerprints” carrying through construction sand supply-networks.
figure 1

Natural sand composition is inherited from colour-coded source regions (indicated in circles) and carried through extraction, transport, and use as a discernible schematic signal. The signal can include the mixing of fingerprints from different sources as shown in the lower right.

We address three research questions crucial to understanding the potential of sand fingerprinting for construction sand traceability and monitoring applications. First: Are natural sand compositional signatures preserved through processing of construction sand? Knowing if this processing deleteriously alters sand compositional fingerprints is a crucial first step in considering applying fingerprinting to construction sand supply-networks. Second: Can sand compositional fingerprints trace construction sand supply-networks at a useful spatial scale? Any natural sand will have a definable compositional fingerprint, but it is crucial to understand the conditions required to use that fingerprint to trace construction sand supply networks. Third: Can machine learning-aided image analysis be used as a more exportable and inexpensive sand provenance method? When considering all costs from sample preparation through analysis, conventional sand provenance methods range from around $50 to over $1000 per sample (Fig. 2). While these costs may be reasonable for academic studies, agencies in high-income countries, and large industries, the broad adoption of sand provenance analysis as a scalable monitoring approach in low-income and under-served areas requires low-cost analytical tools.

Fig. 2: Generalized overview of the training and cost required for various sand provenance or “fingerprinting” methods and the approximate cost of instrumentation required for each type of analysis.
figure 2

Methods employed in this study are highlighted in blue.

We collected 41 sand samples across seven sourcing areas of construction sand supply-networks in Texas, USA (Fig. 3). Four of the sampled supply-networks comprise regional distribution of bagged concrete produced in four plants spread out over approximately 900 km across the state. Each plant sources sand from a local mine. These plants are located in the cities of Amarillo, Abilene, San Antonio and east of the city of Houston. To sample sand from these four, we procured bagged concrete samples at local hardware stores across the state. Bagged concrete is sold as a dry mixture of sand, gravel and cement that is mixed with water by the end user and is intended for applications that require only a small amount of concrete. As a value-added product, bagged concrete is generally shipped over much wider distribution networks than raw construction sand.

Fig. 3: Overview of natural sand deposits and the geography of the construction sand industry in Texas, USA.
figure 3

a geologic units from which sand is mined, sand mine locations, concrete plant locations and samples used in this study coded by their source as outlined in the panel symbol legend. b detailed inset of the San Antonio–Austin area in which we tested compositional fingerprinting over local scales.

The rest of the samples belong to a series of denser, more complex, supply-networks of sand mines and concrete batch plants across the cities of Austin and San Antonio and their surrounding peri-urban areas (Fig. 3b). To encompass material from these supply-networks, we sampled natural sand from sand mine pits, processed construction sand (washed and size sorted) at the mining site, and sand from sand stockpiles at concrete batch plants. Concrete batch plants mix large volumes of sand, gravel and cement on site to generate batches of wet concrete that are then transported to local construction sites. Mines in this region process sand in classifiers that largely work on hydrodynamic and specific gravity sorting32. These classifiers take raw natural pit sand and sort it into size ranges that match the desired engineering specifications that the mining site has set for various construction sand products like concrete or masonry sand.

The sampled suppliers source sand from seven different geologic units:33 (1) Holocene-age and (2) Pleistocene-age terraces of the upper Colorado River in and around the city of Austin, (3) modern sand from the Llano River near the town of Llano, (4) Pleistocene terraces of the lower Colorado River, (5) Paleogene-age shallow marine sand deposits preserved in an arcuate outcrop belt across central Texas (Fig. 3); tapped in the mines in our study in an area just south of San Antonio, (6) Pleistocene sands near Abilene and (7) Pliocene to Miocene-age sand deposits near Amarillo (Fig. 3). For the purposes of our study, these seven sand sources offer useful range in determining the resolution at which supply-networks can be distinguished in the four sources (Llano River, Holocene up. Colorado, Pleistocene up. Colorado and Pleistocene low. Colorado River) within the same natural sediment dispersal system (Fig. 3) and the other three (San Antonio, Abilene and Amarillo), which are entirely unique and unrelated geologically.

Results and Discussion

Preservation of fingerprints through construction sand processing

To test if natural sand compositional signatures are preserved through construction sand processing, we sampled raw pit sand and processed sand products sourced from four mining areas: (1) San Antonio, (2) Holocene upper Colorado River terraces, (3) Pleistocene upper Colorado River terraces and (4) the Llano River (Fig. 4a, b). To encompass pre and post processing, we sampled both raw pit sand and processed sand at each site (excluding the Llano River mine, where we only sampled raw river sand from the mining area) and sand from stockpiles at concrete batch plants from known sources. For batch plant samples, we confirmed with the mine-site manager where on-site sand came from, ensuring that we were comparing sand samples from the same original set of mines across the study area. We were only able to sample sand from the final product in bagged-concrete distribution networks. Therefore, those samples are not included in this section. We found that by bulk major element geochemistry and framework mineralogy, natural compositional sand fingerprints are preserved through processing such that compositional fingerprint variability between mining areas is much greater than compositional variability within sample sets from each area (Fig. 4c, d). However, bulk trace element geochemistry shows that there is some enrichment of elements associated with heavy minerals like Zr in processed sand samples. This suggests there may be some fractionation of denser mineral phases during processing and thus care should be taken in using compositional fingerprints that rely on heavy minerals (Fig. 5).

Fig. 4: Conventional provenance results from the detailed Central Texas sample region.
figure 4

a Sample location map of sand mine (n = 13) and concrete batch plant (n = 6) samples from central Texas used to test if processing affects sand compositional fingerprints. b Inset showing the location of sand mines on the upper Colorado River south of Austin that mine from Holocene-age (24) terraces (uCRm1-3) and Pleistocene age (24) terraces (uCRm4). c Bulk geochemistry results showing major elements, Si, Al + K content for each sample in this area. Note that the Pleistocene u. Colorado River sample cluster is comprised of five samples; two samples in the upper left plot too closely to distinguish their symbols. d Optical petrography results for this sample set. Qm monocrystalline quartz, F total feldspar, Lt total lithic fragments.

Fig. 5: Grain size and geochemical fingerprinting results from closely spaced sand mines on the Colorado River.
figure 5

a Grain size for upper Colorado River mine samples displayed as weight percent cumulative distributions measured in test sieves. Samples are colour coded by type. b Bulk major element X-ray fluorescence analysis (XRF) results for upper Colorado River mine samples. c Chondrite normalised Rare Earth Element (REE; Taylor and McClennan, 1985) signatures of sand from mining location uCRm1 which taps Holocene upper Colorado River terraces. d Chondrite normalised Rare Earth Element (REE; Taylor and McClennan, 1985) signatures of sand from mining location uCRm4 which taps Pleistocene upper Colorado River terraces. Note that this is the only figure that includes results for masonry sand samples.

By bulk major element geochemical signatures ([Al2O3 + K2O] vs. SiO2; Fig. 4c) and framework mineral petrography (Fig. 4d), all four mining areas are clearly distinguishable with the silica rich San Antonio sand samples particularly distinct from sand in the Llano—upper Colorado River areas (Fig. 4). The similarity between Llano River and Holocene and Pleistocene upper Colorado River terraces sands is consistent with the fact that they are part of the same regional natural sand dispersal system. Sand from Pleistocene upper Colorado River terraces is compositionally distinct from sand from Holocene upper Colorado River terraces across all samples (Fig. 4c, d). Natural variation in compositional sand fingerprints between Pleistocene and Holocene Colorado River terraces is further supported by previous studies focused on the natural sand dispersal system and is attributed to variations in climate and weathering regimes since the last glacial maximum34.

To further assess potential processing fractionation, we also sampled and analyzed masonry sand from two mines in the upper Colorado River mining area; uCRm1 in Holocene terraces and uCRm4 in Pleistocene terraces (Fig. 4b). Masonry sand represents the most heavily processed product that these mines produce as it needs to be consistently fine, well-sorted and clean; generally much finer and better sorted than the bulk sand grain size in area mining pits (Fig. 5a). Masonry sand results are only considered in this section on examining fractionation and are not compared to concrete sands in any other section. To look for potential compositional fraction by mineral density, we compared the Zirconium (Zr) concentration, bulk Rare Earth Element (REE) signatures, and major element geochemical signatures of each sample (Fig. 5a–c). The granitic rocks of central Texas in the Colorado River catchment are known to be particularly fertile with respect to detrital zircons35. With a chemical formula of ZrSiO4, zircon is the primary mineral host for Zr in most sands35 and with a specific gravity of 3.9–4.7, the concentration of the mineral is a useful proxy for heavy mineral fractionation36,37. Masonry sand from both uCRm1 and uCRm4 is notably elevated in Zr concentration even as compared to natural pit sand of a similar grain size (uCRm1 fine raw pit sand vs. masonry sand; Fig. 5), suggesting that mine-site processing is enacting some heavy mineral fractionation. This is also perhaps suggested with the enrichment of light REEs (La–Gd) in masonry sand particularly from the uCRm1 site (Fig. 5).

However, the bulk major element composition of masonry sand from uCRm1 is similar to fine raw pit sand from the same site with both relatively depleted in Al2O3 and K2O as compared to coarser pit sand and concrete sand. Masonry sand at uCRm4 is also relatively depleted in Al2O3 and K2O as compared to raw pit sand and concrete sand from this mine (finer pit sand was unavailable from this site). This depletion in Al2O3 and K2O likely reflects a natural difference in the composition of sand at each site by grain size, a common feature of natural sands (27).

Cumulatively, results from these four supply-networks suggest that any fractionation that does occur when processing construction sand is unlikely to affect bulk major element and other framework mineralogy fingerprints like optical petrography QFL. However, care must be taken that the compositional fingerprint used to represent the raw natural source sand is of the correct grain size to match the grain size of the construction sand product in question and mineral density fractionation needs to be considered when using trace element geochemistry or methods the rely on heavier minerals like detrital zircon.

Defining supply-networks with conventional fingerprinting techniques

After determining that intra-source area variance in compositional fingerprints was much less than inter-source area variance, we set out to identify the resolution with which construction sand supply-networks can be reconstructed by conventional provenance methods and the specific conditions that must be met to do so. For this, we added compositional sand fingerprints from regional bagged concrete samples to the central Texas networks described above. As with local mine-to-batch plant networks in central Texas, sand from each bagged concrete plant produces a distinct compositional fingerprint by bulk major element geochemistry and QFL petrography and each of the four is entirely distinct from the central Texas networks (Fig. 6a). Even sand from the San Antonio bagged-concrete plant is distinguishable from San Antonio-derived sand mine and concrete batch plant sand; a finding we encountered while iterating the image analysis methods described below and then confirmed by the bulk major element geochemistry. This distinction derives not from any natural differences in sand composition but instead from the fact that the San Antonio bagged-concrete plant mixes natural sand with crushed limestone as the coarse aggregate to produce their final product. Particles of crushed limestone remained in the sand-sized fraction of material we analyzed for this study resulting in San Antonio-derived bagged concrete having systematically higher bulk Calcium content (wt% CaO; Fig. 6a). This plant therefore introduces useful artificial compositional variability not present in the natural sand deposit that can be used in compositional provenance analysis.

Fig. 6: Overview of all conventional provenance results and dispersal pathways for construction sand.
figure 6

a Bulk major element geochemistry results for sand samples from all sampled locations. b Bar graph of Ca weight % in San Antonio area samples showing artificially introduced compositional difference in sand from bagged concrete. c Optical petrography results for all samples. Note that while not as distinct as bulk geochemistry results, each distribution network is distinguishable based on framework mineralogy. Qm: monocrystalline quartz, F: total feldspar, Lt: total lithic fragments. d and e Regional supply-networks traced by sand fingerprinting.

The fact that all eight sourcing areas sampled for this study are distinct and distinguishable across extraction, processing and transport is an encouraging sign for using provenance analysis in tracing supply-networks. These results also illustrate the specific conditions required to employ these techniques. Where natural compositional variability exists between two sand sources (by any provenance method), as here in Texas, that variability is likely to be preserved from “source to sink” in a construction sand supply-network. Additionally, if the processing phase adds compositional variability, by mixing sands from multiple sources (e.g., naturally occurring sands and crushed rock), as in the example of the San Antonio bagged-concrete plant, compositional provenance analysis will also be effective. However, if no compositional variability exists, provenance analysis will be ineffective. As an example of this counterpoint, we cannot distinguish sand, by any method employed in this study, sourced from the uCRm1, 2 nor 3 sites (uCRm: upper Colorado River mine) which all mine from Holocene upper Colorado River (Fig. 4b–d). By coincidence, the concrete batch plants we sampled for this study that sourced sand from the upper Colorado River mining area all sourced from uCRm4 specifically.

Had any of those plants sourced from uCRm1, 2 or 3, we would not have been able to independently distinguish which specifically it came from with compositional analysis. Similarly, the Paleogene silica-rich sand deposits that are mined south of San Antonio extend in an arcuate outcrop belt across the entire central Texas study area (Fig. 4a). If there were mines extracting from those deposits in the Austin area, it is unlikely that we would have been able to distinguish that sand from sand mined south of San Antonio.

The efficacy of provenance analysis in construction aggregates therefore depends on both natural (or artificial) variability in sand composition and the internal complexity of the sand-sourcing regime of the supply-network in question. This is to say that there must be heterogeneity in the natural “fingerprints” of the sourcing areas and the networks must be sufficiently diverse to leverage that heterogeneity into answering an impactful question on sand sourcing. If both of these requirements are not met, sand fingerprinting is unlikely to be effective.

How sand fingerprinting might be used at the final site of consumption depends on the use of the sand. If the sand is used in an unconsolidated state as landfill, sand fingerprinting as described here can be employed. If it is set with cement in a concrete product, optical petrography is likely still viable as a sample can be cut and polished into a thin section in the same way as natural sandstone. However, applying bulk geochemical methods may not be viable as the cement will alter the elemental signature. Further work is needed to unravel how best to fingerprint sand from set-concrete. Moreover, while most sand considered in this study was sourced from pre-Anthropocene natural deposits, sand from certain modern settings might also include diagnostic anthropogenic detritus that could contribute to source fingerprinting2.

Cost effective sand fingerprinting with machine learning image analysis

Although conventional provenance analysis methods clearly have potential in tracing construction sand supply-networks from “source to sink”, the analytical facilities within which to conduct conventional provenance analysis are not ubiquitous globally nor is analytical funding. Fortunately, in addition to geochemical and petrographic signatures that can be expensive to unravel, natural sand from different deposits often has systematic differences in grain size, shape and color all owing to natural mineralogy and local sedimentary processes. We reasoned that these same features could be leveraged by an algorithm to predict provenance using images of sand samples. Similar approaches are already becoming commonplace on the engineering side of construction aggregate research where machine learning image analysis is used to determine things like sand particle size and shape to assess structural parameters and materials best uses38,39.

To test the viability of a machine learning image analysis approach to provenance analysis, we developed an image classification pipeline, sandID, which uses transfer learning40 to train a deep convolutional neural network to predict sample provenance using photos of sand captured with an iPhone. The sandID model is, on average, 88% effective at identifying the original source of mined concrete sand in our Texas study area (Fig. 7). A large fraction of prediction error derives from model prediction mix-ups between samples taken from the Holocene and Pleistocene river terraces on the upper Colorado River which are only subtly different compositionally by conventional methods as described above. Combining these categories yields an average accuracy of 93% in provenance prediction.

Fig. 7: sandID image-analysis results.
figure 7

a All sand samples (all material <2 mm) and b all samples sieved at 500 microns (medium sand and finer). For both a and b, the scatter plot is a two-dimensional, simplified representation of what the neural network “sees” as differences between each source population in images using t-Distributed Stochastic Neighbor Embedding (tsne) to squeeze 1024 identified features into 2-D representations that can be assessed visually. Each color-coded point is a snip of a training image and distance between two points roughly correlates to degree of difference. In both a and b, the confusion matrix illustrates model success in assigning an image of sand to its correct original source. low. CR samples derived from the lower Colorado River (east Houston bagged-concrete), Hol. uCR samples derived from Holocene terraces of the upper Colorado River, Pleis. uCR samples derived from Pleistocene terraces of the upper Colorado River. SA bag conc.: San Antonio bagged concrete, SA m&bp: samples from San Antonio mines and concrete batch plants.

We found that the relative placement of our samples within the t-Distributed Stochastic Neighbor Embedding (t-SNE) plots, which describe what sandID “sees” as differences between samples, reflects natural relationships between sand sources and relative natural compositional variability. The Llano River and Colorado River samples cluster closely together (Fig. 7), reflecting that these sources belong to the same sediment dispersal system. Sand from different bagged concrete plants plot in distinct clusters relatively far apart (Fig. 7). Our results therefore suggest that regionally keyed machine learning models may be useful tools for determining sand provenance in any area with sufficient compositional differences. More generally applicable models that could be used inter-regionally may be possible with more training data in the future but the possibility of extending beyond region-specific models remains a work in-progress. Additionally, results from running sandID on the full sand fraction (<2 mm) and a medium sand and finer fraction (<500 μm) from the full case study sample set illustrates these methods are sensitive to sample grain size (Fig. 7). Care needs to be taken when using these methods that imaged samples consist of either their full grain size distribution or a consistently prescribed grain size fraction across all samples.

The sandID tool requires only a personal laptop to run and, once trained, takes only seconds to classify new sand samples at no additional cost outside of the labor required to collect and photograph the samples. Thus, we conclude that this method holds promise as a scalable approach for tracing sand provenance that should be readily exportable to settings lacking access to specialized and expensive methods of provenance analysis.

Beyond its utility for predicting sand provenance, sandID can also function as a tool for uncovering salient heterogeneity within sand sources that may not be apparent in initial conventional analysis. We originally trained sandID with seven defined source populations: (1) Amarillo, (2) Abilene, (3) Llano River, (4) upper Colorado River Hol., (5) upper Colorado River Pleis., (6) lower Colorado River and (7) San Antonio under the assumption that the model would not be able to distinguish San Antonio bagged concrete sand from mine and concrete batch plant sand. Differences between the two sands which are >95 wt% SiO2 are minimal in the conventional compositional fingerprints we initially plotted (Fig. 6). However, even when trained on a lumped San Antonio source, sandID suggested there were multiple San Antonio provenance families, with the two groups on the left-hand side of Fig. 7a reflecting samples from San Antonio mines and batch plants (“SA m&bp”) and the group on the right-hand side reflecting San Antonio bagged concrete. These sample differences are capture in Ca wt% from each sample set (Fig. 6b).

Conclusions

Our results conclude that sand provenance analysis, whether with conventional compositional methods or image analysis approaches, has an untapped potential as a monitoring tool to support traceability systems (e.g., certification schemes) and to support monitoring and enforcement in areas where there are concerns about illegal, illicit or simply unknown construction sand sourcing. A few particular facts about the success of our case study in Texas can be extrapolated to discuss potential for success in other places globally. First, Texas is not particularly geologically complex. The abundant leverage in compositional provenance analysis and image analysis in these passive margin sand deposits bodes well for regions with more complex surface geology in adjacent sand dispersal system catchments like South and Southeast Asia. Countries like Bangladesh, Myanmar, Laos and Malaysia show greater than 20% average annual growth in aggregate consumption of the last 20 years and are known areas of sand mining conflict41,42 with opaque sand sourcing issues and are among the most geologically complex areas in the world. Several decades of conventional provenance work show that the sands from major rivers in this region a compositionally distinct43,44 suggesting provenance analysis of construction aggregate supply networks could be effective in the region.

A second finding from Texas that bodes well for broader exportation of sand provenance analysis for effective monitoring and certification is the fact that natural compositional variability between Pleistocene and Holocene river sand terraces from closely spaced mines in the same river valley are preserved through the supply-network. Natural climate cycles over ten to hundred thousand year time scales are known to shift sand composition due to both drainage reorganization and changing weathering regime in many places globally45,46,47. Many sand extraction environmental sustainability issues boil down to mining from active sand dispersal systems vs. older sand deposits (e.g., modern river sand bars vs. older river sand terraces). Consequently, the regulations of some regions across the world forbid or limit the extraction of sand from active river channels for the construction industry3. If, as in the upper Colorado River in Texas, young or modern sand in a given river of concern is distinguishable from older river terraces, it may be possible to develop a location-specific certification scheme that can flag unauthorized extraction from the modern river vs. extraction from older terraces.

Broadly speaking, provenance analysis will likely be useful in any traceability strategy that includes certification and verification of the geographic origin of sand resources and could be used to ensure the correct performance of responsible sourcing schemes. There are a growing number of management frameworks designed specifically to assess, audit and certify supply chains for construction materials48. By providing a method to independently confirm the geographic origin of samples, sand provenance could identify illegal extraction and fraudulent trade practices. Responsible sourcing applications of these methods could be particularly useful in regions and countries with existing regulatory concerns and active illicit supply-networks49 and in places with limited local supply that rely heavily on imports such as Singapore5 or Hong-Kong50. The full spectrum of specific applications of provenance analysis in construction sand supply-networks is likely broader than we have currently described. Having demonstrated that this approach is effective in principle and provided a tool in sandID to make it more broadly accessible and exportable, more work is needed to continue to expand applications of sand provenance analysis to making human sand supply-networks more transparent, equitable and sustainable.

Methods

Sample collection and processing

We collected all 41 samples used in this study from July through September of 2021. Sand samples from sand mines (n = 15) were directly collected with cooperation from mine-site personnel from 6 different mines. We collected one or two raw pit sand samples from the area of the raw natural sand deposit being mined that day. Processed sand samples were collected directly in the processing area from the active stockpile below the outflow of the mine site’s aggregate classifying machinery. At 5 different concrete batch plants, we collected 7 samples from sand stockpiles (two batch plants had sand from two different mining sources in their stockpiles) and confirmed the original source of the sand from the plant manager. Bagged concrete samples were purchased at local hardware stores in the sampling localities (n = 19).

Bagged concrete comes as pre-mixed cement, sand and gravel. We washed sand and gravel out of the cement-aggregate mixture by hand in a five-gallon bucket. We dumped approximate 3–4 kg of the cement-aggregate mixture into the bucket and filled the bucket with water while mixing until the bucket was nearly full. We then let the aggregates settle out of suspension and the cement-laden water was decanted off. We repeated this process until the water was clear and then dried the sample. For all raw pit mine sand samples, we washed out any top soil or mud present in the sample using a similar decanting method. All samples were sieved at 2 mm. This sample processing was all done before samples were sub-sampled for any further analysis.

Grain size analysis

We conducted grain size analysis for sand samples from the upper Colorado River mines (Fig. 4) from which we collected a full suite of raw, concrete and masonry sand using simple test sieve analysis. Our sieve stack consisted of thirteen sieves with mesh sizes at ½ phi intervals from 4 mm to 63 microns. Each sample was run on an automatic sieve shaker for 15 minutes and mass-retained for each fraction was converted to grain size distributions using GRADISTAT.

Conventional compositional provenance methods

We analyzed all sand samples with optical petrography and bulk major and trace element geochemistry. Optical petrography consisted of point counting grain-mount thin sections using the Gazzi-Dickinson method in which every sand-sized mineral (>62.5 µ) is counted individually. This method is designed to reduce grain size bias and produces a result that reflects the bulk framework mineralogy of the surface geology in the catchment from which the sand eroded. We counted 400 points per thin section following the conventional Gazzi-Dickinson method51 and all thin sections were one-half stained for identifying potassium feldspar (using Sodium hexanitritocobaltate) and plagioclase feldspar (using Rhodizonic acid). Full optical petrography results are available in Supplementary Table 2. Bulk sand geochemical analyses were conducted at the Washington State University (WSU) Peter Hooper GeoAnalytical Lab. Bulk major and trace element geochemistry was determined via X-ray fluorescence analysis (XRF) and inductively coupled plasma mass spectrometry (ICPMS). XRF analyses were conducted on a Thermo-ARL automated X-ray fluorescence spectrometer. XRF sample material was analyzed in a Li-tetraborate fused bead. ICPMS analyses were conducted on an Agilent inductively coupled plasma mass spectrometer. Full data tables for all geochemical results are further detailed references for geochemical methods can be found in Supplementary Tables 3, 4.

We display bulk major element results here as (Al2O3 + K2O) vs. SiO2 as this is a particularly useful discriminator in our study area, which largely derives from natural differences in plagioclase feldspar, potassium feldspar (K-spar) and quartz content across samples. Aluminum and potassium are hosted preferentially in the feldspars while Silica derives preferentially from quartz. Combining aluminum and potassium accentuates the presence of K-spar in Colorado River catchment sands eroded in part from central Texas granites.

Machine Learning Image-Analysis: sandID

As described above, we generated image analysis results from sample material sieved at 2 mm and at 500 microns to look for grain size bias in results. The sub-500 micron fraction was only analyzed via image analysis and was not included in conventional geochemistry and petrography analyses. The first step in our image analysis process was generating a dataset of sand images that could then be used to train the image classification model. To generate a training dataset containing images of sand samples, we placed material from each sample in a 5 cm diameter PVC pipe cap, and took a photograph directly overhead from a standardized height of 15 cm away using an iPhone 12. This provided sufficient scaling consistency for this proof-of-concept work. However, future iterations may require embedded scaling control to simplify image acquisition for non-expert users. The phone’s camera was set to all standard, default, settings. Through this process, we produced 78 distinct images of our sand samples (two different images of each sample [n = 39]; excluding masonry sand samples). Due to the random nature of sand distribution within each large-scale sample image, it was possible to subdivide each for the 78 images computationally into smaller 176 × 176 pixel image squares, each of which could serve as a separate training sample. This produced a dataset containing 1,690 sample images of sand, with at least 150 sample images per supply network category. This process was repeated for the sub-500 micron image set as well. All analyzed images were color images. Iterations using greyscale did not produce useful results. We estimate that each 176 × 176-pixel training image contains roughly 150 (<2 mm set) to 400 (<500 µm set) individual sand grains. As can be seen in the training images provided on the GitHub page listed at the end of the manuscript and in Fig. 7, individual grains are easily identified visually and sand from each source area is generally evidently distinct with simple visual examination.

In general, our relatively small sample size of 1690 total images would be insufficient to train a large machine learning model from scratch. To circumvent this problem, we employed transfer learning52, in which models previously trained in a distinct but related classification problem (typically with large amounts of training data) are retrained to apply to new problem (where training data is typically more limited). For our image classification problem, we took GoogLeNet as our starting point, which is a deep convolutional neural network with 22 layers that was originally trained to classify 1000 distinct everyday objects (e.g., keyboard, mouse, pencil). We retrained the model to predict the provenance of different sand samples from our case study sample set using 105 images from each source. To prevent overfitting, we set the learning rate for the first 21 layers of the model to be low (rate = 10−4) and set the final, fully connected layer to have a tenfold faster leaning rate (rate = 10−3). In this way, we essentially treated the first 21 layers as a deep, pretrained feature-extraction network and updated the final layer to leverage these features to make class predictions in the context of our sand provenance dataset. We used standard back-propagation methods for training. We held out 45 images per source and used these as a validation set to periodically gauge model accuracy over the course of training. We note that our approach works directly on raw image snippets and does not require preprocessing steps such as sand particle segmentation.