High-resolution X–ray diffraction datasets: Carbonates

X–ray diffraction (XRD) analysis is a versatile and reliable method used in the identification of minerals in solid samples. It is one of the primary techniques geoscientists, mineralogist, solid-state chemists depend on to characterize the composition of unknown samples. In recent years there has been a growing interest among researchers to have readily accessible and large dataset to use to calibrate their experiment or to simply build various statistical models. Sadly, this is difficult to come by. Most well-curated datasets are propriety in nature and often too expensive for the average researcher. Additionally, when these datasets are available, they might not be suitable for purpose due to lack of proper coverage for certain a mineral of interest. For these reasons, we have carefully selected and curated samples rich in calcium carbonate that will be useful for various applications. Our dataset includes 1680 X-ray diffraction scans of samples collected from carbonate rich rock formations outcrops in Spain, Italy, and Saudi Arabia. They represent materials with total carbonate concentration range between 30-99%. The spectra were acquired on a Malvern PANalytical EMPYREAN Diffractometer system at two theta range 2- 70 and 0.01 step size. This dataset will be valuable to geoscientists, mineralogist, solid-state chemists, data scientists alike looking to design experiments, build mineralogical reference databases or statistical models with sufficient data points. We currently use the dataset in our own projects to develop comprehensive carbonate library and felt compelled to share.

a b s t r a c t X-ray diffraction (XRD) analysis is a versatile and reliable method used in the identification of minerals in solid samples. It is one of the primary techniques geoscientists, mineralogist, solid-state chemists depend on to characterize the composition of unknown samples. In recent years there has been a growing interest among researchers to have readily accessible and large dataset to use to calibrate their experiment or to simply build various statistical models. Sadly, this is difficult to come by. Most well-curated datasets are propriety in nature and often too expensive for the average researcher. Additionally, when these datasets are available, they might not be suitable for purpose due to lack of proper coverage for certain a mineral of interest. For these reasons, we have carefully selected and curated samples rich in calcium carbonate that will be useful for various applications. Our dataset includes 1680 X-ray diffraction scans of samples collected from carbonate rich rock formations outcrops in Spain, Italy, and Saudi Arabia. They represent materials with total carbonate concentration range between 30-99%. The spectra were acquired on a Malvern PANalytical EMPYREAN Diffractometer system at two theta range 2-70 and 0.01 step size. This dataset will be valuable to geoscientists, mineralogist, solid-state chemists, data scientists alike looking to design experiments, build mineralogical reference databases or statistical models with sufficient data points. We currently use the dataset in our own projects to develop comprehensive carbonate library and felt compelled to share.
© 2022 The Author(s

Value of the Data
• In recent years there has been a growing interest among researchers to have readily accessible and large dataset to use to calibrate their experiment or to simply build various statistical models. • Sadly, this is difficult to come by most well-curated datasets are propriety in nature and often too expensive for the average researcher. Additionally, when datasets are available, they might not be suitable for purpose due to lack of proper coverage for a mineral of interest. For these reasons, we have carefully selected and curated sample rich in calcium carbonate that will be useful for various applications. • Our dataset includes 1680 X-ray diffraction scans of samples collected from Spain, Italy, and Saudi Arabia. • They represent materials with calcium carbonate concentration ranging between 30-99%.
• Geoscientists, mineralogist, solid-state chemists, and data scientists alike looking to design experiments or build statistical models with large data points. • This dataset can be used to build databases, create data science model requiring large datapoints or be used as a reference dataset for comparing carbonates  Fig. 1 below is a sample spectrum from the dataset depicting a typical spectra file.
• Fig. 2 Result of Uniform Manifold Approximation and Projection (UMAP) and hierarchical clustering (HCPC) showing the variability among the carbonates spectra files. • Fig. 3 A generalized pairs plot, offering a range of displays of paired combinations of categorical (Cluster) and quantitative variables (mineral Concentration). On the top, boxplots display the variability in concentration of the minerals. On the left, bar charts and scatterplots are used to display the main groups to which each of the spectra file belongs to based on cluster analysis while the scatterplots pair the concentration minerals. Pearson correlation is displayed on the right and range of variable distribution is available on the diagonal.

Experimental Design, Materials and Methods
Samples for this dataset were collected from several research projects we've worked on in the past. The samples were collected from Spain, Italy, and Saudi Arabia representing over 10 carbonate formations. Spectra included in this dataset were carefully selected examining the percentage composition of calcium carbonate each individual file. The Calcite group of mineral Table 1 Summary statistic for the quantified spectra files including the mean, standard deviation (sd) the various quartiles (p0-p100) and the concentration range of each mineral depicted using histogram.    ( Table 1 ). Calcium carbonate (CaCO 3 ) and its polymorphs (calcite, aragonite, and vaterite), are the focus of our dataset and someone of the most abundant and therefore important among biominerals. It is ubiquitous in nature and can be found in all rock types, hot springs, in caves, in ore deposits, and geyser deposits, etc., It also constitutes both structural and nonstructural components of living organisms and therefore its biomineralization makes a huge contribution to our ecological and geochemical systems [1][2][3] .
A general non-linear dimension reduction technique i.e., Uniform Manifold Approximation and Projection (UMAP) was first applied to thousands of spectra files to sort out scans rich in carbonates and hierarchical clustering (HCPC) of UMAP embeddings was used to characterize the geochemical signatures and compare them to known carbonate files before they were quantified. If a spectrum was quantified and the calcium carbonate composition was below 30%, it was excluded from the list. Analyses were conducted using the R Statistical language (version 4.1.2; R Core Team, 2021) on Windows 10 × 64 (build 19044), using the packages hrbrthemes [4] , powdR [5] , tidymodels [6] , purrr [7] , report [8] , datawizard [9] , embed [10] , rxylib [11] , ggforce [12] and tidyverse [13] . Interconversion procedure between various formats of Powder X-Ray files was accomplished in PowDLL software [14] . The dataset was acquired on a Malvern PANalytical

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
High-resolution X -ray diffraction datasets: Calcium Carbonates (Original data) (Mendeley Data)