A database and framework for carbon ore resources and associated supply chain data

The Carbon Ore Resources Database (CORD) is a working collection of 399 data files associated with carbon ore resources in the United States. The collection includes spatial/non-spatial, filtered, processed, and secondary data files with original data acquisition efforts focused on domestic coal resources. All data were acquired via open-source, online sources from a combination of 18 national, state, and university entities. Datasets are categorized to represent aspects of carbon ore resources, to include: Geochemistry, Geology, Infrastructure, and Samples. Geospatial datasets are summarized and analyzed by record and dataset density or the number of records or datasets per 400 square kilometer grid cells. Additionally, the “CORD Platform,” an ArcGIS Online geospatial dashboard web application, enables users to interact and query with CORD datasets. The CORD provides a single database and location for data-driven analytical needs associated with the utilization of carbon ore resources.


Specifications
Energy Economics Specific subject area Carbon ore resources including in-situ coal resource geology and geochemistry, supply chains, waste streams, and beneficial uses. Type of data Tables  Geodatabases  Feature classes/shapefiles  Rasters  Figures  How data were acquired All data were acquired from online databases using standard configuration PC and internet browser software. Data format Secondary Filtered Processed Parameters for data collection Data were collected for relevance to domestic US coal resources. Description of data collection The Carbon Ore Resources Database (CORD) is a working collection of spatial/non-spatial, filtered, processed, and secondary datasets categorized to represent aspects of US carbon and coal resources, including supply chains, waste streams, and end uses. Data source location Data were downloaded from various national, state, and university entities.

Value of the Data
• The Carbon Ore Resources Database (CORD) enables broader understanding and data-driven analyses of in-situ-, supply chain-, and consumer-based carbon resources, by providing a single location to efficiently access carbon ore resource datasets for a range of applications and end users. The systematized database organizes carbon ore data so it can easily be retrieved and analyzed. • Increased accessibility to systematized carbon ore resource datasets benefits research and development scientists, analysts, developers, economists, and engineers from various organizations. These entities include coal mining companies; power plant operators; government agencies; non-governmental organizations (NGOs); and natural resource managers. • Access to integrated, comprehensive carbon ore resource data are necessary for a range of applications, including optimizing coal production and deliveries to existing and new markets; mitigating the impacts of coal ash disposal, acid mine drainage, and greenhouse gas emissions; increase beneficial use of coal and coal by-products; and extraction of specific coal sources for carbon-based products and rare earth elements.
• Broader applications include decision support for carbon management and policy, identifying opportunities for the development of coal and carbon management technologies. • Geospatial datasets within the CORD facilitate mapping and analysis using GIS (Geographic Information Systems) software.

Data Description
The Carbon Ore Resources Database (CORD) is a collection of 399 individual data files associated with carbon ore resources. The original data acquisition efforts focused on coal resources in the United States. Supplementary File 1 provides descriptions for each individual data file organized by category. Supplementary File 2 lists each file by name, category, data type (secondary or processed), coal filter field (name of field used to filter coal related records), data format type (spatial (vector), spatial (raster), or table), available formats, source organization, link to the data download source, and publication citation (if available). The CORD can be downloaded from the NETL's Energy Data eXchange website ( https://edx.netl.doe.gov/dataset/cord ) as two separate zipped folders, one in a geodatabase format and the other in a folder file structure.
A summary of the number of data files and records within the CORD by category and data type or general format of the data are shown in Table 1 . Data types include either tables or spatial formats.  Table 1 ). Most of these data files (87% of total) and records (79% of total) are contained within the Geology category ( Table 1 ). The data are organized into 6 categories ( Table 1 ): Geochemistry, Geology, Infrastructure, Infrastructure network, Samples integrated, and Samples original. These categories are described in further detail below. Note: Individual raster data files are counted as a single record. Note: Individual raster data files are counted as a single record.
The number of data files and records within the CORD by primary source organization and data type or general format of the data is shown in Table 2 . In total, there are 397 data files containing 1328,704 records from 18 primary sources (organizations). Most of the data files (74% of total) and records (72% of total) are sourced from the USGS [1][2][3][4][5][6][7][8][9][10][11][12] .
The "Geochemistry" category consists of seven secondary data files and 4012 records associated with coal geochemistry but not explicitly coal sample data ( Table 1 ). This includes data that are derivatives of or associated with coal sample geochemistry, for example, interpolations of elemental concentrations or water produced from coal beds. These data are sourced from the USGS [ 1 , 2 ] and ISGS [13] ( Table 2 ). Fig. 1 A displays the quantity of spatial records within the "Geochemistry" category across the United States and Alaska within 400 sq. km grid cells.
The "Infrastructure" category consists of 17 secondary data files and 123,777 records associated with coal resource infrastructure ( Table 1 ). Currently, this includes datasets related to coal mines and power plants. These data are sourced from the USGS [6] , ISGS [13] , EIP [17] , MSHA [18] , PASDA [19] , SkyTruth [20] , and the TRC [21] ( Table 2 ). Fig. 1 C displays the quantity of spa- Fig. 1. displays spatial data summarization of database categories within 400 sq. km grid cells across the Unites States, including Alaska. Fig. 1 A (top left) displays the quantity of spatial records contained within the "Geochemistry" category. Fig. 1 B (top right) displays the quantity of spatial datasets contained within the "Geology" category. Fig. 1 C (bottom left) displays the quantity of spatial records contained within the "Infrastructure" and "Infrastructure network" categories. tial records within the "Infrastructure" and "Infrastructure network" categories across the United States and Alaska within 400 sq. km grid cells.
The "Infrastructure network" category consists of nine processed data files and 90,634 records associated with coal resource infrastructure ( Table 1 ). Currently, this includes datasets related to coal supply chains from mines to power plants (i.e., sources, production, deliveries, consumption, stockpiles, by-products) from 2011 through 2016. These data are sourced from the USGS [3] MSHA [18] , and the EIA [22][23][24] ( Table 2 ). Fig. 1 C displays the quantity of spatial records within the "Infrastructure" and "Infrastructure network" categories across the United States and Alaska within 400 sq. km grid cells. Additionally, Fig. 2 displays these datasets used within an online dashboard web application, entitled "CORD Platform". Link to the application is provided through EDX ( https://edx.netl.doe.gov/dataset/cord ) The "Samples integrated" category consists of two processed data files and 64,776 records associated with coal samples ( Table 1 ). This includes datasets integrated from the "Samples original" category. The two datasets are both the integration of all collected coal sample records. One is provided in a table ("Samples_All") and the other in a spatial format ("Samples_spatial"). Fig. 1 D displays the quantity of spatial records within the "Samples_spatial" dataset within the "Samples integrated" category across the United States and Alaska within 400 sq. km grid cells.
A data dictionary (field names and descriptions) is provided in an Excel workbook for datasets that required additional processing and integration steps (Supplementary File 3), where each tab refers the following 10 datasets: Additionally, two python scripts and two CSV files associated with field mapping of the "Sam-ples_All" table are provided in Supplementary File 4: • CORD_Data_Script1.py -This script takes an input folder path with CSV files and exports the files into a new folder with updated and modified attribute names. • CORD_Data_Script2.py -This script takes an input folder path with CSV files and exports the combined files into a new folder with an updated schema and all empty rows removed. • CORD_Field_map.csv -This CSV file provides the Mapping of the input data fields to the output data fields for the data conversion script CORD_Data_Script1.py. • CORD_Schema_combined.csv -This CSV file contains the combined schema for the data that is converted from multiple input CSV files to a single output CSV. It is used with the CORD_Data_Script2.py python script.

Experimental Design, Materials and Methods
All data processing was performed using ESRI's ArcGIS ArcMap 10.7 software [35] , Microsoft Excel, and python scripting. The data collection method consisted of manual internet searches through authoritative sources starting at the national level and then at the state level. Searches were focused on explicit coal datasets and data that do not primarily include coal information (i.e., the USGS produced water database [1] ). Data were also collected from journal publications where readily available (i.e., Taggart et al. [28] ). As these data were collected, they were catalogued and categorized into file folders according to their category ( Table 1 ) and converted to a table (File Geodatabase Table or CSV) or spatial file (feature class, shapefile, FGDBR, or TIFF) format. Where applicable, each data file was renamed according to location (state or geologic basin), name and/or physical representation, and source organization acronym (i.e., App-Basin_Pocahontas_Coal_Bed_Thickness_USGS). To summarize overlapping datasets and features (records) by category within 400 sq. km grid cells ( Fig. 1 ), the Cumulative Spatial Impact Layers (CSIL) tool [36] was applied within ArcMap 10.7 software [35] .
Secondary data files that did not require any processing were directly imported into the database. If data required filtering for explicit coal records, the field name used to filter the records was recorded Supplementary File 2, within the "Coal_filter_field". Files labelled as "processed" in Supplementary File 2, required additional modification before including into the database. These data files include those in the "Infrastructure network", "Samples original", and "Samples integrated" categories. Each processed data file involves a unique method before integrating into the CORD. These methods are described by category and for each processed data file as necessary: Infrastructure network: Coal_Source_Regions_Production_Deliveries_2011_2016 : This dataset was created from the original secondary data file "Coal_fields_USGS" [3] . The 602 polygons representing coal fields and basins were dissolved by name (i.e., areas with the same name were merged into a single record.) and then manually split in key areas. The Appalachian Basin Region was split into North, Central, and Southern regions by county boundaries as in the coal basins map within the EIA Coal Data Browser ( https://www.eia.gov/coal/data/browser/ ). This North Appalachian Basin Region was further split to separate the Pennsylvania Anthracite Region. In total, 109 separate coal source regions were developed. Coal production and delivery quantities information from 2011 through 2016 were added by spatially joining (point features closest to each region) and summing values from the "Coal_Mine_Production_2011_2016" and "Coal_Mine_Deliveries_2011_2016", respectively.
Coal_Delivery_Pathways_2011_2016: This dataset was extracted from the "Page 5 Fuel Receipts and Costs" tab in the EIA-923 excel files [22] for the years 2011 through 2016. Annual spreadsheets were compiled into one CSV file, filtered for "Coal" in the "Fuel Group" field, and exports to other countries were removed, using the "Plant State" field. The table was then pivoted on delivery quantity or "QUANTITY" field to obtain new field totals by month and year(s). A field was added to designate interstate or intrastate ("InterIntraState") deliveries. To obtain latitude and longitude coordinates, mine coordinate information was joined from the MSHA mines dataset using the unique MSHA unique identifier fields. Before joining, the raw mine longitude coordinates first had to be multiplied by −1, to correctly represent the decimal degrees format. Additionally, locations with null or visibly incorrect coordinates were corrected by comparing the location description ("DIRECTIONS" and "NEAREST_TO" fields) with satellite imagery in Google Maps. A total of 14 mine locations associated with deliveries (2011-2016) were corrected. This processing procedure for mine location correction was repeated for "Coal_Mine_Production_2011_2016" dataset as well. For delivery records that did not have a unique MSHA identifier number, the centroid coordinates of the associated counties ("COALMINE_COUNTY") or states ("COALMINE_STATE"; if county information is unavailable) were used. These centroids were obtained by calculating the centroids in county and state polygons in the "WGS84" datum. A total of 132 delivery records could not be mapped due to lack of spatial information and were left out. To obtain latitude and longitude coordinates for power plants, the original data were extracted from "Plant" tab in the annual EIA-860 excel files ("2__Plant_Y [YEAR].xlsx") for the years 2012 (locations not available before 2012) through 2016 [23] . Annual spreadsheets were combined into one CSV file and pivoted on the unique identifying field for power plants ("Plant Code") to obtain a single unique record for each location. Power plant coordinates were then joined to the coal deliveries table using the "Plant Code" field. Polylines or delivery origin-destination paths from mines to power plants were created using the "XY to Line" tool within ArcMap 10.7 software [35] . After the spatial data was created, polyline lengths were calculated in kilometers using the North America Lambert Conformal Conic projection.

Coal_Mine_Deliveries_2011_2016:
This dataset was created from the "Coal_Delivery_Pathways_2011_2016" dataset, by dissolving on the MSHA unique identifying number, latitude, and longitude fields to obtain a single unique record for each mine. Delivery quantities were aggregated and summed for each and all years, including total delivery count from each mine. Point features were then created to represent individual mines. To obtain the name of the coal source region associated with each mine ("Coal_Source_Region"), the "Coal_Source_Region_2011_2016" dataset was spatially joined to the mine point features (by nearest coal source region polygon). The mine point features were then joined to the "Coal_Delivery_Pathways_2011_2016" dataset" by a temporary "Delivery_ID" field (deleted after join), to obtain the coal source region names within the deliveries dataset.
Coal_Mine_Production_2011_2016: This was extracted from the EIA-7A excel files [24] for the years 2011 through 2016 and compiled into one CSV file. The table was then pivoted on the "YEAR" field to obtain a unique record each mine and new field totals for production quantities were summed for each and all year(s) ("p_[YEAR]"). Mine latitude, longitude, and metadata fields were added by joining the MSHA unique identifying numbers within the "MINE_ID" field within the MSHA mines dataset [18] . Although latitude and longitude were available in the EIA-7A data [24] , the MSHA mines dataset [18] was used due to inconsistent location and name values from year to year. Using the same mine relocation method used for the "Coal_Delivery_Pathways_2011_2016" dataset, a total of 23 mine locations were corrected. Point features were then created to represent individual mines. To obtain the name of the coal source region associated with each mine ("Coal_Source_Region"), the "Coal_Source_Region_2011_2016" dataset was spatially joined to the mine point features (by nearest coal source region polygon).
Power_Plant_Deliveries_by_Coal_Source_Region_2011_2016: This dataset was extracted from the "Coal_Delivery_Pathways_2011_2016"dataset. First, the "Coal_Delivery_Pathways_2011_2016" dataset was dissolved on the unique identifying number for power plants ("Plant_code") and "Coal_Source_Region" fields to obtain a single unique record for each unique combination of power plant and coal source region. Delivery quantities were aggregated and summed for each and all years. The "Coal_Source_Region" field was then pivoted to add fields for coal delivery quantity totals for each unique combination of region and year. Point features were then created to represent individual power plants.
Power_Plant_Deliveries_and_ByProducts_2011_2016: This dataset was extracted from the "Coal_Delivery_Pathways_2011_2016" dataset and the "8A Annual Byproduct Disposition" tab in the EIA-923 excel files [22] for the years 2011 through 2016. First, the "Coal_Delivery_Pathways_2011_2016" dataset was dissolved on the unique identifying number for power plants ("Plant_code"), latitude, and longitude fields to obtain a single unique record for each power plant. Delivery quantities were aggregated and summed for each and all years (total average as well). Point features were then created to represent individual mines. Next, the "8A Annual Byproduct Disposition" annual spreadsheets [22] were compiled into one CSV file, and then pivoted on "Year" and "Plant ID" fields to obtain by-product quantity totals for each combination of disposition type (sold, stored, used, disposed) and year for each power plant. The point features containing the delivery quantity information were then joined to the IL_Coal_quality_samples_ISGS: The original data was directly converted from the "coal-quality-