Highlighting and overcoming data barriers: creating open data for retrospective analysis of US electric power systems by consolidating publicly available sources

Studies of power system operation commonly draw from two key databases produced by the US Environmental Protection Agency: the Acid Rain Program’s Continuous Emission Monitoring System (CEMS) data, and the Emissions and Generation Resource Integrated Database (eGRID). Separate reporting requirements and heterogeneity in data aggregation between these two databases creates a barrier to systematic spatial and temporal retrospective power system analysis. This work describes the inherent challenges to this undertaking and documents a method for reconciling the two seemingly disparate data sources. While fundamental differences in data reporting and aggregation prevent us from achieving full coverage, this work represents an important initial step to aligning these two repositories of US power system data. We demonstrate the value of this linkage by computing relative unit-level, hourly utilization metrics for most thermal power plants in the US. Analysis of these metrics across time illustrates thermal generator cycling trends in California between 2011 and 2017. These unit-level results indicate that combined cycle units within California increased their part-load generation by 15% and resultant CO2 emissions by 17% and decreased their start/stop frequency over time by 8.5% and resultant emissions by 47%. Open cycle gas turbines overall increased their generation- number of start/stop cycles by 97% and resultant emissions by 85%, part-load generation by 120% and resultant emissions by 100%, and full load generation by 40% and resultant emissions 18%. We also observe a temporal shift in thermal generation from morning hours to evening in California.


Background
The addition of renewable power capacity in the electricity sector increased exponentially in the United States (US) over the past decade, doubling the contribution of renewables to total electricity generation from 2008 levels to 742 billion MWh in 2018 [1]. During the same period, coal generation declined and was surpassed by natural gas generation, which now represents 34% of US electricity supply [2]. These changes were facilitated by declines in natural gas prices and the cost of renewables [3] and brought about US power system emissions reductions of 27% between 2005 and 2017 [4]. With these significant changes in the electricity paradigm, it is imperative to understand emerging patterns of flexible thermal power generation in the context of expanded use of intermittent renewables such as wind and solar energy [5][6][7].
For such analysis in the US, we are fortunate to have access to a public database of hourly generation and emissions of the thermal fleet [8], and comprehensive data on the characteristics of almost all electric power plants [9][10][11]. The US Environmental Protection Agency (EPA) measures hourly emissions and generation using Continuous Emissions Monitoring Systems (CEMS) at the exhaust stacks of thermal units larger than units and open cycle gas turbines over time in California are explored. At a unit level, difference in operational behavior, and thereby emissions, for two different combined cycle units are explored.

Method
In this section, the steps for linking EPA's CEMS data with the EPA's eGRID database are described. The resulting database contains hourly generation profiles for every generator integrated with the eGRID data.
The CEMS data provides the hourly gross generation and the corresponding emissions for most thermal units in the US since 1996 [8]. CEMS encompasses all the measuring devices (pollutant analyzer), and computer programs to determine a particular gas or particulate matter concentration and produce output results in units of the applicable emission limitation or standard. Since these emissions occur at the exhaust stack, the measuring devices are installed physically at the exhaust points of the power plants. It's possible to have multiple configurations of generators connected to the exhausts, where measurements are taken. The set of generators connected to the exhaust are called units, provided with a unique identification number. Records within eGRID represent every licensed generator in the US and provide information regarding environmental characteristics, annual generation, and nominal operating capacities [9], that could be merged with the CEMS data to wrangle meaningful insights on the hourly operational behavior of different types of units.
Merging the information in these two datasets is a nontrivial effort. To summarize, CEMS recordkeeping is done for individual 'combustion units,' while eGRID presents data both at the 'generator' and 'unit' level. The unit could be a single generator, or a group of generators run together to produce electricity-the latter commonly occurs in combined cycle plants. While these levels of aggregation may be identical for some unit configurations, they cannot be easily paired for other configurations. We discuss this point in more detail below. We create an informed mapping between CEMS and eGRID data to analyze CEMS data in the context of eGRID unit attributes.

Combining the data
Hourly profiles at the 'unit' level, as reported by CEMS, record generation and emissions at the exhaust stack, where gases/pollutants are emitted. The nominal capacity of these units is not available from CEMS and is instead determined using the eGRID database. Though the method primarily focuses on deriving the nominal capacities from the eGRID database, all other attributes from the eGRID database can be combined with the CEMS data using our matching procedure (figure 1).
First, we pre-process the data within the eGRID database before combining it with the CEMS data. For this step, the attributes at generator level must be translated to the unit level within the eGRID data. The unit dataset is titled 'UNT18' and the generator dataset titled 'GEN18'. The generator dataset has the nominal capacity value Figure 1. Schematic diagram of the process used to combine EPA's datasets: eGRID database and CEMS database. The eGRID data comprises of important technical characteristics of plant such as efficiency, prime mover type, fuel type, and name plate capacities. Since, the CEMS data is at unit level, eGRID data is first consolidated at unit level before combining with the ARP data.
of most of the turbines in the US. These capacity values are mapped to the associated units within every plant 6 , within the eGRID data. The algorithm for this mapping is developed using the Python language. In the following paragraphs, we explain in detail the steps of the algorithm for mapping generator data to unit data presented in figure 1. The heuristics for mapping differ by different types of generators and units, which we categorize into three cases. In the first case, each unit comprises a single generator (e.g., natural gas-based gas turbines). In the second case-each unit comprises more than one generator, seen for combined cycle units (combustion turbine, and bottoming cycle steam turbine). In the third case, a single generator connects to multiple units; usually, when a single steam turbine has multiple boiler units. The columns of interest from the unit's and generator's data of eGRID are unique identification number (ID) of the plants (i.e., the 'ORIS code'), state, plant name, the unique ID of the units within the plant (i.e., 'UNITID'), turbine type, primary fuel, and unique ID of the generators (i.e., 'GENID'). It is a nontrivial effort to map all the generators to their respective units, even within the eGRID database.
Case 1. One generator per unit For the units running with single generators, both UNITID and GENID should exactly match, but the actual data requires processing before this step. Therefore, we first outer-join the GEN18 data with the UNIT18 data based on the plant's ORIS code. This ensures that each UNITID is compared with all the GENIDs within a plant, before proceeding to the mapping. We exclude the combined cycle units in this part of the algorithm, as the unit is generally a single generator, except for combined cycle units where a group of generators run together to produce electricity. Though GENIDs and UNITDs can be directly matched on a one-one basis, in approximately 30% of cases these identifiers don't exactly match. Therefore, we assign a score based on the similarity of the identifiers to each combination of UNITID and GENID using Python's SequenceMatcher from difflib library. Since each individual UNITID is compared to all GENIDs within a plant, we identify the mapping of UNITIDs with GENIDs based on the maximum resulting similarity score. For instance, a representative plant with ORIS code 7315 is shown in table 1. There is an equal number of units and generators. The unit prime mover is labeled GT (gas turbine), and the generator prime mover is correspondingly labeled GT. Each unit is an open cycle gas turbine that drives its own generator. We map the generators to the corresponding units based on the similarity in their IDs. A ratio of 1 means an exact match and ratio of 0 means no similar characters. For 9892 plants, which is about 95.4% of plants in this category, there is one-to-one mapping between units and generators. For other plants, there is no automatic way of knowing what the correct mapping would be in those cases.

Case 2. Combined Cycle units with bottoming steam fired generator-multiple generators in each unit
For combined cycle units, there is more than one generator associated with each unit. This happens in instances where there is a bottoming cycle in a two-to-one combined cycle facility as depicted in figure 2. In that case, we assume that an equal ratio of the nominal capacity of the bottoming steam turbine generator is divided equally between each associated gas turbine generator to form a unit. Other cases could exist in which two or more combustors/boilers feed one or more steam turbines-a traditional layout for coal-fired units. In these cases, the proper fraction of electricity from each of the component generators is assigned to the appropriate unit. This case is illustrated for a representative plant with ORIS code 5567 in table 2. This plant, taken from eGRID unit and generator data, illustrates a case with two units and three generators. Furthermore, the 'unit bottom firing type' in the UNT18 sheet of eGRID shows that the unit is a combined cycle unit. The generator prime mover column reports that the 'G1' and 'G2' generators of this plant are combined cycle gas turbines, denoted by CT. Generator 'G3' is denoted as CA, indicating a combined cycle steam turbine. Furthermore the 'G3' generator has two associated boilers which matches the number of gas turbines. The resultant nominal capacity of the unit is shown in table 2. The nominal capacities of units 'CT01' and 'CT02' of ORIS code 5567 is 326 MW. This value is a summation of the nominal capacity of each CT and half of the CA generator.
In order to map combined cycle generators with combined cycle units, we first must map the steam turbines used in the bottoming cycle to the corresponding combustion turbines within the GEN18 sheet. These together form a unit. For this, we separate the combustion turbines and steam turbines in the generator's data of the eGRID database. The mapping of steam turbines to the combustion turbines involves two major steps. First, we check if the total number of boilers within each plant is equal to the number of combustion turbines. This is part of the quality check, where ∼2% of cases have inconsistencies in the total number of boilers reported under each plant, and a different actual number of combustion turbines using the boilers. After eliminating these cases, we 6 Each plant has units with individual exhaust stack that emit waste gases/pollutants where the CEMS data is recorded. These units are a combination of generators. All the generators produce electricity by converting mechanical energy through turbines to useful energy (i.e., electricity). The type of turbine/prime mover is driven by the type of fuel used to run the turbines. Throughout the article, prime-mover are used as naming convention representing the turbine types. calculate the similarity scores as discussed above, mapping the steam turbines to the corresponding combustion turbines. In the final step, we calculate a new total capacity at the unit level, which will divide out the nameplate capacity of the steam generators into their corresponding combustion turbines. We then repeat the process of mapping the combustion turbines of the GENIDs with the new capacities calculated at unit level to the UNITIDs within the eGRID data using the similarity scores as described previously. This creates a complete mapping of generators to units within the eGRID data for both combined cycle units, and other units. The unit data from eGRID, including the nominal capacity values, is now combined with the CEMS data by ORIS code and UNITID. For instances where nominal capacity is missing from the CEMS data after this matching algorithm, we estimate nominal capacity levels with the maximum generation of the unit within a given year. The total number of missing values are 5% of the total number of units across all the states.

Case 3. Multiple boiler units connected to a single generator
In this case, the total number of units within a plant is greater than the number of generators. This is mostly the case for multiple boilers-designated as units connected to a large steam turbine for electricity production. Assigning based on heuristics would not guarantee a good mapping as there could be cases with a faulty IDs. Therefore, we filter out 357 plants which have more units than generators because we don't know what a good mapping would be in those cases, without manual intervention. However, we consider the annual maximum generation from the ARP data for each unit, which is a better representation of the distribution of generator load between the boilers. Overall, these cases constitute 1% of the total thermal power plants' capacity in US.

Description of key metrics used in the case studies
Loading fraction ('LF') in this study is the CEMS-reported level of generation load output divided by the nominal capacity obtained from eGRID (equation (1)). Following the matching described in section 2.1, it is now possible to estimate the LF values at both the individual unit level and the aggregate technology level (e.g., by prime mover type). Additionally, the annual capacity factor ('CF') is the total annual generation over the maximum generation (equation (2)). We compute total annual generation as the summation of the hourly generation from the CEMS data; hourly generation ('MWh') is the electricity produced (GLOAD) times the fraction of time in any given hour (equation (3)). It is not necessary that the generators always produce electricity throughout the hour, and thus the output is adjusted to an hour based on the fraction of time they operated within an hour.
Where, Subscript u-Unit level Subscript h-temporal scale, over a given hour, or over a given year P-nominal capacity (MW) Figure 2. Assumption that in any combined cycle unit, the nominal capacities of steam turbines from the eGRID database will be equally distributed and added to the nominal capacities of the combustion turbines to estimate the total capacity of the combined cycle unit in the CEMS database.

LF-Loading fraction CF-Annual capacity factor α -Fraction of hour the unit is on Load-Electricity generation (MWh)
MWh-Output in any given hour (MWh) For analyzing the operational behavior of power plants, individual units are grouped into the broader plant level generator types. These include (i) combined cycle, (ii) open cycle gas turbines, (iii) steam turbines, and (iv) other types. In California, the analysis below is performed for combined cycle and open cycle gas turbines. Depending upon the loading fractions, the operational behavior of the units at the plant level are semiheuristically characterized into four different buckets: start/stop, part load, near full load, and full load (table 3). The start/stop loading fractions are any generators that operate less than the minimum loading requirements [33,34].
The efficiencies are calculated after classifying the units by generation technology at a technology level such as combined cycle unit type in California (equation (4)

MWh-Output in any given hour (MWh)
For all temporal aggregation levels, the total energy output and heat input are calculated first, and then the efficiency, as shown in equation (4).
The average emission intensity for each LF bucket is calculated by dividing total emissions by total electricity produced for combined cycle units and open cycle gas turbine units (equation (5)).

Results
In this section, we demonstrate the significance of cross referencing eGRID data with CEMS data by illustrating the changes in generation behavior by prime mover type over time and the change in operational behavior of individual turbine units within a single plant. The results section is organized as follows: we first show a breakdown of changes in loading behavior of the generating plants over time for combined cycle and gas turbines, then we compare the efficiency changes between similar individual combined cycle plants during start and stop operations.

Data quality issues
Overall the eGRID dataset comprises of 10, 964 plants with 26, 709 units, and 27, 935 generators. Data quality issues arise in two cases: one from mapping generators to the units in the eGRID dataset, second while combining the resultant dataset with the CEMS data. A summary of total number of mappings considered for each case described in section 2.1 is provided in figure 3. Overall, 7% of the plants could not be mapped owing to data quality issues within the eGRID dataset between units and generators. The name plate capacity of the all the plants falling under case 3 could not be evaluated, and they were 1% of the total capacity of the plants.
Around 65% of the units from the eGRID at unit level could be mapped with the UNITIDs of the ARP data. For rest of the cases, the nameplate capacities were either unavailable or the maximum value of annual generation from the ARP data was greater than the estimated nameplate capacities. In those cases, the ARP's maximum generation was considered. Apart from nameplate capacities, the combined data could indicate the unit turbine type, primary fuel, and other eGRID plant level attributes for 95% of the cases, that could be useful for the hourly level generation analysis. From the combined data, we found that for ∼80% of the observations, both estimated capacity from eGRID and maximum generation from ARP data were within 25% of each other.
Overall, in this article, we hope to highlight the data quality issues, the possible range of applications with cleaner transferable data, and heuristics to be able to combine the data. If government agencies can resolve these issues, time and resources could be utilized towards meaningful applications of the data.

Change in plant loading fraction by hour of day
A combination of the CEMS data and the eGRID data allows us to estimate loading fractions in each operating hour at both individual unit level and aggregated generation technology type level. Figure 4 shows the breakdown of the loading fractions of combined cycle units for 2011 and 2017 by the hour of the day and the resultant emissions. Results show that between the years 2011 and 2017, the total generation for starts/stops by the combined cycle units went down by 8.5%, part load went up by 15%, and full load by 10%. The change in emissions follows a similar direction of total generation. Also, the emission intensity (Ton/MWh) is largely unaffected for all the load fractions except for starts/stop between 2011 and 2017. A decrease in 8.5% in start/stops results in a decrease in 47% emissions and a decrease in 43% emissions intensity from start/stops ( figure 4). The change in energy generation and emissions during start/stops and part load is concentrated Table 3. Unit loading characterizations. Based on the loading fraction a unit operates at any given hour, they are categorized under start/stop, partial load, near full load, or full load. The criteria for categorizing the unit's operation is described in the  during the afternoon hours, but the change in full load generation is mostly concentrated during the evening and early morning hours. Taken together, these observations illustrate a shift in CC units ramping midday in preparation for the subsequent ramp down of solar generation. This trend is further supported by changes in loading fractions at full load increasing in the late evening hours; observed LFs in this range increase by an average of 40% between 4 PM and10 PM, and by an average of 115% between 11 PM and 6 AM, concurrent with a decreasing frequency of start/stops. The decrease in emissions intensity during starts/stops could possibly attributed to several potential hypotheses such as the installation of newer power plants designed for flexible operation, a switch from more cold starts to hot starts due to better operations planning or simply the definition used to separate start/stops from part load operations. Through this article, we hope to highlight the applications of such comprehensive data, and further analysis of individual trends in detail is out of scope for this study. Figure 5 shows the breakdown of the change in loading fractions of the open cycle gas turbine plants over 24 h in a day. Metrics displayed in figure 5 demonstrate that between 2011 and 2017, the energy output by gas turbines mostly increased at all load fractions by 2 TWh (74%). The total generation spent on starts/stops increased by 97% and was spread over the day. Also, the delivery of part-load capacities increased by 120%. The delivery of full-load capacities is largely during the evening between 4 PM and 9 PM, and it increased by 384 GWh (40%). The annual capacity factors of the gas turbines remain almost constant at 55% during 2011 and 2017. The emissions follow a similar trend to total generation. Also. The change in emissions intensity for different loading fractions is about 8%, unlike combined cycle units. What is changing is the intra-day dynamics of the GT units, primarily working towards providing flexibility when renewables are absent. Overall from these results, the gas turbine plants increased their start/stop times and their part-load generation, but decreased their peak load generation in the hours before 3 PM in the afternoon.
Given the availability of annual data from CEMS, a comprehensive analysis of intra-day operation trends can be compared over time. We present an illustration of this loading fraction analysis for combined cycle units and gas turbines in figure 6. From the figure, it can be inferred that combined cycle plants increased the generation at part load and full load from about 4:00 PM in the evening, and 50% of these units then ramp to near-full load from 5:00 PM until around 8:00 PM. The trend in start/stops shifted from morning hours before 10:00 AM to afternoon hours after 10:00 AM. For the gas turbines, the trend in generation is a more pronounced peak during the evening hours from 5 PM-7 PM, where 60% of gas turbines are operating at near to full load capacities, 33% at partial load, and 7% in starts/stops. This figure clearly demonstrates the changing operation behaviors of combined cycle plants and highlights the importance of accounting for relative loading levels in any assessment of plant operating trends. Figure 7 shows the efficiency and CO 2 emissions rate for two different CC units (ARP ORISPL number 260, and 358). The data shows the efficiency changes and the resultant CO 2 emissions of the actual units within each CC plant at different loading fractions. While these curves can be constructed using fundamental mechanical principles, the thickness of the cluster at different loading fractions could visually indicate the operational behavior of the CC units. Here, we examine both unit efficiency levels and emissions intensities. Each point on the scatter plot is the actual efficiency/CO 2 emissions observed from the data set during the year 2017. The two plants considered in figure 7 have similar CC unit configurations: each plant is comprised of four combustion units and two steam turbines, with nominal capacities 1, 300 MW and 1, 100 MW, respectively. CC units in the plant with ORIS code 260 have a larger cluster of start/stops with loading fractions <20% and thus higher emissions than ORIS code 358. Furthermore, it can be inferred that because of the larger cluster of start/stops, plant 260 operates more often in part loads at loading fractions 50%-75% of the nominal capacities than the plant with ORISPL code 358.

Detailed unit-level analysis of sample combined cycle plants
Thus, knowledge of the nominal capacities allows us to analyze changes in the operational behavior of power plant operators at both the unit level and fleet level, which are influenced by various parameters such as the level of variable renewable generation and seasonal fluctuations. Also, this allows us to examine individual unit-level behaviors to understand the relative changes in fleet operating characteristics within a state, or across different regions.

Discussion
Energy models support decision making, inform policy, and grapple with issues of uncertainty and forwardlooking strategies for emissions reductions. A vast amount of data describing electricity grid characteristics both at the granular plant level and the fleet level is publicly available in the US. Combining comprehensive datasets such as CEMS and eGRID provides a foundation for generating invaluable insights about changes in grid operational behavior over time, and a basis for developing metrics to aid in retrospective analysis of policy changes. The results from this study can be used to expand our understanding of the interaction between the operational behavior of generating units and the variability of renewable generation. These trends are also likely influenced by seasonal and weather patterns. We leave these inquiries to future work.

Conclusion
In this paper, we presented a novel approach of combining two prominent power system databases, CEMS and eGRID. We then detailed a selected set of insights from cross referencing these datasets. We assessed key generator metrics at both the fleet and individual plant level in an effort to demonstrate the application of the resultant data to inform changes in trends and operational behaviors of thermal units. Overall, we observed combined cycle plants in California between 2011 and 2017 increase their part load generation at 20%-75% nominal capacities during the morning hours and peak load generation during the evening hours. On the other hand, the gas turbines in 2017 increased both their start/stops and part load generation at 20%-75% nominal capacities. Meanwhile, they decreased their generation at full load during the afternoon hours and increased during the evening hours. Also, the increase in generation is much steeper during the recent years than the gradual change observed in 2011, from morning to evening hours. Furthermore, for combined cycle units at the plant level, we analyzed two different cases where the cluster of start/stop emissions of plant with ORISPL code 260 was a larger cluster of points compared to the plant with ORISPL code 358. Also, plant with ORISPL code 260 consistently operated at lower loading fractions at about 70% nominal capacities, alluding to larger start/ stops and emissions. Further analysis can be done to identify location-based drivers and time-of-day-based drivers for these two similar plants that exhibit different operational behaviors.
These results, and the overall dataset, can be used for an array of power system analyses. This includes efforts to analyze: historical operational behavior for different grid mixes, response to renewable integration, response to policy changes, and other market-based influences. This study's data are an essential component of an integrated life cycle and cost assessment tool called SESAME. The analytical framework of SESAME encompasses the vast majority of the energy sector. Therefore, the tool can be used to conduct: (1) conventional pathway-level life cycle analysis (LCA) to study a specific technology or to comparatively assess different carbon mitigation pathways, and (2) system-level LCAs to study energy systems besides analyzing the impact of technology adoption rates, and interaction between different energy sectors [32].