A new hourly dataset for photovoltaic energy production for the continental USA

This new dataset is an ensemble of solar photovoltaic energy production simulations over the continental US. The simulations are carried out in three steps. First, a weather forecast system is used for the predictions of incoming insolation; then, forecast ensembles with 21 members are generated using the Analog Ensemble technique; finally, each ensemble member is used to simulate 13 different solar panels. In total, there are 21×13=273 simulated scenarios. Simulations are carried out for the entire year 2019, with a temporal resolution of one hour, and a spatial resolution of 12 km. The data provide a high spatio-temporal analysis of the power production under different weather and engineering scenarios. The size of the entire dataset is about 1 TB but can be openly accessed by days and scenarios. Details on how to access and use such a dataset are provided in this article.


a b s t r a c t
This new dataset is an ensemble of solar photovoltaic energy production simulations over the continental US. The simulations are carried out in three steps. First, a weather forecast system is used for the predictions of incoming insolation; then, forecast ensembles with 21 members are generated using the Analog Ensemble technique; finally, each ensemble member is used to simulate 13 different solar panels. In total, there are 21 × 13 = 273 simulated scenarios. Simulations are carried out for the entire year 2019, with a temporal resolution of one hour, and a spatial resolution of 12 km. The data provide a high spatio-temporal analysis of the power production under different weather and engineering scenarios. The size of the entire dataset is about 1 TB but can be openly accessed by days and scenarios. Details on how to access and use such a dataset are provided in this article.  Table   Subject Renewable Energy, Sustainability and the Environment Specific subject area Ensemble simulation of solar photovoltaic energy production to study generation efficiency and forecast uncertainty Type of data NetCDF

Value of the Data
• These new data provide an ensemble of power production simulations with high spatial and temporal resolutions. They can be used for a multitude of studies, from assessing the performance of forecasting systems, to identify the best locations for solar power generation.
Simulations are generated with multiple solar panels, thus together representing a valid and comprehensive tool for studying power generation efficiency and simulation uncertainty. • Institutions and researchers involved in renewable energy, specifically in the photovoltaic solar sector, can benefit from these data because it provides simulations for a wide range of weather and engineering scenarios. Weather and economic modelers can also benefit from this study by deriving uncertainty from the ensemble simulations. • The simulated power production data can be used in a variety of applications including power potential assessment at a specific location. The simulation ensemble enables research on forecast uncertainty quantification which is critical for grid operation and renewable penetration. • The additional value of these data lies in its readiness and completeness. Generation of the data is computationally intensive but this dataset enables rapid assessment of solar power generation with various weather scenarios and panel configurations.

Data Description
This dataset contains hourly power production simulation for 2019 over the Continental US (CONUS) with a 12 km spatial resolution. There are 21 members in the weather forecast ensemble and 13 solar panel modules. In total, there are year-round power simulations for 273 different scenarios considering weather and engineering conditions. Considering the high spatial and temporal resolution of the simulation, the entire dataset is about 1 TB in size but can be accessed by days and scenarios, making each file about 246 MB. Fig. 1 shows the annual accumulated power production calculated from ensemble mean for 2019. The visualization is created for the solar panel module, SP128. The domain of interest is the CONUS which is covered by a 12 km mesh grid with 56,776 grid points. It is typical to expect a high amount of power production from photovoltaic solar in western and southwestern  US. The Pacific Northwest shows a lower production especially over the coastal region. This is related to the high amount of cloud cover year-round. Fig. 2 shows the hourly power production time series for 2019 with the solar panel module, SP128, accumulated over CONUS. There are 21 forecast members (weather scenarios); the range of the ensemble is shown in the grey shade and ensemble median is shown in the solid red line. Considering the wide range of possible atmospheric states, the variability of the simulated power production is significant, especially entering the spring and the summer seasons. During the summer season, the upper bound of the power production reaches a plateau and the ensemble median approaches the upper bound. This is because solar panels are already performing at its maximum power level under excessive solar irradiance condition.
Figs. 1 and 2 demonstrate only one out of the 13 simulated modules. All other modules can be analyzed similarly. While the general spatial patterns and temporal trends of power produc- The dataset is hosted on a publicly available repository. It contains the following files:

Experimental Design, Materials and Methods
The dataset is generated from a three-stage workflow. Fig. 4 shows the workflow of generating the simulation dataset with three components, the weather forecast system, the AnEn generation [2] , and the power production simulation.
Weather Model A weather forecast system provides the primary forcing to simulate power production including estimates of solar irradiance, wind speed, and temperature. The model used in this dataset is the NAM forecast system [3] . It is a deterministic weather model that Ensemble Generation NAM is a deterministic weather model which only provides a single state of the atmosphere. An additional process is carried out to generate the weather ensembles using the Parallel Analog Ensemble library [4,5] . AnEn is a technique to generate accurate and calibrated forecast ensembles from deterministic weather forecasts without running the model multiple times. Therefore, it is well suited for operational forecasts where ensembles can be directly generated when the operational model is run. AnEn looks for similar historical weather forecasts and then use the corresponding historical observations as the ensemble members. The set of predictors used by the AnEn include downward shortwave radiation flux, surface pressure, temperature at 2 m above ground, total cloud cover, wind speed, and wind direction. These parameters are pre-selected because they have relatively high correction with the final solar energy production and they have been found to be useful predictors for solar energy forecasts [6,7] . The search period is 2017 and 2018, and 21 ensemble members are generated to account for sampling error [8] . Please refer the [2,4] for a detailed description of the AnEn and the open-source software used.
Power Simulation Finally, weather forecasts are fed into a simulated power system to generate the power production estimates at each grid point. This process is carried out with the pvlib python package [9] . pvlib is an open-source and community-supported tool that simulate the performance of photovoltaic energy systems. A 10 KW system is assumed at each model grid cell (12 km by 12 km) and the number of panels is therefore calculated as the desired system output divided by the nominal power output of a single panel. Global horizontal irradiance, air temperature, and wind speed are input to the power system. Global horizontal irradiance is used to estimate the incident irradiance reaching the panels. Air temperature and wind speed are used to estimate the cell temperature that could cause an impact on panel efficiency. The decomposition of global horizontal irradiance is carried out using the DISC model [10] . Hay & Davies's 1980 model [11,12] is used to calculate the diffuse irradiance from the sky. Cell temper- ature is estimated per the Sandia Array Performance Model [13] , and finally, the power output is estimated using the SAPM model [13] . All together, the three components are built together with the RADICAL Ensemble Toolkit (EnTK) [14] . EnTK is a python library for developing and executing large-scale ensemble-based workflows. It allows a high level of flexibility while defining the workflow and simplifies the process of managing communication and jobs on a cluster. This dataset was generated on the supercomputer, Cheyenne, from the National Center of Atmospheric Research. The execution of the entire workflow costs 18,675 core * hours . It is, however, possible to apply the same workflow on a historical time series of data collected on site. For example, the Surface Radiation Budget Network (SURFRAD) [15] , established in 1993 through the support of NOAA's Office of Global Programs, aims to provide high-quality ground measurements of variables related to energy budget (e.g. global horizontal irradiance) and weather conditions (e.g. air temperature and wind speed). Another option is to use the historical weather records from the Automated Surface Observing System (ASOS), also maintained by NOAA. The key consideration, for this particular dataset, of using a weather forecast system is twofold, a complete temporal and spatial coverage and the ability to study the uncertainty of solar photovoltaic energy forecast. SURFRAD has eight stations and ASOS has around 700 stations across the CONUS, which only provide distributed local coverage. A weather forecast ensemble also provides a series of possible atmospheric states to test how sensitive the performance simulation is to forecast uncertainty. Table 1 lists out the 13 solar panel modules simulated in the dataset with detailed information including the manufacturer, the year, and the panel efficiency. The modules are selected from the Sandia Module Database [9] . The database contains 523 modeled solar panel modules but most of the modules share similar characteristics in size and power efficiency. We selected a subset of the modules by running a hierarchical clustering algorithm based on Area, Cells_in_Series , and MP (Maximum Power). We also excluded modules prior to 2008 to prefer recent technological advances.
An example of how to interact with and visualize the dataset can be found in Example.ipynb , available from the same data repository. Potential data users are encouraged to review the generated HTML page from the Example.ipynb directly from the repository.

Ethics Statement
This work is supported by the EarthCube office, US National Science Foundation. Information on the funded project is available at https://www.nsf.gov/awardsearch/showAward?AWD _ ID= 1639707 .