Simulation of long-term time series of solar photovoltaic power: is the ERA5-land reanalysis the next big step?

Modelling long time series of photovoltaic electricity generation in high temporal resolution using reanalysis data has become a commonly used alternative to assess the viability of systems with high shares of renewables, their risks of failure and probability of extreme events. While there is a considerable amount of literature evaluating the accuracy of the original solar radiation and temperature variables in these data sets, the validation of the calculated output of photovoltaic installations is scarce and usually limited to locations in Europe. This work combines the new ERA5-land reanalysis data set and PV_LIB to generate hourly time series of photovoltaic electricity generation for several years and validates the results using individual data of 57 large photovoltaic plants located in Chile. Results are also compared with PV output for these locations calculated using renewables.ninja, a platform relying on MERRA-2, a global reanalysis with five times lower spatial resolution. Accuracy and bias indicators are satisfactory for plants that do not present severe anomalies in their generation profiles and where basic plant characteristics such as size and orientation match our model assumptions. However, the improvements in indicators over results obtained with renewables.ninja from MERRA-2 are minor. The validation process serves not only to confirm the suitability of the proposed workflow to model the output of individual photovoltaic plants, but also to list and discuss data quality and availability issues. Efforts towards availability and standardization of data of individual installations are necessary to improve the basis for further developments.


Introduction
Solar photovoltaic systems (PV) play a major role in the renewable energy transition taking place around the globe. E.g. in Germany, an early adaptor of PV, 20% of installed electricity generation capacity, i.e. 45 GW, was PV in 2019 [1]. Newcomers to the renewable energy transition are catching up quickly. One example is Chile where more than 2.6 GW of PV have been installed since 2013, representing 11% of the total installed electricity generation capacity. More than 2 GW are currently under construction and environmental permissions to an additional 17.7 GW have been granted until the beginning of 2020 [2].
There is considerable work on the forecasting of solar radiation and PV electricity generation, showing improvements in accuracy of up to two thirds in the last decade [3][4][5]. Much of these improvements are driven by the utilization of machine learning techniques that rely on high quality reference data of well documented and monitored measurement stations. Three recent reviews [3][4][5] show that most of new research is dedicated to forecasting exercises for the very short and short term and are predominantly performed for locations in North America and Europe. There is increasing attention for locations in Asia and Australia but there is very little work in South America and Africa. Furthermore, in general, scientific literature tends to avoid reporting on the performance of models for locations and time-horizons, where it is known that models underperform [3].
However, there is considerable less scientific work related to the simulation of long-term time series of PV generation, as necessary for modelling studies for the energy transition. Private companies such as solargis [6] and VAISALA [7] offer high resolution time series of solar radiation and PV output estimations for particular locations but at costs that would be difficult to cover by research projects interested in data for thousands of locations or large-scale systems analysis. Pfenninger and Staffell [8] addressed this issue with the renewables.ninja platform. This platform allows to freely generate long time series of hourly PV output for any location in the world. The underlying solar radiation data, the Modern-Era Retrospective Analysis for Research and Applications (MERRA) [9], the Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2) [10] and the Surface Radiation Data Set -Heliosat (SARAH) [11], have been validated extensively in literature and are in use in numerous studies, see e.g. [12][13][14]. There is also a growing body of literature making use of the renewables.ninja data but the output of individual PV installations estimated there has been only validated by Pfenninger and Staffell [8] themselves. This restricts the geographical coverage of the validations to Europe. Similarly, the PVGIS platform of the Joint research centre of the European Commission [15] in its version 5 allows to generate hourly time series for up to 12 years (2005-2016) for locations in most parts of the world. The underlying solar radiation data sets, that include data derived from data of the Climate Monitoring Satellite Application Facility (CM-SAF) [16], the regional reanalysis COSMO-REA6 [17], the US National Solar Radiation Database (NSRDB) [18] and the global reanalysis ERA5 [19], have been validated also multiple times and a classification of which data set works better for which locations is also provided [20,21]. In contrast to the solar radiation data used in PVGIS, to the best knowledge of the authors, validations of the derived time series of PV output of this platform have not been performed.
While satellite derived solar radiation data were shown to be usually more accurate than reanalysis data sets of previous generations, the accuracy of state of the art regional reanalysis data sets such as COSMO-REA6 is coming closer to the one of their satellite derived counterparts [21,22] and in particular the last global reanalysis of the European Centre for Medium-Range Weather Forecasts (ECMWF), ERA5, presents promising results compared to its global reanalysis predecessors. Urraca et al. [21] showed that ERA5 solar radiation data have an average bias on the global scale that is 50% to 75% lower compared to ERA-interim and MERRA-2. Huang et al. [23] found that ERA5 performs relatively robustly across Australia without notable deficiency and is more accurate than the Global Forecast System (GFS) and the Australian Community Climate and Earth-System Simulator (ACCESS). Trolliet et al. [24] compared 5 data sets including MERRA-2 and ERA5 for irradiance estimation in the tropical Atlantic ocean and stated that while the reanalysis data sets are not the best performing among all available datasets, ERA5 presented consistently higher correlations than MERRA-2, i.e. correlation coefficients greater than 0.85 for MERRA-2 and 0.89 for ERA5. Overestimation of global horizontal irradiation (GHI) by ERA5 has been, however, reported for 98 sites in China [25] and multiple locations in Norway [26]. Further low performance of ERA5 was reported for direct normal radiation in a location in Brazil, when comparing it to other 10 solar radiation data sets, where ERA5 scored a root mean square error (RMSE) of 63.4% but none of the other data sets in the comparison reached better values than 37% [27]. However, concerning GHI, ERA5 performed in the midfield of the evaluated data sets for the same location [27], did not have considerable lower performance than data from CAMS, CM-SAF and SARAH for a location in Methoni, in southwest Peloponnese, Greece [28], had similar bias compared to that of satellite data for inland regions with few clouds globally [21] and presented promising results as a complement for satellite-based databases in regions not covered by geostationary satellites when simulating PV systems [20]. Atencio Espejo et al. [29] present even a comparison between ERA5 and PV_LIB derived PV power and measured data for one location in Milano, Italy for 2014-2016. They find correlations around 0.91 and a normalized RMSE around 11% when comparing the calculated against the measured PV output.
The positive accuracy results have already been used as justification in multiple studies that employ ERA5 data as input to calculate PV output time series. These studies include an assessment of synergies of solar PV and wind power potential in West Africa at hourly resolution [30], an assessment of on-site steady electricity generation from hybrid (PV, wind power and battery) renewable energy systems for the entire territory of Chile [31], and the mapping of degradation mechanisms and total degradation rates for a monocrystalline silicon PV module at the global scale [32]. However, none of these studies presented a validation of the PV output data.
Recently, the Copernicus Climate Change Service (C3S) launched the ERA5-land [33] data set of the ECMWF, derived from ERA5, with a spatial resolution of 9 km x 9 km, which is more than three times and five times higher than the resolution of ERA5 and MERRA-2, respectively. To the best of our knowledge, no validations of the radiation variables in this data set or any validation of PV output derived from them have been performed so far. Considering the progress in the accuracy of radiation variables in the global reanalysis data sets, it could be expected that such high resolution data set allows the estimation of PV output time series that are considerable more accurate than calculations relying on previous global reanalysis generations. In order to test this, the present study proposes the calculation of PV output time series relying on ERA5-land data and the widely used PV_LIB library [34]. Moreover, it compares the estimations to renewables.ninja PV output data as well as to measured data from the locations of large PV installations in Chile. The comparison is performed using hourly, daily and monthly capacity factors as well as typically used indicators such as MBE, the Pearsons correlation coefficient and RMSE. The selection of locations in Chile is related not only to the massive expansion of PV generation PV in this country but also to the fact that data of all large PV installations connected to the grid are open and freely available online through official sources. An additional particularity of this country is that most part of its territory is not reached by the Meteosat Second Generation (MSG) geostationary satellite, which is the main source of most of the solar radiation data sets that are the usual benchmark for solar radiation reanalysis data sets, also in use in renewables.ninja and PVGIS.

Data sets and pre-processing
Three data sets are used to test the proposed hypothesis. These include the measured data from PV installations in Chile, the variables of the ERA5-land data necessary for PV output estimation, and PV output time series for the locations of the Chilean PV Plants calculated with renewable.ninjas. All of them are available openly and freely for academic uses, have an hourly temporal resolution and its basic description is summarized in Table 1. Details of the data and the necessary pre-processing are presented in sections 2.1.1. -2.1.3.

. Generation profiles of large PV installations in Chile
Chile started its transition to renewable energies recently, but at a fast pace [35]. Apart from regulations to support the deployment of RES and the establishment of energy efficiency measures that are developing rapidly (see e.g. [36] for an overview), the Chilean government follows also a policy of transparency in the energy sector that lead to the commissioning of the Open Energy ("Energía Abierta") platform [37]. This web platform includes, among many others, hourly data for marginal electricity costs, grid balances, electricity demand and electricity generation of every single large generation plant connected to the grid. Monthly reports of hourly generation of all large-scale PV plants connected to the grid were downloaded for the period 2014-2018. The very first installations were supposed to go online in 2012, however, we only found generation observations since 2014. The output of PV installations were merged with a spatially referenced data set of 103 PV installations that also included basic information about commissioning year and size in MWp. The matching was performed based on installation names in a semi-automatic fashion supported by a fuzzy string matching function that uses the Levenshtein distance algorithm to calculate differences between strings [38]. This was necessary since the names of the installations in both data sets had small differences that did not allowed a fully automated matching. After the matching and the exclusion of all installations with less than one year of generation data, a total of 57 installations remained available for validation. Hourly capacity factors were calculated dividing the total output of the installation per hour by a maximum output value determined as highest value in the 99 percentile of each installation. This served to correct differences between maximum observed generation values and the installed capacity reported in the data base. These are in average only 5.9 % but in in several cases showed a significant difference of +/-30%, with two exceptional cases with differences even beyond 150%. Additionally, all values beyond 1.1 times the 99 percentile were classified as outliers and excluded from further calculations. Similarly, only values larger than 0 were preserved for the validation in order to avoid the inclusion of periods where plants were entirely offline as well as night periods, when prediction is trivial. A summary of the main characteristics of the measured data is provided in Table A1 in the appendix. Additionally the processed data set with the capacity factors is available under https://data.mendeley.com/datasets/6mkhck9t6x/draft?a=07b9492f-16ad-4126-a977-e5c8dd0308d7.
Moreover, two subsets of installations were created after a manual screening of the time series of each plant. The first subset of data (S1) has a total of 23 PV installations and excludes installations with large periods of exceptionally low generation after the commissioning day (determined by the first record of the system).
An example is presented in Figure 1. We assume that such periods are related to some kind of failure or maintenance in the plant or to an erroneous data logging process that we are not able to account for in the model. These periods are also difficult to remove individually from the timeseries and therefore we prefer to exclude the generation time series for the entire installation. The second subset of data (S2) excludes time series that present a pattern resembling a system configuration different to the one that is assumed for our simulation with ERA5-land data. In the present study we rely on a PV system configuration that aims at maximizing yield in a year by using a panel orientation towards the equator (in this case towards North) and an inclination equal to the latitude. These conditions represent optimally installed PV systems without tracking [8,39]. The meta data on installations does not include information about the configuration (orientation, inclinations, use of a tracking system, size of the ) of the PV systems. We therefore made a manual selection of installations, which time series do not present several hours of (close to) maximum generation per day during summer days with clear sky conditions. Such time series reflect the generation of installations with some sort of tracking systems, undersized invertor or fixed injection limit. Since it is not possible to determine with certainty which type of tracker systems are responsible for the particular shapes of the time series and these would present differences to the simulated time series with optimal static configuration, we decided to also create a subset that excludes such installations. Figure 2 shows an example of one excluded installation. In this case, the PV system "Puerto Seco Solar" has a one axis tracking system [40]. The resulting data set excluding time series of installations with such particularities is subset S2.

ERA5-land data and PV power output using PV_LIB
ERA5-land is defined by the ECMWF as an enhanced version of ERA5 for land applications [41]. The main particularity compared to ERA5 is the spatial resolution of around 9 km, which is more than three times higher than the one of ERA5 and more than five times higher than the resolution of MERRA-2. The data will eventually cover the same time horizon as ERA5 (January 1950 to near real time) and the period 2014-2018 was retrieved from the Copernicus Climate Data Store [33]. The necessary data for PV output modelling are solar irradiance, the temperature of the air and wind speed. These parameters are derived from the variables from the ERA5-land data set listed and described in Table 2. Wind speed is calculated using the u and v components of wind at 10m (v10 and u10). This is adjusted to one meter height using a logarithmic vertical wind profile equation, using a surface roughness length of 0.25, i.e. assuming terrain with scattered obstacles. Furthermore, the accumulated radiation values are transformed to hourly values by subtracting the previous values within each forecast horizon, i.e. in this case 24 hours starting at 00 UTC.  The PV output is calculated using PV_LIB for python [34]. This is a widely used open source toolbox created by the PV Performance Modeling Collaborative (PVPMC) of the Sandia National Laboratories in continuous development since 2014. The reported users include Espejo et al. [29], which employs PV_LIB in combination with ERA5 data to predict the output of one installation in Italy with promising results. For this study it has been assumed that PV installations are oriented towards North and inclined in an angle equal to the latitude of the location. This characteristics are an approximation of the necessary conditions for maximum output during a year without any tracking system. The conversion from horizontal irradiance to an inclined surface requires an estimation of the DNI and DHI from the derived instantaneous SSRD in Wm -2 . These are obtained using the Erbs model as implemented in PV_LIB. GHI, DNI, DHI, air temperature and wind speed as well as technical details of technologies launched in 2014 are the input to the model. The selected PV panel, "Silevo_Triex_U300_Black", is part of the module data base of the Sandia National Laboratories and the selected inverter "ABB__MICRO_0_3_I_OUTD_US_240_240V" belongs to the list of approved systems of the US Clean Energy Council. From the data base available for PV_LIB these systems resemble the state of the art technology in 2014, the first year for which we have generation observations for PV installations in Chile. The resulting time series were adjusted to daylight summer times since these are present in the timeseries of observed generation.

PV output from renewables.ninja
The third data set was retrieved from renewables.ninja. Time series from 2014 to 2018 were calculated and downloaded for each location of the 57 available PV plants in Chile. The orientations and inclinations for the PV panels are the same as for the ERA5-land derived data set, the selected source for the weather variables is MERRA-2 (SARAH is not available for most part of Chile), the selected capacity is 1.0 and no additional system loss is assumed. The authors of renewable ninja provide their own simplified PV model with temperature-dependent panel efficiency that relies on global solar irradiance data, the BRL model for estimating the diffuse irradiance fraction and the ground temperature data at each location. Further details are available in [8] or directly on the renewables.ninja web page. An adjustment of the daylight summer times was also performed.

Accuracy indicators for PV output time series
To allow comparability with the results obtained for locations in Europe in [8], the comparison is performed for capacity factors. Three commonly used indicators in solar and PV forecasting literature, MBE, MRSE and Pearson's correlation coefficient [3], defined in equations 1-3 respectively, are calculated for the hourly, daily and monthly values: , where Î are the simulated and I t the measured capacity factors.
Additionally, for the subset S2 of PV installations the indicators were also calculated for deseasonalized time series of the data sources. In order to do this, clear sky global horizontal irradiance profiles where calculated for each location using the Ineichen and Perez clear sky model [42] also available in PV_LIB. These where normalized to one by dividing the time series trough the highest irradiance value.

All installations
Pearsons correlation coefficient and RMSE for all sets of installations and the hourly, daily and monthly capacity factors for the ERA5-land derived PV output as well as for the renewables.ninja PV output data are presented in Figure 3. When considering the data set with all installations, the correlation values for the hourly capacity factors are mainly around 0.8, there are some outliers below 0.6 and the best cases reach even more than 0.95. These results deteriorate considerably for the daily and monthly values. In the case of RMSE and for the hourly capacity values, the most common results are around 0.2 and in general the distributions are very similar for the ERA5-land and the renewables.ninja PV output data sets. Concerning the MBE for the hourly capacity values, the highest amount of the installations in the case of the ERA5land derived PV output data is around 0.0 and for the renewable ninja data is around -0.05. However, as in the case of the correlations, the distribution of the MBE values is very similar between the two compared data sets. The distributions of the MBE for the daily and monthly capacity factors for both data sets are analogous to the ones obtained for the hourly capacity factors (see Figure B1 in the appendix).

Subsets of installations
The full set of installations includes numerous installations with characteristics and events that cannot be simulated by the models. The statistics for subsets S1 and S2 show that results improve considerably when the characteristics of the installations actually match the simulated systems (see Figure 3). Correlations increase, RMSEs decrease, and the MBE has values closer to 0. Also, the spread of results decreases. For the subset S2 of PV installations and the ERA5-derived data the correlation values of the hourly capacity factors range between 0.88 and 0.97, the MBEs are -0.02 -0.05, and root mean square errors are around 0.12. Similarly, the renewables.ninja data shows a correlation spread between 0.86 and 0.96, and the MBE and RMSE present the same maximum and minimum values as the ERA5-land derived data. The statistics are only slightly better for the ERA5-land derived data. Furthermore, for this PV installations subset, the monthly correlations considerably improve with all values for the ERA5-land derived data above 0.84 and above 0.82 for renewables.ninja. Figure 5 shows an example of the time series of capacity factors from both simulated and measured data. In that case, ERA5-land data reproduce significantly better the time series for the 7 th of June. However, even in this period it is not entirely accurate. For all other 6 days the results are mixed. In some cases renewables.ninja data are closer to the measured data than the ERA5-land derived PV output. Over the long time series the differences compensate, and the indicators tend to be similar for both compared data sets. When comparing the indicators for the deseazonalized data of subset S2 (see Figure 6), daily and monthly correlations improve, hourly ones however deteriorate. In the latter case, the medians for both compared data sets have values below 0.8 showing that there is still considerable need for improvement of the reanalysis data sets to reproduce the intra-daily variability of PV generation related weather variables. Figure  6 and B2 in the appendix also allow a direct comparison of the results of ERA-5 land derived and renewables.ninja PV output that shows that the medians of the ERA-5 land derived data have higher correlations, lower MBE and lower RMSE independent of the time horizon. However, the differences are low and may not significantly improve the results of model studies using these timeseries as input.

Aggregated results
The indicators for the aggregated values have been calculated using the time series since February 2016, which is the month after the last of the installations in subset S2 starts having records. These indicators are presented in Table 3 and show that spatial aggregation will reduce errors. The correlations for the aggregated time series are either equal or larger than the ones for the individual installations in all cases. Even the deseasonalized time series reach correlations for hourly capacity factors of 0.88 for the ERA5-land derived PV output and 0.839 for the renewables.ninja data. MBEs are, in all cases but the monthly renewables-ninja data, below 0.05 and RMSEs remain always below 0.01. The results for the aggregated values corroborate the results for individual installations. Time series of PV output calculated with ERA5-land data and PV_LIB are consistently better than the ones calculated with renewables.ninja but the difference is minimal.

Discussion
The RMSE values for all stations are in most of the cases worse than the RMSE values presented in [8] for hourly capacity factors and European installations, which present a median around 0.1. For the daily and monthly capacity factors, the results improved and the majority of results are below 0.15 for both data sets being compared. Nevertheless, these remain worse than the results presented in [8] for European installations. In contrast, the RMSE values of subset 2 are very similar to the ones presented in [8] for installations in Europe for both compared data sets and both hourly and daily values. These achieve even the level of accuracy of the PV output calculated with SARAH irradiance data for installations in Europe. Similarly, the results obtained with subset S2 are comparable or better than the obtained by Atencio Espejo et al. [29]. While these authors show a correlation for hourly values for a location in Italy that reaches 0.91 when using ERA5 data and PV_LIB, the time series of five out of six installations in subset S2 simulated using ERA5-land data and PV_LIB are equal or higher compared to this correlation.
A clear output of the validation exercise presented here is the need for more and better data for validation and forecasting model development purposes. Using standard assumptions about PV system configurations may lead to the generation of PV output profiles that are considerable different to the actual ones, since differences on e.g. tracking system type or orientation already have a large impact on the estimated output of the systems. This might sound trivial but this is a known issue of PV installations data bases in Europe, where the information is either not available or have errors [8], an experience that is now replicated in energy transition newcomer countries such as Chile and Brazil. These issues are relatively easy to correct, at least for new installations that will be integrated in the data bases, but awareness about the importance of data requirements is necessary. We emphasize here that the open energy data initiative "energía abierta" in Chile goes beyond the official European Counterparts in terms of the provision of data. Comparable, entirely open, hourly PV output data at the individual installation level for a whole country, as provided by Chile, is only available for Brazil to the best of our knowledge, but not for European countries or the US. Making the data available, however, is an important step forward towards a geographical redistribution of research in the field of PV output forecasting and simulation.
Furthermore, while most of the progress in the PV output forecasting field for the very-short and short term in the last decade came from artificial intelligence based procedures, the quality of the measured long time series of PV output have to improve in order to be a suitable input for e.g. machine learning applications. Since keeping good records of measured data is challenging, e.g. see Müller et al. [43] that report on a well monitored data set of the Fraunhofer ISE with more than 300 PV installations, where only 38 have data that are good enough for comparisons of long term yield predictions, at least making the effort of keeping metadata of the operational status of installations will make a considerable contribution to the field. To make a simple example with such metadata our subset S1 of installations would have had 57 instead of 23 installations since we would have had certainty about which hourly values are suitable for comparison. In machine learning based approaches such metadata on the hourly values will bring certainty on which data is worth using for learning and validation.
Finally, this study is a contribution to the evaluation of suitability of global reanalysis for the modelling of the output of PV installations but a global validation would require global efforts. Results are, however, mixed: a new generation of reanalysis with several times higher spatial resolution has not generated a significant step forward in the prediction of PV output at individual installation level.
It should also be taken into account that there are uncertainties in the models which transform horizontal irradiance into irradiance on an inclined surface as well as in the technical PV and inverter models. However, the main parameters which determine the hourly variability of PV output are still weather variables.
Following the results obtained with the deseasonalized data, there is still considerable room for improvement of the reanalysis data sets. Moreover, selection of optimal input weather data and PV and inverter models can only be improved if there are improvements in the reference data in a wide range of locations. Similarly, generalization of results at the global scale can only be achieved if such improvements, validation exercises and development of forecasting methods are also made for locations distant from typical hot spots of research. Beyond the necessity of open availability of data and standardization of data warehousing procedures from official institutions, the scientific community can contribute here by making not only their datasets available but also reference data that have been gathered from data bases that are only available in locally known repositories. A platform such as open power system data [44] is a positive start but efforts have to be made to make data of locations outside the European borders also available in an open and usable way. This however relies on original data providers using open data licenses. Furthermore, metadata about basic characteristics such as orientation, inclination, size of the inverter and tracking type of the existent PV systems play a key role in PV output estimation but are usually missing. There is progress in terms of data availability at the individual plant level, coming from transparency and open data initiatives in newcomers of the energy transition such as Chile in Brazil, but standardization of metadata and additional data about the operation status of the installations will be necessary for further modelling developments. The improvements that have been made in the very short and short term forecasting of solar radiation and PV output by using machine learning techniques would only be possible for long term simulation of time series if input data quality is improved. While most of the responsibility in data availability might lay on governmental institutions, scientist might start contributing to simplifying such validation exercises and avoiding repetition of work by also making available the data of individual installations that have been used in their research.

Conclusions
Appendix A: