SOLETE, a 15-month long holistic dataset including: Meteorology, co-located wind and solar PV power from

The aim of the SOLETE dataset is to support researchers in the meteorological, solar and wind power forecasting fields. Particularly, co-located wind and solar installations have gained relevance due to the rise of hybrid power plants and systems. The dataset has been recorded in SYSLAB, a laboratory for distributed energy resources located in Denmark. A meteorological station, an 11 kW wind turbine and a 10 kW PV array have been used to record measurements, transferred to a central server. The dataset includes 15 months of measurements from the 1st June 2018 to 1st September 2019 covering: Timestamp, air temperature, relative humidity, pressure, wind speed, wind direction, global horizontal irradiance, plane of array irradiance, and active power recorded from both the wind turbine and the PV inverter. The data was recorded at 1 Hz sampling rate and averaged over 5 min and hourly intervals. In addition, there are three Python source code files accompanying the data file. RunMe.py is a code example for importing the data. MLForecasting.py is a self-contained example on how to use the data to build physics-informed machine learning models for solar PV power forecasting. Functions.py contains utility functions used by the other two.


Value of the Data
This data is useful for either meteorological, renewable energy or big data studies as it is a collection of recordings from both atmospheric conditions and energy production.
• The resolution and length of SOLETE makes it particularly useful for the forecasting community in both the meteorological and power fields. Specially for those working in machine learning (ML) and big data fields. • The dataset was originally developed for one to three days-ahead forecasting of solar and wind power using data-driven methods such as ML. The ongoing discussion of the renewable community regarding forecasting meteorological metrics or directly power requires honest comparisons. This dataset and the related publications can serve as a baseline.
• The dataset is complemented by a Git repository [2] including three Python scripts employing only Open Access libraries. The first script simply imports the data, while the second showcases the methodology discussed in [3] and [4] to build physics-informed ML-models for solar power forecasting. This resource is particularly useful for students and researchers first starting in the machine learning-based forecasting field.

Data Description
The dataset [1] and its related repository [2]   • MLForecasting.py This file is directly related to the dataset's twin publications [3 , 4] . It is a self contained example of how to build a physics informed solar power forecaster based on different off-the-shelf ML algorithms. First, it imports the dataset, and expands it according to the methodology proposed in both papers. Then, it generates the training, validation and testing sets. Subsequently, the model is trained, tested and its results presented using root mean squared error (RMSE) as the evaluation metric. The user can choose from 5 different ML methods: random forest, support vector machine, and three types of artificial neural networks. Each one of them has a dictionary defining their configuration which can be adapted to suits the user's preferences.

Experimental Design, Materials and Methods
The Pyranometers, humidity and Temperature Probe, Wind Vane WZOOP and WindSensor P2546A-OPR Cup Anemometer constitute the meteo-mast (MM). This is located at less than 10 m from the PV array and directly on the roof where the SMA SunnyTripower 10 0 0 0TL inverter is located at approximately 6 m from the ground. However, the Gaia WT is placed at roughly 230 m, which implies certain displacement between the recorded wind speed at the anemometer and at the turbine. Lastly, the barometer is located 1180 and 970 m from the MM and the WT, respectively. This distances are depicted in Fig. 1 . Each device records and samples differently with a minimum of 1 Hz resolution. The recordings are collected in different nodes over the SYSLAB topology. There, they are minimally preprocessed by interpreting the recorded signals into actual SI units and timestamping them. Later, the data from all the nodes is transmitted to a central log.
The data has been retrieved from a DTU server via FTP client, where there is a csv file per day and node. The compiled metrics are distributed into 5 different nodes, hence the raw data has been imported and cleaned from useless strings and other meaningless characters embedded in the original csv files. Then, the data has been time aligned using the timestamps and averaged with a fixed window to obtain the desired resolutions.