Optimal Selection of Weather Stations for Electric Load Forecasting

For the proper planning and operation of any electric power system, it is essential to have a reliable software tool that allows for accurate short-term forecasting of electric power consumption. Temperature is one of the most important drivers of electric energy consumption and selecting the appropriate weather station or combination of weather stations is crucial for improving the forecasting accuracy. In this study, we propose an algorithm to determine the optimal selection of weather stations based on (i) an initial reduction in problem complexity, which drastically decreases the computational cost, and (ii) a posterior incremental search algorithm. The proposed methodology, which identifies the optimal number of weather locations to be used and their specific geographical locations, is exhaustively tested using three insightful case studies using real data from the Spanish mainland electric power grid and two different weather forecast databases, denoting a significant reduction in forecasting errors. The developed method is currently in use by the Spanish transmission system operator (Red Eléctrica de Espa na, REE) to make hourly forecasts.


I. INTRODUCTION
Short-term load forecasting plays an important role in all aspects of the planning, operation, and control of real-world electric power systems. Accurate forecasts are essential for reliable operation of power networks and for optimizing the use of both natural and renewable resources. Short-term load forecasting has been a very active area of research over the last three decades and a significant number of studies have been devoted to this area, addressing the problem from different mathematical perspectives ( [1], [2]): regression analysis, artificial neural networks, fuzzy logic, and time-series models, among others.
Because weather is a major driving factor of electric demand, the vast majority of load forecasting models include information on meteorological variables such as temperature, humidity and solar radiation, among others. Specifically, the hourly load demand exhibits a strong non-linear relationship with the ambient temperature; in winter, electricity demand The associate editor coordinating the review of this manuscript and approving it for publication was Jethro Browell . increases due to heating devices, decreases for spring or autumn, and increases for summer due to air-conditioning equipment. The correlation temperature-demand is so strong that some works exploit this relationship to detect anomalous temperature data [3]. For large networks, a unique weather station is not sufficient to represent the ambient meteorological conditions of the entire area [4].
The load-forecasting software used by any Transmission System Operator (TSO) considers the meteorological information from one or more weather stations located throughout the corresponding power network. It is common practice to integrate ambient temperature information from several locations in the mathematical model by computing the average value [5], [6] or using a selection based on experience or population statistics. In fact, the temperature information weighted by load and economy is not necessarily better than the equally-weighted average [7].
To the best of our knowledge, minimal research effort has been devoted to the creation of novel methodologies to appropriately select weather locations to increase forecasting accuracy (as stated in [8]). In recent years, some studies have shown that an enhanced model for temperature information can be obtained using advanced methodologies. In reference [9], several methods have been proposed and compared, such as linear and exponential combination, geometric mean, MAPE-based combination, genetic algorithms as in [10], and two-fold combination, among others. Fan et al. [4] proposed an alternative approach: make use of multiregional load forecasting to create smaller systems with higher temperatureload relationships. A wavelet-squared coherence approach was proposed in [11]. Moreno-Carbonell et al. [10] proposed the use of genetic algorithms to analyze data from the Global Energy Forecasting Competition of 2012 and 2014. Ziel and Liu [12] addressed the problem of weather station selection by analyzing the goodness of fit to a cubic regression. In [8], two main questions were addressed: (i) how many weather stations should be used and (ii) which ones should be considered. The mathematical technique of singular value decomposition was employed in [13], and Nedellee et al. [14] proposed a station selection method based on minimizing a V-fold cross-validation criterion. Some of those studies are devoted to modeling the effect of COVID pandemic [15]. Recently, the work [11] presents a new method for selection and combination of weather stations based on Wavelet Squared Coherence that seeks to improve the performance of forecast models for the electric load.
In this paper, we propose a weather station selection framework for electric load forecasting that aims to optimally select the number and location of temperature information to increase prediction accuracy. The proposed algorithm has been adapted to the Spanish TSO electric load forecasting software to identify the optimal number of weather locations to be used and their specific geographical locations. It has been observed that the obtained optimal selection outperforms the previous model [5], [6].

A. CONTRIBUTION
This contribution is twofold. First, we propose an algorithm to determine the optimal selection of weather stations (identifying both the optimal number and location of temperature information) based on (i) an initial reduction of problem complexity by computing a demand time series without holiday effects, which drastically decreases the computational cost, and (ii) a posterior incremental search algorithm. Second, the developed methodology is exhaustively tested using several case studies and two databases. The developed model is currently in use by the Spanish TSO on a daily basis to compute load forecasts.

B. PAPER STRUCTURE
The remainder of this paper is organized as follows. Section II explains the forecasting model used in this study. In Section III, we propose an algorithm for optimally choosing a set of weather stations. Section IV provides three insightful case studies using actual data from the Spanish mainland system. Finally, Section V provides relevant conclusions.

II. FORECASTING MODEL
This section briefly presents the mathematical model for short-term electric load forecasting currently used by the Spanish transmission system operator. Further details can be found in [6].

A. REG-ARIMA MODEL
The dynamic behavior of hourly electric power consumption exhibits both daily and weekly seasonality. In addition, the mathematical properties (both mean and covariance) of this time series are highly influenced by the hour of the day.
To address this complex model, in [6] is proposed to model the electric load by means of 24 hourly seasonal Reg-ARIMA models, as follows: where the logarithm of the electric consumption for hour h of day t (variable y t,h ) is computed in (1) ; the regular and seasonal auto-regressive polynomials (φ h (B) and h B 7 , respectively); and the regular and seasonal moving-average polynomials (ϑ h (B) and h B 7 , respectively). The term w t,h represents the independent zero-mean Gaussian-distributed errors with variance σ 2 h . The Box-Jenkins methodology [16] has been applied to estimate the above-mentioned mathematical model.

B. MODEL FOR THE TEMPERATURE
The thermic effect has a significant influence on electric energy consumption [5], [17], [18]. An accurate modelling of this complex effect must be taken into consideration: (i) the relationship between temperature and load varies throughout the year; (ii) the spatial dimension should be included in the mathematical model for large countries; and (iii) the thermic effect may have a non-instantaneous repercussion because the temperature of a particular day may have a significant influence over the following dates.
In the forecasting mathematical model employed in this study, the non-linear relationship between temperature and consumption has been modeled using the regression-spline procedure [19]. The temperature regressor is computed using the average daily maximum temperature for the ten most representative locations throughout the Spanish mainland. The temperature range is divided into smaller ranges and a polynomial is fitted for each section, ensuring smooths joints at the beginning and end of each period. The mathematical procedure [19] ensures that the function as well as its first and second derivatives are continuous.
The non-instantaneous impact of the thermic effect on energy consumption is modeled by including an additional regressor with the temperature information corresponding to the previous days. Specifically, four-day persistence has been included in the model: the temperature of day t can affect the demand of days t, t + 1, t+2, and t + 3. The thermic information is included in vector x t in (1).

C. MODEL FOR SPECIAL DAYS
During the year, some weekdays are declared public holidays (non-working days) for political, cultural, or religious reasons. In the Spanish working calendar, there are six national public holidays: 1 st -Jan., 1 st -May, 15 th -Aug., 1 st -Nov., 6 th -Dec., and 25 th -Dec.. Additionally, a few more days are declared as regional holidays for a specific territory or city. The decrement in consumption for any special day also depends on the day of the week on which it occurs. It should be noted that the existence of a holiday in the middle of the week (e.g., on Tuesdays or Thursdays), in general, provokes a ''long weekend effect,'' influencing the consumption of adjacent days. The load profile for the aforementioned special days is significantly different from regular working days, and load forecasts usually have larger errors.
The peculiarities of each type of special day have been quantified by analyzing the information collected over several years. Specifically, it is required an historic equal or larger than eleven years, to ensure that the database contains each national public holidays in each of the days of the week (i.e., the database must contain -for example-a 'May 1 st ' on Sunday, on Monday,. . . and on Saturday). The 'patterns of behavior' for each special day is identified by the method in the estimation phase, and are used next in the forecasting phase to predict consumption's variations for future holidays. Information on special days is included in (1) using vector z t .

III. OPTIMAL SELECTION OF WEATHER STATIONS
As detailed in Section B, one of the key factors for improving forecasting accuracy is the appropriate modelling of the temperature information. During the last few decades [5], [6], the Spanish TSO has employed the data of the temperature-related regressors obtained by computing the average value of the daily maximum of the ten most-representative cities in the Spanish mainland. Figure 1 depicts the non-linear relationship between daily demand and maximum daily temperature for working days, for the ten most-representative cities in Spain. From Figure 1, it can be observed that the relationship is tighter in some locations (such as Barcelona or Madrid), and the U shape is not well-defined in other weather stations (such as Bilbao or Oviedo). These ten plots lead us to wonder whether there is an optimal subset of these locations with a higher accuracy performance than including all of them on average. In this section, an algorithm is proposed for selecting the optimal set of weather stations to be included in the model.

A. METHOD 1: EXHAUSTIVE ANALYSIS
To determine the optimal selection of weather stations, a basic procedure consists of an exhaustive analysis of all possible combinations. With respect to the Spanish electric power system, in the last decades, the TSO traditionally makes use of the temperature information for ten cities [5], [6]. Thus, if a subset of K weather stations is selected out of ten, the number of possible dispositions is the binomial coefficient ''10 choose K,'' computed as nchoosek (10, Because the optimal value for parameter K is not known a priori, all possible values for K (from one to ten) should be explored. Consequently, the total number of possible combinations to be explored is computed as Once all possible combinations have been analyzed individually, the optimal selection of weather stations is the one with the lowest outof-sample forecasting errors.

B. COMPUTATIONAL EFFORT AND FEASIBILITY
The computational effort required to estimate the forecasting model detailed in Section A depends on the technical specifications of the computer used. In this work, a desktop PC with a multi-threading octa-core i7-6700 3.41 GHz Intel processor and 32 GB of RAM has been used, taking advantage of parallel computation techniques (which increased the parameter estimation speed by approximately 700%). The CPU time required to estimate the 24-hourly Reg-ARIMA models and the subsequent evaluation of forecasting performance using an out-of-sample one year test period is approximately 150-180 minutes to complete.
From a practical perspective, testing all possible combinations is not feasible; it would require more than 110 consecutive days of continuous CPU operation. In this study, a methodology to reduce the computational effort is proposed.
As detailed in Section II-A, each hourly Reg-ARIMA model employs 284 regressors: 16 regressors for temperature and 268 regressors to model special day effects. The associated 284 parameters are estimated using maximum likelihood [16], employing the implemented estimate function of MATLAB R2017a, Econometrics Toolbox [20], which is based on solving a large non-linear optimization problem with constraints, with more than 4300 datapoints. The estimation of a single hourly model takes approximately 40-50 minutes to complete. A significant computational effort is required for the estimation of parameters related to holidays and special days.
Because the main purpose of this study is to analyze the temperature location effect on demand, the effect of special days is secondary. Therefore, the estimation of the parameters related to special days constitutes a significant computational VOLUME 11, 2023 FIGURE 1. Scatterplot denoting the non-linear relationship between demand of working days (Y axis, in GWh) and temperature (X axis, in • C).
cost, which represents no profit for this study. Because special days can neither be eliminated nor disregarded, in this work, we propose a technique to drastically reduce the computational effort without compromising the accuracy of the obtained results.

C. MODIFIED DEMAND TIME SERIES WITHOUT THE HOLIDAYS EFFECT
To reduce the computational effort, we propose estimating model (1)-(2) using the electric load values as the response variable after removing the special days effect. For illustrative purposes, Figure 2 depicts the Spanish mainland hourly power demand during December 3 rd -9 th , 2018. In Spain, both December 6 th and 8 th are two public holidays (Constitution Day and Immaculate Conception Day, respectively). As it can be observed, there is a significant reduction of power demand from Thursday to Saturday, due to national holiday on Thursday and the so-called 'long weekend' effect. The affected days do not follow the weekly seasonal pattern, and consequently, appropriate regressors must be included in (1)-(2) to model this particular behavior owing to special days.
Once model (1)-(2) has been estimated, the parameter vector β h allows quantifying the effect of each special day over the hourly demand. Thus, since the impact of the public holidays December 6 th and 8 th over the demand has been estimated, this effect can be removed from the original observed demand, leading to a 'synthetic' demand without the influence of the holidays (orange line plotted in Figure 2). It should be noted that: (i) the obtained ''demand without holidays effect'' shows a clear weekly seasonal cyclic behavior, without the sporadic disruption of public holidays; (ii) this demand does contain the unaltered effect of temperature; and (iii) this time series can now be analyzed with model (1)-(2) without the need to include the term β T h z t , because no special days are present any more. Additionally, because the effect of special days is not present in this modified demand time series, the requirement of 11 years of history (see Section C) is no longer required. Consequently, the computational cost of model estimation has been drastically reduced because the number of regressors has been decreased from 284 to 16 (96% reduction), and the historic data can be reduced from 11 years to 1-2 years (85% reduction).
The proposed algorithm works as follows: Step 1) Solve model (1)-(2), estimating the Reg-ARIMA parameters and parameter vectors β h and α h . The estimated values of vector β h are stored in vectorβ h .
Step 2) Compute the demand without holidays effect (denoted as y ′ t,h ), as the difference between the observed demand and the special-day-effect reduction.
Step 3) Estimate the following model (for any specific selection of weather stations): After comprehensive testing, it has been observed that the computational effort required to estimate the 24 Reg-ARIMA models is approximately 10 minutes. Thus, an exhaustive analysis of the 1023 possible combinations would require approximately 170 hours (7.1 days), which is now a feasible amount of time.

D. INCREMENTAL SEARCH
The main disadvantage of the exhaustive analysis proposed in Section A is that the computational effort increases exponentially as the number of available weather stations increases. For instance, if the number of available weather locations is 15 (instead of 10), the required CPU time is 227 days (instead of 7 days). Thus, it is impractical to analyze all possible combinations if the number of weather stations is higher. To overcome this limitation, an alternative algorithm is proposed.
To determine the best selection of temperature locations, we propose a procedure that involves progressively adding the location that causes the lowest prediction error. The proposed algorithm is as follows: Step 1) Start the counter k← 0 and the set 0 ← ∅ Step 2) Update the counter: k ← k + 1 Step 3) Maintaining the set of selected weather stations k−1 , compute the out-of-sample forecasting errors of all possible combinations of including a new weather station, using the model (4)-(5). The number of possible combinations is K − k + 1.
Step 4) The weather station which minimizes the forecasting error in Step 3) is included in set k .
Step 5) If k < K , go to Step 2). Otherwise, the algorithm ends. Once the above algorithm is concluded, the optimal set of weather stations is the one which exhibits the lowest out-ofsample forecasting error.
The total number of evaluations is K k=1 K − k + 1 = (K 2 + K )/2, providing a computational cost drastically lower than that of the exhaustive analysis proposed in Section A, which is based on a brute-force algorithm. The reduction in computational effort is computed using the following equation: For instance, if the number of available temperature locations is 10, the reduction in computational effort is 95%. If the number of available stations is 20, the reduction in computational effort is 99.98%. Figure 3 provides the number of evaluations for Methods 1 and 2, and the reduction in computational cost of the incremental search compared with the exhaustive analysis.

E. DECREMENTAL SEARCH
An alternative algorithm can be created by substituting incremental search with decremental search. The computational effort is equivalent. Note that the resulting final selection may not coincide with the selection obtained one using incremental search. Consequently, this decremental method can be used to either check the robustness of the incremental method solution or to obtain an alternative local optimal solution.
The decremental search algorithms works as follows: Step 1) Start the decremental counter k← K and the full set 0 ← {1, 2, . . . , K } Step 2) Update the decremental counter k ← k − 1 Step 3) Maintaining the set of selected weather stations k−1 , compute the forecasting errors of all possible combinations of removing a single weather station, using the model (4)- (5). Note that the number of possible combinations is K − k + 1.
Step 4) The weather station whose removal minimizes the forecasting error in Step 3) is eliminated from set k .
Step 5) If k > 0, go to Step 2). Otherwise, the algorithm ends. Once the above algorithm is concluded, the optimal set of weather stations is the one which exhibits the lowest out-ofsample forecasting errors.
A flowchart of the proposed algorithms (incremental and decremental) and the methodology used to reduce the CPU time are shown in Figure 4.

IV. CASE STUDY
Following the methodology described in Section III, the performance of the proposed methodology has been tested using VOLUME 11, 2023  real data from the Spanish mainland electric power system. As in [6], the metric error employed in this work corresponds to the Root Mean Squared Percentual Error (RMSPE), defined as: · 100 (7) where values y d,h and y * d,h are the observed and forecasted hourly Spanish demands at the h-th hour of the d-th day, respectively, and value n d is the number of days considered in the study. Forecasts y * d,h are computed using the estimated model (1)-(2).

A. FIRST CASE STUDY: ANALYSIS OF 2018
The time period considered for estimating the parameters in model (1)-(2) is the 12-year period from 2006 to 2017. Such a large estimation period is required to estimate the effects on all special days. Once the complete model is estimated, the computational effort technique described in Section C is applied, reducing the estimation period for model (4)-(5): from June 2016 to Dec. 2017. The electric data have been provided by the Spanish TSO. This information is open-source and can be publicly accessed through: https://demanda.ree.es/demanda.html. Concerning the tem-perature model, weather information has been provided by AEMET(Spanish State Meteorological Agency). To evaluate the forecasting performance of each configuration, the Reg-ARIMA model is employed to compute one day-ahead forecasts from January 1, 2018, to December 31, 2018.

1) CASE A: EXHAUSTIVE ANALYSIS
The extensive analysis algorithm proposed in Section A has been applied to estimate the model (4)-(5) for the 1023 possible selections and computing the out-of-sample forecasting for the prediction period. The global RMSPE error for each possible weather station selection is plotted in the boxplot of Figure 5 (left plot), as a function of the number of considered temperature locations. Note that the reference model corresponds to the model with ten temperature locations (the RMSPE is equal to 1.36%). The right plot in Figure 5 provides the hourly forecasting errors for all possible combinations; the line color indicates the number of weather stations considered.
From Figure 5 (left plot), it can be observed that: (i) none of the possible combinations with one or two weather stations can improve the forecasting accuracy; (ii) some configurations considering from three to nine weather stations can reduce the prediction errors; (iii) in particular, more than 25%  of the combinations with n = 8 and n = 9 are more accurate than the reference model with n = 10.
From Figure 5 (right plot), it can be observed that (i) the forecasting error during night hours is not significantly influenced by the weather station selection employed, whereas (ii) the prediction error from 12 noon until 12 midnight highly depends on the selected temperature locations. Table 1 provides the forecasting errors (RMSPE column) of the reference model (n = 10) and the 20 configurations with higher prediction accuracy. The first column indicates the number of weather locations considered, and columns from 2 nd -11 th specifies the weather locations included in the mathematical model (in the following order: Barcelona, Bilbao, Cáceres, Madrid, Málaga, Murcia, Oviedo, Sevilla, Valencia, and Zaragoza). The last two columns provide the out-of-sample forecasting error and the position out of 1023 (sorted by decreasing accuracy).
From Table 1, it can be observed that Barcelona, Madrid, and Sevilla appear in (almost) all scenarios, whereas Murcia or Bilbao rarely appear, and Zaragoza, Málaga, and Cáceres appear in a percentages of 65%, 60%, and 55%, respectively. Note that the accuracy of the reference model (n = 10) is in position 79 th (out of 1023 combinations).

2) CASE B: INCREMENTAL SEARCH AND DECREMENTAL SEARCH
The algorithms described in Sections D and E are applied. Table 2 indicates the iteration number (first column), location considered at the k-th iteration (second column), and forecasting error (third column). The optimal location is denoted by a gray-colored row (k = 5 in both cases).
From Table 2 it is observed that (i) in both cases, the optimal selection includes the same locations; (ii) as the iteration process continues, the forecasting error is reduced as it approaches the value k = 5; (iii) the required CPU time has been 95.5% lower than method 1 (exhaustive analysis), since the number of required evaluations have been 45; and (iv) the optimal selection obtained using the three methods has been exactly the same.

B. SECOND CASE STUDY: ANALYSIS OF 2019
Once the set of weather locations has been optimally selected (using data up to 2018), the forecasting accuracy has been  tested using data from 2019. The complete model (1)- (2) has been used, and the traditional model with 10 locations (labeled as ''reference'') is compared with the model using the optimal set of weather stations. The forecasting accuracy of each method has been quantified using hourly/monthly RMSPE (see Table 3 and Figure 6).
From Table 3 and Figure 6, it can be observed that: • The optimal location model provides significantly higher forecasting accuracy, reducing the hourly error up to 7%.
• Specifically, greater improvements in accuracy are achieved during hours from 2 PM to 12 midnight. During these hours, the percentage of improvement ranges between 4.0% and 7.4%, with an average of 5.5%. The hourly behavior of percentage of improvement is reasonable, because the highest increments in accuracy are achieved during sunny hours with the highest influence of temperature. • During the period between 3 AM and 11 AM, the improvement can be disregarded, because it is lower than 1%.
• The highest accuracy increments are obtained during summer months (June, July, and August) and February to March.

C. THIRD CASE STUDY: ECMWF DATABASE
In this case study, an alternative database of temperature forecasts is used. The database is generated by the European Centre for Medium Range Weather Forecasts (ECMWF) [21], and the processed file contains the maximum daily temperature for the 38 cities, corresponding to the provincial capital cities in the Spanish mainland. It should be noted that the database source is different (thus, the forecasted weather information may be slightly different), and 28 temperature locations have been added. Consequently, the resulting optimal selection is expected to differ from that obtained in Section A. The exhaustive analysis algorithm cannot be applied here because the number of possible combinations is higher than 274 billion configurations (which would lead to a computation time greater than five million years). Therefore, the incremental search method has been applied. After running the incremental search algorithm indicated in Section D, Table 4 provides the selected location at each step of the iterative process, whereas Figure 7 shows the RMSPE of each configuration analyzed at each iteration.
From Table 4 and Figure 7, it can be observed that: (i) seven locations out of 38 is the number of weather stations considered at the optimal selection that provides the best forecasting accuracy; (ii) the geographical location of the optimal selection using the ECMWF database has some common patterns with the optimal selection obtained using the AEMET database.
Finally, Figure 8 shows the hourly forecasting error of the following configurations: optimal selection (labeled as ''7 loc''), the reference configuration (labeled as ''10 loc''), and the configuration using just a single weather station (the Spanish's main city, Madrid, labeled as ''1 loc''). From this figure, it can be observed that an appropriate selection of weather stations influences the hours between 9 AM and midnight the most, whereas the forecasting accuracy between 1 AM and 8 AM is not significantly affected. The average improvement is 2.6%, with an average improvement of 5% between 2 PM and 8 PM, both included.

V. CONCLUSION
Temperature is one of the most important drivers of electric load, and selecting the right station or combination of stations is crucial for reducing the forecasting error. In this study, we propose an algorithm that determines the optimal selection of weather locations based on (i) an initial reduction in problem complexity which decreases drastically the computation effort using a simplified demand time series without the holiday effect, and (ii) a posterior incremental/decremental search algorithm. The developed methods not only determine the optimal number of temperature regressors to be used but also identify their optimal geographical location. This study employs the model which is currently in use by the Spanish TSO to make hourly forecasts up to 10 days ahead.
The proposed methodology has been exhaustively tested using three insightful case studies using real data from the Spanish mainland electric power network. Two different weather forecast databases have been used: the Spanish meteorology agency AEMET and the European weather forecast agency ECMWF. The global forecast accuracy increased by approximately 2-3%, denoting a higher enhancement between 2 PM and 8 PM (average increment of accuracy between 5.0-5.5%) Future work will focus on analyzing the effect of optimizing the location of additional weather variables such as solar irradiance and/or humidity. He has been a Professor of statistics with Universidad Politécnica de Madrid, since 2001. He has more than 25 years of experience in the solution of associated power system problems, ranging from reliability analysis, the development of planning tools or production, price, and demand prediction. Along with Eduardo Caro, he have developed the hourly electricity demand forecasting model used by Red Eléctrica de España in the Spanish electricity systems. He has published several theoretical and methodological papers in the area of statistics. His research interest includes the application of statistics to engineering problems.
SHADI NOUHITEHRANI received the bachelor's and master's degrees in industrial engineer (industrial management and organization) from Islamic Azad University, Iran, in 2015 and 2018, respectively.
Her research interests include electric load and electricity price forecasting.