Solar Radiation Forecasting Based on the Hybrid CNN-CatBoost Model

The renewable energy industry is rapidly expanding due to environmental pollution from fossil fuels and continued price hikes. In particular, the solar energy sector accounts for about 48.7% of renewable energy, at the highest production ratio. Therefore, climate prediction is essential because solar power is affected by weather and climate change. However, solar radiation, which is most closely related to solar power, is not currently predicted by the Korea Meteorological Administration; therefore, solar radiation prediction technology is needed. In this study, we predict solar radiation using extra-atmospheric solar radiation and three weather variables: temperature, relative humidity, and total cloud volume. We compared the performance of single models of machine and deep learning in previous work. For the single-model comparison, we used boosting techniques, such as extreme gradient boosting and categorical boosting (CatBoost) in machine learning, and the recurrent neural network (RNN) family (long short-term memory and gated recurrent units). In this paper, we compare CatBoost (previously the best model) with CNN and present a CNN-CatBoost hybrid model prediction method that combines CatBoost in machine learning and CNN in deep learning for the best predictive performance for a single-model comparison. In addition, we checked the accuracy change when adding wind speed and precipitation to the hybrid model. The model that considers wind speed and precipitation improved at all but three (Gangneung, Suwon, and Cheongju) of the 18 locations.


I. INTRODUCTION
The fossil fuel-based energy supply system has low sustainability due to price volatility, limited fuel reserves, and environmental problems, spurring the development of the renewable energy industry to generate sharp growth. In addition, several countries, including Germany and Australia, have already achieved grid parity, as fossil fuel prices are soaring and technology development is lowering renewable energy production costs. While the added fossil fuel decreased from 64 GW in 2019 to 60 GW in 2020 [1], the produced renewable energy was 261 GW worldwide, and solar power generation increased to 127 GW-the most among The associate editor coordinating the review of this manuscript and approving it for publication was Rosalia Maglietta . renewable energy facilities [2]. Solar power generation occupies a high proportion of new and renewable energy due to its infinite resources, ease of installation, and eco-friendly characteristics that do not emit noise or pollutants. These advantages are expected to increase the proportion of solar energy generation further. Solar photovoltaic (PV) systems are a primary renewable energy source and are simply panels that convert sunlight into electricity. However, solar power generation requires advanced prediction technology due to the unstable energy supply under the influence of the weather. The output of PV is highly dependent on solar irradiance, solar radiation, temperature, and various weather variables. Predicting solar radiation means that the output of PV is predicted one or more steps ahead of time [3]. Therefore, various weather variables are used to predict solar radiation accurately.
In addition, previous studies have used various weather variables to predict solar radiation. Alluhaidah et al. examined studies using various weather variables to identify the root mean square error (RMSE) and mean absolute percentage error (MAPE) and revealed that cloud cover, humidity, and temperature contribute the most to prediction [4]. Kwon et al. attempted to predict the global horizontal irradiance (GHI) using the temperature, relative humidity, dewpoint, and sky-coverage values [5]. Kisi et al. attempted to predict solar radiation in the Antakya and Adana areas using the lowest temperature, highest temperature, wind speed, relative humidity, and sunshine hours using an artificial neural network (ANN) and the extreme learning machine [6]. McCandless et al. used cloud cover, dewpoint temperature, categorical precipitation in the last hour (1 = precipitation did not occur), and a more accurate k-nearest neighbors cluster model with six meteorological parameters [7]. Wojtkiewicz et al. used weather variables and cloud cover as exogenous variables to predict solar radiation and fit the long short-term memory (LSTM) network and gated recurrent unit (GRU) models, confirming that LSTM provides a predicted performance of 23.79% based on the MAPE [8]. Qing and Niu used such variables as temperature, dewpoint, humidity, visibility, wind speed, and weather type (13 types of weather) to argue that the data fit the LSTM model, which performs much better than the other benchmark models [9].
In addition to using assorted variables, techniques for solar radiation prediction have also been studied. Research on a single machine-learning model has actively been conducted [10], [11], [12], [13]. For example, Yadav and Behera applied the recurrent neural network (RNN) and wavelet transform by adding variables, such as temperature, humidity, wind speed, wind direction, dewpoint temperature, and pressure to predict solar radiation values. The wavelet deformation technique was excellent regarding the mean absolute error (MAE) at 9.62% and RMSE at 14.96% [10]. Kim proposed multiple regression models with an accuracy of 0.1553 based on the MAE, suitable for the autoregressive integrated moving average (ARIMA), ARIMA exogenous (ARIMAX), and multiple regression models [11]. Fan et al. applied the support vector machine, M% model tree, random forest (RF), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost) models using the data from three stations in humid subtropical China. Comprehensively considering prediction accuracy, generalizability, and computational efficiency, CatBoost is the best model to develop general models. [12]. Pang et al. determined that solar radiation prediction using the RNN model has higher accuracy than the ANN model [13].
Recently, a hybrid model combining the two models was also developed [13], [14], [15], [16]. Some studies on the hybrid model have combined single machine-learning models. Ghimire et al. introduced convolutional neural network (CNN)-LSTM techniques combining the CNN, LSTM, deep neural network, decision tree, and multilayer perceptron. When the CNN-LSTM was used for prediction for a month, the results confirmed that the MAE (%) value was superior at 13.131 [14]. Agga et al. evaluated two hybrid models (CNN-LSTM and convolutional LSTM) that incorporate an LSTM layer using two types of datasets (univariate and multivariate, with weather features, such as wind speed, temperature, humidity, and cloud cover) [15]. The LSTM method is used as a baseline to evaluate the performance and efficiency of the models. Both hybrid models predicted the one-day-ahead power output well, using only the single-variable dataset with MAE values of 5.04 and 5.18 for the convolutional LSTM and CNN-LSTM models, respectively [15].
Lai et al. used a hybrid model applied to various previous studies, such as LSTM, GRU, CNN-LSTM, CNN-GRU, and the CNN with bidirectional LSTM (BiLSTM). The CNN-BiLSTM model outperformed other models in univariate and multivariate predictions in terms of the MAE [16]. In addition, Gala et al. used a single machine model, such as support vector regression (SVR), gradient boosted regression (GBR), and RF regression, and a hybrid SVR-GBR-RFR model, as a weighted linear combination of the SVR, GBR, and RFR outputs [17].
Based on previous studies, machine-learning models, including RNN and boosting, are used in a single model, but the hybrid model has a research spectrum with limited combinations of the CNN and RNN. Therefore, we demonstrate the performance of the CatBoost model by comparing the time-series model (ARIMA), RNN series (LSTM, GRU, simple RNN) and boosting series (XGBost and CatBoost) models [18]. This study proposes an accurate solar radiation prediction technique by expanding on the previous study. By applying basic weather variables, such as temperature, relative humidity, solar radiation, and total cloud volume, we applied a hybrid model combining the CNN with Cat-Boost and confirmed the results by adding two variables: precipitation and wind speed. This work proposes a rarely used CNN-CatBoost technique and display the best performance. This work contributes to predicting solar power generation in the future.
Section II describes the single machine-learning and hybrid models used as prediction techniques. Next, Section III discusses the solar insolation and meteorological variables for model training and fit, and Section IV details the performance after fitting the model. Finally, Section V proposes that the hybrid model is superior to a single model and that these models should be actively studied for accurate solar radiation and solar power generation prediction.

II. METHODOLOGY
This section proposes the RNN series models (LSTM and GRU), boosting series models (XGBoost and CatBoost), and a CNN model. The hybrid model combines the CNN and Cat-Boost models, and the results are compared by differentiating the number of convolutional layers. VOLUME 11, 2023 A. UNBIASED BOOSTING WITH CATEGORICAL FEATURES The CatBoost model is an improved version of the gradientboosting decision tree algorithm that can handle categorical features well. This algorithm has two main advantages: (1) dealing with categorical features during training time instead of preprocessing time and (2) using a new but more efficient schema to calculate leaf values during tree structure selection, reducing overfitting. We performed a random permutation of the dataset. We computed the average label value for the example with the same category value before computing the given label in the permutation [19]. We let = [σ 1 , · · · , σ n ] T n be a permutation, which is substituted as follows [20]: where P is a prior value, and a parameter is the weight of the prior. For regression tasks, the standard technique for calculating the prior is to average the label value in the dataset. Another advantage of the CatBoost model is that oblivious trees are used as base predictors. In such trees, the same splitting criterion is used for an entire tree level [21]. The primary property of CatBoost is that the feature sample permutations maintain the diversity of the coupled inputs and prevent overfitting. The average values are classified into the same category and converted into numerical values. This stage copes with noisy low-frequency categories. The feature combination passes through greedy subtree splitting for terms that the initial trees do not consider in the first generation [22].

B. CONVOLUTIONAL NEURAL NETWORK
The CNN is the most successful deep-learning algorithm for extracting image features at a good resolution by assigning weights [23]. These features become complex at a coarser resolution as the network becomes deeper. The CNN architecture can be divided into three layers. The initial layer is convolutional and extracts features by convolving images with weights known as kernels, which are randomly initialized. The kernel slides over the image with a certain stride value, extracting low-level features, such as shapes and edges, in the initial layer. After applying these kernels, the final output at each layer is known as a feature map.
To extract distinct features with various weights, we can vary the number of kernels according to the model requirements. More convolutional layers help extract high-level features [24], and the CNN is an effective technology for automatic feature extraction and has achieved remarkable success in image vision. Moreover, the CNN has a strong potential for dealing with time series, such as automatic speech recognition and wind speed forecasting [25]. The design of a CNN is determined by the types and number of layers it comprises, such as convolutional layers, pooling layers, and fully connected layers, and is inspired by the genetic structure of the visual cortex, which has configurations of simple and complicated cells [26]. Initially, input data are entered into the input layer to process the model for feature transformation. Then, features are extracted in the convolutional and pooling layers. Afterward, the extracted information from the convolutional and pooling layers is assimilated using the fully connected layers. Finally, the result is communicated through the output layer.
Each convolutional layer is targeted to extract spatial patterns from the target variable (i.e., GHI) and its related input variables (i.e., meteorological data and historical GHI values), demonstrated as follows: where f represents an activation function, W k denotes the kernel weight, and × indicates a convolutional process operator. In the present work, the efficient recurrent linear unit activation function is used [27] f In this study, the CNN layers were applied to the first and second layers, respectively, and the accuracy was checked. This model was combined with CatBoost to generate the proposed model.

C. HYBRID MODEL(CNN-CatBoost)
In this study, the hybrid (CNN-CatBoost) model was constructed for the most accurate solar radiation prediction. This model is divided into feature extraction and prediction parts. We compare the results by attempting one-layer onedimensional convolution (Conv1D) and two-layer Conv1D in feature extraction. This section addresses temperature, relative humidity, and cloudiness, and the predicted part of the CatBoost model yields GHI predictions [28]. The basic structure of the one-layer Conv1D hybrid model is shown in Figure 1. For the two-layer Conv1D hybrid model, one more Conv1D is added for the feature extraction.

A. DATA COLLECTION AND PREPROCESSING
The data applied to this study are 1-h weather observation data provided by the Weather Data Open Portal (https://data.kma.go.kr) from March 1, 2017, to February 28, 2022. The data cover 5 years, and the learning and testing data are divided into 8:2 so that all seasons can be tested. From March 1, 2017, to February 28, 2021, the model was used for training data, and the remaining data from March 1, 2021, to February 28, 2022, were used for testing data to evaluate model performance. The independent variables were applied to the model as basic input variables, including three weather variables with high correlation (temperature, humidity, and total cloud volume) and the out-of-atmosphere solar radiation (ei) proposed by He et al. [29].
The weather data were reconstructed through preprocessing. The points where the solar radiation value or total cloud amount was missing for a long time were removed, and the analysis was conducted at 18 points, as depicted in Figure 2. In addition, solar radiation starts to form a peak after sunrise, and a pattern with a value of zero is repeated daily after sunset. Due to these characteristics, many parts had zero or inapplicable values after sunset and before sunrise, and it was assumed that solar radiation was not observed for the remaining time, by setting the sunrise and sunset time, as listed in Table 1. Sunrise and sunset times are the same in spring and autumn, so there are only three time zones, although it displays four seasons.  In addition, days on which solar radiation was not observed for one day at each point were excluded, and values were replaced using linear interpolation if the temperature, humidity, cloudiness, and wind speed were missing. If precipitation was missing, the precipitation was judged to be the condition of not raining, and the value was replaced with zero. The prediction results were compared by fitting the LSTM, GRU, XGBoost, CatBoost, and CNN models using the final preprocessed data.
The predicted results were compared in the previous analysis by fitting the XGBoost, CatBoost, simple RNN, LSTM, GRU, and CNN models with the four input variables mentioned above [30]. Among them, the best CatBoost results were compared with the results of the CNN model recently used in several algorithms. Both models demonstrated high performance, incorporating a hybrid model that combines the CatBoost and CNN, and exhibited better performance, selecting the hybrid model as the final model. The accuracy change was confirmed when wind speed and precipitation variables were added with basic input variables to understand the influence of other weather variables on the selected model.

B. EVALUATION METRICS
To compare suitable models, MAE and RMSE were used as error measures. In general, MAPE is widely used to evaluate models, but it was challenging to apply the MAPE calculation because the solar radiation value was often zero. Therefore, the accuracy was evaluated on the scale of the MAE and RMSE, defined as follows: where n indicates the number of data for prediction, Y t denotes the observation value at time t, and F t represents the prediction value through the model at time t. For MAE and RMSE, a smaller value indicates higher accuracy.

IV. MODEL APPLICATION A. PERFORMANCE COMPARISON OF MACHINE-LEARNING AND HYBRID MODELS
The performance results of the machine learning(CNN and CatBoost) and hybrid models are presented in Table 2, and the values with the best MAE and RMSE values for each branch are underlined. When comparing the boosting series (XGBoost and Cat-Boost) with the RNN series (LSTM and GRU) models in previous studies, the MAE of the boosting models was about 0.12, and that of the RNN models was 0.16, confirming that the boosting series was more accurate than the RNN series [18]. Among the boosting models, CatBoost results performed best at all points for the MAE standards [18]. The importance of the variables of the CatBoost model with good performance varies slightly from branch to branch, but the importance ranking was the same in the order of out-ofatmosphere solar radiation, total cloud volume, humidity, and temperature outside the atmosphere, as depicted in Figure 3 for the Seoul branch. VOLUME 11, 2023   Additional CNN models were assessed to improve performance further. For the CNN, weather observation values from 3 h prior to the prediction time point were used as input variables, and a rolling prediction of the hourly solar radiation was performed. The number of features and past values were not used much; thus, it is difficult to use multiple layers or a large kernel, and the CNN structure comprises Conv1D, flattened, and dense layers, as illustrated in Figure 4.
The calculated value of the fully connected layer in the CNN model was obtained, and a hybrid model was implemented with the value using the CatBoost model, with the best results in the last step. Although no significant difference exists in the average of the MAEs by point between the CNN and CatBoost, the CNN results were lower than the CatBoost results in Table 2. However, the accuracy was not comparatively improved; thus, a hybrid method was applied to extract features from the CNN and predict the final value using CatBoost. As a result, the MAE results improved at almost all points.
The graphs in Figures 5 and 6 present the learning and verification curves of the CNN and hybrid (CNN+CatBoost) models in Seoul according to the number of convolutional layers. The x and y-axes represent the number of epochs and  MAE values, respectively. The training and validation curves confirm that the MAE value for learning and validation gradually decreases as the epoch increases, and no overfitting or underfitting occurs. The MAE value is uneven in the case of a model using only the CNN, but the hybrid model using the CNN and CatBoost reveals that the MAE values are steadily decreasing to similar values in both training and validation; thus, it is a more stable model. The Conv1D layer was split into two layers to fit the hybrid model. The MAE of the two-layer hybrid model remained similar or slightly lower at most points compared to the one-layer hybrid model. Although the average MAE value per point does not significantly differ, the two-layer hybrid model was more stable with better results at most points; therefore, it was selected as the final model. Figure 7 present the graph of the solar radiation observations in Seoul and the predictions of the CatBoost and two-layer hybrid models from February 22 to 28, 2022. The blue line indicates the two-layer hybrid prediction. The green line marks the CatBoost single model, and the red line represents the actual prediction. In most cases, the peak points of the two-layer hybrid model were closer to the actual values than those of the model using only the CatBoost model. In particular, the gap in peak points can be observed on the fourth, fifth, and seventh days (February 25, 26, and 28). In addition, with CatBoost, the results often vary, such as on the afternoon of Days 4 and 7. However, the hybrid model seems to calibrate these parts to affect the accuracy.

B. RESULTS OF ADDING VARIABLES
In addition to temperature, humidity, total cloud volume, and out-of-atmosphere solar radiation, changes in accuracy were confirmed when wind speed and precipitation variables were added to understand the effects of other weather variables. The accuracy of adding variables in the selected hybrid (two Conv1D layers) model is presented in Table 3. Adding wind speed or precipitation rather than the basic variables made the average MAE for each point slightly smaller, but the degree of influence on wind speed and precipitation differed for each point. However, the model considering wind speed and precipitation performed better at most locations, as the MAE became smaller except in three locations (Gangneung, Suwon, and Cheongju). Gangneung is located on the coast and has higher humidity and more precipitation than other areas. Therefore, it showed an effect when only precipitation was added to the input variables rather than an effect on wind speed. In the case of Suwon, the average wind speed and precipitation were lower than in the other regions. Although Suwon is inland, the difference between when the wind does and does not blow is significant enough to be distinguished. Therefore, the results were improved only when wind speeds with these characteristics were reflected. Cheongju is an area with less precipitation and very low wind speed compared to the other regions, so overfitting seems to have occurred by adding precipitation and wind speed. However, information on wind speed and precipitation does not significantly affect solar radiation, given that the accuracy is not as effectively reflected in the model.

V. CONCLUSION
In this paper, we aimed to predict the seasonal wind power generation in Gangwon-do using the wind direction and wind speed variables that affect each season as exogenous variables. The raw data were transformed using log trans-formation, and the existing time-series models (ARIMA and ARIMAX) and the machine-learning regression models (support vector machine, RF, and XGBoost) were used. When comparing the evaluation indicators MAE and MAPE, the machine-learning model displayed the best predictive performance compared to the time-series models. Among the machine-learning models, the RF model had the best predictive performance, followed by It was followed by the XGBoost model and the SVR models. In this study, prediction was attempted using log transformation and a single machine-learning model, but in future studies, other data transformation techniques, the addition of new weather variables, or complex machine-learning models may be evaluated. As interest in solar power prediction increases, the importance of solar radiation prediction is increasing. Therefore, this study was conducted to improve the accuracy of predicting solar radiation.
The solar radiation was predicted using three weather variables, temperature, humidity, and total cloud volume, and the results were compared using various models. The boosting (XGBoost and CatBoost) and RNN (LSTM and GRU) models were suitable for determining the optimal hyperparameters for each point. The point-by-point average MAE was 0.1251 for XGBoost, 0.1157 for CatBoost, 0.1599 for LSTM, and 0.1515 for GRU. Thus, the CatBoost model was the best. Additionally, the CNN (first layer) model was fitted; thus, the average MAE was 0.1104, slightly improving the performance. Subsequently, the hybrid model was selected as the final hybrid model, combining the CNN and CatBoost models and lowering the average MAE of the solar radiation prediction accuracy from 0.1104 to 0.1027.
In addition, the influence of wind speed and precipitation was identified by considering these parameters in addition to the temperature, humidity, and cloud volume that have frequently been used. Wind speed and precipitation were added to improve accuracy, resulting in an average MAE of 0.1003. The model that considers wind speed and precipitation improved at all locations except three (Gangneung, Suwon, and Cheongju). However, considering that the MAE value changed little, wind speed and precipitation do not significantly affect solar radiation.
The analysis focused on comparing the model based on weather variables. A deeper study of the relationship between weather variables and solar radiation is needed. A diverse approach should be used, such as examining the relationship between weather variables through principal component analysis by assessing multicollinearity and extracting appropriate characteristics for solar radiation prediction using various methods. The accuracy of solar radiation prediction can be improved through the proposed future research to establish future renewable energy generation plans based on improved prediction accuracy.
SUJIN PARK was born in Seoul, South Korea. She received the B.S. degree in information statistics from Kangwon National University and the M.S. degree in statistics from Chung-Ang University, Dongjak, Seoul, where she is currently pursuing the Ph.D. degree in statistics. Her main research interests include time-series analysis and renewable energy.
HEE-JUN PARK received the B.S. degree from the Department of Astronomy and Space Science, Chungbuk National University, Cheongju, South Korea. Since 2012, he has been a Developer of energy management system (EMS) and an Operator of transmission grid with Korea Power Exchange. His main research interests include renewable energy demand forecasting and short-trem electricity load forecasting for stable power systems.