Estimating missing data in historic series of global radiation through neural network algorithms

Managing meteorological data is usual to find incomplete data of time series in some intervals; the issue is addresses commonly using the autoregressive integrated moving average (ARIMA) or the method by regression analysis (interpolation), both with certain limitations under particular conditions. This paper presents the results of an investiga-tion aimed at solving the problem using neural networks reported. The analysis of a time series of global radiation obtained at the Francisco de Paula Santander University (UFPS) is presented, with basis in the recorded data by the weather station attached to the Department of Fluids and Thermals. Having a series of ten-year study for 125,658 records of temperature, radiation and energy with a percentage of 9.98 missing data, which were duly cleared and completed by a neural network using algorithms backpropagation in the mathematical software MATLAB.


I. Introduction
The Department of Fluids and Thermics of the University Francisco de Paula Santander (UFPS) in 1998 took the initiative to acquire a weather station in order to support research in areas related to alternative energy power generation and meteorological data analysis for regional agronomy. From that date, they recorded variables influenced by the weather, with recording problems over some time periods due to factors such as: lack of energy, datalogger failures, among others. In order to solve this problem, the Development Research Group of Industrial Processes (GIDPI) moved forward the current research, whose main objective is to complete the missing data of the variables that were lost in time.
Meteorological time series of data processing commonly merge with the problem of missing data in some time intervals; there is a large variety of alternative analyses to complete the missing data, among which stand out two common methods: the autoregressive integrated moving average model (ARIMA) and the regression analysis method (i.e. interpolation).
The autoregressive integrated moving average model is based on a dynamic temporary series that makes future estimates by observing past events in the series. This method is commonly used in analysis of phenomena that do not vary too much with time, e.g. temperature, humidity, pressure, wind speed, and other phenomena in which the tolerance of change in short periods of time is low (Vásquez, Rojas, & Duarte, 2015).
The application of the regression analysis method, for its part, requires selecting a data series with a similar behavior, within the same topoclimatic influence area (reference series), which must have records in the same time interval that are missing in the study series; in this way, mathematical regression between the study series and the regency series is performed (Medina, 2008). Solar radiation is a natural phenomenon that has a cycle of action between 10-12 hours, generally between 6:00 AM and 6:00 PM, having a maximum value within half an hour of this interval. Because of these characteristics, the application of the ARIMA method for predicting the missing data is prevented. On the other hand, the interpolation method is shown to be feasible because the nearest weather station attached
La Radiación solar es un fenómeno natural que posee un ciclo de acción de entre 10 y 12 horas, por lo general entre las 6:00 AM y las 6:00 PM, el cual tiene un valor máximo en la hora media de este intervalo. Debido a estas características se impide la aplicación del método ARIMA para la predicción de los valores extraviados, por otra parte el método de interpolación se muestra viable debido a que la estación meteorológica más cercana adscrita al Instituto de Hidrología, Meteorología y Estudios Ambientales [IDEAM], se encuentra dentro del rango de influencia a 5 km de distancia, pero esta misma ofrece los valores de radiación global en intervalos diarios, por tal motivo no se consideró pues se presentaría mucho rango de incerteza en los resultados. A diferencia de estos dos métodos que trabajan con ecuaciones estadísticas, las redes neuronales [RNA] se consideran como un modelado de datos de gran alcance, ya que to the Institute of Hydrology, Meteorology and Environmental Studies [IDEAM] is within the range of influence at 5 km away. Global radiation values are taken in daily intervals; for that reason it was not considered, since a high uncertainty range would be presented in the results.
Unlike these two methods that work with statistical equations, Neural Networks [NN] are considered as far-reaching data modeling, since they can capture and represent complex implicit relationships with input/ output variables [I/O]. The advantages of NN are that they have the ability to represent both linear and nonlinear relationships, and also can learn a direct relationship between the variables of [I/O] (Hamzaoui et al., 2011). These valuable advantages allow us to propose a method to complete missing data, making use of the I/O known variables, based on the NN.
It should be clear that any method used is unable to reproduce the missing data, but the best method to be implemented allows us to obtain reasonable values that are consistent with the nature of any varying phenomenon in time (Serlin, 2010).

II. Selection of the neural network
There is a variety of criteria and types of algorithms in the implementation of the NN, but its operation is similar, consisting of multiple interconnected artificial neurons that behave as Elemental Processors [EP] whose function is to relate the external stimuli of a system with the response of the same - Figure 1- (Hamzaoui et al., 2011). In order to model the behavior of a EP, the existing relationship I/O has to be determined, which is referred to as the transference function, where the output (Y) is obtained from the addition of the initial state denoted by the coefficient bias (b), which defines the state of every neuron, i.e. inhibition or excitation (Ramírez, 2010), with the multiplication between external stimuli (X), and the interconnection coefficients between EP, known as weights (W) (1) (Ponce, 2010).
Neural networks can be classified in several ways according to one or more of their relevant characteristics, such as: pueden capturar y representar relaciones implícitas complejas con variables de entrada/salida [E/S]. Las ventajas de las RNA es que tienen la capacidad de representar, tanto relaciones lineales, como no lineales, y tienen la habilidad de aprender relaciones directas entre dichas variables de E/S (Hamzaoui et al., 2011). Estas valiosas ventajas nos permiten proponer un método para completar datos faltantes, valiéndonos de las variables de E/S conocidas, fundamentado en las RNA.
Las redes neuronales pueden clasificarse de diversas formas de acuerdo con una o más de sus características relevantes, entre las cuales se encuentran: • la función a la cual están diseñados los PE (e.g., asociación de patrones de E/S); • the function for which EP are designed (e.g., pattern matching of I/O); • degree of connectivity between EP, which can be partial or complete; • direction of flow of information between the NN connections; • type of learning algorithm, which represents the entirety of the existing relationship between the I/O under an arbitrary training in EP; • learning rules, which can define relevant parameters such as time alliterations and learning coefficients; and • degree of supervision of required learning for network formation (i.e. the percentage of data used for training and testing; Basheer & Hajmeer, 2000).
Neurons are grouped into different layers and interconnected according to their architecture. The network often has one or more hidden layers between the input layer and the output layer, which allow it to learn about nonlinear and linear relationships Figura 2 (Infante, Ortega, & Cedeño, 2008).
The optimal number of neurons in the hidden layer is difficult to calculate, and depends on the type and dynamics of the phenomenon to work and its complexity. This number is generally determined in an iterative way, observing the uncertainty range between the training data and validation results in the learning phase of NN (Infante et al., 2008;Hernández, Bassam, Siqueiros, & Juarez, 2009).
Based on Hamzaoui et al., (2010), Basheer and Hajmeer (2000), and Colmenares (n.d), who recommend the use of backpropagation network type [MLP] for functions with patterns of reconstruction of damaged or completely missing data, the implementation of a type of MLP Feed-Forward called Backpropagation [BP] was determined. The BP term refers to how error is calculated in the output layer, and then propagated to the hidden layers, and subsequently to the input layer. This gives a major advantage because it provides versatility to the NN at the moment to be trained, allowing it to self-adjust the coefficient W of each connection and the B of each neuron used.

III. Data experimental
Las series de temperatura, radiación y energía utilizadas para este estudio son registros tomados de forma horaria a una altura de 15.4 metros y suministrados por la estación meteorológica Groweatherlink de Davis Instruments, ubicada en la UFPS, latitud 7°53'545''N, longitud 72°29'166''W The characteristics of the architecture of the NN BP are: • it is connected forward; that is, its connection type is unidirectional between the layers of the first line and the following layers; • has an input layer with nodes that represent input stimuli in NN This layer acts only as a distribution point; for that reason, some workers do not categorize it as a layer; • has an output layer with nodes that represent the dependent variable (i.e. the modeled variable); and • has one or more hidden layers containing neurons that help to capture the nonlinearity using a supervised learning strategy. Ramírez (2010) and Jamett (2004) recommend the use of the training function Levenberg Marquardt BP, known as Trainlm, which adjusts each variable according to Levenberg-Marquardt algorithm in order to obtain higher performance with higher training speed. In addition, the activation functions must be continuous; for that reason, a sigmoidal type of restriction must be performed, which is responsible for normalizing the dependent values in intervals between 0 and 1.

III. Experimental data
The series of temperature, radiation and energy used here are records taken on an hourly basis at a height of 15.4 m and supplied by the weather station from Groweatherlink of Davis Instruments, located in the UFPS, latitude 7°53'545''N, longitude 72°29'166''W  (Vásquez et al., 2015). Due to the periodicity of the solar radiation, data between 6:00 AM and 6:00 PM are analyzed, obtaining groups of records on a monthly basis. In this way the treatment is more convenient and simple. Table 1 shows the percentage of records supplied by the station; 120 months are shown, of which 10 contain less than 30% of data, 21 have between 30-97% and 88 have all the data.
In the treatment of databases with neural networks, it is recommended to distribute the records as follows (Jamett, 2014;San Juan, Jamett, Kaschel, & Sánchez, 2015): • 70% of data must be for training; • 20% must be for testing; and • 10% must be for validation.
This is in order to check the results of the test phase of the NN. As shown in Table 1, months that contain all the data constitute 73.3%, which concludes the viability of the series for the validation process and record estimation.
As previously mentioned, the operation of the NN depends on the I/O relationship, where our output is the Global Radiation [R], and the independent entries must be classified according to the analysis of this phenomenon. In our case, the most influential external stimuli in R provided by the weather station are: temperature (T) and energy (E; see Figure 2).

IV. Implementation of the NN
The tool that lets us create any type of network, train it and put it into operation and testing, is the Neural Network Tool-Graphical User Interface [nntool] from MAT-LAB, which, as its name implies, provides a graphical interface where the user interacts with the I/O variables.
As previously stated, the external stimuli are temperature (T) and energy (E), and the dependent variable is radiation (R). To operate nntool we must prepare these variables in such a way that the system output (R) must be defined by a 1 x n matrix, in which n is denoted as the amount of known output records; in our case, for the month of January, 3224, which corresponds to (Vásquez et al., 2015). Debido a la periodicidad de la radiación solar se analizan los datos comprendidos entre las 6:00 AM y las 6:00 PM, obteniendo grupos de registros de forma mensual. De esta manera se hace más cómodo y sencillo el tratamiento.
Esto con la finalidad de comprobar los resultados de la fase de prueba de la RNA. Como se observa en la Tabla 1, los meses con la totalidad de los datos completos comprende el 73.3% lo cual concluye la viabilidad de la serie para el proceso de validación y estimación de registros. Como se ha definido, el funcionamiento de la RNA depende de la relación de E/S, en donde nuestra salida a evaluar es la Radiación Global [R], y las entradas independientes se deben clasificar de acuerdo con el análisis de este fenómeno. En nuestro caso, los estímulos externos más influyentes en R que suministraba la estación meteorológica son: Temperatura (T) y energía (E) (ver Figura 2).

IV. Implementación de la RNA
Existe una herramienta que nos permite crear cualquier tipo de red, entrenarla y ponerla en funcionamiento y prueba, la Neural Network Tool-Graphical User Interface [nntool] de MATLAB, la cual, como su nombre lo indica, proporciona una interfaz gráfica donde el usuario interactúa con las variables de E/S. Como está definido, los estímulos externos son temperatura (T) y energía (E), y la variable dependientes es la radiación (R), para poner en funcionamiento nntool debemos preparar estas variables de tal manera que la salida del sistema (R) debe estar definido por una matriz de 1xn, siendo n la cantidad de registros de salida conocidos, en nuestro caso, para el mes de enero, 3.224, lo cual correspode a la suma de los trece valores de radiacion existentes de 6:00 AM-6:00 PM en los ocho años conocidos del mes de enero, y luego ser importada en Tarjet Data. Las variables de entrada (X), las cuales deben ser importadas en Input Data, se deben agrupar en un vector, de tal manera que la primera fila corresponda a (T) y la segunda a (E), quedando de un tamaño de 2xn. Se observa que se deben importar dos entradas IN2001 y IN 2004, las cuales son usadas en la fase de funcionamiento y prueba de la red. the sum of 13 radiation values from 6:00 AM to 6:00 PM in the eight known years in January. These records must then be imported into Tarjet Data. The input variables (X), which must be imported into Input Data, must be grouped into a vector in a way that the first row corresponds to (T) and the second to (E), giving as a result a 2 x n size. Two inputs IN2001 and IN 2004 must be imported, which are used in the operating phase and network testing. The next step is to adjust the parameters of the network created (Figure 4), modify parameters such as: • network type, as indicated previously, a NN feed-forward backpropagation is implemented; • input stimuli X denoted by T and E; • network output R; • type of training function, defined by TRAINLM; • learning function LEARNGDM, which allows one to modify W and b according to the amount of training required; • type of calculation error MSE, which allows one to adjust the performance on the significant squared error; • number of layers of the NN defined in an iterative way until the best validation is reached; in this case, it is decided to create two hidden layers and the output layer; • number of neurons in each layer; as described above, there is no methodology that indicates how to select the number of layers or number of neurons; and • type of relationship between inputs and outputs. TANSIG is assigned to the two hidden layers, allowing them to learn the nonlinearity of input stimuli, and PURELIN is assigned to the output layer, which relates the values of the hidden layers to the output variables in a linear way. Figure 5 shows the structure created, which consists of two inputs (T) and (E); two hidden layers, each with 30 PE with a TANSIG relationship; and an output layer R, with a neuron and a function relationship R.

V. Training phase of NN
The ability to learn is a peculiar characteristic from the smart, biological or any kind of system. In artificial systems, learning is seen as the process of updating the internal representation of the system (S), as a response to external stimuli (E), in such a way that it can perform a specific task. This includes changing the network architecture, through adjustments in their connections (W) (Basheer & Hajmeer, 2000).
El método general de entrenamiento de una RNA BP se resume en cinco pasos.
Forward steps: 1. Select an input vector from the training set.
2. Apply this entry to the network and calculate the output.
Backward steps: 3. Calculate the error between the calculated output and the desired output of the used input.
4. Adjust the weights so that the error between the calculated output and the desired output decreases.
5. Repeat steps 1-5 for all entries in the training set until the global error is acceptably low.
To start the training phase of NN from nntool we must enter from Training Info known parameters (I/O), and later, from Training Parameters, adjust training values, as shown in Table 2 (Sumathi, Ashok, & Surekha 2015).
After importing inputs and outputs from the network and having adjusted the parameters, the training process begins as many times as neccessary, to reach the desired validation indices. In this case, it is possible to obtain a validation of 11.16, with three training tests.

VI. Operation phase and NN testing
In the training process of the NN, the nntool tool updates the value of the weights (W) and the bias (b) automatically, allowing one to continue with the operating phase and testing from the same interface.
Since the implementation of the NN, vectors of the incomplete variables had to be imported (Figure 3). In the case of January, years with missing data are 2001 and 2004, of which only 25% and 39% are known data, respectively.
The input vector for 2001 was denoted as IN2001, where it was grouped in a way that the first row corresponds to (T) and the second to (E); as only 25% of these values are known, blanks or unknown spaces must be filled with any wrong number (see Table 3). VI. Fase de funcionamiento y prueba de la RNA En el proceso de entrenamiento de la RNA, la herramienta nntool actualiza el valor de los pesos (W) y los bias (b) automáticamente, lo cual brinda la facilidad de continuar con la fase de funcionamiento y prueba desde la misma interfaz.
Para ejecutar la prueba de la RNA desde nntool se debe seleccionar la variable IN2001 como input y asignar un nombre a la variable de salida, y de esta manera obtener el vector que corresponde a los valores de radiación completados (Figura 8).
Los resultados obtenidos muestran que la mayor radiación se presenta en los meses de agosto-octubre, con un prome-To run the NN test from nntool the IN2001 variable must be selected as input and assigned a name to the output variable, thus obtaining the vector corresponding to the values of completed radiation (Figure 8).

VII. Results
Historical data series allow us to know the behavior of a variable with a high level of reliability. In this case, an incomplete series has many drawbacks at the moment of calculating averages due to missing data. Thanks to the procedure carried out, these problems were solved, and a multi-year series was obtained with time intervals that define the behavior of the global radiation.
The results show that most radiation occurs in the months between August and October, with an average between 3.9-4.2 kW/m2; and the lower radiation occurs between November and January, with an average between 3.4-3.2 kW/m2 (Figure 9). These values converge with the results obtained by UPME/ IDEAM (2005) and Leal and Hernández (2013), who demonstrated multi-year average radiation for the city of Cúcuta of 3.5-4.0 kW/m2, an interval in which the value of 3.51 kW/m2 obtained here is adjusted.
The use of strategies that allow us to complete missing data provides advantages of obtaining mean values more accurately in short time intervals. Figure 10 shows the radiation scale (W/m2) in the hours and García, F., Rojas, J., Vásquez, A., Parra, D., & Castro, E. (2016).

VIII. Conclusiones
El uso de redes neuronales artificiales para completar datos ausentes en variables meteorológicas se convierte en una solución viable y de fácil manejo gracias a herramientas compu-months through the year, vital information in process of optimizing the use of energy. A clear example of this is a smart solar tracker, looking to have control at the point of maximum energy capture.

VIII. Conclusions
The use of artificial neural networks to complete missing data in meteorological variables becomes a viable solution and has easy handling thanks to computational tools as nntool. They provide global minimum percentages in the validation of results, some less than 10%. The accuracy of these validations depends directly on the type of training and the percentage of known values; also, it depends on the number of entries to be analyzed in the system. In this case, only two were analyzed (temperature and energy), but it is recommended to introduce other meteorological variables, such as maximum temperature, minimum temperature, wind temperature, pressure and relative humidity, because of their relationship to the global radiation.
It is not recommended to complete values in series where lower percentages than 20% are known, for not having the appropriate amount of data for validation and test results, which provides a high level of uncertainty in the output variable. This particular case occurred in eight months of the series, in which all data are unk- nown; for that reason, it was decided to perform an average of the known input variables, and to enter them in the input variable for the test phase of the NN.