Accurate determination of parameters relationship for photovoltaic power output by augmented dickey fuller test and engle granger method

Power output from photovoltaic (PV) systems in outdoor conditions is substantially influenced by climatic parameters such as solar irradiance and temperature. PV manufacturers always provide technical specifications in laboratory conditions but reliable relationship for the power output must be determined for accurate prediction under real operating conditions. For the present study, solar irradiance G, temperature T and electrical power output P data under real conditions are methodically and regularly inscribed in dataloggers. Hence, in this paper, we investigate rigorous and robust statistical methods for small sample such as Augmented DickeyFuller and Engle Granger for stationary series to determine the estimate regression between variables P, G & T. A first regression of power output P time series variable on solar irradiance G time series has shown spurious results and thus spurious regression. The first differences of such time series are stationary and a regression is proposed whereas temperature variable is identified as not significant and where autocorrelation of residuals is suspected. Finally, the novelty of this paper is the Engle & Granger method that is used to provide a relationship between variables P and G in a difference level. A final relationship without suspicious heteroscedasticity has been determined. Our model is formulated on the basis of PV real conditions statistical approach and is more realistic than steady approach models.


Introduction
Forecasting accurately photovoltaic (PV) power output is not only based on climatic conditions but large variation of the power output is also observed due to several factors such as solar irradiance, temperature, wind speed and humidity. Actually PV nominal specifications such as power output or energy yield of recent modules are evaluated by manufacturers under standard conditions (STC) by a flashing technology and these nominal specifications are hand out to customers. Under real outdoor conditions theses specifications are not always valid. Meanwhile, an economical solar simulator based on micro-channel solar cell thermal (MCSCT) [1][2][3] has been designed and tested in indoor condition. Yet, the PV measured data from indoor and outdoor conditions are decorrelated. Since late 70 s the simplest explicit equation [4] for the steady-state operating temperature of a solar module links with the ambient temperature and the incident solar radiation flux had been studied. It is difficult to effectively assess the impact of PV output variability [5,6] on the power grid stability without a clear understanding of the factors that influence this variability.
Many factors have become redundant in the presence of other factors, and their value in the forecasting framework may vary under different prediction horizons and error measurements. Nevertheless, many studies conducted to date have shown that temperature [7][8][9] is the most important parameter that influences the PV output, although the latter is largely dependent on solar irradiance. Comparative studies of different models have been used to predict the PV module temperature in a mathematically [10] explicit and several empirical models [6,11,12] for estimation of PV module temperature have been proposed based on solid state physics of PV or thermal radiation and convection contributions [13][14][15] to the PV operating temperature as well as for PV technologies such as thin film PV system under Mediterranean climate conditions [16]. More implicit in form or experimental results have validated temperature effect on electrical model [17] in outdoor conditions. In the past few years, many statistical techniques [18][19][20][21] have also been proposed to forecast energy output of PV system or others have shown either how to improve confidence intervals of PV data from estimator's variance [22] or modeling solar forecasting through Artificial Neural Network [23][24][25][26] as an alternative to conventional approaches. Also a novel contribution of fault detection algorithm [27] has been proposed using a statistical analysis of realtime long term measured data and theoretical thresholds.
Some studies have used classical regression methods that took advantage of correlation nature of meteorological variables which are used prediction model as inputs. However, regression between non-stationary series may lead to conclude on the existence of a relationship between two variables even though there is no meaningful linear relationship between them. The novelty of this paper is to show that if PV explanatory variables are non stationary then a study of first difference must be determined. An incidental advantage of the first-difference transformation is that it may make a nonstationary time series stationary to reach the final PV equation taking into account the temporal relations between the PV explanatory variables in a moderate zone.
We compared measurements and prediction of energy output [28] in real outdoor conditions as a function of location and environmental conditions mainly solar irradiance  (W/m 2 ). The goal of the present work carried out is to develop an operational forecasting framework zone. An accurate PV output forecasting method would be of great help to grid operators and would minimize costs while enabling more integration of variable renewable energy (RE) in the grid. The resulting statistical model should then be applied in future works on collected datasets at several particular temperate climatic sites for validation. A mathematical model [29] using satellite data has also been proposed to determine PV performance in continental scale. The two models should be compared in a future work.
Statistically we studied a linear relation analysis of time series data collected over a period of time and investigated mainly the dependent variable of power output P on explanatory variables such as solar irradiance G and temperature T. For that we first determined the stationary character or not of each time series of solar irradiance and temperature as this will set the statistical method to be used. Testing the stationarity of a series would be able to understand autocorrelation function and its statistical significance. Dickey Fuller (DF) have developed a test called Augmented Dickey & Fuller (ADF) [30,31] which is a unit root test for stationarity.
The ADF test is used to determine the method of regression estimation between PV variables such as P, G and T. Then we show that these variables are not stationary at level and an ordinary least square (OLS) regression is only possible at difference level for each variable including a constant in the regression. First results of the study show that temperature at first difference level and the constant are not very significant as the OLS coefficients are well estimated. Then we analyze residual hypotheses.
Analysis of residuals is a powerful diagnostic tool and it helps to assess whether some of underlying assumptions of regression have been violated. Ideally all residuals should be small and unstructured meaning that the regression analysis has been successful in explaining the essential part variation of the dependent variable. We first apply the Goldfeld-Quandt (GQ) test when heteroscedastic variance is related to variables in the regression model then the Durbin Watson (DW) test to detect serial correlation in the residuals. In this present study, the GQ test is well verified but DW test is still doubtful and not very conclusive. Outliers are suspected in the model. Therefore, a more recent procedure such as the Engle & Granger (EG) method is suggested to determine the most appropriate model. Our model is formulated on the basis of PV real conditions statistical approach and is more realistic than steady approach models.
For this study, the small samples of one year daily means data is retrieved among the 7 years measurements from the GREEN lab of Physics department of University of Lorraine in Metz. The PV design of the GREEN lab is an on grid connected system. Six polycrystalline modules of SCHÜCO technologies are connected in a series wiring pattern and mounted on the south-south east vertical wall of the platform building. Each module has a peak power of 205 W P , at tilt angle of 60°, low ventilation and connected to a SCHÜCO inverter for a power level up to 1 kW P . This paper is organized as follows. Section 2 is an introduction to the methodology explaining the ADF stationary test and Engle Granger tests for cointegration. In Section 3, we describe the input solar irradiance distribution of this peculiar zone to predetermine if a frequency distribution such as Weibull distribution is expected. In Section 4, we use informal tools such correlograms and Q-statistic to test whether each variable such as solar irradiance G, power output P and temperature T is stationary or non-stationary. The ADF test is applied to each variable and results are presented and analyzed in table form. If a time series has a unit root then first differences of such time series are assume to become stationary. In that section, we apply the ADF test to first difference series transforming non-stationary time series to yield stationary series, we propose a probable estimates difference equation regression model. In section 5, we propose the OLS regression equation and identify outliers before analyzing residuals from the propose regression using the GQ test then the DW-d stat test with its corresponding tabulated bound. We discuss and highlight the proposed regression equation with residuals and introduction to EG method. In section 6, we apply the EG correction mechanism reconciling the short-run behavior with its long-run behavior and a final relationship without suspicious heteroscedasticity is proposed. Finally, conclusion and future work are indicated in Section 7.

Methodology
Let us consider a dependent variable Y and a number of explanatory variables such as X's (X 1 , X 2 , ..., X k ). X k , being the k th explanatory variable. These (k+1) variables will denote the i th observation on explanatory variable X k . In order to determine the linear relationship between variables, for example, Y = a 0 + a 1 X 1 + … + a j X j + … + a k X k , the well-known least squares linear regression technique is used. This technique provides an estimate of each of the coefficients for a i , giving a measure of the quality of the linear fit of the correlation coefficient R and provides confidence intervals for the predictions, etc. The interpretation of R in a multiple regression model is of dubious value and this technique can only be used when observations of one explanatory variable is independent one to each other.
However, and in this study, this hypothesis of independency is not necessarily verified, due to time series: there exists a relationship between the y(t) series and y(t-1), y(t-2) series and so on, there may be an autocorrelation phenomenon for X k variables as well.
In this context, it is then necessary to check whether the series are stationary or not. Indeed, performing a regression between non-stationary series leads to what is known as spurious results or regression [29] for R 2 determination as well as for the regression coefficients. This can lead to conclude the existence of a relationship between two variables when in fact there is no linear relationship between them. Regressions involving time series data include the possibility of obtaining spurious or dubious results [29] in the sense that superficially results look good but on further probing they look suspect.
If variables are non stationary which is the case of this study, then the first difference (or upper orders if needed) must be studied. If these first differences are stationary then the multiple regression technique is used to determine linear relationships between theses first differences. An incidental advantage of the first-difference transformation is that it may make a nonstationary time series stationary.
A linear combination of variables at a non stationary level may become stationary which is given as the co-integration equation also known as the Engel Granger (EG) method [32,33].
Highlighting of such a relationship should allow improving the estimation of the linear relationship between the variables in difference that should also lead to reach the final equation, first by taking into account the temporal relations between the variables, then by introducing explanatory variables as variables measured in the previous period.

Dickey-Fuller stationary test
The principle of DF [9] in statistics stationary test is based on null hypothesis or test for the existence of a unit root in an autoregressive model. Unit root test is explained in appendix A. DF have shown that under the null hypothesis that is if  = 0 then the estimated t value of the coefficient of X t−1 follows the τ (tau) statistic. The tau statistic or test is known as the DF test. DF has computed critical tau values and can be obtained in a tabulated form.
The DF test is usually estimated on three models with their corresponding equation under three different null hypotheses. But we should consider only one random walk equation for the model as others are mainly used by economists in a financial field. Model: X t is a random walk, where there is no constant or drift component and no trend: In each case, the null hypothesis is given for  = 0, that is there is a unit root and the time series is non-stationary. The alternative hypothesis is that  is less than zero and the time series is stationary.
In this paper, the following indication for different hypotheses will be used: Hypothesis H 0  = 0 means  = 1 then such series is a non-stationary series. Hypothesis H 1  < 0 means  < 1 then such series is a stationary series. It is extremely important to note that the critical values of the tau test to test the hypothesis that  = 0, are different for each of the preceding three specifications of the DF test. The estimation procedure of tau will be explained later from experimental data table. However, if the computed absolute value of the tau statistic (τ) less than the DF critical tau values from tables, hypothesis that  = 0 is rejected in which case the time series is stationary. On the other hand, if the computed τ is less than the critical tau value the null hypothesis is not rejected in which case the time series is non-stationary.
In conducting the DF test it was assumed that the white noise error term u t was uncorrelated. But in case the u t are correlated then DF have developed a test, known as the Augmented Dickey-Fuller (ADF) test which is explained in the next section.

Augmented Dickey-Fuller stationary test
To tackle autocorrelation problems DF have developed a test called Augmented Dickey-Fuller (ADF) which is still a unit root test for stationarity. This test is conducted by augmenting the above equation by adding the lagged values of the dependent variable X t . This equation is given as follows: The number of lagged difference terms to include is often determined empirically, the goal is to include enough terms so that the error is serially uncorrelated. In ADF still test whether  = 0 and the ADF test follows the same asymptotic distribution as the DF statistic so the same critical values can be used.

Spurious regression
Let us consider a random walk model given by Eq 3a: And the same number of observations generated from v t et u t and also assumed that the initial values of both Z and X were zero. Il is also assumed that u t and v t are serially uncorrelated as well as mutually uncorrelated. Suppose Z t is regressed on X t as both are uncorrelated processes the R 2 from the regression of Z on X should be close to zero; that is, there should not be any relationship between the two variables. However, it may happen that the R 2 value is significant but there is no linear relationship between the variables. This is the phenomenon of spurious regression and spurious model is not desirable [33].
To avoid spurious regression problem that may arise from regressing a non-stationary time series on one or more non-stationary time series, we have to transform non-stationary time series to make them stationary. If a linear combination of non-stationary variables is stationary, the variables are said to be cointegrated [33]. Their fluctuations are concomitant. But the regression model is not appropriate for highlighting linear relationships between non-stationary variables, this is the problem of spurious regression, so other procedures such as Engle & Granger must be used as described in the next section.

Engle-Granger tests for cointegration
A linear combination of non stationary variables can be stationary, the variables are then cointegrated and their fluctuations are related. But the OLS regression model is not appropriate for highlighting linear relationships between non stationary variables, this is the problem of spurious regression as discussed earlier. Another procedure such as EG [34,35] is then applied. The EG test for cointegration is a three-step procedure.
The first step: it is necessary to ensure that the first differences of the corresponding Z t and X t series are stationary series.
The second step: let v t be the residual of the regression of Z t with respect to X t given as follows: then if v t is stationary then Z t and X t series are cointegreted and the relationship is usually called long run equilibrium. Practically it's not possible to do a stationary test on the unknown z t series but the test can be applied to the following series: where hat a and hat b are determined by regressing Z t with respect to X t . The critical values of DF test and ADF test are then modified.

The third step: Error Correction Model (ECM)
The model is as follows: The three variables of this equation ∆Z t ,  t-1 and ∆X t-1 are stationary so that all estimations of this equation are given by the least squares method. If  is a negative value, v t-1 acts as spring force and thus when v t-1 is a positive value the Z t value has a higher value than it should be given the long run relationship, Z t-1 > aX t-1 + b. As  has negative value the v t-1 effect on ∆Z t gives a negative value and thus Z t < Z t-1 that tends to balance the long run relationship. For those interested on that topic may refer to a more specialized books in econometrics.

Solar irradiance frequency distribution test
Solar energy is harnessed with solar PV converting sunlight directly into electricity. Thus the energy output of PV system depends on the input solar irradiance distribution. In this section, the solar data of a peculiar site is systematically recorded and analyzed to determine whether the distribution of solar irradiance frequency distribution follows a predetermined distribution such as Weibull distribution which is then used to predict performance and energy output as for wind turbine 34. Although 8 years of solar irradiance data have been recorded from the GREEN Platform of University of Lorraine and due to data similarities between years, only one year data is used for the test. Otherwise the 8 years data that is being sampled every 10 minutes would need huge processing time.
These long term irradiance data are used to calculate the probability density function of the irradiance for different hours of a typical day in a month. The frequency distribution of solar irradiation of the study series is displayed as a histogram in Figure 1. The histogram divides the series range (the distance between the maximum and minimum values) into a number of equal length intervals or bins and displays a count of the number of observations that fall into each bin. Figure 1 indicates the number of observation against 15 bins of classes and each class is an interval of values of the irradiance variable G (W/m 2 ). For example, the fourth bin between 75 to 100 indicate the corresponding number of 17 observations. The visual examination does not seem to identify any appropriate functional form of a Weibull distribution curve that may be superimposed on the histogram pattern. We therefore proceed by a statistical test as indicated in Table 1. The statistical hypothesis testing framework is as follows, the null hypothesis H 0 is Weibull distribution and H 1 is non Weibull distribution. The statistical test is based on the likelihood ratio that gives the number of times data is more likely under one model than the other.
This likelihood ratio is then used to determine the probability or p-value or compared to a critical value to decide whether or not to reject the null hypothesis. Table 2 is the second part of the output of Table 4. The p-value is less than the required significance level then we say the null hypothesis is rejected at the given level of significance. But at the same time the z-statistic value is so great compared to test critical values.
Therefore, the null hypothesis of a Weibull distribution cannot be retained but seems to be very unlikely.    Before we go to the ADF model, we draw the graph of the study series for an eye examination of any pattern that might be important in our future assumption. In the next sections, we should verify whether the series are stationary or not.

Correlograms & Q-Statistics
In that part the two alternative hypothesis are considered, H 0 : series X is stationary; H 1 : series X is not stationary.
We shall be using correlograms or Ljung box statistics to test whether the series is stationary or not. The following tables show pictures of correlograms of a time series. Autocorrelation and partial autocorrelation functions characterize the pattern of temporal dependence in the series and typically make sense only for time series data. This is applied to solar irradiance data and is displayed in Table 3. The first column is the autocorrelation column represented as spikes between vertical lines.
The autocorrelation of a series Z at lag k is estimated by the following Eq 4: where is the sample mean of Z. This is the correlation coefficient for values of the series k periods apart. If 1 is non zero, it means that the series is first order serially correlated. If k dies off more or less geometrically with increasing lag k it is a sign that the series obeys a low-order autoregressive (AR) process. If k drops to zero after a small number of lags it is a sign that the series obeys a low order moving-average (MA) process.

Z
Normally, when the spikes are outside the two lines we suspect that the data is not stationary and this is for example one sign among others. From the AC column in Table 3, the estimated values at lag k that are usually noted as hat symbol is gradually going down which means that probably data is non-stationary and is lagged from 1 to 36. Table 3. Correlogram solar irradiance series.
The partial autocorrelation at lag k is the regression coefficient on Z t-k when Z t is regressed on a constant and Z t-1 , … , Z t-k . This is a partial correlation since it measures the correlation of Z values that are k periods apart after removing the correlation from the intervening lags. When the pattern of autocorrelation is one that can be captured by an autoregression of order less than k then the partial autocorrelation at lag k will be close to zero that is indicated by the PAC ( Partial AutoCorrelation) column in Table 3.
Finally, the last value of the Q-stat (Q-Statistic) column in Table 3 are significant at all lags indicating significant serial correlation in the residuals. The last value of the Q-stat column which 931.47 is a high value and the corresponding probability or p-value is zero which less that 5% meaning that the null hypothesis H 0 is rejected.
We note that equation for Q-stat is given as follows: where T is the total number of observations et m is the lag length. The stationary test of the solar irradiance series seems to be a non-stationary data so OLS between these series could not be applied. The slow linear decay of the AC coefficients in Table 3 can be observed which indicates the need to differentiate. An advantage of the first-difference transformation is that it may make a non stationary time series stationary. k  Correlogram of the first difference for solar irradiance is given in Table 4, the autocorrelations at various lags hover around zero which is a picture of the correlogram of a stationary time series. We repeated the procedure for power output and temperature data. Tables 5 & 7 respectively, show each correlogram of power and temperature series with the corresponding correlogram of the first difference series given in Tables 6 & 8. These correlograms show similarities to the solar irradiance series. The null hypothesis is thus rejected and all series are non-stationary series but not for the first difference series.
The ADF is discussed in the next section to determine the functional form of the regression model with the first difference as variables.

Augmented dickey fuller solar irradiance test
To tackle the autocorrelation problem, the Augmented Dickey-Fuller (ADF) is applied to test the null hypothesis of whether a unit root unit is present in a time series test. The test is applied to the three series solar irradiance series (G), power output series (P) and temperature series (T) with the following assumption, H 0 : series has a unit root; H 1 : series has no unit root.
The ADF test results are divided into two distinct sections. The first portion displays the test of the unit root output provides information about the form of the test that is, the type of test, the exogenous variables and lag length used and contains the test output associated critical values such as the probability p-value. The second part of the output shows the intermediate test equation computed to calculate the ADF statistic. In this section, the unit root test is applied to the solar irradiance and the results are explained. In the following sections, the same procedure is then applied to the power and temperature series.
The first portion of the ADF test is given in Table 8a, where the none exogenous estimate the test that does not include a constant and linear trend in the test regression and the variable is E that is lagged once as displayed by the correlogram of Table 3.  Table 8a is − 0.778035 and the associated one-sided p-value is − 0.3786 for a number of observations specified in Table 8b. In addition, the critical values at the 1%, 5% and 10% levels are also reported. We notice here that the t-statistic value is greater than the critical values which is higher than 5% so that we do not reject the null hypothesis at conventional test sizes.

ADF statistic absolute value in
Second part of the output is displayed in Table 8b and shows the intermediate test equation that has been used to calculate the ADF statistic upon 363 observations and the dependent variable is D (G) which is regressed first lag.
In Table 8b, the column labeled coefficient depicts the estimated coefficients or estimates and should not be viewed as being deterministic. It's a kind of indication upon the variable precision. The absolute t-statistic value associated to the G ( − 1), that is 0.778035 is defined as the ratio coefficient to the standard error and given as follows: The t-stat value of the coefficient value should be compared to critical values tabulated by DF. This value is less than those indicated in Table 2, given as 2.58, 1.95, 1.61 for the corresponding threshold 1%. This comparison indicates that the solar irradiance series is not stationary. In this test we ignored the probability value from Table 8b because the coefficient value of G ( − 1) is not distributed that is it does not follow the student distribution. The t-statistic value and corresponding probability value of the first difference term lagged one are highly significant so we did not make any mistake from earlier analysis in considering this variable.
In the following sections, only results of the first part of the ADF test corresponding to power output and temperature data are presented as the second parts have similar characteristics to the first difference lagged at different levels to the second part of solar irradiance ADF test.

Augmented dickey fuller power & temperature test
Similar ADF test is applied to the variable power (P) and temperature (T) and each first part is given in appendix C. Examination of these tables reveals that the observed p-value in each case given respectively as 0.3459 and 0.6907 is not significant as it is a very high value compare to 5%. Also t-stat value of the variable power (P) equal to − 0.853089 is again higher than all the critical values. The t value of the temperature coefficient is 0.026823 but this value is greater than even the 10 percent critical τ value of − 1.616101 suggesting that even after taking care of possible autocorrelation in the error term, both power and temperature series is a non-stationary series. However, the both dependent variable D (P) and D (T) first difference lagged one shows significant characteristics values of a stationary series.
The next section is concerned with the ADF test applied to difference series.

ADF test to difference series of irradiance, power and temperature
To avoid the spurious regression problem that may arise from regressing a non stationary time series on one or more non stationary time series, we have to transform non stationary time series to make them stationary. If a time series has a unit root then the first differences of such time series may be stationary. Therefore, the solution here is to take the first differences of the time series.  D operator is used to specify differences of series. To specify first differencing, we simply include the series name in parentheses after D with the corresponding lagged. The number of lagged difference terms to include is often determined empirically, the idea being to include enough terms so that the white noise error term in is serially uncorrelated. The difference solar irradiance table of the ADF test is illustrated in following tables. The similar tables of each dependent variable P and T are given in appendix C.
The lag length is being checked with the corresponding correlogram of the series to determine and to re-estimate with one less round of differencing.

Conclusion
The time series of variable P, G, T are not stationary series but differencing may yield stationary series. We can therefore use the regression test between the 3 variables in difference. This is discussed in the next section.

Difference regression equation
We computed the regression as each variable is stationary in difference. Table 10, is the regression difference where the explanatory variables G and T are used including a constant C. The dependent variable is P (Delta P) for 363 samples and 362 included observations after adjustments. Let's just turn to interpreting the results. From Table 14, firstly we deduce the expected regression equation referring to the corresponding explanatory coefficient. This is given as Eq 6a below: (6a) It is a standard practice to use the coefficient p-values to decide whether to include variables in the final model. Yet, the corresponding p-value of T and constant term C are not statistically significant because their corresponding p-value (0.4077, 0.9953) are greater than the usual significance level of 1%, 5%, 10% and can be removed from Eq 6a. Whereas the t-statistic value of G is highly significant as well as the corresponding probability that is less than 5%, expecting good regression between P and G.
The overall regression fit as measured by the R 2 value is more than 94% indicating a very tight fit. Obviously we'll focus on R-squared (R 2 ) suggesting that solar irradiance alone can explain over nearly 95% of the variation of power output and it will have only 1% impact on the variable power P. Although this regression seems to be very significant however, both the constant and temperature difference are not significant as is revealed by the corresponding probability p-value 0.99 and 0.40. Indeed, the p-value given just below the F-statistic in Table 10 denoted as Prob (F-statistic) is the marginal significance level of the F-test. If the p-value is less than the significance level that is 5% the hypothesis that all slope coefficients are equal to zero are rejected.
For example, in Table 10, the p-value is essentially zero for G. Note that the F-test is a joint test so that even if all the t-statistics are insignificant the F-statistic can be highly significant. However, we can propose the apparent estimate equation using least squares given as follows by Eq 6b. (6b) The proposed regression equation is still not satisfactory because other data in the table seem to indicate differences between the predicted value (based on the regression equation) and the actual observed value. This is discussed in the next section.

Outliers & analyzing residuals from regression
One problem with least squares occurs when there are one or more large deviations for example, cases whose values differ substantially from the other observations. These points are called outliers. They may represent important information about the relationship between variables. Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in ordinary least squared (OLS) regression. In linear regression, an outlier is an observation with large residual. Figure 3, represents the residual graph where the blue curve is the residuals from the regression provided by a residual table. The green and red curves are respectively the real or actual and fitted curves of power output against the number of observations. We see that the regression seems to be going rather well from the point of view of predicting power output evolution. However, the residual graph shows 4 outliers diagnosing failures of the above assumptions of the regression model. The 4 outliers do appear on one side at 220 and 221 observations with amplitude around 100 and on the other side at 357 and 358 with amplitude around 170. Given the existence of outliers, we re-estimate the regression without these 4 points as given in Table 11 with the corresponding residual curve of P as illustrated in Figure 4. From the residual graph of P in Figure 4, we can see that there is no longer any spurious residual but we can suspect a little heteroscedasticity, that is to say a variance of the residual values varying with the level of the variables. This is discussed in the next section.

Heteroscedasticity & autocorrelation
When using some statistical techniques such as OLS, a number of assumptions are typically made. In a linear regression model in which the errors have expectation zero and are uncorrelated as well as having equal variances then the best linear unbiased estimator (BLUE) of the coefficients is given by the OLS estimator. If the variance of the error term ε i is not the same across all observations i=1, ..., n, then the disturbances are said to be heteroscedastic.
Heteroscedasticity tends to produce p-values that are smaller than they should be. This effect occurs because heteroscedasticity increases the variance of the coefficient estimates but the OLS procedure does not detect this increase. Consequently, OLS calculates the t-values and F-values using an underestimated amount of variance.
This problem can lead to conclude that a model term is statistically significant when it is actually not significant as the coefficients are biased. In our case, as the t-test for each coefficient examines them individually the t-test of G in Table 11 is higher than that in Table 10 with the F-value very significant in Table 11. The overall F-test is significant and the R-squared value has been improved therefore correlation between the model and dependent variable is statistically significant.
We thus identified the single significant explanatory variable which is G with heteroscedastic disturbance and therefore the kind of heteroscedasticity must be identified. In the next section we propose more in-depth studies in order to propose a regression equation closer to real experimental conditions.
The corresponding graph of first difference of power output P against first difference of solar irradiance G is illustrated in Figure 5. However, some scattered values do not fit the regression line reasonably and although the constant term has been removed, the very slight intercept on the delta P axis are indications to pursue further studies.  We then test that the residual is an increasing function of the explanatory variable and the Goldfeld Quandt (GQ) test is proposed. This is discussed in the next section.

The goldfeld quandt test
The blue curve of Figure 6 is the residual of P with no outliers and is oscillating in a range of values that act as good estimates. This range is also known as the confidence interval and is represented by the dotted blue lines. GQ have argued that the error term may not satisfy the ordinary least squares assumptions and may itself be heteroscedastic. The GQ test is accomplished by undertaking separate least squares analyses on two subsets of the original dataset: these subsets are specified so that the observations for which the pre-identified explanatory variable takes the lowest values are in one subset, with higher values in the other. The test statistic or F-test used is the ratio of the mean square residual (MSR) errors for the regressions on the two subsets given by Eq 7 where MSR max is the highest value in a subset and MSR min is the lowest value. In our study, the OLS regressions of the GQ test is based on the first 100 and the last 100 observations and their associated residual sums of squares are given respectively in Tables 13a and  13b.  From Tables 13a and 13b, the corresponding sum squared residuals values are MSR max = 101375 and MSR min = 86421. The F ratio as indicated above is determined and is equal to 1.17 so we can reject the hypothesis of heteroscedasticity.
However, this value from the GQ method is a non-zero value for the error term and a further study on the DW d-stat test is proposed in the next section.

Durbin Watson D-Stat test
The corresponding residual graph is displayed in Figure 6, the colored curves have the same defintion as defined in Figure 3 where the blue curve indicates that no outliers values are blatant for the residuals but does not explain the DW statistic value. The conventional DW tables are not applicable when a constant term does not exist in the regression and instead an appropriate set of Durbin-Watson tables is used as reference table.
The DW is a test that the residuals from a linear regression or multiple regression are independent and makes it possible to detect a first order autocorrelation errors that is an equation given as follows: Because most regression problems involving time series data exhibit positive autocorrelation the hypotheses usually considered in the DW test are H 0 :  = 0 and H 1 : The test statistic is given as follows:  Table 16 is 2.72 which is between 2.336 and 4 therefore a negative autocorrelation is expected.
However, a more rigorous method such as Cochrane Orcutt method [29] can be applied if required but this is beyond the scope of this study.

Discussion
Regression of a non stationary series on another non stationary series may cause spurious regression which is not desirable. In our case, we have two variables P and G and they have unit root at level meaning non stationary. They are stationary after first difference. Nevertheless the regression model is not appropriate for highlighting linear relationships between non stationary variables and suppose P is regressed on G as follows: P t = G t + C + e t (10) where e t has a unit root and is stationary as explained in the previous section. It is then proposed to estimate the regression of equation (10) and obtain the residuals. However, there is one precaution to exercise. Since the estimated e t are based on the estimated cointegrating parameter , ADF critical significance values are not quite appropriate and other values are used and determined from Engle and Granger method. This is discussed in the next section.

Engle and granger method
The outline of Engle & Granger (EG) [30] method is explained in section 2.4, still we can simply resume the principle. EG note that a linear combination of two or more series may be stationary, in which case we say the series are cointegrated. Such a linear combination defines a cointegrating equation between variables with cointegrating vector of weights characterizing the long-run relationship between the variables. Therefore, from Eq 10 is deduced the following relationship: This is computed and the results are given in distinct tables. Table 14a shows result of regression between variables at level, that is long term relationship. The dependent variable is P and the explanatory variables are solar irradiance G and a constant term C. When the dependent variable P t is regressed on G t the following regression is obtained P t = 0,9398G t + 16,6755 + e t (11b) So we focus on stationarity or not of the residual of Table 14c. The residual is noted as ResidCoint.
Since the computed t value (-14.677) is very significant and much more negative than -2.5713 [30,33,36], our conclusion is that the residuals from the regression of P on G are stationary.
Hence, Eq 17b is a cointegrating equation and is called the long run function and interprets its parameters as long run parameters. Thus, 0.9398 represents the long-run or equilibrium. Obviously, in the short run there may be disequilibrium. Therefore, the error term in Eq 11a must be processed as the equilibrium error and this error term is used to link the short-run behavior of PCE to its longrun value.  The error correction mechanism (ECM) which is the third step of the EG test as described in section 2.4 is applied and the similar Eq to (3b) is given as follows: ∆ P t = α 0 + α∆G t + βe t-1 + ε t (12) Where, ∆ as usual denotes the first difference operator, ε t is a random error term, and e t-1 = (P t-1 -G t-1 -C ) (13) that is the one lagged value of the error from the cointegrating regression of Eq 11b. ECM equation, that is Eq 12, states that ∆P depends on ∆G and also on the equilibrium error term e t-1 . If the latter is non zero, then the model is out of equilibrium. Suppose ∆G is zero and e t-1 is positive this means ∆P t-1 is too high to be in equilibrium. Since β is expected to be negative the term βe t-1 is negative and therefore ∆P t will be negative to restore the equilibrium. That is, if P t is above its equilibrium value it will start falling in the next period to correct the equilibrium error. As well as if e t-1 is negative that is P is below its equilibrium value βe t-1 will be positive which will cause ∆P t to be positive leading P t to rise in period t. Thus, the absolute value of β decides how quickly the equilibrium is restored.
Eq 13 is computed the corresponding results are given in Table 15. The coefficient β tells us at what rate it corrects the previous period disequilibrium of the system. When β is significant and contains a negative sign it validates that there exists a long run equilibrium relationship among the variables. The coefficient value β of ResidCoint (-1) or (e t-1 ) = -0.7255 that validate the long run relationship between P and G.
Therefore, from Table 19 we can deduce that the model is very significant as the R squared value is very high thus the Fisher statistic value is high. The two explanatory variables that is, first difference stationary solar irradiance G, ResidCont lagged one (e t-1 ) are very significant whereas the constant term C = α 0 is not significant as indicated by its p-value.
Still, we plot the residual graph and this is displayed in Figure 7. The colored curves have the same defintion as defined in Figure 3 and where two outliers are identified respectively at position 320 and 357. Thus we applied the EG test for difference regression and the dependent variable is ∆P.
Results are given in Table 16 for the coefficient of the equilibrium error term and the short run term (coefficient of ∆G). The R-squared value is very significant as well as the DW statistic close to 2 there is no more heteroscedasticity as represented in Figure 8. Thus, the relationship between the variables is given as ∆P = 0.966∆G -0.601(P t-1 -0.939G t-1 -16.67) (14) where ∆P = P t -P t-1 and ∆G = (G t -G t-1 ) and by substituting ∆P and ∆G in Eq 14 the final relationship for P t is written as follows P t = 0,966 G t -0.402G t-1 +1,601 P t-1 + 10.019 (15) Hence, Eq 15 is the full equation not only with variables P and G in difference but also with P and G with one period lagged.

Conclusion and future work
It is difficult to effectively assess the impact of PV output variability on the power grid stability without a clear understanding of the factors that influence this variability. Many factors can become redundant in the presence of other factors and their value in the forecasting framework may vary under different prediction horizons and error measurements. One of the objectives of this paper consisted of applying a statistical method of time series data to identify the most important on-site climatic and environmental parameters that influence PV output variability. Various estimation technique to estimate a linear relationship between PV variables have been proposed in the literature but these OLS regression techniques had lead to spurious results as these techniques were based on the assumption that the variables involved are stationary. Regressions involving time series data include the possibility of obtaining spurious or dubious results in the sense that superficially results look good but on further probing they look suspect.
The originality of this paper is that a technique of cointegration has been introduced known as the EG method according to which models containing nonstationary stochastic variables have been constructed in such a way that the results are statistically meaningful. Through this technique a linear relationship is obtained including not only the dependent variable P(t) and explanatory variable G k (t), but also their one time lagged variables P(t-1) and G k (t-1), thus a temporal link between the variables in small samples, that is daily mean data upon one year.
For that, the EG method requires testing whether variables are integrated of the same order. This is done using the ADF unit root test. We then used the error correction model to test whether the residuals of the long run relationship between P and G are stationary. All these conditions are satisfied and the two variables under investigation cointegrate.
The advantage of the Engle-Granger method over the other techniques is its ease of implementation. However, its results are dependent on how the long-run equilibrium equation is specified. In some cases, it might not be easy to identify which variable enters as the dependent variable.
This study was conducted in a temperate region and temperature had been eliminated in the final PV relationship. However, the latter cannot be readily adopted by other countries as both the methodologies employed depend on several site-dependent features. The PV output also is heavily reliant on numerous local meteorological and environmental factors. Therefore, future work has been proposed in a South West Indian Ocean project to improve this PV model.