Towards Multiple Regression Analyses for Relationships of Air Quality and Weather

Multiple regression is a common data analytic approach to be applied in many areas. The regression objective is to learn regression coefficients, and therefore can be used to evaluate the statistical significant relationships between variables with statistics tests and predict the future values on the basis of the statistical significant relationships. The sample data of air quality and weather in Hong Kong has been analyzed. The pilot study shows factors with statistical significance to pollutant, PM2.5.


I. INTRODUCTION
Regression is a curve fitting technique to model the distribution of data.With minimization of the residual errors, i.e. sum of residuals squares, the best curve can be found by "learning" the most appropriate parameters, i.e. the regression coefficients.Among various regression techniques, multiple linear regression is the most common one and is focused in this paper.
Air quality analysis is essential for the human health and economic and environment development.The multiple Regression is one of the common methods to analyze the air quality, for example [1][2][3].Since most papers focus on discussion of the findings by regression methods and the related equations are usually not included, as a tutorial paper, this paper concisely presents the equations and demonstrates the steps of use of the multiple regression to analyze a month scale data of air quality [4] and weather [5] in Sep 2016 in Hong Kong.The basic steps for the regression analysis is as follows.Firstly, data are prepared, cleaned and may further be normalized.Secondly, the estimated regression coefficients for a regression model are calculated.Thirdly, R related values and F-test for the regression model are calculated.Fourthly, t tests for regression coefficients are computed.Fifthly, according to the significance of each coefficient, the regression model may be refined by deleting and adding the variables.The second to the forth steps are repeated and the ANOVA with F test of old and new models are calculated until all tests are statistical significant.A regression model of statistical significant or/and good R related values are finally produced.The related concepts and functions, which details can be found in some recommended literature [6][7][8][9][10], are concisely presented in this paper with some modification for the notations.
The rest of the paper is organized as follows.Section 2 presents the formulation of multiple regression model.Section 3 presents the statistics for regression analysis.Section 4 demonstrates the pilot test of regression analysis of the linear relationship among air quality and weather factors in Hong Kong.Section 5 concludes this work and the future study.

II.MULTIPLE REGRESSION MODEL
For a data set,   , , , , , j=1,…,m, is a vector of values for an input variable j (also called independent variable , regressor, predictor, explanatory variable, factor or feature).n is the number of individuals and m is the number of input variables.
x and all input variables ( 0 j X  ).
  In matrix form, Ŷ is used to estimate the true Y defined as below.
Y W X   Since there are residual errors e , that is To find W  , residual error sum of squares is minimized.
For an ideal regression model, i e is zero and W  is used to estimate W , i.e.W W   , and the closed form   If matrix rank of X , i.e.Rank(X) , is equal to m+1 (or   det 0 T X X  ), T X X is invertible and W has a unique solution.The QR decomposition of one of the common approaches to find the inversion matrix.W  is used to calculate the model values as below.
Therefore Y  is used to estimate Y with some errors for the estimation.To measure the fit of the multiple regression models, some statistics methods are used and presented in the next section.

A. Multiple Coefficient of Determination
The Squared Multiple Correlation Coefficient ( 2 R ) for a linear regression model is the ratio of the variances of the model variances and observed variances of the dependent variable.In other words, 2  R is used to measure the portion of variability of the data being explained by the regression model and has the form below.
Y is the mean of Y .The squared root of 2 R is the multiple correlation coefficient, R, which represents the linear correlation between observed values ( Y ) and regression model values ( Ŷ ).The Regression Sum of Squares (SSR), or Model Sum of Squares (SSM), is the form as below.
The Error Sum of Squares (SSE) (or Residual Sum of Squares) is of form as below.
The Total sum of squares (SST) below is the sum of SSR and SSE.
For the sample of small size, 2 R can be adjusted to offset bias and has form below.
( ) When the sample size n is larger, To show a H for the regression model, 0 H should be tested to be rejected at significant level  by the F statistics.
The F value for the hypothesis test in form (15) is the ratio of Regression Mean Squares (MSR) and Mean Squares Error (MSE).MSR is the quotient of SSR and degree of freedom of m ( 1 df m = ) whilst MSE is the quotient of SSE with degree of freedom of n-m-1 ( ) The p value of ratio F from F distribution F D can be obtained as below.

C. Hypothesis and Significance Tests for Regression
Coefficients Significance test for regression coefficients is used to test whether there is any statistical significant relationship between the dependent variable Y and intercept 0 w and each independent variable x j (j = 1, 2, ..., m).The following hypotheses are established.: 0 t values for W are given as follows.
The p value of W t with t-distribution ( ) ,where ( ) Substituted by Eqs. ( 20) and ( 21), and thus, If the p value is less than the significant level, e.g.

a £
, we reject the null hypothesis o H that 0 j w = .
In other words, there is a significant relationship between j x and Y in the linear regression model.The input variable(s) without favored statistical significance should be removed to form a better regression model.

D. ANOVA F-test for a pair of models
By using ANOVA, variable selection can be evaluated by the steps below.
1. Evaluate SSE for the "full" regression model of more input variables, i.e.SSE + ; 2. Evaluate SSE for the reduced model (or restrict model) of less input variables, i.e.SSE -on the basis of "full" model in the previous steps; 3. Decide if some variables should be removed by The p value of F  can be obtained from F distribution function F D as below.
( ) , 0 H should be rejected.means the full model is favored.Otherwise, the reduced model is favored, i.e. a H is favored.4. Repeat steps 1 to 3 until a good enough and simpler regression model without "unrelated" variables is found.

IV. ANALYSIS OF AIR QUALITY
The air pollutant data were obtained from [4] and the weather data were obtained from [5].For the pilot test and the availability of the data, the air quality data from YUEN LONG station is chosen for the analysis in Sep 2016 in Hong Kong.The data is presented in Appendix A.
To analyze the factors related to PM 2.5 (or Fine Suspended Particulates), 10 factors are listed in order as follows: Carbon Monoxide (CO), Nitrogen Dioxide (NO 2 ), Nitrogen Oxides (NO X ), Ozone (O 3 ), PM 11 or Respirable Suspended Particulates (RSP), Sulphur Dioxide (SO 2 ), Mean Air Temperature, Mean Relative Humidity, Mean Amount of Cloud, Prevailing Wind Degree and Wind Speed.
The regression coefficients W  are calculated by Eq.8 and the results are presented in Table I

Figure 1 .
Figure 1.PM2.5 Values and its predicted values with 10 and 6 factors in Sep 2016

TABLE I
test for each coefficient is performed, some factors should be removed.SE and t values calculated by Eq. 24 and P values for W  calculated by Eqs. 25 and 27 are presented in Table I.As the p values less than 0.05 provide strong evidence against the null hypothesis, 5 factors, NO 2 , NO X , SO 2 , Cloud and Wind Speed should be removed.O 3 in marginal value will go through the further process removing the 5 factors.
The factors are re-indexed and the regression coefficients and related statistics are shown in TableII.F are presented in TableIII.Table III also shows ANOVA analysis with F statistics calculated by Eqs.28 and 29 for both regression models. p

TABLE III .
R RELATED VALUES, F TESTS AND ANOVARESULTSThis paper demonstrates the use of multiple regression for the factors related to PM 2.5 with a monthly air quality and weather data in Hong Kong.There are several directions for the future study.More data should be analyzed, i.e. longer time data and data from more stations.More methods will be evaluated for feature selection.The extended study will use and compare other regression methods.