Comparative Variance and Multiple Imputation Used for Missing Values in Land Price DataSet

: Based on the two-dimensional relation table, this paper studies the missing values in the sample data of land price of Shunde District of Foshan City. GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis; NORM software was adopted to construct the multiple imputation models; EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets. Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set, and the weight is determined according to the differences. Finally, comprehensive integration is implemented to achieve the imputation expression of missing values. The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%, the PRICE variable was missing and the deletion rate was 10%, and the PRICE variable and the CBD variable were both missing. The new method compared to the traditional multiple filling methods of true value closer ratio is 75% to 25%, 62.5% to 37.5%, 100% to 0%. Therefore, the new method is obviously better than the traditional multiple imputation methods, and the missing value data estimated by the new method bears certain reference value.


Introduction
The advent of big data era has led to more serious data missing in the process of data collection, transmission, storing and processing. Usually, these missing values are mostly directly eliminated or replaced by some simple approximations [Mallinckrodt, Roger and Chuang (2014)]. These methods generally bear great limitations, and basically cannot guarantee the authenticity of the missing values. The lack of data not only loses a lot of valuable information but also seriously affects the further development of data mining.

Content and technology
Based on the two-dimensional relation table, this paper studies the missing values in sample data of land price of Shunde District of Foshan City. Combining the ideas of multiple imputation techniques and probability models [Papadopoulos, Papadopoulos and Sager (2016)], this paper proposes a new method of filling variances to determine the missing values. To a certain extent, the data quality of the relevant two-dimensional relational tables is improved. The main research content is to apply the stepwise regression analysis method to eliminate the insignificant factors by GeoDa software. According to the correlation between the attribute values of various factors affecting the land price and the land price data [Jeong, Koh and Park (2010)], the NORM software is used to construct the multiple imputation models, and EM algorithm and augmentation are used to fit multiple linear regression equations [Horton and Kleinman (2007)]. On such basis, the imputation data set is analyzed, and a method of determining the weight based on the differences between the individual variance and the integrated variance is proposed to realize the imputation expression of the missing value. Combining the ideas of multiple filling techniques and probabilistic models, this paper proposes a new method of filling variances to determine the missing values. The technical route is as follows: (1) Randomly eliminate 5%-10% of the land price data as a clean value, determine the missing mechanism and variable type of the remaining data, find other variables related to the missing value variable [Pan, Li and Zhang (2012)], and gradually eliminate impact of the insignificant factor by using the stepwise regression analysis method by GeoDa software.
(2) Determine the number of imputation, which is 3 to 5 times at best. Through NORM software, a number of different linear regression equations between missing value variables and other variables are obtained.
(3) According to different linear regression equations, different filling datasets are calculated and a multiple imputation models is constructed [Dorigato, Pegoretti and Penati (2010)]. In this way, the imputation dataset is created.
(4) Statistical analysis is performed on a plurality of different padding data sets, and the mean and variance of each padding data set are respectively calculated and then integrated. Its formula is: Mean: M represents the average after integration, and K represents the number of filled data sets; k m represents the average calculated by each imputation data set. Variance: B is between imputation, indicating the uncertainty of the point data between the data sets, M indicating the integrated average, k W indicating the variance calculated by each filled data set, W which refers to the average variance calculated by each imputation data set, T represents The variance after integration [Xie, Qin, Xiang et al. (2018)].
(5) Determine the weight value of the padding data set by difference T W k − , which indicates the proximity between K W and T, to determine the weight in turn: K represents the number of padding data sets [Qin, Li, Xiang et al. (2019)], k W represents the variance calculated by each fill data set, and T represents the variance after integration, Q represents the sum of the distances of k W variances of the packed data sets and the integrated variance T [Sarnak, Zhao and Woodbury (2013)], k ∂ representing the weight coefficient obtained. (6) Imputation estimate of the missing point is determined based on the weighting factor. Therefore, the missing value data of each point can be expressed approximately equal to the sum of the product of the point value corresponding to each padding data set and its corresponding weight coefficient k ∂ .
(7) Finally, the estimated value and clean value are compared, and the experimental results are tested to conclude.

Data source
This study selected the residential land price sample data of Shunde District of Foshan City in 2013 to conduct experiments. The land price data and real estate information of this data are the real estate transaction data provided by the Shunde District Land and Resources Bureau; the location-related data, such as the distances related to the land price data, are based on questionnaires and actual research [Li, Zhao, Du et al. (2016)], according to Google. The coordinate data obtained by the name of the real estate in the map is calculated by the geographic information space analysis method. The specific description of the land price data set is as follows: The data in the dataset is calculated by a questionnaire survey, field survey and geographic information spatial analysis method. It is mainly based on the twodimensional relational table, which is composed of land price data and various factor data affecting land price data, including land price and sampling point number. , address, coordinates, distance from the point to the city center, distance from the point to the park, population density, distance from the point to the school, distance from the point to the mountain, distance from the bus stop, distance from the point to the hospital [Hanula, Mayfield and Reid (2016)], point to the main road The distance, the distance from the point to the highway, the distance from the point to the river, the number of floors, the base area, and the floor area of the building, and 29 attribute fields, and 713 records. Because it is the data set collected and calculated in the field, rather than the standard test data set, the data itself is different from the general standardized data, there are certain errors, and there are certain limitations in the data quality. The final accuracy has a large impact.

The specific implementation of multiple fills 3.2.1 Elimination of insignificant factors by stepwise regression analysis using GeoDa software
Using the ordinary OLS test, the OLS test results can obtain the value of Multicollinearity Condition Number, the values of LM-lag and LM-error, and the p value of LM-lag and LM-error. All factors were tested and the results were: Multicollinearity Condition Number=48.118, LM-lag=43.57, LM-error=24.47, P(LM-lag)=0.0000000, P(LM-error)=0.0000008. The value of Multicollinearity Condition Number is 48.12 and greater than 30, which indicates that there is a problem of multicollinearity in the respective variables in the analysis [Hagen and Mohl (2016)]. It is necessary to adopt the method of gradually eliminating variables to discharge the problem of multiple collinearities. After further experiments, it was found that after eliminating the three independent variables of the river, population density, and school, the value of Multicollinearity Condition Number obtained by the analysis is 28.24 less than 30 [Reddy (2011)], which basically meets the requirements of the multicollinearity test, and the remaining independent factors Further analysis is possible. Where P(LM-lag)=0.0000000, P(LMerror)=0.0000002. Because the p value of LM-lag and LM-error is less than 0.01, it indicates that the two regression models of SLM and SEM are statistically very significant. In the comparison between Robust-lag and Robust-error, P(Robustlag)=0.0000048, P(Robust-error)=0.2510494. It can be found that the p value of Robustlag is smaller than the p value of Robust-error, Robust-lag is significantly more significant than Robust-error, and LM-lag=46.32 is larger than LM-error=26.71, and Robust-lag=20.92 is larger than Robust-error=1.32. Therefore, the final selection of the spatial lag model SLM for estimation is most in line with the actual situation. The results of the regression analysis of the SLM space lag model are shown in Tab. 1: According to the significance level test of the independent variable p value of 0.05, from the P value of the respective variables obtained from the analysis results [Kuznetsova, Brockhoff and Christensen (2017)], it can be found that the p values of the four independent variables of CBD, PARK, MOUNTAIN, and FREEWAY are very close to 0.05, indicating that they are on the land price. The impact is more obvious. However, the significant factors of HOSPITAL and HIGHSTRE are not obvious and can be gradually eliminated in further regression analysis to obtain the optimal spatial autocorrelation model. After removing the two independent variables of HOSPITAL and HIGHSTRE, the results of further regression analysis are shown in Tab. 2. It can be found that the p values of the two independent variables of CBD and FREEWAY are less than 0.05, which is significant; PARK, MOUNTAIN The P value of the independent variable factor is slightly larger than 0.05, which has a certain influence on the land price, but the significance is not strong; while the other independent variable factors are not obvious in the stepwise regression analysis because the influence on the land price is not obvious. Through the regression analysis method of gradual elimination, the four independent variables of CBD, FREEWAY, PARK, and MOUNTAIN are selected as the significant correlation factors affecting the land price data, which is ready for establishing the optimal filling model of land price data.

Fitting multiple linear regression equations with NORM software to construct a multi-fill model 3.2.2.1 Determine the number of iterations and perform the operation of the EM algorithm
The EM algorithm, the Expectation Maximization Algorithm, is an iterative algorithm for the maximum likelihood estimation or the maximum posterior probability estimation of a probability parameter model with a hidden variable [Wang, Hao, Xu et al. (2018)]. The maximum expectation algorithm is calculated alternately in two steps: the first step is to calculate the expectation (called the E step), the existing estimate of the hidden variable is used to calculate its maximum likelihood estimate; the second step is to maximize (called For the M step), the maximum likelihood value found on the E step is maximized to calculate the value of the parameter. The parameter estimates found on the M step are used in the next E-step calculation, which alternates until it converges, ultimately determining the number of iterations. Experimental studies were performed on the three deletions. There were only cases where the PRICE variable was missing and the deletion rate was 5% and 10%, and there were cases where the PRICE variable and the CBD variable were missing. By statistically testing the PRICE variables in the three case data, it is found to be in a normal distribution, which satisfies the basic requirements of NORM software for estimating the missing data of multivariate normal distribution data. According to the regression analysis method which was gradually eliminated before, the four independent variables CBD, FREEWAY, PARK, and MOUNTAIN closely related to PRICE were selected as the relevant factors affecting the land price data, and the filling model was constructed. It can be seen from the calculation results of the EM algorithm that the maximum number of iterations of the data in the three cases is 8, 11 and 10 times in order to ensure that the posterior prediction distribution is independent and stable, and the number of iterations of the subsequent data augmentation is generally set to be larger than EM. 2~3 times the number of iterations.

Perform data augmentation (DA) operations
Data augmentation (DA) is widely used to deal with non-conjugated models [Xiang, Zhao, Li et al. (2018)]. Its idea comes from the EM algorithm. It assumes that there are some implicit structures in the original model that can be explored by introducing augmented variables. Moreover, the original model can be obtained by subtracting the augmented variable from the model. In this framework, the augmentation variable is treated as "augmented data", making the inference easy to handle. This idea was first proposed by Tanner and Wong to simplify the calculation of the maximum likelihood estimate and then applied to the Bayesian posterior inference. A linear regression equation that simulates the different fill times between missing value variables and other variables in the three cases is obtained, and the corresponding filled data set is obtained. Because the maximum number of iterations of the EM algorithm is 8, 11 and 10 times in a turn, and the number of iterations of the DA is set to be 2 to 3 times of the EM, the number of iterations of the DA operation in the three cases is determined to be 50 times. According to the number of iterations, the appropriate number of filling times is selected. In the experiment, 10 is selected as the interval number, and every 10 iterations are estimated. Therefore, in the three cases, 5 filling datasets can be obtained.

Calculation of missing values 3.2.3.1 Analyze the padding data sets in the three missing cases
The mean and variance of each padding data set are calculated separately. Then integrate it. Mean: where K represents the number of padding datasets, representing the average of the calculated population of each padding data, and the average after integration. The results obtained by calculating each group of filled data sets are shown in Tab. 3: where K represents the number of padding data sets, k m represents the average calculated by each fill data set, M Indicates the average after integration, k W represents the variance calculated from each fill data set, W represents the average variance calculated for each fill data set, and T represents the integrated variance.
It is worth mentioning that the variance after T integration consists of two parts, W and B.
W refers to within imputation, can reflect the natural differences in the existence of data between datasets. B is between imputation, which can indicate the uncertainty of point data between datasets. It can be seen that this calculation is in line with the previous idea and is scientifically effective. Therefore, based on the variance of the integration, the weights are determined according to the degree of similarity compared with the variance calculated by each of the filled data sets ].
The values of k W , W , B and T when three missing cases are calculated as shown in Tab. 4 and Tab. 5:

Determination of the weight value of the padding data set
By a difference of T W k − , Indicates the degree of closeness between K W and T, which in turn determines the weight.
Where K represents the number of padding data sets, k W represents the variance calculated from each fill data set, T represents the variance after integration, Q represents the sum of the distances of the variances of the packed data sets k W and the integrated variance T, k ∂ indicates the weight coefficient obtained.
The correlation coefficients calculated are shown in Tab. 6:    (1) After the multi-fill model is added to the weight calculation, the filling result of each missing value can be obtained, and the estimation result of the former is obviously better than the filling result of the filling method. In the traditional multiple filling methods, the estimation result of the former is closer to the true value as a whole; (2) When there is only one missing variable, the filling effect of the new method will be worse as the missing rate in the data set increases, but overall It is better than the traditional multiple filling methods; (3) When the missing rate of the data set remains the same, as the number of missing variables increases, the estimation effect of the new method will be better than the traditional multiple filling methods; (4) Although overall See, the new method is better than the traditional multi-fill method, but you can see that the average value of the difference between the estimated value and the real value of the new method is in the range of 100-300. There is still a significant gap, which is mainly Because of the real-time nature of the data, the algorithm of the new method is still rough, and the data quality is not high, and further research and optimization, this side There is still potential to improve the accuracy of the result of the filling.

Conclusion
Combining the ideas of multiple filling techniques and probabilistic models, this paper proposes a new method of filling variances to determine the missing values and applies them to the missing value filling process of the two-dimensional land price sample data in Shunde District, Foshan City. The experimental sample data were analyzed by stepwise regression analysis, and the significant factors were eliminated. The four independent variables of CBD, FREEWAY, PARK, and MOUNTAIN were selected from 29 attribute fields as significant factors affecting land price data. From the experimental results (Tab. 7, Tab. 8, Tab. 9), in the three cases, the new method estimated the fill value is 75%, 62.5%, 100%, The traditional multiple filling methods are 25%, 37.5%, and 0%, respectively, and the new method is closer to the true value. It can be seen that the overall new method is obviously superior to the traditional multiple filling methods.