Modeling on Financial Revenue via Data Mining

As a new first-tier city of core force in the central region, Wuhan plays a vital role in the development of the mid-regional economy of China, which arises our rich interest about the powerful factors for rapid economic improvement. In this paper, we select 12 factors of financial revenue from 2002 to 2018, and try to find out the significant elements through the Adaptive-Lasso algorithm, then predict the future financial revenue by combining with the grey and neural network model. Real example shows the model is efficient to provide a feasible way to predict the financial market in the future.


Introduction
The comprehensive reflection of the national economic regional level is reflected in financial revenue. The state and the government have macro-regulated the market economy according to financial revenue. Accurate financial revenue forecasting can not only help the relevant departments to effectively formulate fiscal policies, but also provide forward-looking assistance for strengthening the supervision of financial revenues. In order to ensure scientific and precise financial management and optimize the allocation of social resources, it is necessary to make effective and accurate predictions of financial revenue.
For a long time, the main prediction method of financial revenue include grey forecasting method, stepwise regression method and co-integration analysis method. Yuan R P et al. [1] used the grey model to predict the financial revenue of Fengtai Science and Technology Park and provided the basis management decision for the park. Zhang Q and Dong X [2] used the stepwise regression method to predict and analyze Wuhan financial revenue. Cao P P et al. [3] used co-integration analysis to find a long-term equilibrium relationship between total retail sales of social consumer goods, economic growth and financial revenue in Anhui Province. With the gradual deepening of the research, some scholars analyzed the linear correlation between financial revenue and influencing factors by establishing multiple linear regression models, using the least squares method for parameter estimation, and selecting variables according to stepwise regression method [4][5] . However, both the stepwise regression method and the least square method have an obvious shortcoming, that is the two methods can only achieve the local optimum rather than the global optimal. Therefore, Tibshirani R [6] proposed the Lasso algorithm, which not only reduces the dimensions of the data, but also predicts financial revenue more accurately. Some scholars use this method to estimate the parameters. In order to minimize the sum of squared residuals, the absolute values of the regression coefficients and the constraints less than the constants are set. When the variable coefficients are compressed to 0, it is equivalent to the variable selection [7][8] .
As an important city of mid-China, Wuhan plays an vital role in the field of financial revenue. In this paper, we use the Adaptive-Lasso variable selection method to analyze influencing factors. Combining the grey forecasting model and neural network model to establish an effective and feasible financial 2019 2nd International Symposium on Big Data and Applied Statistics Journal of Physics: Conference Series 1437 (2020) 012118 IOP Publishing doi:10.1088/1742-6596/1437/1/012118 2 revenue prediction model with the relevant data in Wuhan over the years. Further predicting the financial revenue of Wuhan in 2019 and 2020 will provide a basis for macroeconomic regulation and control of the market economy.

Lasso regression
In 1996, Robert Tibshirani proposed Lasso regression which is a compression estimation of linear regression model. In order to reduce the coefficients to zero for variable selection, it takes the sum of squared residuals as the objective function and adds a penalty term to achieve the purpose of optimization, which is shown in (1).
Where λ is the non-negative regular parameter, and  = p j j 1   is the penalty.

Variable Selection via Adaptive-Lasso Process
It solves the shortcomings of least squares method and stepwise regression local optimal estimation in the section 2.1, but it is difficult to derive the evident parameter estimation. Zou [9] adds different weights to the coefficient of the penalty item, which follows in (2).
ˆ is the different coefficients of different variables from the least square solution.

Combined Model of Grey Prediction and Neural Network
The grey model is an way to predict filtered variables and obtain the predicted values for the next two years, which also has good prediction with less data volume. The neural network model is trained by historical data, and then we combine the grey prediction results into the neural network model for more accurate forecasting.
The step to establish a GM(1,1) model, first we supposed , and accumulate once for , then get an accumulated sequence, which expressed as . Second, establishing a first order linear differential equation for ) (1 X as follows (3).
The above first order linear differential equation is GM (1,1) model. Solving the differential equation, the solution of the differential equation is the prediction model.
The GM (1,1) model finally obtains a cumulative quantity,

Empirical process
There are many factors affecting financial revenue (y), considering the relationship between financial revenue and energy consumption, population relations and consumption, we select 12 influencing factor of financial revenue from 2002 to 2018.The influence factor as follows, the number of social employees (x1), total wages of on-post staff and workers (x2), social consumable total retail sales(x3), urban per capita disposable income (x4), urban per capita consumption expenditure (x5), population at the yearend (x6),total social investment in fixed assets (x7), regional GDP (x8), primary industry output value (x9), tax (x10), CPI (x11), tertiary industry and secondary industry output value ratio (x12) and household consumption level (x13) data amount as an independent variable, analyze their impact on the financial revenue and its internal relationship.

Adaptive-Lasso based on variable selection
In the past, we used the Adaptive-Lasso method to select the variables affecting financial revenue. The coefficient of 13 Table 2, so these 10 factor can be screened out in the model. The selected variables are urban per capita consumption expenditure (x5), social consumable total retail sales (x3) and the number of social employees (x1).  Table 3 shows that the number of social employees (x1), total wages of on-post staff and workers (x2), urban per capita consumption expenditure (x5), total social investment in fixed assets (x7) and regional GDP (x8) are screened out. In the same way, the household consumption level is taken as the dependent variable. As shown in Table 4, the two factors of urban per capita consumption expenditure (x5) and primary industry output value (x9) are selected. It can be seen from Table 5 that the two variables of urban per capita disposable income (x4) and urban per capita consumption expenditure (x5) are selected among the factors affecting regional GDP.

Financial Revenue prediction.
The financial revenue influencing factors are substituted into the grey model and artificial neural network. The neural network parameters are 10 -7 , 10000, 3, respectively for the error precision, the number of learning and the number of neurons. The predicted values of x1, x3 and x5 in 2019 and 2020 are obtained via grey model. The prediction accuracy of grey model is shown in Table 6. Through the neural network model, the predicted value of financial revenue in 2019 is 32, 747, 426 million yuan and the predicted value in 2020 is 33, 384, 371 million yuan. Figure 1 is a comparison of the financial revenue predicted values and true values based on the neural network model.  Table 7. Based on the tax neural network prediction model, the predicted value in 2019 is 1, 427, 728, 700 yuan and the predicted value in 2020 is 1, 588, 590, 100, 000 yuan. Figure 2

Household Consumption level prediction.
The selected influence factors are substituted into the grey prediction and artificial neural network. The neural network parameters are 10 -7 , 10000, 2, respectively for the error precision, the number of learning and the number of neurons. Through the grey model obtains the predicted values for the next two years. The prediction accuracy of grey model is shown in Table 8.   3.2.4 Regional GDP prediction. The selected influence factors are substituted into the grey model and artificial neural network. The neural network parameters are 10 -7 , 10000, 2, respectively for the error precision, the number of learning and the number of neurons. The prediction accuracy of grey model is shown in Table 9. According to the neural network prediction model of the regional GDP, the predicted value of the regional GDP in 2019 is 156. 085 billion yuan and the predicted value in 2020 is 15224. 3304 billion yuan. Figure 4 is a comparison of the regional GDP predicted values and true values via the neural network model.

Conclusion
In the past, we used the least squares method to estimate the model parameters and used stepwise regression to selection variable. But both of them have an obvious shortcoming, that is their solutions can only be partially optimal. The Adaptive-Lasso method for variable selection makes up for this shortcoming because it's very efficient in processing high-dimensional data and the computational speed is fast enough to produce a clear approximation path. We used grey model to predict a single variable because grey prediction has excellent performance for the prediction of less volume data, and then applying neural network to the final prediction. The more accurate conclusions have been achieved, which provide an effective way to financial analysis.