Statistical Inference on Nonparametric Spline Models and its Applications

One of the components of the Human Development Index which is still a problem and concern in the world today is the Life Expectancy Rate (LER). United Nations Development Program (UNDP). United Nations Development Program (UNDP) uses the LER to measure community health status as well as a benchmark for development success. LER in Indonesia continues to increase almost throughout the year. That is, the hope of a newborn baby to be able to live longer is getting higher. LER data modelling with parametric regression is not necessarily suitable to be applied because the LER relationship pattern has a pattern that varies at certain age intervals. Spline regression is a regression method that can handle data whose pattern changes at certain intervals. Spline is one of the models in nonparametric regression that has a very special and very good visual statistical interpretation. In addition, splines are also able to handle data characters or functions that are smooth (smooth). This study aims to derive the form of the estimator and the shortest confidence interval for the quadratic spline model and model the LER data in Indonesia.


Introduction
If discussed in probabilistic theory, Life Expectancy Rat (LER) is a quadratic function, because the graph pattern is U-shaped and almost like the shape of an inverted normal distribution curve but not symmetrical. In statistics, the quadratic function is U-shaped in the case of LER data, the curve tends to be quadratic but the shape is slightly skewed to the right, not smooth and, not truncated. Curves that are not necessarily smooth in shape, need to be proven and analysed using a method that can sensitively detect the existence of points that are pieces. One method that can be used to solve modelling problems is regression analysis. Regression analysis is one of the methods used to determine the pattern of the relationship between response variables and predictor variables. There are three approaches in the regression analysis method, namely the parametric approach, the nonparametric approach, and the semiparametric approach. If a data regression curve forms a known pattern such as linear, quadratic, cubic, or polynomial of degree k then the appropriate approach is the parametric approach. The nonparametric regression approach is used when the shape of the regression curve of a data is unknown. Meanwhile, the semiparametric approach is used in the shape of the regression curve is partly known and the pattern is partly unknown. Not all of the data contained in the field are known to be related patterns, so a nonparametric regression approach can be used to solve these problems.
According to [1] said that in the nonparametric regression approach the data is expected to find its own form of estimation without being influenced by the subjectivity of the researcher. In other words, nonparametric regression approach has high flexibility. Nonparametric spline regression approach is more suitable to be applied to LER data than parametric approach. Several nonparametric regression approaches that are often used are Fourier Series, Kernel, Spline, and Wavelets. The spline approach has advantages such as the model tends to find its own estimate wherever the data moves. This advantage is caused because in the Spline there are knot points which are joint points of integration that indicate changes in data behaviour patterns [2]. In general, the Spline is an estimator obtained by minimizing the Penalized Least Square (PLS), namely the optimization criteria for combining goodness of fit and curve smoothness. This optimization solution is obtained by using the RKHS or Gateaux method which is mathematically relatively difficult to do. To overcome this, another simpler estimator method is used, namely Least Square (LS) optimization, which is expected to obtain relatively easier and simpler mathematical calculations and mathematical interpretations [3]. The statistical inference that is very important in spline regression is the confidence interval with the concept of the shortest confidence interval with Pivotal Quantity.
The population is one of the wealth owned by a nation. Indonesia is one of the countries that occupy the top five positions with the largest population in the world. In 2015, Indonesia was in fourth place after China, India and the United States. As a large nation, Indonesia should have the ability to carry out development by making the population the subject and object of development. The success of nation-building cannot be separated from the achievement of human development as a real wealth owned by a nation. In principle, there are three most basic choices at all levels of development, namely the community wants to live a long and healthy life, wants to get an education, and wants to have access to the resources needed to live a decent life. The concept of life became the forerunner to the formation of the Human Development Index which was first launched by the United Nations Development Program (UNDP) in 1990. Therefore, the success of a nation's development can also be seen from the achievement of the Human Development Index. One of the components of the HDI that is still a problem and concern for the world today is the Life Expectancy Rate (LER). UNDP uses LER to measure community health status as well as a benchmark for development success. Low LER in an area must be followed by health development programs and other social programs such as environmental health, nutritional adequacy and calories including poverty eradication programs. LER in Indonesia continues to increase almost throughout the year. That is, the hope of a new-born baby to be able to live longer is getting higher. In 2010 Indonesia's LER was 69.81 years and continued to increase until 2015 to 70.78 years. According to Bappenas, the government targets Indonesia's LER to reach 73.7 years by 2025. However, if we look closely, the growth of LER in Indonesia is still very low. In the period 1990 -2015, the average growth of LER was only about 0.3 percent per year. This is in line with WHO's observations on LER in developing countries which move slowly and even slower when compared to underdeveloped countries.
Along with the increase in LER nationally, provinces in Indonesia also experienced positive LER growth. However, there are still LER disparities between provinces. This fact is certainly not in line with the government's development target stated in the NAWACITA program, namely the equitable distribution of development between regions in Indonesia. In 2015, provincial LER in Indonesia ranged from 64 -74 years. Of the 34 provinces, there is still 73.53 percent of provinces that have LER under the national LER. Most of these provinces are located in the eastern part of Indonesia Based on the resulting study, LER Indonesia is ranked 137th out of 223 countries in the world. This is a reflection of the low quality of life of the population in Indonesia. The study is in line with the problems and challenges stated in the National Long-Term Development Plan (RPJPN) 2005-2025, namely improving the quality of life in various development fields. The Global Burden Disease (GBD) study under the supervision of the Institute for Health Metrics and Evaluation (IHME) said that developing countries are currently facing serious challenges, namely lifestyle and deadly diseases, namely heart disease, stroke and diabetes that can affect all ages. According to the records of the World Health Organization (WHO), one of the countries that has succeeded in overcoming this deadly disease is South Korea, so that in 2015 South Korea was able to achieve LER above 80 years and in the same year succeeded in creating a community of people aged almost reaching the age of 80. 100 years. As an indicator of the quality of health and well-being, LER is of course influenced by many factors. Several studies that examine the factors that affect LER include literacy rates, average years of schooling, and per capita expenditure. Other studies also mention that the factors that influence LER are the Human Development Index, the percentage of deliveries with non-medical assistance, the average age at first marriage for women, the average length of schooling, the average household expenditure, the percentage of areas with village status, the average length of exclusive breastfeeding, and the percentage of households using clean water.

Nonparametric regression
The parametric regression approach is used if the spread of the data pattern forms a certain pattern, whether linear, quadratic or cubic. To be able to use parametric regression models, past experience is needed or there are other sources available that can provide detailed information about the data. In parametric regression there is a firm assumption, namely the shape of the regression curve is known [1]. The method used for parameter estimation is Least Square and Maximum Likelihood [4]. The approach model generated by these methods is equivalent to the parameter estimates in the model. In general, linear parametric regression can be written in the form [ where y response variable, 12 , ,..., r x x x as predictor variable, i  is an independent random error with normal distribution with zero mean and variance 2  . If the assumed model is correct then parametric regression estimation is very efficient, but otherwise it will lead to misleading data interpretation [7].
In general, the form of multivariable parametric regression with r predictor variables in equation (1) can be written in matrix form: The vector and matrix forms in equation (2)

Spline model
Many studies of spline estimators on nonparametric regression models have been carried out such as [10,11,12,13]. The nonparametric regression model which states the relationship between the predictor variable and the response variable can be written as follows: Causes an estimate for   x f is:   fX λ β X λ X λ X λ X λ y (8) The knot points are the controller of the balance between the smoothness of the curve and the suitability of the curve to the data. The selection of knot points can be done by using the Generalized Cross Validation or GCV [14,15] method.
where MSE( ) λ in equation (9) are: The selection of optimal knot points is done by looking at the smallest GCV value.

Method
The steps of this research are described in detail as follows: 1. The nonparametric regression model which states the relationship between the predictor variable and the response variable can be written as follows: 2. Looking for confidence interval 1-α in nonparametric spline quadratic regression with Pivotal Quantity approach. The steps are as follows: a. Determine the basis of the quadratic spline function. b. Estimating the parameters of the nonparametric spline quadratic regression model by using the MLE method. 3. Calculating the length of the confidence interval using the concept of the shortest confidence interval. 4. Deriving the shortest confidence interval for the nonparametric regression curve.

Result
By using the concept of Pivotal Quantity, which is a statistic whose distribution does not contain a regression curve   i fx, ,..., 0,1 , 1, 2,..., var with probability function does not load regression curve  . (17)

Quadratic spline model for life expectancy rate
To see the pattern of the relationship between LER and IMR, the plot is given as follows: Figure 1. Scatter plot of the relationship between LER and IMR From the picture above, there are indications of changes in data behaviour patterns. In addition, the data pattern tends to be quadratic. Therefore, in this study the author will try to approach using nonparametric spline quadratic regression.
In using Spline regression, the first step that must be done is to determine the location of the knot point. The method of selecting knots is by experimenting in places where there is a change in data behaviour at certain intervals. From various knots, the minimum GCV value is chosen. To get the best results in spline regression, a quadratic spline model was tried with one knot to three knots. Then calculate the estimated parameter value and create an estimation curve. The GCV calculation is carried out using the R Program facility. The GCV will be searched for 1 knot, 2 knots and 3 knots quadratic spline model, as follows: a. Quadratic model 1 knot point: Based on Table 1, it can be seen that the smallest GCV value is 5.832 in the quadratic spline model with three knot points. After obtaining the minimum GCV score for the quadratic spline model, the next step is to calculate the estimate for the quadratic spline model.
The estimation of the quadratic spline model with two knot points is as follows: The estimation of the spline model above can be interpreted that in Indonesia with infant mortality rates above 35 infants, life expectancy decreases. A gradual decline will be more visible when the infant mortality rate is above 50. It is indicated by the shape of the curve that is slowly decreasing. The estimation of the quadratic spline model with two knot points (35;50) for LER in Indonesia has a coefficient of determination R 2 = 89.17% and Mean Square Error = 28.56943. If shown in the figure, the visualization of the quadratic spline curve estimation model is shown in Figure 2  With a quadratic spline model with two knot points, namely at the infant mortaity, a 95% confidence interval will be built with the equation: