Weighted Regression Curves for the Population with Diabetes Mellitus

This study contributes to the fields of Statistics and Numerical Analysis by considering the Least Square method to develop new regression functions. Based on the chronological, discrete dataset for the national and international population with diabetes mellitus from 1980 through 2015, we examine the accuracy of various traditional regression functions. Moreover, in order to emphasize the importance of more recent, past data to the future outputs, the article suggests a weighted Least Square Method. The weight vector is developed by appropriately transforming the standard logistic function σ(t) on the given timeline while counting the number of node points, Then the root mean square error (RMSE) is re-defined along with these components and the derived weighted regression functions work more coincident than the non-weighted ones to predict the near future values. The obtained regression functions are employed again to predict the near (2016) and remote (2030) future populations of patients with diabetes and outputs are analyzed with their unique properties.


Diabetes Mellitus
The 2016 World Health Organization (WHO)'s global report [1] describes that "Diabetes mellitus (DM) is a chronic disease caused by inherited and/or acquired deficiency in production of insulin by the pancreas, or by the ineffectiveness of the insulin produced. Such a deficiency results in increased concentrations of glucose in the blood, which in turn damages many of the body's systems, in particular the blood vessels and nerves". That is why diabetes can be a major cause of many serious complications including kidney failure, neuropathy, heart disease, blindness, stroke, diabetes foot ulcers, and so forth [2].
According to the WHO's Fact Sheet, the number of people of all ages with diabetes has risen worldwide up from the 108 million in 1980 to 422 million in 2014 [3,4]. See the chronological numbers in Section 2. The International Diabetes Federation (IDF) predicts that the number of population with diabetes will reach 642 million by 2040 [5], which means 1 in 10 adults will have diabetes. The global prevalence of diabetes among adults over 18 years of ages has been rapidly rising up to 8.5% in 2014. The 2015 Diabetes Atlas shows that there were 5 million deaths worldwide directly caused by, or attributable to, diabetes. That is, every 6 seconds one person dies from diabetes. This number also indicates a remarkable increase from the 3.9 million deaths in 2012 [1,5].
On the other hand, The US National Centers for Disease Control and Prevention (CDC) state that the number of Americans with diagnosed diabetes has a fourfold increase, from 5.5 million in 1980 to 22.0 million in 2014 [6,7] (see the chronological data for the number of US adults with diabetes in Section 2). It stands about one out of every 11 US adults has diabetes. The report also states that there are an additional 8.1 million US people with diabetes (i.e., 27.8% of people with diabetes) who are still undiagnosed. According to the American Diabetes Association (ADA) through the National Vital Statistical Report (NVSR) there were a total 76,488 deaths directly caused by diabetes for the year 2015. Additionally, a total of 234,051 were direct or contributed deaths by diabetes, which represents the 7th leading cause of death in the United States in 2014 [8]. The CDC also identifies about 86 million US adults who have prediabetes, whom are likely to become the diabetes in 10 years or less without intervention. Furthermore, 90% of those with pre-diabetes are unaware of the onset they face [9].
One of the outcomes that result from increasing emergence of diabetes is the financial burden placed on healthcare systems in various countries. In terms of economics, the Harvard School of Public Health estimates that the global expenditure for directly treating diabetes is 825 billion dollars per year in 2016, including 105 million dollars in the US [10]. Meanwhile, the ADA estimated the total direct and indirect cost at 245 billion dollars on March 6, 2013 [7,11].

Mathematical Background
In statistics, regression analysis is a study for predicting the relationships among dependent (or random or response) variables and related independent (or predictor) variables. The objective function to express these relationships is called the regression function, and the regression curve is its graphical representation. Regression analysis has been widely used where the statistical anticipation based on the informed data are demanded or when it is of interest to explore the trends of the relationships among variables. (Some real practical applications are sought in [12].) In particular, establishing population models in social sciences and income or sales revenue modeling in business emphasize the importance of regression analysis.
Many methods for performing regression analysis have been developed from the simple linear regression method to various nonlinear approaches, and much literature has also joined to propose the usefulness of these methods with distinct techniques (e.g., see [13]). Nevertheless, the purpose of regression functions is eventually to find the maximum likelihood functions known as curve fitting such that the total summation of differences in the distances of data points from the curve is minimized. Thus most regression curves are sought through the Least Square Method (LSM), although some curves drive from alternative approaches, such as exponential trend function in MS office [14].
The Least Square Method is a numerical approach to find an approximate function using the root-mean-square error (RMSE) [15,16,17,18]. That is, we are looking for a fitting function ( ) such that minimizes for sampling discrete data ( , ) for = 1, ⋯ , . Here ∑ − ( ) in Eq. (1) is commonly called the sum of squared deviations in statistics. If the data are given in the format of continuous function ( ) on the interval [ , ] , then ( ) is defined by minimizing Alongside those startling numbers of national (US) and international populations with diabetes, we examine traditional regression curves in linear, exponential, and powered function forms in Section 2. In Section 3 a reasonable weight vector is introduced, and the weighted regression functions are created on the basis of this vector. The resultant outputs are then compared to ones in their accuracy in Section 2. Section 4 predicts the number of persons with diabetes in 2016 and 2040, respectively using both non-weighted and weighted methods and the following arguments are addressed.

Data
The international chronological populations of adults aged 18 or older with diagnosed and undiagnosed diabetes are shown in the Table 1 [5, 19,20,21,22,23,24]. Each data has about ±30 % uncertainty percentage points due to mostly inaccurate counting of the number of underrepresented patients. Table 2 also shows the chronological data of the adult (18+) populations with diagnosed diabetes in the United States during 1980 -2014 [6]. Each data set includes an around ±5% error bound. Since the numbers in Table 2 include only the diagnosed patients, it has less error than ones in Table 1.
In addition, the data of the more recent, past has the most accuracy. It seems remarkable that there were decreases in numbers in 1996, 2011 and 2014.

Traditional and Non-weighted Leastsquare Methods
The traditional trend curves for the data between 1994 and 2014 from Table 1 using Microsoft Excel formula are shown in Fig. 1. According to these regression functions, the worldwide population of age 20+ with diabetes in 2015 are predicted to be 373 million, 297 million, and 409 million, respectively in linear, powered, and exponential functions.
When the data is applied to the LSM, we have the natural exponential function as = 9.7579 × e . which estimates the population in 2015 to be 420 million. The RMSE (simply, the average error for each data) is 17.812 (million). As shown in Table 4, the exponential approach looks more fitting to data. As mentioned, the natural exponential function and powered function using LSM are slightly different from ones using Microsoft Excel formula. Table 4 shows more details. The other exponential function with an unknown exponential base is given by = 97.579 × 10.655 in million, and it is interesting that the predicted number in 2015 and RMSE are exactly the same as the natural exponential case. This phenomenon amazingly happens in other data. (See Table 7.) Thus we stop finding other exponential functions with an unknown base as experimental targets. Clearly, the LSM linear regression function is identical to the one from Excel. These described functions have the starting year 1994 as = 1 and 2015 as = 22.
Unlike the data for international patients, the data of the US population with diabetes are sequentially given without missing years since 1980. Fig. 2

Remark
A common difficulty occurring in the calculation of regression in LSM is to find the optimized solution for the systems of nonlinear equations. As shown in the sample contour graph in Fig. 3,

Weights
In order to develop a more accurate predicting tool on the basis of past chronological information, it sounds reasonable to emphasize the importance of more recent, past data to the future outputs. We thus suggest a weighted Least Square Method in the form of a combinational LSM with weights. To define a set of weights, we consider a logistic function, in particular, the standard logistic function ( ) = . (Its graph is shown in Fig. 4) Since its output always takes values between zero and one, it is also interpretable as a probability. Then the weight vector is sought by evenly restricting the domain [− , ] and appropriately transforming the function ( ) considering the given timeline and the number of node points as like  Table 3. The first sum of weights is 54% in the last column of the second row because the missing years also take their own weight portions. Again, dividing the sum of actual node weights, we have the final weight as in the third row.
Then the root-mean-square error (RMSE) is redefined along with these weight components by for the discrete data ( , ) for = 1, ⋯ , .

Remark
We note that since the x-variable represents the year, the logistic regression analysis (in which more than one dependent variables are categorical) does not fit to this model. However, observation of characteristics of the potential diabetes factors (e.g. the prevalence of obesity, the increase of fast-food markets, the indexes of various blood tests, etc.) can be employed in the determination of weights.

Weighted Regressions
The results of applying the weight (written in Table 3) to the weighted RMSE and implementing the linear, powered function and natural exponential functions for the international data are shown in Table 4. Clearly they show different functions but mostly improved functions in that the predicted estimation is more coincident to the existing data value in 2015. For example, while the non-weighted linear function has a 10 % relative difference error in prediction of population in 2015, the weighted linear function = 16.650 + 26.617 marks 393 million in 2015, which is a 5.1% relative difference from the actual data 414 million in 2015. In particular, in the approach with powered functions in the form of = • , the non-weighted functions, = 90.218 × .
in LSM with relative errors 28.3% and 14.3%, respectively, are remarkably improved to a weighted function with an error of 3.4%.
Since the deterioration from 1.3% to 5.1% in natural exponential approaches is not serious considering the counting error boundary of populations, we conclude that the weighted models have more accuracy in the near future prediction than the non-weighted ones.
In the case of the US populations with diabetes, we also confirm the generally enhanced accuracy for predicting the population in 2014, especially in the powered type functions. (See Table 5.) It is impossible to directly compare the exponential regression from Excel with the weighted (or non-weighted) exponential regression in LCM because they adopt different techniques for building functions. Table 5 also shows that the RMSE and weighted RMSE both remain stable at the level of 1 -2 million.
The result shown in Tables 4-5 justifies that we could well adopt the weighted regression functions as an alternative to or as a more improved approach, than non-weighted functions.

PREDICITON OF THE TREND IN 2030
The global population with diabetes is anticipated to range from 389 million to 450 million in 2016, and more than 600 million by 2030 as shown in Table 6 and Fig. 5. Here, the significant volume of numbers in exponential cases reflect the rapid increasing property of exponential function with a base of greater than 1.  Table 7 also shows the prediction of the US population with (diagnosed) diabetes in 2016 and in 2030. Two linear functions estimate that 21.5 million and 22.9 million in 2016, and 29.7 million and 32.9 million in 2030, respectively, will have diabetes. Following the trends of exponential functions, the patients may reach 53 million by 2030.
Overall, the weighted regression analysis generally forecasts the largely increased numbers better than the non-weighted one in the global case. This fact may imply that the diabetes population increases more rapidly during recent years than the remote past. Nevertheless, the prediction using an exponential graph looks less realistic for remote future populations. One possible reason is that the human population has been growing in very constrained environments unlike bacteria.

CONCLUSION
We examined the accuracy of various traditional regression functions and newly defined weighted regression functions based on the chronological, discrete dataset for the national and international population with diabetes mellitus from 1980 through 2015 that is released by the CDC and IDF. As we compare them, the exponential approach has a smaller RMSE, which better fits to the data both nationally and internationally. The weighted regression functions are defined on the basis of a weight vector which takes the advantage of the standard logistic function σ(t).
Then they more strongly reflect the trend of more recent data and so work very properly in the prediction of the near future values. In addition, the following observation of these regression functions forecasts that the population of diabetes patients will exceed 600 million worldwide and 30 million in the United States in 2030 without intervention.
The experiments show that while the exponential function can well reflect the drastic increasing and near future prediction, the linear and powered function can better express the slower or gentle increasing trends. Nevertheless, using only a single type function as a regression tool for the remote future still is untrustworthy. Thus we finally suggest that a well-organized combination of several type functions can be an alternative so that it can reflect the overall trends in increase or decrease as well as more subtle movement.