Prediction of incident chronic kidney disease in a population with normal renal function and normo-proteinuria

Regarding the irreversible clinical course of chronic kidney disease, identifying high-risk subjects susceptible to Chronic Kidney Disease (CKD) has an important clinical implication. Previous studies have developed risk prediction models identifying high-risk individuals within a group, including those who may have experienced minor renal damage, to provide an opportunity for initiating therapies or interventions at earlier stages of CKD. To date, there were no other studies developed a prediction model with quantitative risk factors to detect the earliest stage of CKD that individuals with normal renal function in the general population may experience. We derived 11,495,668 individuals with an estimated glomerular filtration rate (eGFR) ≥90 mL/min/1.73 m2 and normo-proteinuria, who underwent health screening ≥2 times between 2009 and 2016 from the prospective nationwide registry cohort. The primary outcome was the incident CKD, defined by an eGFR <60 mL/min/1.73 m2. Sex-specific multivariate Cox regression models predicting the 8-year incident CKD risk were developed. The performance of developed models was assessed using Harrell’s C and the area under the receiver operating characteristics curve (AUROC) with 10-fold cross-validation. Both men and women, who met the definition of incident CKD, were older and had more medical treatment history in hypertension and diabetes. Harrell’s C and AUROC of the developed prediction models were 0.82 and 0.83 for men and 0.79 and 0.80 for women. This study developed sex-specific prediction equations with reasonable performance in a population with normal renal function.


Introduction
Chronic kidney disease (CKD) is a global public health problem. The global CKD prevalence of stage 3-5 is reported to be 10.6% and its prevalence is consistently high in Europe, the USA, Canada [1], and low-and middle-income countries [2]. Although the reported prevalence and deaths from CKD have increased globally [3], disease awareness is often low among many subjects [4]. CKD is often not detected until it has already advanced due to its asymptomatic nature, and as a consequence, the medical costs of patients and healthcare systems have increased [5].
Regarding the irreversible clinical course of CKD to end-stage kidney disease and the association of CKD with all-cause and cardiovascular morbidity and mortality [6], identifying high-risk subjects susceptible to CKD has important clinical implications [7]. Many studies have reported the cost-effectiveness and advantages of early identification [8] to enable delay or prevention of CKD progression by altering modifiable risk factors [9,10]. As the benefit from accurate prediction of risk in CKD is obvious, risk prediction models for CKD have been developed and validated in Western population [11][12][13][14][15]. Owing to the limited applicability and validity of these algorithms to Asian population, several studies have developed prediction models for Asians [16][17][18][19][20]. These models demonstrated sufficient discrimination performance (c statistics; 0.7 to 0.8), but their clinical usefulness is still questionable because of their study population, limited risk measures, limited follow-up time, and/or lack of evaluation information (diagnostic analysis) [21]. Most of these previous studies were based on individuals with an eGFR >60 mL/min/1.73 m 2 at the baseline. However, the population with eGFR �90 mL/ min/1.73 m 2 and with no trace of proteinuria, which is considered as a normal kidney function, may demonstrate different characteristics and CKD progression from the rest of the population [22]. We focused exclusively on the population with normal kidney function to develop prediction models and identify related risk factors using a large representative data. The developed models may provide more accurate prediction of incident CKD while the individuals are at a risk of early stage CKD and thereby provide meaningful clinical tool facilitating early intervention and treatment to improve clinical outcomes.
Therefore, we conducted a study to develop and validate risk prediction models for the Korean population with normal renal function. The developed prediction models were based on 11,495,668 adults who had undergone two or more health screenings during 2009 and 2016 and had a baseline eGFR �90 mL/min/1.73 m 2 and negative dipstick proteinuria.

Study population
This study was conducted using the national registry database derived from the National Health Insurance Service (NHIS) in South Korea. The NHIS collects all national health screening results for all Koreans and provides these data for the purpose of policy and academic research [23]. In Korea, all insured subjects have a legal obligation to participate in regular national health screening programs. During the program, they were asked to complete a selfreported standard questionnaire and undergo a routine health check-up at designated screening hospitals. From 2009 to 2016, 98,484,853 health screenings were performed on 30,209,982 subjects aged between 20 and 80 years. The specific information for the measurements in screening data is well-described in elsewhere [24]. The screening data of 6,561,245 screenings (807,459 subjects) were excluded due to missing or outlier values. Cutoff values for identifying outliers in this study were defined after thorough review of the distribution of the data as follows: Height (100-200[cm]), weight   (29,402,523 screenings), who had not completed health screening more than twice, which were performed at least six months apart, during the study period, or those with a baseline eGFR <90 mL/min/1.73 m 2 or dipstick proteinuria (trace or higher) were excluded (The same process has been done for subjects with eGFR �60 mL/min/ 1.73 m 2 regardless of their proteinuria status, and a total of 21,740,341 subjects were included). A total of 11,495,668 subjects (5,862,343 men and 5,633,325 women) were included in developing our proposed CKD risk prediction equations (Fig 1). The Institutional Review Board of Seoul National University Hospital (E-1505-034-670) waived the requirement for informed consent and approval because of the nature of this study, which retrospectively analyzed the national registry data.
The health screening data included a structured questionnaire, clinical measurements, and laboratory tests. Past medical history (stroke, heart diseases, hypertension, diabetes mellitus, and hyperlipidemia), family history (stroke, heart disease, hypertension, and diabetes mellitus), and lifestyle factors (smoking status, drinking habit, and physical activity) were collected using a structured questionnaire of the specified form. Height, weight, waist circumference, and blood pressure were measured. The laboratory tests for urine and blood were fasting serum sugar, liver function test, blood hemoglobin, serum lipids (total cholesterol, high-density lipoprotein cholesterol, and triglycerides), serum creatinine, and urine protein using a PLOS ONE urine dipstick. Drinking habits were categorized using the World Health Organization (WHO) classification [25]: (1) no drinking; (2) low risk: average daily alcohol consumption <40 g/day for men and <20 g/day for women; (3) medium risk: average daily alcohol consumption between 40-59.9 g/day for men and 20-39.9 g/day for women; (4) high risk: average daily alcohol consumption �60 g/day for men and �40 g/day for women. The daily average alcohol consumption was calculated as follows: Average daily alcohol consumption (g/day) = [Frequency (drinking days in a week) × Quantity (number of drinks per day) × Volume of drink (50cc) × Alcohol by volume (0.2) × Density of alcohol (0.785)] /7 (days/week), where a standard drink is a glass of "Soju (distilled liquor commonly consumed in Korea)." Physical activity was defined as total metabolic equivalent task minutes per week (MET-min/week) using the International Physical Activity Questionnaire (IPAQ) and categorized into 3 levels [26]: low (<600 METs-min/week), moderate (600-2999 METs-min/week), and high (�3000 METs-min/week). BMI was categorized by following the definition of the WHO [27]: underweight, <18.5 kg/m 2 ; normal, 18.5-24.9 kg/m 2 ; overweight, 25.0-29.9 kg/m 2 ; and obese, �30.0 kg/m 2 . Incident CKD was defined as the development of low eGFR <60 mL/min/1.73 m 2 during follow-up. The eGFR was calculated using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation based on serum creatinine level [28].

Model development
Two sets of sex-specific multivariate Cox regression models were developed to estimate the individual risk for incident CKD: one utilizing data of all the subjects with eGFR �60 mL/ min/1.73 m 2 regardless of their proteinuria status (referred to as the overall model; S1 Table) and the other utilizing data of the subjects with eGFR �90 mL/min/1.73 m 2 and normo-proteinuria (referred to as the proposed model; Table 2). Univariate Cox regression was performed using all available baseline variables: age, waist circumference, BMI, SBP, DBP, FSG, total serum cholesterol, HDL, LDL, TG, GGT, SGOT and SGPT, blood hemoglobin, eGFR, past medical history (diabetes mellitus, hypertension, heart disease, stroke, and hyperlipidemia), family history (stroke, heart disease, hypertension, and diabetes mellitus), smoking status (never, past, and current smokers), physical activity (low, moderate, and high activity), and alcohol intake (low, medium, and high risk). Even with the use of bi-variable selection that did not consider confounding effects, all these variables were included in the multivariable analysis with statistical significance. Our intent was to develop a model using a subset of potentially relevant factors, which is both statistically parsimonious and clinically applicable and useful. To achieve this, a stepwise selection process was used to elaborate the final multivariate Cox regression model, and risk factors with statistical significance were selected. The 10-fold crossvalidation was performed to test the predicted 8-year risk of incident CKD. Discrimination was assessed using Harrell's C and the area under the receiver operating characteristics curve (AUROC) [29,30]. We performed diagnostic analyses by setting threshold values with the Youden index [31]. Youden's index, sensitivity+specificity-1, with bootstrapping procedure was used to find a cutoff point that maximizes sensitivity and specificity at the same time with equal importance/weight. To assess the predictive power of the proposed model, we divided the vulnerable group (subjects with eGFR between 60-90 mL/min/1.73 m 2 or positive dipstick proteinuria) and the less vulnerable group (subjects with eGFR � 90 mL/min/1.73 m 2 and negative dipstick proteinuria) into 10 subgroups and computed the sensitivity and specificity of the incident CKD prediction using both the proposed model and the overall model. A twosided p value less than 0.05, was considered statistically significant. Analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC, USA) and R 3.5.2 (http://www.R-project.org).

Results
Of the 11,495,668 subjects (5,862,343 men and 5,633,325 women) aged 20-80 years, 187,767 subjects (71,192 men and 116,575 women) developed incident CKD during a median followup of 6.2 years (6.3 years for men and 6.1 years for women). The CKD incidences per 100,000 person-years were 208.2 for men and 365.1 for women. Table 1 shows the descriptive characteristics of the study subjects. Both men and women, who met the definition of incident CKD, were older and had more medical treatment history. In particular, a high proportion of CKD was observed in the medical treatment history of hypertension (26.2% in men and 26.8% in women) and diabetes (12.5% in men and 9.5% in women). Table 2 shows adjusted hazard ratios (aHR) for CKD risk factors obtained from the proposed Cox regression models. In both men and women, the risk of incident CKD was positively associated with continuous covariates such as age, SBP, DBP, FSG, SGOT, and TG, and those under the categories of BMI (�25 kg/m 2 ), current smoker, high-intensity exercise, and medical treatment history (hypertension and diabetes). The highest risks of incident CKD occurred in both sexes with diabetes (aHR 1. 38  Higher waist circumference and a history of past smoking were positively associated with a slight risk of incident CKD in men, while the medical history of treatment in heart disease was positively associated with a risk of incident CKD only in women. Those who had higher baseline eGFR, higher HDL, higher hemoglobin, underweight BMI, alcohol intake of any amount, and positive family history of heart disease, stroke, or hypertension had a decreased risk of incident CKD in both men and women. LDL and medical history of stroke were negatively associated with risk of incident CKD only in men, and a family history of diabetes mellitus was negatively associated with risk of incidence CKD only in women. In the discriminatory analysis performed with the 10-fold cross-validation technique, the average Harrell's C-statistics were 0.818 for men and 0.785 for women. The average AUROCs for men and women were 0.827 and 0.801, respectively, for the proposed model. Applying a validation set of subjects with normal renal function using the overall model (eGFR �60), the average AUROCs for men and women were 0.815 and 0.796, respectively.
Applying the Youden index to find the maximum overall sensitivity and specificity in the proposed models resulted in a cutoff point of 0.0177 for men and 0.0346 for women. These cutoff points yielded a sensitivity of 74.7% and specificity of 75.2% in men and a sensitivity of 66.9% and specificity of 77.6% in women. The same analyses have been taken to the overall model (eGFR �60) to determine the cutoff points. Then, 10-fold cross-validation was applied to the subjects with normal renal function to assess the discriminating power of each model for subjects with normal renal function. The main purpose of the CKD prediction model is to allow early detection of CKD and apply appropriate preventive measures before renal damage becomes irreversible. For this reason, we compared the sensitivity and specificity of the overall and proposed models (Table 3), constructed and validated using different datasets. First, the overall model was constructed from all subjects with (eGFR �60 mL/min/1.73 m 2 ) and the proposed model was constructed from subjects with normal renal function (eGFR �90 mL/ min/1.73 m 2 and negative dipstick proteinuria), which is subset of the former group of subjects. The proposed model demonstrated higher sensitivity while maintaining a sufficient level of specificity in subjects with normal renal function, demonstrating a stronger discriminating power. Moreover, the proposed model demonstrated high sensitivity for subjects with renal damage, indicating that it may serve as a screening tool for not only subjects with normal renal function but also subjects with renal damage. Schoenfeld residual plots were used to test whether the proportional hazard assumption was satisfied for each variable in both the proposed model and the overall model for men and women. No variables violated this assumption.

Discussion
Using the nationwide registry data and all clinically relevant variables, we developed and validated the sex-specific equations for 8-year CKD risk prediction in the general population with normal renal function. Our risk prediction equations included most of the variables that were previously recognized as renal risk factors [32,33]. We combined the results of stepwise selections from both the proposed and overall models into one final model in order to evaluate the significance of developing proposed model (normal renal function). The final set of predictors for incident CKD in men were age, SBP, DBP, waist circumference, FSG, GGT, SGPT, SGOT, total cholesterol, HDL, LDL, TG, blood hemoglobin, eGFR, BMI, alcohol intake, smoking status, physical activity, past medical history (heart disease, stroke, hypertension, diabetes mellitus, and hyperlipidemia), and family history (heart disease, stroke, and hypertension). For the men, only family history of diabetes mellitus was excluded. For the women, all other variables were included in the final set except LDL cholesterol. Following the general risk prediction equations [34], parameters and specifications for our equations are shown in S2 Table. Derived risk estimations from our equations for men and women demonstrated good discrimination in the AUROC and Harrell's C calculated with 10-fold cross-validation. In previous studies [11, 13-16, 18, 20], CKD risk prediction models were developed with subjects who might have potential kidney damage. A few Asian models have been developed that incorporate such potential risks. Prediction models for CKD stages I-V [35] were developed previously [17] but no information about the exclusion criteria of the baseline subjects was mentioned. A recent study excluded subjects with stage 3-5 CKD or dipstick proteinuria (2+ or 3+) at the initial health screening [19]. This may indicate that subjects with baseline renal damage might have been included in their study. Unlike any other study, subjects with an eGFR �90 mL/min/1.73 m 2 or dipstick proteinuria (trace or higher) at the initial health check-up were excluded from our study; therefore, our prediction models may be more suitable for identifying the CKD risk factors and risks of subjects with healthy renal function. As our study design is different from other studies (included potential risk group), a direct comparison between models is not possible, but the discrimination of our models (C statistics and AUC: 0.818 and 0.827 for men and 0.785 and 0.801 for women, respectively) is fairly acceptable compared to other recently developed Asian long-term prediction models (C statistics 0.826 for men and 0.827 for women [19], and AUC 0.79 for both sexes [16]). The sensitivity for the proposed model was 74.7% (95% CI: 74.2%-75.3%) and the overall model was 26.8% (26.3%-27.2%) for men, whereas the sensitivity for the proposed model was 66.9% (66.0%- 67.8%) and the overall model was 39.7% (39.2%-40.3%) for women. This shows that the selected predictors in the proposed model were reflected in many renal risk factors known to be associated with an increased risk of CKD, which may help clinicians interpret the results and discuss intervention strategies for the risk of CKD. The association between lifestyle factors and CKD has been reported, emphasizing that lifestyle factors may play a significant role in developing CKD [36,37]. All lifestyle factors (smoking, drinking, and physical activity) were selected in our final prediction model, but our study has found a reverse association between incident CKD and alcohol intake, contrary to the general knowledge. Our prediction model did not account for all possible phenomena like hidden reverse epidemiology since our primary goal was to construct a prediction model with variables increasing the predictive power rather than an inference model. Smoking had adverse effects on CKD in our study, and the same result was seen in a case-control study by Yacoub et al. [38], who reported that current smokers have an increased risk of CKD incidence compared to nonsmokers. Physical activity(exercising for 30 minutes, 5 times per week) is recommended as a lifestyle intervention for subjects with CKD by the Kidney Disease Improving Global Outcomes [39]. Our finding showed that the high activity group (�3000 METs-min/ week) was positively associated with the risk of incident CKD. A similar result was reported in a cross-sectional study by Wang et al. [40] indicating that the high-intensity exercise group was associated with decreased eGFR (�90 mL/min/1.73 m 2 ). These results imply the importance of the proper intensity and time of exercise for high-risk subjects with incident CKD, and the possible differential association between physical activity and CKD needs to be confirmed in future studies. Our study showed a protective effect of alcohol intake on the risk of incident CKD, similar to previous observations elsewhere [41][42][43][44]. This negative association between alcohol intake and incident CKD should be interpreted cautiously because higher alcohol consumption is reported to be associated with an increased risk of overall mortality [45]. Our findings suggest that future studies on these complex relationships of lifestyle should be conducted to clarify the association with CKD. Most of all, it is advisable to adopt a healthy lifestyle to reduce the risk of not only CKD but also other chronic health conditions.
The strengths of the current study include its study design and methodology. This study developed prediction models to estimate the risk of CKD in subjects with healthy renal function (eGFR >90 mL/min/1.73 m 2 ) and no trace of proteinuria, based on large cohorts. We used the national registry database of a large number of Koreans aged 20-80 (~22 million). According to the Korean Statistical Information Service [46], the estimated general population aged 20-79 years in 2010 was 35 million. By using the data of about two-thirds of the general population, we were able to develop CKD risk prediction equations with less selection bias. Moreover, we developed and validated our prediction equations using 10-fold cross-validation to reduce bias and variance. All available variables were tested statistically, and only validated variables were selected for our final prediction equations to provide enough information for interpretation and understanding of the estimated risk of CKD. Lastly, we provided the diagnostic characteristics of different cutoff points to improve the detection of high-risk for CKD; the evaluation of CKD risk with cutoff points may help physicians to decide whether close monitoring or additional testing is necessary. This study has some limitations. First, the definition of outcome was based on eGFR (<60 mL/min/1.73 m 2 ) using the CKD-EPI equation instead of measured GFR. The measurement of GFR is not practical in most studies with large sample size, such as ours. Second, there can be misclassification of variables reported by subjects in a self-reported standard questionnaire. Although a standard questionnaire was designed to be clear and easy to understand, the possibility of mistakes could not be excluded. Finally, risk prediction equations were developed in ethnic Korean subjects; therefore, caution may be necessary when planning implementation to groups whose lifestyles are different from the Korean general population.
In summary, we developed and validated sex-specific risk prediction equations for incident CKD in a general population with a baseline eGFR �90 mL/min/1.73 m 2 and negative dipstick proteinuria. To the best of our knowledge, this is the first study to develop and validate CKD risk prediction equations in subjects with normal renal function. The discriminative ability of the final prediction equations showed that the performance of the model was fairly acceptable, and the selected risk factors in the equations would provide an additional interpretation and understanding of subjects with normal renal function at a high risk of CKD. The clinical utility of these equations must be validated in other populations.
Supporting information S1

Author Contributions
Conceptualization: Seung Min Lee.