A Noninvasive Prediction Model for Hepatitis B Virus Disease in Patients with HIV: Based on the Population of Jiangsu, China

Objective To establish a machine learning model for identifying patients coinfected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) through two sexual transmission routes in Jiangsu, China. Methods A total of 14197 HIV cases transmitted by homosexual and heterosexual routes were recruited. After data processing, 12469 cases (HIV and HBV, 1033; HIV, 11436) were left for further analysis, including 7849 cases with homosexual transmission and 4620 cases with heterosexual transmission. Univariate logistic regression was used to select variables with significant P value and odds ratio for multivariable analysis. In homosexual transmission and heterosexual transmission groups, 10 and 6 variables were selected, respectively. For identifying HIV individuals coinfected with HBV, a machine learning model was constructed with four algorithms, including Decision Tree, Random Forest, AdaBoost with decision tree (AdaBoost), and extreme gradient boosting decision tree (XGBoost). The detective value of each variable was calculated using the optimal machine learning algorithm. Results AdaBoost algorithm showed the highest efficiency in both transmission groups (homosexual transmission group: accuracy = 0.928, precision = 0.915, recall = 0.944, F − 1 = 0.930, and AUC = 0.96; heterosexual transmission group: accuracy = 0.892, precision = 0.881, recall = 0.905, F − 1 = 0.893, and AUC = 0.98). Calculated by AdaBoost algorithm, the detective value of PLA was the highest in homosexual transmission group, followed by CR, AST, HB, ALT, TBIL, leucocyte, age, marital status, and treatment condition; in the heterosexual transmission group, the detective value of PLA was the highest (consistent with the condition in the homosexual group), followed by ALT, AST, TBIL, leucocyte, and symptom severity. Conclusions The univariate logistics regression combined with the AdaBoost algorithm could accurately screen the risk factors of HBV in HIV coinfection without invasive testing. Further studies are needed to evaluate the utility and feasibility of this model in various settings.


Background
Acquired immune deficiency syndrome (AIDS) caused by human immunodeficiency virus (HIV) is a global health crisis [1]. HIV invades T lymphocytes (CD4+) cells of the human immune system, leading to a variety of opportunistic infections and death [2,3]. Hepatitis B (HB), a chronic disease characterized by hepatitis B virus (HBV) infection for more than 6 months and varying degrees of inflammatory necrosis/fibrosis of the liver, also poses a global health threat [4]. Worldwide, estimates suggest that more than 2 billion people have been infected with HBV, including 248 million chronical infections (defined as hepatitis B surface antigen (HBsAg) positivity) [4,5]. Approximately 10% of HIVinfected people are also chronically coinfected with HBV [6]. Coinfected patients present with rapidly progressive liver disease, facing a higher risk of cirrhosis and even death [7]. HBV infection worsens the prognosis in HIV-positive people [6]. Therefore, efficient tools should be developed to identify HIV/HBV coinfection before the establishment of specific treatments.
Considering China's greatest HBV population worldwide, it is supposed that the HIV/HBV coinfection rate may exceed the global average [8]. HIV and HBV share similar transmission routes and risk factors, making it difficult to create an accurate method for early distinguishment of HBV from HIV infection [8].
It is difficult to use traditional logistic regression to process demographic and serological data which are often nonlinear, abnormal, and heterogeneous [9]. But machining learning (ML) provides a chance. Compared to the logistic regression model, ML analysis does not require data structure, since ML can balance the deviation and variance of data. Nowadays, ML methods have wide applications in the medical field, such as diagnosis of cancers, diabetes, medical imaging, and pediatric diseases [10][11][12]. In the present study, we established a ML model for identifying HIV patients coinfected HIV with HBV patients based on four algorithms.
Previous studies have explored the differences in baseline variables between HIV patients and HIV/HBV coinfected patients. Compared to HBV infection group, the HIV/HBV coinfection group had a lower level of platelets [13,14] and a higher level of liver enzymes, especially aspartate aminotransferase and alanine aminotransferase [15][16][17]. At the onset of coinfection, CD4 count dropped significantly and recovered slowly after cART [16,18,19]. Besides, compared to HIV infection, HBV/HIV coinfection raised serum bilirubin levels [17]. Nevertheless, HBV/HIV coinfection could perplex the diagnosis of HBV on account of spontaneous loss of hepatitis B surface antibody and reactivation of HBV replication [14]. Hepatitis B surface antigen (HBsAg) seroclearance has been analyzed during the treatment and prognosis of CHB using four ML algorithms, including extreme gradient boosting (XGBoost), Random Forest, Decision Tree, and logistic regression (LR) [20]. However, our study is the first to use them to screen patients with HBV/HIV coinfection.

Study Population.
Recruited were 14197 HIV and HIV and HBV cases infected through homosexual transmission and heterosexual transmission recorded between 2005 and 2019. Demographic and serological data were obtained from Jiangsu Provincial Central of Disease Control (CDC). Demographic information mainly consisted of age, marital status, severity of symptoms (fever, cough, and so on), clinical stage, weight, and height. Four clinical stages of AIDS were included. The geographic setting was divided into Jiangsu province and other provinces. The marital status was divided into four classes: unmarried, married or cohabiting, divorced or separated, and widowed. Next, "yes" or "no" was used to describe the presence/absence of tuberculosis, drug use, and symptoms. The serological indexes contained blood creatinine level (CR), leucocyte, triglyceride (TG), total cholesterol (TC), aspartate aminotransferase (AST), alanine aminotransferase (ALT), and total bilirubin (TBIL).

Data Processing.
A total of 14197 HIV and HIV/HBV cases were recruited from the CDC of Jiangsu. Variables with missed and abnormal values, as well as samples with many null variables (totally 4 variables and 1728 cases), were deleted. Finally, 12469 cases (1033 HIV and HBV cases and 11436 HIV cases) left were used for further analysis, including 7849 infected through homosexual transmission and 4620 through heterosexual transmission. And for some variables with few nulls, we replaced them with the most frequent values or the mean number for continuous variables. Categorical variables were presented as the frequency number (percentage), and continuous variables in a normal distribution were presented as mean ± standard deviation.

ML Model Establishment.
A ML model was established based on Decision Tree, Random Forests, AdaBoost, and XGBoost algorithms. Specific, Decision Tree is a supervised learning method based on a cluster of "if-then" regulations to increase readability [21]. However, Decision Tree may present with big variance and overfitting. To solve this, we employed different trees, like Random Forests, to improve the prediction ability of the model. AdaBoost and XGBoost algorithms were used to combine stumps of trees [1]. AdaBoost is based on gradient, and XGBoost is based on the weight of data with the wrong classification. In this study, we used these four methods for the classification of HBV patients.

Feature Selection.
Characteristics of specific value can be selected using ML and traditional statistical methods, such as Random Forests [22], logistic regression [23], principal component analysis(PCA), analysis of variance, and Fisher discriminant rate [24]. In this study, we used univariate logistics regression to keep the original data structure and make the variables comprehensible. Corresponding odds rate (OR) with 95% confident interval (95% CI) and P value were calculated.

Data
Balancing. In HIV patients, the number of patients coinfected with HBV was imbalanced. Generally, it is more difficult to classify a small population than a large population [25]. Therefore, the data were balanced with random undersampling (RUS) and random oversampling (ROS). ROS was used to remove the fraction of HIV/AIDS patients without HBV, then balance the numbers in both categories. After ROS, the computing speed and memory are increased, but some information may be lost. In this study, Synthetic Minority Oversampling Technique (SMOTE), a method evolving from ROS, was used for data balance. SMOTE could increase the number in the smaller class through some regulations. For example, disturbing data and random noise might be added to achieve class balance, meanwhile saving the original information.
2.6. Model Evaluation. After balancing, the data were divided into two groups: development group (70%) and validation group (30%) [26]. In order to improve the accuracy of Decision Tree, we plotted the "verification curve" based on the fivefold cross verification. The receiver operating characteristic curve (ROC) [27], confusion matrix, accuracy rate, precious rate, recall rate, and comprehensive evaluation index (F-Measure) were used to judge the accuracy of the model. Relative concepts were showed as follows.
Confusion matrix presents the diagnostic results in the form of tables, summarizes the number of correct and incorrect classifications, and divides them into categories. The confusion matrix shows which part of the classification model will be obfuscated during the identification, and the tabular form of output is shown in Table 1.
Accuracy refers to the proportion of correct classification, calculated with the following formula: However, in dealing with imbalanced data, it is difficult to express the accuracy of the classifier. For example, 100 cases in the development group, the positive results accounted for 99%. Even if all cases are predicted to be positive and the accuracy of the model is more than 90%, the result makes no sense. Precision means the proportion of all correctly predicted HIV coinfected with HBV cases against all actual HIV coinfected with HIBV cases. It represents the probability of a correct prediction among all the results: The difference between accuracy and precision is that the former represents the accuracy of prediction in positive samples, while the latter represents the accuracy of prediction in the total of positive and negative samples.
Recall represents the proportion of all correctly predicted cases to all predicted HIV coinfected with HBV cases.

Recall = TP TP + FN
: It is a trade-off measurement with precision and should be balanced according to actual requirements.
F-Measure is the weighted harmonic average of Precision and Recall.
ROC curve is used for evaluating the performance of the ML model. The vertical coordinate of the ROC curve is the true rate (TPR), and the horizontal coordinate is the falsepositive rate (FPR).
AUC curve represents the area under the ROC, whose value is between 0.5 and 1. The value of AUC close to 1 indicates the better performance of the model [28].
All data analyses were carried out with R software 4.0.2 version and Python software 3.7 version.

Results
After data processing, 12469 individuals were incorporated into the model. There were 7849 cases in the homosexual transmission group, including 7239 HIV cases and 610 HIV/HBV coinfected cases. There were 4620 samples in the heterosexual transmission group, including 4197 HIV cases and 423 HIV and HBV cases. Univariate logistic regression analysis was performed in two groups. In the homosexual group, the risk of HBV infection increased with age. Compared with the unmarried, the cohabitors and the divorced or separated had a higher risk of HBV. Some blood indexes, such as leucocyte, platelet, blood creatinine, ALT, AST, and TB, were significantly different between HIV and HIV and HBV groups, suggesting that they can be used as risk-evaluating indicators. In the heterosexual transmission group, AST, ALT, TB, and platelet levels were significantly different between the HIV and HIV and HBV groups, which was consistent with that of the homosexual group. Compared with the unmarried, other populations showed no differences in these indicators, which is different from the condition in the homosexual group. In both groups, the geographic setting had no differences between HIV and HIV and HBV groups. Details were shown in Table 2. Moreover, we drew the forest plots of the results in univariate logistic regression ( Figure 1). Figure 1 depicted the baseline information of the HIV and HIV and HBV cases and the OR values with 95% CI in homosexual and heterosexual transmission groups.
The original data in the heterosexual transmission group contained 4620 cases. After calculation with SMOTE algorithm, the number was raised to 8385 cases that were randomly divided into the development group (5870 cases) and validation group (2515 cases). In the homosexual transmission group, the original data consisted of 7849 cases. After calculation with SMOTE algorithm, the number was raised to 14482 cases that were randomly divided into the development group (10137 samples) and validation group (4345 samples). The ratio of the numbers in development group to validation group was 7 : 3. Results of the SMOTE algorithm were presented in Table 3.
The performances of ML model in the heterosexual and homosexual groups were shown in Table 4 and Table 5. The confusion matrix was shown by the heat map. The larger the value, the darker region. The color of TN and TP was almost orange. On the contrary, the lighter FN and FP regions, the higher accuracy of the model. We found that AdaBoost algorithm was the most accurate in both groups (homosexual group: accuracy = 0:928, precision = 0:915, recall = 0:944, F − 1 = 0:930, and AUC = 0:96; heterosexual group: accuracy = 0:892, precision = 0:881, recall = 0:905, F − 1 = 0:893, and AUC = 0:98). ROC curves of four algorithms in homosexual and heterosexual groups were shown in Figure 2.   4 BioMed Research International    BioMed Research International The AdaBoost algorithm had the strongest capacity in classification. Therefore, we sorted out the significant variables using AdaBoost in homosexual and heterosexual groups. AdaBoost calculated the detective value of each variable with three algorithms. However, the three scores of each variable showed no significant difference, indicating their mild effect on the rank level. Finally, we chose the default method for further analysis. In the homosexual transmission group, 10 variables were selected by univariate logistic regression. Among them, the detective value of PLA was the highest, followed by CR, AST, HB, ALT, TBIL, leucocyte, age,     (Figure 3). Figure 4 depicted the scores of the 6 variables selected from univariate logistic regression in the heterosexual group. Among them, the score of PLA was the highest (consistent with the condition in the homosexual group), followed by ALT, AST, TBIL, leucocyte, and symptom severity.

Discussion
Globally, among the annual 1.3 million HIV-related deaths, 96% are caused by complications of chronic hepatitis, 66% by HBV [29]. In China, about 8.7% of HIV/AIDS patients are coinfected with HBV that shares the same transmission routes [30]. By 31 October 2019, the prevalence of AIDS had maintained at a low level, but HIV positive rate is still high in some high-risk groups. Data show that since the "thirteenth Five-Year" (2016-2020), the HIV positive rate of MSM stays between 6.97% and 8.58%. AIDS is mainly transmitted through sexual routes. In a large study of HIV-positive Chinese, the prevalence of HBV and HIV coinfection was 9.5%, highest in Eastern China (14.5%) and lowest in Central China (5.0%) [31]. Over one-third of HIV and HBV coinfected patients develop moderate-to-severe liver disease [32]. However, it is difficult to find an accurate model for classifying patients coinfected with HBV in HIV.
In the present study, a ML model was constructed with four algorithms. The accuracy, recall, precision, and AUC value of each algorithm were analyzed. SMOTE method was used to balance the data. In the homosexual transmission group, the accuracy calculated by DT, RF, AdaBoost, and XGBoost was 0.779, 0.844, 0.928, and 0.875, respectively; the precision was 0.750, 0.804, 0.915, and 0.844, respectively; the recall was 0.839, 0.910, 0.944, and 0.919, respectively; the AUC was 0.805, 0.921, 0.982, and 0.944, respectively; the value of F-1 was 0.792, 0.854, 0.930, and 0.880, respectively. In the heterosexual transmission group, the accuracy was 0.762, 0.837, 0.892, and 0.863, respectively; the precision was 0.710, 0.838, 0.881, and 0.860, respectively; the recall     We next chose the AdaBoost algorithm for further analysis. Among 10 variables selected by univariate logistic regression in homosexual groups, through ML analysis, we calculated the importance sorting of these variables. The higher the feature score is, the more contribution a variable makes to the detective of coinfected with HBV patients in HIV cases. PLA was the most accurate variable in our study, closely followed by CR, AST, HB, ALT, TBIL, leucocyte, age, marital status, and treatment condition, which is similar to previous studies [33][34][35]. PLA had the highest detective value in the heterosexual group, followed by ALT, AST, TBIL, leucocyte, and symptom severity. Some studies have demonstrated a strong association of PLT, age, AST, ALT, and INR with liver fibrosis in coinfected patients [35]. AST-to-platelet ratio index (APRI) and FIB-4 score have been proposed for staging hepatic fibrosis [36]. It needs future study to explain the differences between two transmission groups.
Extensive studies have attested to the fact that HBV and HIV infections have mutually adverse effects. Compared to HIV patients, HIV and HBV coinfected patients display faster immunological and clinical progression, stronger hepatotoxicity after initiation cART, and higher morbidity  BioMed Research International and mortality. Similarly, HIV impairs host immunity against HBV, because CD4 cells are destructed by HIV. Coinfection with HIV can accelerate HBV-induced liver damage, leading to cirrhosis and advanced liver disease. Liver enzymes (ALT, AST mainly), synthetic function (albumin and prothrombin time), bilirubin, complete blood count, and platelet count, especially a gradual decline in platelet count, may be more sensitive markers of progressing liver fibrosis [37]. However, more sensitive markers should be explored.
In both the homosexual and heterosexual groups, there was a statistically lower platelet count and higher ALT and AST levels in HBV and HIV coinfected patients, indicating the presence of advanced fibrosis. The higher baseline serum bilirubin suggests that tuberculosis is more associated with HIV and HBV coinfection than HIV infection. Multiple factors contribute to the hematological manifestations of HIV disease, like anemia, neutropenia, and thrombocytopenia. This may explain why the leucocyte count was lower in the coinfected patients in both groups.
HIV infection influences all hematopoietic cell lineages and can cause a spectrum of hematological abnormalities [38]. The pathogenesis of HIV-associated anemia is not fully understood, but is assumed to be multifactorial [32]. Besides, we found an association between CD4 count and hemoglobin level: anemia rate significantly increased in patients with low CD4 count, and low hemoglobin level increased the risk of death in patients with AIDS, independent of the CD4 count [32]. In the homosexual transmission group showed a lower hemoglobin level, indicating that HBV facilitates HIV progression and increases the risk of death.
CR can be used to assess kidney function [39]. The study based on 90 newly diagnosed HIV patients with viral hepatitis infection in Cape Coast Teaching Hospital HIV clinic showed that severe kidney malfunction (chronic kidney disease stage 4, with eGFR < 15 mL/min/1:73 m 2 ) was only and nearly significant in those with HIV1/HBV [31]. This finding is also supported by the higher CR in heterosexual coinfected patients in the present study.
It is a great pity that we did not collect the data of CD4+ T cell count, which is a reliable marker for assessing liver disease progression. Previous studies have shown that the baseline CD4 cell count is lower in HBV/HIV coinfected individuals compared to monoinfected individuals. This association may be related with HBV genotype, chronic immune activation, cytokines, and apoptotic pathways involved in these infections [14].
There are some limitations in our study as well. First, the data are provided by the CDC, so some potentially relevant variables may be left out. Second, other ML algorithms should be evaluated. Third, the efficiency of the model should be clinically validated.

Conclusions
The univariate logistics regression combined with the AdaBoost algorithm could accurately screen the risk factors of HBV in HIV coinfection without invasive testing. We found that AST was the most significant detective variable in both homosexual and heterosexual transmission groups, which should be paid more attention. Different sex transmission routes of HIV had different risk of coinfected with HBV, as well as risk factors, but detailed evidence required further studies. Further studies are needed to evaluate the utility and feasibility of this model in various settings.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.