The lasso binary logistic regression method for selecting variables that affect the recovery of Covid-19 patients in China

The Lasso regression performs the least squares method with the l 1-constraint. This is a particular type of regularization which adds a penalty equal to the absolute value of the magnitude of coefficients that can result in sparse models with few coefficients in which some coefficients can become zero and be eliminated. Larger penalties result in coefficient values closer to zero, which is ideal for producing simpler models. In this study, the lasso method has been applied to select variables affecting the recovery of the Covid-19 patients. The data consisted of the number of patients treated in several hospitals or clinics in China, recovered patients, demographics, comorbidities, symptoms, and treatment. The performance of the lasso binary logistic regression was compared with the full model of logistic and the stepwise logistic regression. The results showed that the number of independent variables selected by the lasso method was larger than those selected by the stepwise method. It has also been showed that coefficient values of variables produced by the lasso method were smaller than those produced by the stepwise. The independent variables that affect the cure rate of covid-19 patients with the lasso are gender, comorbidities, diabetes, cardiovascular, cough, fatigue, diarrhea, platelet count and antibiotics.


Background
Least Absolute Shrinkage and Selection Operator (lasso) regression uses the least squares method with constraint ℓ 1 [1]. This is done by adding a penalty in the form of adding the absolute value of the regression coefficient which is multiplied by lambda to the Lagrange form to obtain a parameter estimator. The lasso makes coefficients of the regression shrink toward zero [2]. The lasso can select independent variables that explain the response variable. It is like the stepwise regression.
Coronavirus disease 2019 (Covid-19) was a disease that began to be discovered in Wuhan China in December 2019. The number of covid-19 cases, patient deaths and cases of recovery patient in the world are 43,405,696; 1,159,835; and 31,935,211, respectively on October 26, 2020 and their trends rise. However, the trend of them flats in China with cases 85,810; deaths 4,634; and recovery 80,911 [3].
Meta-analysis is a quantitative approach to systematically combine the results of previous research to reach certain conclusions [4]. In this study, 202 papers about Covid-19 in China were selected, then 26 papers were taken to be the observations of the study. The data were analyzed with the lasso binary regression model then the results of the lasso model was compared with the results of the full model and 2 the stepwise regression model. This was conducted to identify characteristics of the lasso which has smaller coefficient scores than the other methods.

Research purposes
There are two (2) objectives in this study, namely (1) build a lasso logistic regression model to see the effect of demographics, disease history, symptoms and treatment on the recovery of Covid-19 patients and compare the model's performance with a full logistic regression model and a stepwise logistic regression model, (2) find variables that explain the recovery of Covid-19 patients with lasso binary logistic regression.

Binary Logistic Regression
The formula of the binary logistic regression model is given below. where:

Stepwise regression
The stepwise regression is a modification of the forward selection. After the forward process, all candidate variables in the model are checked again. The stepwise regression requires two significance levels: one for adding variables and one for removing variables [6].

The lasso binary logistic regression
The lasso is a technique to shrink the coefficients of predictors [7]. The lasso is a regression method which is penalized so the coefficient by shrinking toward zero [8]. General form for finding coefficient estimators of linear method is given below.
The lasso estimator is given by the following.

Data
The data in this study consisted of twenty six (26) observations and nine (9) independent variables. The data were selected from COVID Analytics web [9] which has 202 research results with more than one hundred variables where the research was conducted between December 2019 and April 2020. The observations in this study were groups of hospitals or clinics in certain areas in China.
The data consisted of response variables and nine independent variables. The response variable was formed by the population size of Covid-19 patients (Y1) and patients who recovered (Y2), while the independent variables consisted of demographic information, disease history, symptoms and treatment. Demographic information is male. Disease history includes diabetes and cardiovascular. Symptoms consisted of cough, fatigue, diarrhea, and platelets. The treatment is the percentage of antibiotics. The characteristics of the data can be seen in table 1. The number of platelets 10 9 /L Numeric Treatments X9 Antibiotics Percent Numeric

Method
Steps of the research The data were analyzed using R, SAS and Excel software. The steps for conducting the study are (1) collecting data, the selected data consist of 26 observations and 9 independent variables, (2) construct a lasso binary logistic regression model with minimum lambda, minimum the Akaike Information Criterion (AIC) and minimum Bayesian Information Criterion (BIC); the model is  The correlation between the percentage of Covid-19 patients who recovered and the independent variables has a different pattern, see figure 1. This point pattern needs to be supported by looking at the results of formal tests using lasso regression in Table 3 so that the pattern becomes more visible. From figure 1 and table 3, Y (covid-19 patients who recovered) and X1 (percentage of male), and X8 (the number of platelets owned by the patients) had positive correlations. As an illustration of this positive correlation, the more male patients in the hospital, the more they recover. The Y and X2, X4, X5, X6, X7 and X9 have negative correlations. For an example of this negative correlation, the more people who suffer from cardiovascular (X4), the fewer Covid-19 patients who recover.

The lasso binary logistic regression method
The making of the lasso binary logistic regression model involving the minimum lambda value, minimum AIC and minimum BIC is done through 20 stages, this is shown in table 3.Initially, the lasso model sets an intercept on the model using lambda equal to one, AIC = 1518.6 and BIC = 1519.9. Then step one enters X6, this is done by placing lambda 0.8 and producing AIC = 1517.3 and BIC = 1519.8. Then the second step puts X8 into the model, this is done by placing lambda = 0.64 and producing AIC = 1509 and BIC = 1512.7. The information for stages 3 to 19 can be seen in table 3. Finally, stage 20 uses lambda minimum = 0.0115 with minimum AIC = 1290.1 and minimum BIC = 1301.4, this lambda produces 8 independent variables and one intercept that explains rate of recovery Covid-19 patients. Table 3. Selected detail.
Step The most accurate coefficient of lasso binary logistic regression model are generated using a minimum lambda of 0.0115. Based on the results in Table 4, the lasso regression model can be written as follows: To see the characteristics of a lasso where the coefficient value shrinks towards zero, the coefficient is compared with the full model and the stepwise model. The full model produces four independent variables that can explain the recovery of Covid-19 patients. The variables are X4, X6, X7 and X9. The independent variables appear to have a p-value of less than 0.05. Likewise, in stepwise regression model, the stepwise has also four independent variables that can explain the cure for Covid-19 patients the same as full regression. Whereas the lasso regression has 8 independent variables that can explain the recovery of covid-19 patients. These variables can be divided into two, namely those that have a positive effect and those that have a negative effect. The variables that have a positive effect are X1 and X8 while the independent variables that have a negative effect are X2, X4, X5, X6, X7 and X9.

Conclusions
The conclusions in the study are: 1) minimum lambda in the lasso is 0.0115 with nine selected variables, AIC of 1290.1 and BIC of 1301.4. 2) the lasso binary logistic regression model has more independent variables and smaller coefficient scores than those of the stepwise logistic regression, it is suitable with characteristics of the lasso, 3) the variables that affect recovery patient based on the binary lasso logistic regression are male, comorbidity, diabetes, cardiovascular, cough, fatigue, diarrhea, platelets and antibiotic treatment, 4) male and platelets have a positive effect, while comorbidity, diabetes, cardiovascular, cough, fatigue, diarrhea, platelets, and antibiotics have a negative effect.