Promote sign consistency in cure rate model with Weibull lifetime

In survival analysis, the cure rate model is widely adopted when a proportion of subjects have long-term survivors. The cure rate model is composed of two parts: the first part is the incident part which describes the probability of cure (infinity survival), and the second part is the latency part which describes the conditional survival of the uncured subjects (finite survival). In the standard cure rate model, there are no constraints on the relations between the coefficients in the two model parts. However, in practical applications, the two model parts are quite related. It is desirable that there may be some relations between the two sets of the coefficients corresponding to the same covariates. Existing works have considered incorporating a joint distribution or structural effect, which is too restrictive. In this paper, we consider a more flexible model that allows the two sets of covariates can be in different distributions and magnitudes. In many practical cases, it is hard to interpret the results when the two sets of the coefficients of the same covariates have conflicting signs. Therefore, we proposed a sign consistency cure rate model with a sign-based penalty to improve interpretability. To accommodate high-dimensional data, we adopt a group lasso penalty for variable selection. Simulations and a real data analysis demonstrate that the proposed method has competitive performance compared with alternative methods.


Introduction
Survival analysis is a method to analyze time to event of interest assuming that all subjects will eventually experience the event [1]. It has been widely applied in many fields such as medicine, engineering, credit scoring, etc. [2][3][4]. However, in practice, a proportion of subjects may not experience the event of interest. They are considered 'cured', or in other words, the long-term survivors. For instance, in medical practice, there may be a fraction of patients cured of their disease. Therefore, the cure rate model, an extension of survival analysis, is introduced to modeling data with long-term survivors [5].
Let i x be the covariate vector of subject i , and ( | ) ii Stx be the survival probability in time t . The cure rate model can be written as x is the survival function with the vector of regression coefficients  .
As shown in (1.1), the cure rate model is composed of two parts. The first part is the incident part, which predicts the probability of being cured. The second part is the latency part, which describes the conditional survival of the uncured subjects. Since the cure rate model can predict whether subjects are cured and the time to event of the uncured subjects, it is commonly adopted when a proportion of subjects have long-term survivors [6].
There are numerous studies in the literature regarding many extensions of the cure rate model. Cooner et al. [7] proposed a flexible hierarchical cure rate model to distinguish among underlying mechanisms that lead to cure. Rodrigues et al. [8] assumed the number of risk factors to follow the Conway-Maxwell Poisson distribution and proposed a Conway-Maxwell Poisson cure rate model to unify some cure rate models. Li et al. [9] considered a mixture of linear dependent tail-free processes as the prior for the distribution of the cure rate parameter to develop a latent promotion time cure rate model. Results showed that the cure rate model incorporated penalized spline has better performance in prediction [10]. Georgiana proposed a Bayesian spatial cure rate model with Weibull lifetime to model spatial variability in the censoring mechanism [11]. Pal et al. [12] proposed a projected non-linear conjugate gradient algorithm for the cure rate model under a competing risks scenario. Besides, many works have developed semi-parametric and nonparametric methods to investigate the effects of covariates on the outcome. For instance, Li et al. [13] proposed a semi-parametric additive predictor consisting of a sum of linear and nonparametric terms in the incident part. Chen et al. [14] modeled the covariate effects by a nonparametric form.
However, many existing works tend to pay much less attention to the relations between the vector of regression coefficients  and  in the two model parts. Generally, they assume that there are no direct constraints on the relation of the vector of regression coefficients  and  corresponding to the same covariates [15,16]. In other words, these works assume that the probability of being cured and the time to event are independent, and there are no direct constraints between the coefficients  and  .
Notably, in practical applications, the two model parts are quite related. The incident part describes the probability of cure (infinity survival), and the latency part describes the conditional survival of the uncured subjects (finite survival). It is desirable that there may be some connections between the vector of regression coefficients  and  corresponding to the same covariates.
Theoretical derivations and case studies also suggest that relaxing the assumption of no direct constraints on regression coefficients can improve the performance of the model [17,18]. Liu et al. [19] relax the assumption of no direct constraints by establishing a joint distribution of the covariates and the logarithm of the hazard rate. Fan et al. [20] incorporate structural effects of  and  in the cure rate model to improve the estimation accuracy and interpretability. A joint distribution or structural effects of  and  are crude yet effective way to impose the relations between the coefficients. However, the two parts of the model still describe two different aspects.
The assumption of a joint distribution or structural effects are too restrictive constraints and might not work well. So we consider a more flexible model that allows the coefficients  and  can be in different distributions and magnitudes. Sign consistency penalty is proposed to promote the similarity in sign to get more interpretable results by Zhang [21]. In many practical cases, it is hard to interpret the results when coefficients  and  corresponding to the same covariates have conflicting signs. In this paper, we consider a sign consistency cure rate model with a sign-based penalty. In addition, the models may suffer from bad performance due to the high dimension of the data, and grouping structure arises naturally in many practical cases. A group lasso penalty is also imposed for group variable selection.
In this paper, we propose a cure rate model with group selection and sign consistency (CRGS), which can select important groups of covariates and promote similarity in the sign of coefficients.
The proposed method can promote the similarity in the signs of coefficients  and  to improve interpretability. Compared to the individual variable selection approaches such as the sign consistency method in [22], the proposed method with the group selection approach takes into consideration the grouping structure and can lead to better prediction. Compared with previously employed approaches such as joint distribution or structural effects of coefficients, the CRGS method can avoid too strict constraints between the coefficients and lead to more consistency and hence more interpretable results.
The paper is organized as follows. In section 2, the sign consistency cure rate model with Weibull lifetime and the algorithm is introduced. Simulation is presented in section 3. Section 4 displays a real data application. Finally, section 5 discusses the conclusions.

Sign consistency cure rate model with Weibull lifetime
Consider data with n subjects and p covariates.  Examples include the expression of a multi-level factor by a group of dummy covariates and the expression of an addictive model by several basis functions [23]. In addition, grouping structure can also be introduced into a model by taking advantage of prior knowledge [24]. For example, genes belong to the same biological pathway [25].
In the incident part of the cure rate model, we adopted logistic regression, a generalized linear model, to describe the probability of cure  is the intercept.
In the latency part, let i  be the link function. The survival function is the chance an individual survives to time t given the individual will eventually experience the event of interest, while i  is the probability of an individual experiencing the event in the next instant of time t . We assume the time to event t follow the Weibull distribution. Weibull distribution is a considerable flexibility distribution to model lifetime data [26]. It had been justified to be a valid lifetime distribution within the broad family of generalized gamma models [27,28]. Referring to [1,29], the survival function for the uncured subjects following the Weibull distribution can be written as with the probability density function Here 0 r  is the shape parameter (more discussions below). The link function can be written as where 0  is the intercept, For promoting sign consistency and variable selection, we propose the following objective function Here 1 P ( , )  is the group variable selection penalty function, and 2 P ( , )  is the sign consistency penalty function as follows. Volume 7, Issue 2, 3186-3202.
where • is the 2 l norm, sign( ) • is the sign function, 1 0   and 2 0   are tuning parameters. 1  and 2  control the degree of penalty.
The first penalty 1 P ( , )  is a group lasso penalty, which can conduct group variable selection and generate more accurate estimation. Group lasso has the advantage in group variable selection performance and is commonly used in literature. It has been justified that the group lasso is more robust to noise compared with lasso when the underlying signal is strongly group-sparse [30]. In addition, different from methods in some existing works [20,22], group lasso can consider grouping structures and it has better performance compared with the group LARS and the group non-negative garrotte [31]. The second penalty 2 P ( , )  promotes the similarity in signs of coefficients  and  , which can lead to more interpretable results. The proposed method is more flexible than the methods with structural effect in [20] or joint distribution in [19] of coefficients.

EGCD algorithm
In this section, we propose the Expectation Group Coordinate Descent (EGCD) algorithm to optimize the objective function. In the E-step, we introduce a latent unobserved i U to obtain a complete log-likelihood function. In the GCD-step, group coordinate descent is adopted to iteratively update a subgroup of parameters with the remaining parameters fixed at their most recent values.
Since the sign function sign( ) • is not differentiable and continuous, and it is hard to optimize. So we introduce the following approximation for computational feasibility [21,22].
with  is a small positive constant.
The EGCD algorithm iteratively updates 0  ,  , 0  , and  in the m -th iteration.

E-step
In the E-step of the m -th iteration, let the observation of the Regarding The objective function can be written as (2.14)

GCD-step
In the GCD-step, group coordinate descent is adopted to iteratively update 0  ,  , 0  , and  . Group coordinate descent algorithms optimize the objective function with respect to a group of parameters at a time, and iteratively cycles through the parameter groups until convergence [25]. The intercept [ 1] 0 m  + is updated by ( ) Similarly, the intercept [ 1] 0 m  + is updated by ( )

Simulation
In this section, we perform a numerical study to evaluate our method, in terms of both variable selection and estimation performance. The variable selection is assessed in terms of the (1) true positive rate (TPR), and (2) false positive rate (FPR) of variable selection. Estimation is evaluated in terms of mean square error (MSE) of coefficient estimates.

Design and assessment
In Scenario 1, the covariates jk x , 1,..., jJ = , whereas = 0  when in different groups. We consider the case with discrete covariates in Scenario 2. In Scenario 2,jk x is defined as follows. 3 The sample size is 500. n = We consider low-dimension data with 40, p = and high-dimension data with 200 p = . The censoring time is generated from a Weibull distribution with the shape parameter = {0.25, 2.5}. We compare the proposed method (CRGS) with three alternative methods. The alternatives are the standard cure rate model without sign consistency and variable selection penalty (CR), the cure rate model with sign consistency (CRS), and the cure rate model with group lasso penalty (CRG), respectively. For comparison, we also consider the alternatives with the logistic regression in the incident part and the Weibull distribution in the latency part. The grouping structure and coefficients of the 2 scenarios are generated as listed in Table 2. Table 2. Grouping structure and coefficients in Scenarios 1 and 2.

Results
Tables 3 and 4 summarize the mean and standard deviation in parentheses of the MSEs, TPRs, and FPRs for Scenarios 1 and 2. Both the scenarios are repeated 100 times.
As shown in Tables 3 and 4, the MSEs of methods with group lasso penalty (the proposed and CRG) are smaller than the CRS and CR methods. The estimates of CRS and CR methods have increasing MSEs with higher dimensions. The results indicate that higher dimension leads to less efficient estimation, and group lasso penalty can improve the estimation performance. Comparing the TPRs and FPRs of the proposed and CRG methods, the proposed method has better performance in terms of variable selection. Compared with alternatives, the proposed method has the lowest MSEs. Results of simulation reveal that compared with alternatives, the proposed method can improve the performance in terms of variable selection as well as estimation. Note: in each cell, mean (standard deviation). Note: in each cell, mean (standard deviation).

Analysis of credit data in China
In this section, we apply the proposed method to credit data. The data comes from a retail business of a commercial bank in China. It contains 16 covariates of 1213 customers in a personal loan from 2014 to 2019. The primary interest is to assess the credit risk of a credit loan and find the important covariates to predict the time to default of the credit loan customers. The mean observed time is 1.38 years with a standard deviation of 0.69. Customers with missing data of annual household income are removed from our analysis. After preprocessing, the covariates and their descriptions are summarized in Table 5. By transforming the multi-level covariates, we have 24 covariates in the credit model. Censoring time i C is the interval between the value date and either default or the end of observation (June 1, 2019). Due to the different value dates of the loans, the censoring time vary from individual to individual. Customers whose time to event i Y is longer than the censoring time i C are censored ( = 0). In this data, 1201 out of 1213 customers are censored.  ii x at the median and get two different groups of customers. One group with lower ( )  ii x is denoted as "high risk", whereas another with higher ( )  ii x is denoted as "low risk". Figure 1 presents the Kaplan-Meier curves of the survival of the customers belonging to different groups. Kaplan-Meier curve describes the change of the survival probability over time. It is commonly used in survival analysis, see Rodrigues et al [8], and Pal [34]. As indicated in Figure 1, the "low risk" group has higher survival probability than the "high risk" group. The median estimates of coefficients based on 100 duplicates are listed in Table 6. A positive coefficient  indicates that the covariate is positively related to the probability of default, and a positive coefficient  indicates that the covariate is negatively related to default time. Both the probability of default and default time are two quite relevant credit aspects. Customers with a higher probability of default are likely to default earlier. Compared with the alternative method, the signs of the  and  of the proposed method are promoted to be more consistent, whereas many covariates such as the housing status in the CR method have an opposite effect on the probability and time to default.
The coefficient results of the proposed method reveal that loan line, early repayment, gender, housing status, annual household income, employment status, type of workplace, and occupation are important covariates for credit risk assessment. The impact of business type and education on credit is not clear.
The loan line has a positive effect. One possible explanation is that customers with better credit status are more likely to obtain a higher loan line. Customers with housing loans are more likely to default. The increasing annual household income leads to better credit status. Employed customers are less likely to default. Compared with other employment groups such as self-employed, freelance, and unemployed, the employed group has a more stable income and is less likely to default. Customers who work in a government organization and institution have better credit status. Customers who are managers or commercial and service workers are less likely to default. Customers with early payment records tend to maintain good credit records and are less likely to default. Compared with women, men are more likely to default. This is consistent with the results of [35] and the personality characteristics of men's risk preference [36].

Conclusions
The cure rate model is commonly used when the data has long-term survivors. The model is composed of two parts. The incident part describes the probability of cure and the latency part describes the survival function of the uncured group. The drawback of the standard cure rate model is that it assumes that there are no direct constraints between the coefficients corresponding to the same covariates in the two parts of the model. This may lead to conflicting results of covariate effects on the probability of cure and conditional survival of the uncured group. In fact, the two parts of the model describe quite related aspects. It is desirable that there may be some connections between coefficients corresponding to the same covariates. Existing works have considered joint distribution or structural effect of the two sets of covariates, which is too strict.
In this paper, we consider a more flexible cure rate model that allows the two sets of covariates can be in different distributions and magnitudes. In the proposed method, a sign consistency cure rate model is proposed to promote the similarity in the sign of coefficients in the two model parts to improve interpretability. In addition, we also impose a group lasso penalty for variable selection. Simulation results show that compared with alternatives, the proposed method has better performance in terms of variable selection and estimation. An analysis of credit data in China illustrates that the proposed method can improve prediction performance as well as interpretability.