Variational Bayes survival analysis for unemployment modelling

Mathematical modelling of unemployment dynamics attempts to predict the probability of a job seeker finding a job as a function of time. This is typically achieved by using information in unemployment records. These records are right censored, making survival analysis a suitable approach for parameter estimation. The proposed model uses a deep artificial neural network (ANN) as a non-linear hazard function. Through embedding, high-cardinality categorical features are analysed efficiently. The posterior distribution of the ANN parameters are estimated using a variational Bayes method. The model is evaluated on a time-to-employment data set spanning from 2011 to 2020 provided by the Slovenian public employment service. It is used to determine the employment probability over time for each individual on the record. Similar models could be applied to other questions with multi-dimensional, high-cardinality categorical data including censored records. Such data is often encountered in personal records, for example in medical records.

(RNN) in order to predict the empirical probability distribution of future events [30,31]. These models have undisputed applicability with one drawback. They provide only point estimates of the posterior distributions of the survival/hazard functions.
A point estimate only provides a partial view of the results. For example, in a classification problem, this would mean providing the most probable class without the assessment of the probability that a certain entity belongs to the selected class. The cases where the confidence in the classification is overwhelming (for instance 90%) and the cases where the confidence is low (for instance 51%) would result in the same output. However, these cases must be handled very differently in any subsequent decision steps. In the presented case of survival analysis, where the underlying phenomenon is inherently stochastic, it is clear that maximum likelihood estimates do not provide the whole picture.
A method harnessing the power of obtaining complete posterior distributions, like with Markov chain Monte Carlo approaches, while preserving the efficiency of the machine learning methods, is the variational Bayes (VB) method [32]. It has been successfully applied in various fields, such as Gaussian processes modelling [33,34], deep generative models [35], compressed sensing [36,37], hidden Markov models [38,39], reinforcement learning and control [40], etc. It is possible to define a survival model with an arbitrary risk function that is implemented as an artificial neural network (ANN) and estimate the posterior distribution of the ANN model parameters despite the large number of parameters.
All these properties come to use when building survival models using personal records, for instance medical, PES data, or similar. The records tend to be multi-dimensional and usually have high-cardinality categorical data [41][42][43]. Such data are an ideal candidate for harnessing the capability of the VB-based survival method.
The proposed model estimating time-to-employment is a survival model with an ANN used as a nonlinear hazard function. Half of the covariates are categorical, some of them with cardinality of more than unemployment records is described in Section 5. The results are presented in Section 6 and discussed in Section 7.

Survival analysis
Survival analysis is the study of time-to-event data that initially focused on estimation of lifespans. The majority of the developed methods investigate continuous-time models [44], but there are also developments on discrete-time approaches [45].
Survival analysis explores the probability distribution of an event over time. The probability that an event occurs at time T before a certain time t can be written as where the functions f (t) and F (t) are the probability density and the cumulative probability function, respectively. The opposite case, i.e. the probability that the time of occurrence T will be after a certain time t, is The function S(t) is called the survival function. The hazard rate, or hazard function, λ(t) is defined as the event rate at time t if the event has not occurred up until t and is expressed as The survival function can be obtained from the hazard function as Survival models are categorised based on the type of the hazard rate. Often, the multiplicative hazard function is used [44], Its two components are the baseline hazard λ 0 (t) and a risk function h(x), which depends on directly measurable covariates x that are also known as explanatory variables or features. In the simplest form, i.e.
Cox proportional hazards models, the risk function is linear, h β (x) = β T x, and needs no constant term as it is included in the baseline term λ 0 (t). The parameter set β is identified from the data. This is achieved using various estimation approaches, the most popular being the Kaplan-Meier estimator [46]. For the case of non-linear models, the risk function h(x) can be an arbitrary real function. In such cases, various forms of ANNs have been applied [26][27][28]. Another possibility is the use of accelerated failure-time models [44], where the logarithm of the survival time y is approximated with linear regression, The probability distribution of the error is altered using regression coefficients β, covariates x, and scale factor σ to obtain the probability distribution of the event time T . The choice of the probability distribution of W directly defines the probability distribution of the survival time T . Typical pairs are listed in Table 1.
The model can be extended by replacing the linear function β T x with an arbitrary real function of the covariates h z (x), where z is the set of the parameters of the non-linear function. When the model (6) is extended this way, the equation reads It has been done with the output of an ANN used as h z (x) [23,26,47,48].

Censoring
Two particularities of time-to-event data sets that complicate their processing are censoring and truncation [44]. Censoring happens when the time of event may only be known to be within a particular time interval. Truncation occurs when the cases with event times outside of the observation period are not observed. In the studied example, there is no truncation and only right censoring, also known as Type I censoring. That is, all the subjects are observed, the timing of the event is known if it occurs prior to the end of the observation time, and the exact timing is unknown for the events that occur later.
Censoring might lead to false conclusions if it is not accounted for properly [20]. The distinction between survival analysis and simple regression is in the handling of censored data.

Addressed limitations of typical survival models
The most common survival analysis models are of the types (5) and (7) and limited to such functions h(x) or h z (x) that the likelihood function exists in closed form [44]. However, the modelling benefits from the use of more general functions that are capable of handling non-linear relationships between the covariates x.
For those, the evidence p(x) in equation (8) is intractable. Solving the equation (8) in order to estimate the posterior probability distribution p(θ|x) of the model parameters θ thus requires the use of an approximation.
We choose to use an ANN in place of h z (x) in equation (7) and to estimate the posterior probability distribution p(θ|x) using the VB method [32]. Another issue is the use of high-cardinality discrete covariates, necessitating some form of dimension embedding. Such properties are becoming typical in medical records [41][42][43] and are also present in this study.

Variational Bayes method
When performing a stochastic analysis, such as survival analysis, the inference process of estimating the model parameters relies on the Bayes' rule. For a set of observations x, generated by a system with parameters θ, the Bayes' rule reads The likelihood p(x|θ) is typically known because it is prescribed by the model structure. The prior p(θ) is typically chosen. The biggest obstacle in computing the posterior probability distribution p(θ|x) is the evidence p(x). It is the solution of the equation which in most cases cannot be obtained in a closed form. For multi-dimensional cases, even Monte Carlo integration becomes impractical due to the immense computational load. The VB method provides an approximative solution to this problem.
The idea of the VB method is to sufficiently closely approximate the true posterior p(θ|x) with an approximative probability distribution q ω * (θ), referred to as variational distribution. It belongs to a variational family of the functions q ω (θ) ∈ Q, ω ∈ Ω, where Ω is the set of all possible values of the latent parameters ω.
A typical variational family is the mean-field variational family [49] in which the parameters θ are mutually independent.
The optimal values of the latent parameters ω * are obtained by minimising the Kullback-Leibler (KL) divergence KL(q ω (θ)||p(θ|x)) between the true posterior and the variational distribution. The optimisation is solved.
Since the true posterior is unknown, calculating the KL divergence requires a minor rearrangement. As shown by Šmídl and Quinn [32], KL(q ω (θ)||p(θ|x)) can be written as The first term of the final expression is known as evidence lower bound (ELBO) and maximising it results in minimising the KL divergence between the variational distribution and the true posterior. The second term, log p(x), is constant and is therefore not a part of the optimisation process.
Generally, the criterion (11) is not convex and no optimisation algorithm guarantees convergence to a global extreme. Several optimisation algorithms have been used for this problem, such as coordinate ascent variational inference and conjugate models [49], stochastic variational inference [50], black-box variational inference [51] and partially the automatic differentiation variational inference [52].
In the presented case, the optimisation of the ELBO loss function is performed using the stochastic variational inference with gradients of loss function calculated following the black-box variational inference [51].
According to Hoffman et al. [50,Eq. (23)], this iterative algorithm converges to the optimal parameters if the objective function is convex or to a local optimum otherwise.
Having an approximative variational distribution instead of the true posterior distribution introduces an inherent bias which depends on the variational family used. The selection of the variational family Q is thus not an ad-hoc decision but is based on prior knowledge such as empirical observations or experts' knowledge. In spite of the inherent bias, the VB method is justified by the substantial increase in the computational efficiency compared to the alternatives such as Monte Carlo integration. In our analysis, ADAM optimiser [53,54], implemented as a part of the PyTorch package [55], is used for optimisation (10) through Pyro [56]. The overall idea is schematically presented in Figure 1.
Unknown true posterior distribution p(θ|x)

Variational Bayes method in survival analysis
The use of the model of the form (7) with an ANN as h z (x) results in the integral (9) that cannot be calculated analytically. Obtaining the posterior probability density function p(θ|x) thus requires an approximation. ANNs typically have hundreds of parameters, precluding even the use of Monte Carlo methods. Another option are Gaussian processes that are shown to be equivalent to a fully connected ANN with an infinite number of hidden units in each layer [57]. A variational inference approach provides a computationally efficient solution using ELBO (11) as the loss function.
With the VB method, one has to select the prior distribution p(θ) and the variational family of the posterior distribution Q. This is followed by optimisation (10) using the ELBO loss function (11), resulting in the variational distribution latent parameters ω * . In the most general form of the VB method, every model parameter may be treated as a stochastic one.
We implement the function h z (x) as an ANN. The model parameters are assembled as θ = [σ, z], z] = [µ 1 , . . . , µ K . We use a mean-field variational family, so the probability distribution q ω (θ) can be expressed as where K is the number of the ANN parameters. For the variational distribution, we use a half-normal distribution for the factor q 0 (σ) and normal distribution for every q k (z k ). The vector of latent parameters , where φ is the probability density function of the standard normal probability distribution.
Models relying on the VB method are described using the concepts of probabilistic graphical models [58,59] and directed factor graphs [60]. The model (7) is transformed into a VB form that takes censoring of the observations into account. The directed factor graph is shown in Figure 2.
The used VB approach relies purely on numerical evaluation and requires only proper specification of the model. That is, only the model likelihood p(x|θ) and the observations for parameter estimation x have to be specified. The prior probability distribution p(θ) is assumed to be constant and the variational family Q is chosen. In our case, the likelihood is given by equation (7) and the variational family Q is as described in equation (12).

Parameter estimation
As derived by Wingate and Weber [61], the gradient of the criterion function is expressed as where L(ω) is ELBO. This expression is used in stochastic gradient optimisation to find an estimate of ω * . The likelihood of the i th training data point for a sampled value of θ is calculated based on equation (7).
As illustrated in Figure 2, the evaluation process depends on whether the observation is censored or not.
The vector of covariates is labelled with x (i) , y (i) is the logarithm of the time-to-event or time-to-censoring, and c (i) is the censoring label. For a complete observation, i.e. c (i) = 0, the gradient (13) is calculated at the observed y (i) . For censored observations, i.e. c (i) = 1, the loss is calculated from the predicted probability of censoring at y (i) , which equals the survival function 1-CDF W at y (i) . That is, the predicted probability distribution of c (i) is Bernoulli with the expected value of 1-CDF W . The complete procedure is also shown as pseudo code in Algorithm 1. For the initial guess ω (0) , we use σ σ = 5 and µ k = 0, σ k = 1 ∀ k ∈ {1, . . . , K}.

Estimating the survival function S(t)
Solving the survival model in the Bayesian framework, the survival function S(t|x) for a given value of the covariates x and of the model latent parameters ω can be obtained as the posterior prediction distribution using the identified variational distribution q ω (θ) instead of the true unknown posterior. The probability density function is The survival function S(t|x) = Pr[T ≥ t] is by definition expressed as and can be evaluated from (15) using Monte Carlo integration. It is a fairly simple and computationally efficient process. The pseudo code is presented in Algorithm 2.
We have mentioned that Monte Carlo integration of the integral (9) would be too computationally intensive, but the integral (15) is easy to calculate using the Monte Carlo method even though the integration variable θ has the same dimensionality in both cases. The reason is that the probability density function Algorithm 1 VB-based model parameter estimation assuming normal distribution as the variational family.
1: Input N observations, x (i) are covariates, y (i) is logarithm of time-to-event or time-to-censoring, and Variational family Q, prior p(θ), model structure p(y|x, θ).

Variational Bayesian survival analysis applied to unemployment modelling
PES records provide a rich description of job seekers in the form of their profiles. One of the most common metrics that PES associates with every job seeker is the so-called probability of exit, referring to "exiting" from their records. It should be noted that not every "exit" is due to employment but also events such as retirement, PES determining that someone is not genuinely seeking a job, and many others.
Some of the job seekers entering the records at any point in time have exited at a later known date while the others are still unemployed. From the data perspective, this is a clear example of a right censored data.
Survival analysis is therefore a viable tool for analysing PES data where the probability of exit from PES records serves the function of the hazard rate.

Data structure
Every job seeker is described using 19 covariates. Some describe personal characteristics of each job  Table 2. It should be noted that the covariates with cardinality 2 are treated as continuous and do not go through the process of embedding.

Model error distribution
When defining the survival model, one of the key decisions is the selection of the probability density function of the error W in (7). Since the probability of finding a job decreases over time [11], the suitable  Disabilities (17) T in (7) then follows the log-normal distribution and the hazard rate is monotonically decreasing after its maximum [44]. The probability density function of the event f (t) in (1) and the corresponding survival function S(t) become where φ and Φ are the probability density and the cumulative density functions of the standard normal distribution, respectively. Some typical shapes of the hazard functions for various values of µ and σ are shown in Figure 3.

Artificial neural network model for the risk function h z (x)
The underlying risk function h z (x) is modelled using a deep ANN of the architecture shown in Figure 4.
At the entry level of the network, the dimensions of categorical (discrete) covariates are reduced and the continuous covariates are normalised. The network then follows three repeating groups of linear layers, normalisation, and dropout [62]. The cardinality of the categorical covariates ranges from simple boolean (yes/no) up to 3772 categories for a covariate describing occupation. The dimensions are reduced using embedding [25] with the reduced number of dimensions equal to n = min 50, where d is the original number of categories.  Table 2, there are 9 continuous covariates and 10 discrete ones. Using (17), the categorical values are embedded into 262 dimensions. Therefore, the first linear node has 271 input and 200 output values. The structure of the other layers is shown in Figure 4.

As listed in
The variational distribution of each ANN parameter z i is the normal distribution q ωi (z i ). The latent parameters ω i = {µ i , σ i } represent the mean values and the variances of the variational distribution. In essence, such a choice means that the resulting variational distribution will not be able to capture any dependencies among the the network parameters. However, this is not a general limitation of the method. The mean-field approximation treats the latent parameters of the selected variational distribution as mutually independent [49]. The model parameters, in our case the weights of the neural network, would not have to be treated as mutually independent. The standard normal distribution is used for W , and the parameter σ is sampled from a half-normal distribution.

Results
The data set contains daily updates on every job seeker in Slovenia from 2011 up to 2020. The network is trained on the data covering 12 months and evaluated on the records from the following 6 months. The

Hyperparameter selection
The optimisation is performed with the ADAM optimiser. We use two standard approaches for choosing the optimiser's parameters. Initially, following the guidelines from Smith [63], the loss is checked over a range of optimiser's parameters, in particular the learning rate. Furthermore, the impact of step wise adaptation of the learning rate is also checked as suggested in [64]. It should be noted that the whole data set is used in this analysis.
As shown in Figure 6, the minimal value of the loss function over a fixed number of iterations was achieved for learning rate somewhat above 1. Therefore, 1/10th of this learning rate is used for the training process.

Stopping criterion
Typically, the stopping criterion is a convergence of the ELBO loss (11), i.e., its relative change [65].
There are improved stopping criteria such as [66,67], which use a combination of a smaller step size and Monte Carlo gradient estimates. The potential of such approaches becomes apparent for high dimensional problems.
In the case analysed here, it turns out that the decrease of the ELBO value is monotonic and converges to an asymptotic value after a certain number of iterations. This is shown in Figure 7. As a result, in this particular case the stopping criterion can be rather simple, i.e. using the heuristic approach that observes the relative change of the loss function values.

Classification accuracy
Although the overall goal is not classification of the unemployed persons, viewing the results from a classification standpoint offers an easy way for quantifying the performance of the approach. A simple assessment of the model accuracy can be achieved by analysing the distribution of the survival probabilities of the whole test population at a certain time. In Figure 8, the values of the survival function S(t) at t = 180 days from entry are shown separately for the job seekers that exit the records in under 180 days and for the ones that stay unemployed longer. This value of t = 180 days is chosen because more than 180 days of unemployment have a significant negative influence on a job seeker's capability of finding a job [11].
Therefore, this result can be treated as an indicator of a job seeker's capability of finding a job.
The classification accuracy depends on the selected threshold value. For threshold S(t = 180 days) = 0.61, the area under the ROC curve reaches the maximum. At this value, the classification accuracy is 75.6%, whereas the trivial model has 50.5% accuracy. The trivial model is the case when the whole test set is labelled as either employed or unemployed. The two groups are shown in Figure 8. The blue histogram contains persons that exited PES records prior to 180 days, which we label as positive outcome, and the orange histogram are persons that are either still on the records or exited after 180 days.

Survival analysis results
Unlike classification, the estimated survival probability over time t offers a better insight. The set of job seekers remaining in the PES records for over 180 days is investigated into more detail. Table 3 lists the most common outcomes for these job seekers divided into three groups based by the values of S(t = 180 d).
We are particularly interested in the ones for whom the model predicted survival probability below the threshold S(t = 180 d) < 0.61. and their outcome. The job seekers that exit the records through employment are divided further based on the time of event.
The rest tend to end up being not active job seekers, meaning they quit fulfilling their obligations toward PES. Most of the remainder are employed by PES in public service, and a significant fraction of the rest go on maternity leave. The false positives -the job seekers for whom the model assigns low survival probability S(t = 180 days) but remained for more than 180 days -deserve further attention as they may not get the necessary PES support if the model is relied upon. In Table 3, we split them into two subgroups based on S(t = 180 d) and compare them to the true negatives for whom the model correctly predicts over 180 days to exit. Let us term employment within a further 60 days, maternity leave, and having a less common outcome lumped as "other" a "good" outcome. An underestimate of the date of exit for under 60 days and the failure to take maternity leave into account are inconsequential for PES and hopefully for the job seeker. The "other" outcomes are of different desirabilities for the job seeker -they range from enrolling into further education up to imprisonment -but they tend to be independent of PES actions. Just like the people employed in 60 extra days and the ones on maternity leave, the "other" people are neither failed by PES nor a burden of PES. Therefore, assigning low survival probability to such a job seeker will not be likely to cause much harm.
The outcome is "good" for 30% of the false positives, rising to 47% with the model-predicted S(t = 180 d) < 0.4. In contrast, only 13.2% of the true negatives achieve a "good" outcome. It seems the type II error is more likely for the job seekers for whom it is less damaging, which is fortunate.
The fraction of the long-term job seekers of over 180 days that lose their status and become "not active" is the highest in the true negative S(t = 180 d) ≥ 0.61 group. For this group, losing the status is the most likely outcome. They lose the status by not fulfilling their obligations toward PES. It can be inferred that part of the group are the job seekers that PES actions did not help. Inexplicably from the data, the outcome is very common for job seekers around 60 years of age, indicating that there may be regulatory actions making the status less appealing to them. Detailed age profile results are shown in Appendix A.
The second most frequent outcome for the true negatives is getting employed after 240 days, thus using PES resources and suffering the consequences of unemployment for a longer time. A lot of the remainder get employed in public service, meaning that they stay in a contractual relationship with PES. The modelpredicted S(t = 180 d) is strongly correlated with the probability of the job seeker serving in public service.
This is not surprising as public service is only available to the long-term unemployed and there may be causal connections between the model inputs and the availability of public service to the job seeker.

Discussion on the accuracy
Two of the goals of PES are to get as many of the job seekers from their records employed as quickly as possible and to lower the number of long-term (over 1 year) unemployed [68]. The significance of longterm unemployment is that it has various negative consequences, some of which decrease the chance of getting employed [11,69]. The presented model could help PES better estimate which job seekers need more assistance, potentially improving their service.
However, all of the reported accuracy results on implemented systems addressing unemployment records focus solely on classification. As a result, not many such systems became operational and others were disbanded due to fears of discrimination of the classification process. Most recently this happened with the Austrian AMAS system [70].
Due to the lack of proper profiling model results, the only comparable systems are those that perform classification. Even so, comparing the obtained 76% classification accuracy with the other published results has proven itself challenging. Due to data privacy regulations, it is not feasible to get access to data sets from various PES registries used in the literature. The algorithms are not published either, so it is not possible to apply them to the data that is available to us. Model comparison through applying different models to the same data is therefore not possible. We thus compare the reported results with the results of our model. One has to assume that the data sets of various PES organisations are of comparable quality and information content, and neglect the differences between labour markets in order to compare various modelling approaches this way.
Results of several systems implemented in PES offices are reported by Scoppetta and Buckenleib [71]. Table 4 shows a list of reported models comparable to the presented one, their underlying modelling methods and the resulting accuracies. In most published cases, the accuracy of the model in predicting the correct probability of exit over time is close to 70%.
It should be noted that in many cases the reported accuracy listed in Table 4 refers to a particular subset of data, limited by gender, location, education, etc. Furthermore, there is no information on the balance of the test data, and it is easier to achieve a certain accuracy with a less balanced data set. The proposed VB model is tested on the whole population of job seekers, the data set is balanced (50.5% remained unemployed for longer than 180 days, and 49.5% was employed before that threshold), and the achieved accuracy is above the average reported in the literature. It can therefore be concluded that it works well. The source code is published and the curves of survival probability for each classification outcome are reported in Figure 8. It will thus be possible to compare the future models with the presented VB algorithm.

Conclusion
The study successfully introduces advanced survival analysis methods to the question of unemployment. The proposed survival model predicts the time until exit from the job seeker records. As the most common reason of exiting the records is employment, the model provides time information regarding the probability of employment. This information is useful both for the job seekers and for the operation of the public employment service (PES). The model can assist the PES employment counsellors in recognising the job seekers that do not need PES resources as they will get employed soon regardless of the interventions.
It thus enables the counsellors to focus their attention on the job seekers that need help.
Data analysis has been used to make predictions of similar kinds before. However, the performance of the proposed model cannot be accurately compared with the historical ones because the published results are incomplete. Based on the available data and the theoretical soundness, we believe that the proposed model performs relatively well.
The resulting model can be used for performing gamification scenarios allowing both the counsellors and the job seekers to explore various strategies for improving the job prospects, i.e. for reducing the survival probability. A limitation of the study is that it only explores the survival function and not the type of event that results in exiting the unemployment records. Most exits are of a desirable kind but some are not.
Future research that includes the differences between types of exit and focuses on increasing the likelihood of desirable ones would be highly beneficial.

Ethics statement
The data for this analysis was provided by the public employment service in Slovenia in the framework of the EU H2020 HECAT project. All records were anonymised according to the General Data Protection Regulation of the European Union [78] and respecting all national privacy laws.

Appendix A. Age distribution for inactive persons
The age distribution of unemployed persons that were identified as inactive by the PES office is shown in Figure A.9. It is clear that the majority of the cases identified as not active job seekers are persons older than 50 years. There is no such significant skewness in the age distribution for persons with lower estimated survival probability.