Regularized Bayesian transfer learning for population-level etiological distributions

Summary Computer-coded verbal autopsy (CCVA) algorithms predict cause of death from high-dimensional family questionnaire data (verbal autopsy) of a deceased individual, which are then aggregated to generate national and regional estimates of cause-specific mortality fractions. These estimates may be inaccurate if CCVA is trained on non-local training data different from the local population of interest. This problem is a special case of transfer learning, i.e., improving classification within a target domain (e.g., a particular population) with the classifier trained in a source-domain. Most transfer learning approaches concern individual-level (e.g., a person’s) classification. Social and health scientists such as epidemiologists are often more interested with understanding etiological distributions at the population-level. The sample sizes of their data sets are typically orders of magnitude smaller than those used for common transfer learning applications like image classification, document identification, etc. We present a parsimonious hierarchical Bayesian transfer learning framework to directly estimate population-level class probabilities in a target domain, using any baseline classifier trained on source-domain, and a small labeled target-domain dataset. To address small sample sizes, we introduce a novel shrinkage prior for the transfer error rates guaranteeing that, in absence of any labeled target-domain data or when the baseline classifier is perfectly accurate, our transfer learning agrees with direct aggregation of predictions from the baseline classifier, thereby subsuming the default practice as a special case. We then extend our approach to use an ensemble of baseline classifiers producing an unified estimate. Theoretical and empirical results demonstrate how the ensemble model favors the most accurate baseline classifier. We present data analyses demonstrating the utility of our approach.


INTRODUCTION
Verbal autopsy-A survey of the household members of a deceased individual, act as a surrogate for medical autopsy report in many countries. Computer-coded verbal autopsy (CCVA) algorithms are high-dimensional classifiers that predict cause of death (COD) from these high-dimensional family questionnaires which are then aggregated to generate national and regional estimates of cause-specific mortality fractions (CSMF). These estimates may be inaccurate as CCVA are usually trained using non-local information not representative of the local population of interest. This problem is a special case of transfer learning, a burgeoning area in statistics and machine learning.
Classifiers trained on source-domain data tend to predict inaccurately in a target domain different from the source-domain in terms of marginal and conditional distributions of the features (covariates) and labels (responses) (Shimodaira, 2000). Various domain adaptation strategies have been explored for transfer learning of generic classifiers which adjust for this distributional difference between the two domains. We refer the readers to Weiss and others (2016) and Pan andYang (2010) for a comprehensive review of transfer learning for classification problems. We focus on the setting where there is abundant labeled source-domain data, abundant unlabeled target-domain data, and limited labeled target data. Transfer learning approaches pertaining to this setting include multi-source-domain adaptation (CP-MDA, Chattopadhyay and others, 2012), neural networks (TCNN, Oquab and others, 2014), adaptive boosting (TrAdaBoost, Dai and others, 2007;Yao and Doretto, 2010), feature augmentation method (FAM, Daumé III, 2009), spectral feature alignment (SFA, Pan and others, 2010) among others.
All of the aforementioned transfer learning classification approaches are motivated by applications in image, video or document classification, text sentiment identification, and natural language processing where individual classification is the goal. Hence, they usually focus on the individual's (e.g., a person's or an image's) classification within a target domain (e.g., a particular population) with training performed in data from a different source-domain.
Social and health scientists such as epidemiologists are often more interested with understanding etiological distributions at the population-level rather than classifying individuals. For example, we aim to estimate national and regional estimates of cause-specific fractions of child mortality. Hence, our goal is not individual prediction but rather transfer learning of population-level class probabilities in the target domain. None of the current transfer learning approaches are designed to directly estimate population-level class membership probabilities.
Additionally, the extant transfer learning approaches rely on large source-domain databases of millions of observations for training the richly parameterized algorithms. The sample sizes of datasets in epidemiology are typically orders of magnitude smaller. Most epidemiological applications use field data from surveys, leading to databases with much smaller sample sizes and yet with high-dimensional covariates (survey records). For example, in our application, the covariate space is high-dimensional (∼200-350 covariates), the "abundant" source-domain data have around ∼2000 samples, while the local labeled data can have as few as 20-100 samples. Clearly, in such cases, the local labeled data are too small to train a classifier on a high-dimensional set of covariates, as the resulting estimates will be highly variable. A baseline classifier trained on the larger source-domain data will tend to produce more stable estimates, but the high precision will come at the cost of sacrificing accuracy if the source-and target-domains differ substantially.
Our parsimonious solution to this bias-variance trade-off problem is to use the baseline classifier trained on source-domain information to obtain an initial prediction of target-domain class probabilities, but then refine it with the labeled target-domain data. We proffer a hierarchical Bayesian framework that unifies CSMF accuracy GS-COD Gold-standard Cause of Death these two steps. With C classes and S-dimensional covariates, the advantage of this new approach is that the small labeled data for the target domain is only used to estimate the C × C confusion matrix of the transfer error (misclassification) rates instead of trying to estimate O(SC) parameters of the classifier directly from the target-domain data. Since S C, this approach considerably reduces the dimensionality of the problem. To ensure a stable estimation of the confusion matrix, we additionally use a regularization prior that shrinks the matrix towards identity unless there is substantial transfer error. We show that, in the absence of any target-domain labeled data or in case of zero transfer error, posterior means of class probability estimates from our approach coincide with those from the baseline learner, establishing that the naive estimation that ignores transfer error is a special case of our algorithm. We devise a novel, fast Gibbs sampler with augmented data for our Bayesian hierarchical model.
We then extend our approach to one that uses an ensemble of input predictions from multiple classifiers. The ensemble model accomplishes method-averaging over different classifiers to reduce the risk of using one method that is inferior to others in a particular study. We establish a theoretical result that the class probability estimates from the ensemble model coincides with that from a classifier with zero transfer error. A Gibbs sampler for the ensemble model is also developed, as well as a computationally lighter version of the model that is much faster and involves fewer parameters. Simulation and data analyses demonstrate how the ensemble sampler consistently produces estimates similar to those produced by using our transfer learning on the single best classifier.
Our approach is also post hoc, i.e., only uses pre-trained baseline classifier(s), instead of attempting to retrain the classifier(s) multiple times with different versions of training data. This enables us to use publicly available implementations of these classifier(s) and circumvents iterative training runs of the baseline classifier(s) which can be time-consuming and inconvenient in epidemiological settings where data collection continues for many years, and the class probabilities needs to be updated continually with the addition of every new survey record. The post hoc approach also ensures we can work with non-statistical classifiers that do not use a training data but some sort of source-domain information (e.g., CCVA algorithms InterVA and EAVA).
The rest of the manuscript is organized as follows. We present the motivating application in Section 1.1. In Sections 2 and 3, we present the methodology and its extension to the ensemble case. Section 4 considers the extension where class probabilities can be modeled as a function of a few covariates like age, sex, seasons, spatial regions, etc. Section 5 presents simulation results. Section 6 returns to the motivating dataset and uses our transfer learning model to estimate national CSMFs for children deaths in India and Tanzania. We end the manuscript in Section 7 with a discussion of limitations and future research opportunities. A glossary of the different abbreviations used throughout the text is. provided in Table 1.

Motivating dataset
In low-and middle-income countries, it is infeasible to conduct full autopsies for the majority of deaths due to economic and infrastructural constraints, and/or religious or cultural prohibitions against autopsies (AbouZahr and others, 2015; Allotey and others, 2015). An alternative method to infer the COD (or "etiology") is to conduct verbal autopsy (VA)-a systematic interview of the relatives of the deceased individual-to obtain information about symptoms observed prior to death (Soleman and others, 2006). Statisticians have developed several specialized classifiers that predict COD using the high-dimensional VA records as input. Examples include Tariff (James and others, 2011;Serina and others, 2015), InterVA (Byass and others, 2012), InSilicoVA (McCormick and others, 2016), the King and Lu method (King and others, 2008), EAVA or expert algorithm (Kalter and others, 2015), etc. Software for many of these algorithms are publicly available, e.g., Tariff (Li and others, 2018c), InSilicoVA (Li and others, 2018a), InterVA (Thomas and others, 2018), and the openVA R-package (Li and others, 2018b) has consolidated most of these individual software into a single package. Generic classifiers like random forests (Breiman, 2001), naive Bayes classifiers (Minsky, 1961), and support vector machines (Cortes and Vapnik, 1995) have also been used (Flaxman and others, 2011;Miasnikof and others, 2015;Koopman and others, 2015) for classifying verbal autopsies. Predicted COD labels for each VA record in a nationally representative VA database is aggregated to obtain national CSMF-the population-level class membership probabilities, that are often the main quantities of interest for epidemiologists, local governments, and global health organizations. Formally, a CCVA algorithm is simply a classifier using the S × 1 covariate vector (VA report) s to predict c-one of C possible COD categories. Owing to the high dimensionality of the covariate space (VA record consists of responses to 200-350 questions), learning this mapping P(c | s) requires substantial amount of gold standard (labeled) training data. Usually in the country of interest, VA records are available for a large representative subset of the entire population, but gold standard cause of death (GS-COD) is ascertained for only a very small fraction of these deaths. In other words, there is abundant unlabeled data but extremely limited labeled data in the target domain. The ongoing project Countrywide Mortality Surveillance for Action (COMSA) Mozambique typify this circumstance, where, in addition to conducting a nationally representative VA survey, researchers will have access to gold standard COD for a small number of deaths from one or two local hospitals using minimally invasive autopsies (MIA) (Byass, 2016). Budgetary constraints and socio-cultural factors unfortunately imply that only a handful of deaths can eventually be autopsied (up to a few hundred).
Lack of sufficient labeled target-domain data implies that CCVA classifiers need to be trained on nonlocal data like the publicly available Population Health Metrics Research Consortium (PHMRC) Gold Standard VA database (Murray and others, 2011a), that has more than 10 000 paired physician and VA assessments of cause of death across four countries. However, there exists considerable skepticism about the utility of CCVA trained on non-local data (McCormick and others, 2016;Flaxman and others, 2018). To illustrate the issue, in Figure 1, we plot the confusion matrices between the true COD of the PHMRC child cases in Tanzania against the predicted COD for these cases using two CCVA algorithms, Tariff and InSilicoVA, both trained on all PHMRC child data non-local to Tanzania. Both matrices reveal very large transfer errors, some as high as 60% indicating that the naive estimates of population-level class probabilities from CCVA classifiers trained on non-local source data are likely to be inaccurate thereby highlighting the need for transfer learning in this application. Additionally, like for any other application area, there exists considerable disagreement about which CCVA algorithm is the most accurate (Leitao and others, 2014;McCormick and others, 2016;Flaxman and others, 2018). In our experience, no method is universally superior, and a robust ensemble transfer learning approach guarding against use of inaccurate classifiers is desirable.

Naive approach
Let p = (p 1 , p 2 , . . . , p C ) denote the true population-level class probabilities in a target domain where we have abundant unlabeled covariate data, which we denote by U, and a very small labeled data L of paired labels G and covariates s. Our estimand is p i = P(G = i) where G denote the true (gold standard) classmembership. For a case r with covariate s r , let A r = A(s r | G) denote the predicted class membership, based on s r , from the baseline classification algorithm A trained on some large "gold-standard" information G. Our methodology is agnostic to the type of G, it can be either a labeled data set from the source-domain or set of context-specific information (e.g., medical guidelines for determining COD) used to construct the classifier. If we do not use any transfer learning, the naive estimate of p from A is given by where v i is the number of observations in U classified by A to category i, and N = i v i is the sample size of U. If U is large enough to be representative of the target population, it is clear that i.e., q is the method-of-moments estimator of q.
Unless the algorithm A trained on source-domain information perfectly agrees with the true membership assignment mechanism G in the target domain, there is no reason to consider q or q to be a good estimate of p. More realistically, accuracy depends on how similar the algorithm A is in the source-and target-domains. Hence, more generally we can think of q as the expected population class probabilities in the target domain that would be predicted by A(· | G).
Marginally, both G and A are categorical variables. So, we can write To infer about G from A, we need to model their joint distributions. We express q j = C i=1 m ij p i , where m ij = p(A = j | G = i). In matrix notation, we have q = M p where M = (m ij ) is a transition matrix (i.e., M1 = 1) which we refer to as the confusion matrix. First note that, if M = I, then p = q and hence this subsumes the case where class probabilities from the baseline algorithm is trusted as reliable surrogates of the true class probabilities.
For transfer learning to improve estimation of p, instead of assuming M = I, we can opt to use the more general relationship q = M p and estimate the transfer error rates m ij 's from L. Let n = C i=1 n i denote the sample size of L with n i denoting the number of objects belonging to class i. Also let denote the transfer error matrix for algorithm A. Like many transfer learning algorithms, exploiting the transfer errors is key to our strategy. It is clear that t ij /n i is a method-of-moments estimator of m ij .
We can use these estimates of m ij , along with the earlier estimate of q to obtain a substantially improved estimate of p. Formally we can specify this via a hierarchical model as: where for r = 1, 2, . . . , N , and for any matrix M, M i * and M * j denote its ith row and jth column respectively. The top-row of (2.3) represents the relationship q = M p and yields the method-of-moments estimators To estimate p, one can adopt a modular two-step approach where first q and M are calculated separately and then obtain p = arg min where L is some loss function like the squared-error or, more appropriately, the Kullback-Liebler divergence between the probability vectors. This approach fails to propagate the uncertainty in the estimation of M in the final estimates of p. Benefits of a one-stage approach over a two-stage one has been demonstrated in recent work in transfer learning (Long and others, 2014). Alternatively, one can estimate the joint MLE of M and p from (2.3).
The advantage of this simple transfer learning method is that it circumvents the need to improve the individual predictions of A and directly calibrates the population-level class probabilities p, which are the quantities of interest here. We deliberately model the distribution of (A, G) marginalized over the covariates s, allowing us to use this in situations where the covariates are discarded after predictions or cannot be provided due to privacy concerns, and only the predictions A r are available. Even if the covariates were available, trying to use the extremely small L to estimate P(s | G) relationship is not advisable. Instead, we efficiently exploit the small local training data L to reduce cross-domain bias learning about the (P(A | G)) confusion matrix. This involves estimation of only C(C − 1) parameters as opposed to the O(SC) parameters if trying to estimate P(s | G). For verbal autopsy data, S is typically around 250 while we can choose C to be small focusing on the top 3-5 causes. Hence, our approach achieves considerable dimension reduction by switching from the original covariate space to the predicted class space.
In (2.3) above, q = M p can be directly estimated precisely because N is large. However, M has C × (C − 1) parameters so that if there are many classes, the estimates of m ij will have large variances owing to the small size of L. Furthermore, in epidemiological studies, data collection often spans a few years; in the early stages, L may only have a very small sample size resulting in an extremely imprecise estimate of M, even if we group the classes to a handful of larger classes. This can lead to an imprecise estimate of p based on estimates of q and M, as the latter will have high variance. Consequently, in the next section we propose a regularized approach that stabilizes the transfer learning.

Bayesian regularized approach
We champion the idea that in absence of enough target-domain labeled data, our method should shrink towards the default method familiar to practitioners. If L was not available, i.e., there is no labeled data in the target domain, we only have U and G. Then in many applications (including verbal autopsies) the current practice is to train A once using G and then predict on U to obtain q as the estimate for p. This is equivalent to setting p = q and M = I, i.e., assuming that the algorithm A perfectly classifies in the target domain even when trained only using source-domain information G. Extending this argument, when L is very small, direct estimates of M would be unstable and we can rely more on this default practice. Hence, for small L, we will shrink p towards q i.e., we shrink towards the default assumption that the baseline learner is accurate. This is equivalent to shrinking the estimate of M towards I. The simplest way to achieve this is by using the regularized estimate In epidemiological applications, as data will often come in batches over a period spanning few years, one needs to rerun the transfer learning procedure periodically to update the class probabilities. In the beginning, when L is extremely small, it is expected that more regularization is required. Eventually, when L becomes large, we could rely on the direct estimate M. Hence, λ should be a function of the size n of L, with λ = 1 for n = 0 and λ ≈ 0 for large n. Furthermore, at intermediate stages, since the distribution of true class memberships in L will be non-uniform across the classes, we will have a disparity in sample sizes n i for estimating the different rows of M. Consequently, it makes more sense to regularize each row of M separately instead of using a single λ. A more flexible regularized estimate is given by The row specific weights λ i should be chosen such that λ i = 1 when n i = C j=1 t ij = 0, and λ i ≈ 0 when n i is large. One choice is given by We now propose a hierarchical Bayesian formulation that accomplishes this regularized estimation of any transition matrix M. We consider a Dirichlet prior M i * ind ∼ Dirichlet(γ i (I i * + 1)) for the rows of M. We first offer some heuristics expounding choice of this prior. We will have Hence, using a small enough , the Bayes estimator (posterior mean) for M becomes equivalent with the desired shrinkage estimator M i * proposed above. When n = 0, the Bayes estimate E(M | T, γ = (γ 1 , γ 2 , . . . , γ C ) ) ≈ I, and for large n, E(M | T, γ ) becomes the method-of-moments estimator M. Hence, the Dirichlet prior ensures that in data-scarce setting, M is shrunk towards I and consequently p towards q.
In Theorem 2.1, we will present a more formal result that looks at the properties of the marginal posterior of p.
To complete the hierarchical formulation, we augment (2.3) with the priors: In practice, we need to use a small > 0 to ensure a proper posterior for M when any off-diagonal entries of T are zero, which is very likely due to the limited size of L. Note that our model only uses the data from L to estimate the conditional probabilities P(A | G). We do not model the marginal distribution of A or G in L like we do for U. This is because often data for the labeled set are collected under controlled settings, and marginal distribution of G (and hence of A) for the samples in L is not representative of their true target-domain marginal distributions. Hence, we only use L to estimate the conditional probabilities M.
Our previous heuristic arguments, illustrating the shrinkage estimation of M induced by the Dirichlet prior, are limited to the estimation of M from L as an independent piece and disregards the data and model for U, i.e. the first row of (2.3). In a hierarchical setup, however, the models for U and L contribute jointly to the estimation of M and p. We will now state a more general result that argues that for our full hierarchical model specified through (2.3) and (2.4), when there is no labeled target-domain data or if the algorithm A demonstrates perfect accuracy (zero transfer error) on L, then the marginal posterior estimates of p from our model coincides with the baseline estimates q. Before stating the result, first note that the likelihood for a = (A 1 , A 2 , . . . , A N ) can be represented using the sufficient statistics THEOREM 2.1 If T is a diagonal matrix, i.e., either there is no L, or A classifies perfectly on L, then The proofs of the theorems are provided in Section S4 of the Supplementary Materials available at Biostatistics online. Note that Theorem 2.1 is a result about the posterior of our quantity of interest p, marginalizing out the other parameters M, and the γ i 's from the hierarchical model specified through (2.3) and (2.4). We also highlight that this is not an asymptotic result and holds true for any sample size as long as we take the limit → 0. This is important as our manuscript pertains to epidemiological applications where the sample size of L will be extremely small.
Theorem 2.1 also does not require any assumption about the underlying data generation scheme and is simply a desirable property of our transfer learning model. If there is no labeled target-domain data, then we give the same estimate from the method currently used by practitioners, i.e., we trust A trained on a source-domain and only use the target-domain marginal distributions of s from U to arrive at the estimates q of p. Similarly, in the best-case scenario, when A is absolutely accurate for the target domain, Theorem 2.1 guarantees that our model automatically recognizes this accuracy and does not modify the baseline estimates q from A. This shrinkage towards the default method familiar to practitioners, in absence of enough evidence, is a desirable property. The result of Theorem 2.1 is confirmed in simulations in Section 5.
Although Theorem 2.1 is assumption-free, it only concerns with the performance of the model when there is no L or when A is perfect on L. While this is a good sanity check for our model, realistically we will have a small L where A will be inaccurate. In such cases, the performance of our model will of course depend on the data generation process. Hence, we summarize the data generation assumption that drive the model formulation. Since, there is no labeled data in U, we need to assume some commonality between L and U in order for the labeled data in L to be useful for estimating the CSMFs in U. Hence, the model assumes that the conditional distribution of A | G (i.e., the M matrix) is same in U and L. This is a transportability assumption that the error rates on the validation set L reflects the true error rates in the population U. These rates are then used to correct for bias in the estimates of p. This process is thus analogous to applying measurement error correction to a study by assuming transportability of the measurement error distribution from some validation samples (Carroll and others, 2006). We would like to emphasize that we do not assume that the marginal distributions of the cause G (i.e., the CSMFs) and hence also of the symptoms s are same in any of G, U and L. Of course, the assumption of same confusion matrix M for U and L can also be incorrect (all models are wrong). However, the class of models spanned by use of a general M is a superset of the default approach of using the baseline classifier (i.e., assuming M = I). Also, we can relax the assumption of constant M between U and L to make entries of M function of some covariates. This model and its implementation is discussed in Section 4. This would lead to substantial increase in parameter dimensionality and is only recommended when L is large.

Gibbs sampler using augmented data
We devise an efficient implementation of the hierarchical transfer learning model using a data augmented Gibbs sampler. The joint posterior density can be expressed as Let p | · denote the full conditional distribution of p. We use similar notation for other full conditionals. First note that since p(v | M, p) ∝ j ( i m ij p i ) v j , the full conditional densities p | · and M | · do not belong to any standard family of distributions, thereby prohibiting a direct Gibbs sampler. We here use a data augmentation scheme enabling a Gibbs sampler using conjugate distributions.
The term ( i m ij p i ) v j can be expanded using the multinomial theorem, with each term corresponding to one of the partitions of v j into C non-negative integers. Equivalently we can write where the proportionality constant only depends on the observed v j 's. Using the augmented data matrix B = (b 1 , b 2 , . . . , b C ) = (b ij ), we can write the complete posterior as (2.5) The full conditional distributions can be updated as follows (derivations omitted): The data augmentation ensures that, except the C γ i 's, which are updated using a metropolis random walk with log-normal proposals, all the other O(C 2 ) parameters are update by sampling from standard distributions leading to an extremely fast and efficient Gibbs sampler.

ENSEMBLE TRANSFER LEARNING
Let there be K classifiers A (1) , . . . , A (K) and let a (k) = (a (k) 1 , a (k) 2 , . . . , a (k) N ) be the predicted class memberships from the kth algorithm for all the N observations in U. Let v (k) denote the vector of counts of predicted class memberships on U using A (k) . We expect variation among the predictions from the different classifiers and consequently among the baseline estimates of population-level class probabilities q (k) = v (k) /N and their population equivalents q (k) = P(A (k) ). Since the true population class probability vector p is unique, following Section 2.1 we can write q (k) is now the classifier-specific confusion matrix. The predicted class membership for the rth observation in U by algorithm A (k) , denoted by a (k) r , marginally follows a Multinomial(1, q (k) ) distribution. We have K such predictions for the same observation, one for each classifier, and these are expected to be correlated. So, we need to look at the joint distribution of the K C-dimensional multinomial random variables. Since, in its most general form this will involve O(C K ) parameters, we use a pragmatic simplifying assumption to derive the joint distribution. We assume that a (1) r , a (2) r , . . . , a (K) r are independent conditional on G r , i.e., This assumption is unlikely to hold in reality but is a common dimension reducing assumption used in classification problems. For example, the naive Bayes classifier uses this assumption to jointly model the probability of covariates given the true class memberships. Similar assumptions are used by InSilicoVA and InterVA to derive the joint distribution of the vector of symptoms s r . Here, we are applying the same assumption but not on s r but on the lower-dimensional prediction vector (a (1) r , a (2) r , . . . , a (K) r ) . Under this assumption, the marginal independence of the a (k) r 's will not generally hold. Instead we will have p(a r = j) = p(a (1) r = j 1 , a (2) r = j 2 , . . . , a (K) where j = (j 1 , j 2 , . . . , j K ) denotes a C × 1 vector index. From the limited labeled data set L in the target domain, the classifier-specific transfer error matrices are also known and can be used to estimate the respective confusion matrices M (k) in the same way M was estimated from T in Section 2.1. To introduce shrinkage in the estimation of M (k) , like in Section 2.2, we assign Dirichlet priors for each M (k) .
Let w denote a C K × 1 vector formed by stacking up all the w j 1 ,j 2 ,...,j K 's defined in (3.7). The full specifications for the ensemble model that incorporates the predictions from all the algorithms is given by: Although w is a C K × 1 vector, courtesy of the conditional independence assumption (3.6), it is only parameterized using the matrices M (k) and p as specified in (3.7), and hence involves KC 2 + C parameters. This ensures that there is adequate data to estimate the enhanced number of parameters for this ensemble method, as for each M (k) we observe the corresponding transfer error matrix T (k) . The Gibbs sampler for (3.8) is provided in Section S3 of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org). To understand how the different classifiers are given importance based on their transfer errors on L, we present the following result: THEOREM 3.1 If T (1) is diagonal with positive diagonal entries, and all entries of T (k) are ≥ 1 for all k > 1, Theorem 3.1 reveals that if one of the K algorithms (which we assume to be the first algorithm without loss of generality) produce perfect prediction on L, then posterior mean estimate of p from the ensemble model coincides with that of the baseline estimate from that classifier. The perfect agreement assumed in Theorem 3.1 will not occur in practice. However, simulation and data analyses will confirm that the estimate of p from the ensemble model tend to agree with that from the single-classifier model in Section 2.2 with the more accurate algorithm. This offers a more efficient way to weight the multiple algorithms, yielding a unified estimate of class probabilities that is more robust to inclusion of an inaccurate algorithm in the decision making. In comparison, a simple average of estimated p's from single-classifier transfer learning models for each of the K algorithms would be more adversely affected by inaccurate algorithms.

Independent ensemble model
The likelihood for the top-row of (3.8) is proportional to j w y j j where y j = r∈U I (a (1) = j 1 , . . . , a (K) r = j K ) denote the total number of observations in U where the predicted class-memberships from the K algorithms corresponds to the combination j = (j 1 , . . . , j K ) . Even though U will be moderately large (few thousand observations in most epidemiological applications), unless both C and K are very small (C ≤ 5 and K ≤ 3), y j 's will be zero for most of the C K possible combinations j. This will in-turn affect the estimates of w. For applications to verbal autopsy-based estimation of population CSMFs, there are many CCVA algorithms (as introduced in Section 1), and researchers often want to use all of them in an analysis. We also may be interested in more than 3-5 top causes. In such cases, the extremely sparse C K vector formed by stacking up the y j 's will destabilize the estimation of w. Also, the Gibbs sampler (see Section S3 of the supplementary material available at Biostatistics online) of the joint-ensemble model introduces an additional C K independent multinomial variables of dimension C thereby accruing substantial computational overhead and entailing long runs of the high-dimensional Markov chain to achieve convergence.
In this section, we offer a pragmatic alternative model for ensemble transfer learning that is computationally less demanding. From (3.7), we note that by exchanging the summations. Hence, the marginal distribution of a (k) r is Multinomial(1, q (k) ), where q (k) = (M (k) ) p. We model the a (k) r 's independently for each k, ignoring the correlation among the predictions in U from the K classifiers as follows: Multinomial(1, q (k) ), r = 1, . . . , N (3.10) We replace the top-row of (3.8) with (3.10), keeping the other specification same as in (3.8). We call this the independent ensemble model. Note that, while we only use the marginal distributions of the a (k) r 's ignoring their joint dependence, the joint distribution is preserved in the model for the transfer errors on L specified in the second-row of (3.8), as all the M (k) 's are tied to the common truth p through the equations q (k) = M (k) p. While the total number of parameters for the joint and independent ensemble models remain the same, eliminating the joint model for each of the C K combination of predicted causes from the K algorithms allows decomposing the likelihood for (3.10) as product of individual likelihoods on U for each of the K classifiers. Additionally, the Gibbs sampler for the independent ensemble model is much simpler and closely resembles the sampler for the single-classifier model in Section 2.3. We only need to , one corresponding to each CCVA algorithm, akin to the matrix B introduced in Section 2.3. The Gibbs sampler steps for the independent ensemble model are: Observe that the sampler for the independent model uses CK additional parameters as opposed to C K parameters introduced in the joint sampler. This ensures that the Markov Chain dimensionality does not exponentially increase if predictions from more algorithms are included in the ensemble model. The theoretical result in Theorem 3.1 no longer remains true for the independent model. However, our simulation results in Section S5.5 of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org) show that in practice it continues to put higher weights on the more accurate algorithm and consistently performs similar to or better than the joint model. In Section S1, we present an EM algorithm approach to obtain maximum a posteriori (MAP) estimates for the model as a fast alternative to the fully Bayesian approach adopted here. In Section S2, an algorithm for generating posterior samples of individual-level class predictions is outlined.

DEMOGRAPHIC COVARIATES AND SPATIAL INFORMATION
The transfer-learning model introduced up to this point is focused on generating population-level estimates of the CSMF p. An important extension for epidemiological applications would be to model p as a function of covariates like geographic region, seasonality, social economic status (SES), sex and age groups. This will enable the estimation of regional and age-sex stratified estimates. In this section, we generalize the model to accommodate covariates. We illustrate for the single-classifier model in Section 2.2; a similar approach extends the ensemble model. Let x r denote a vector of covariates for the r th VA record in U. We propose the following modifications to the model for allowing covariate-specific class distributions p r = (p r1 , p r2 , . . . , p rC ) : All other components of the original model in (2.3) and (2.4) remain unchanged. The middle row of (4.11) specify a multi-logistic model for the class probabilities using the covariates. The top row uses the covariate specific p r to model the analogous class probabilities q r = M p r as would be predicted by A. Finally, the bottom row specifies Normal priors for the regression coefficients. The switch from a Dirichlet prior for p to the multi-logistic model implies we can no longer directly leverage conjugacy in the Gibbs sampler. Polson and others (2013) proposed a Polya-Gamma data augmentation scheme to allow conjugate sampling for generalized linear models. We now show how our own data augmentation scheme introduced in Section 2.3 harmonizes with the Polya-Gamma sampler to create a streamlined Gibbs sampler.

Gibbs sampler using Polya-Gamma scheme
We will assume there are H unique combinations of covariate values-for example, if there are four geographic regions and three age groups, then H = 12. If we have a continuous covariate, then H = N , where N is the number of subjects sampled in U. Then letting h, h = 1, . . . , H , represent a specific covariate combination x h , we can again represent the likelihood for a = (A 1 , A 2 where v hj is the total number of subjects with covariate values h that were predicted to have died of cause j. Let β = (β 1 , β 2 , . . . , β C−1 ). We now have The terms that are different from Section 2.3 are p(V | M, β) and p(β). The sampling step for γ remains exactly the same as previously discussed. We will use a similar data augmentation strategy as in Section 2.3 and combine with a Polya-Gamma data augmentation to sample from this posterior distribution. We expand the term Let B denote the HC × C matrix formed by stacking all the b hj 's row-wise. We can write The following updates ensue immediately: For β i 's, we introduce the Polya-Gamma variables ω hi 's and define i = diag({ω hi } H h=1 ), n h = j v hj , and Here PG denotes the Polya-Gamma distribution and c i = (c 1i , c 2i , . . . , c Hi ) . This completes the steps of a Gibbs sampler where all the parameters except γ are updated via sampling from conjugate distributions. We can transform the posterior samples of β to obtain posterior samples of p hi . Estimates of the marginal class distribution for the whole country can also be obtained by using the relationship p i = p hi dP(h), where an empirical estimate of the covariate distribution P(h) can be obtained from U.

Covariate-specific transfer error
Until now, we have assumed that the transition matrix M is independent of the covariates. We can also introduce covariates in modeling the conditional probabilities m ij 's using a similar multi-logistic regression. This model will be particularly useful if there is prior knowledge about covariate-dependent biases in the predictions from a classifier. Letting m rij denoting the conditional probabilities p(A = j | G = i, x r ), we can model (4.12) The implementation will involve Polya-Gamma samplers for each row of M in a manner exactly similar to the sampler outlined above (we omit the details). Since we can only estimate the parameters ζ ij from the limited local data, we can only adopt this approach with a very small set of covariates for modeling the transfer error rates.

SIMULATION STUDIES
The PHMRC study, conducted in four countries across six sites, is a benchmark database of paired VA records and GS-COD of children, neonates and adults. PHMRC data are frequently used to assess performance of CCVA algorithms. We conduct a set of simulation studies using the PHMRC data (obtained through the openVA package, version 1.0.5) to generate a wide range of plausible scenarios where the performance of our transfer learning models needs to be assessed with respect to the popular CCVA algorithms. First, we randomly split the PHMRC child data (2064 samples) into three parts representing G, and initial L and initial U respectively using a 2:1:2 ratio, containing roughly 800, 400, and 800 samples, respectively. As accurate estimation of mortality fractions from most prevalent causes are usually the priority, we restrict our attention to four causes: the top three most prevalent causes in the target-domain data (L ∪ U)-Pneumonia, Diarrhea/Dysentery, Sepsis, and an Other cause grouping together all the remaining causes.
We wanted to simulate scenarios where both (i) the marginal distributions P(G) of the classes and (ii) the conditional distributions P(A | G) are different between the source-and target-domains. To ensure the latter, given a confusion matrix M = (m ij ) we want P(A = j | G = i) = m ij on L ∪ U. We will achieve this by discarding the actual labels in L ∪ U and generating new labels such that an algorithm A trained on G shows transfer error rates quantified by M on L ∪ U. Additionally, the new labels need to be assigned in a way to ensure that the target-domain class probability vector is p U , for any choice of p U different from the source-domain class probabilities in p G .
Note that if the true population class probabilities in the target domain needs to be p U , then q U , the population class probabilities as predicted by A is given by q U = M p U . Hence, we first use A trained on G to predict the labels for each case in the initial U. We then resample cases from the initial U to create a final U such that the predicted labels of A has the marginal distribution q U . Next, from Bayes theorem, For cases in U such that A = j, we generate the new "true" labels from Multinomial(1, (α 1j , α 2j , . . . , α Cj ) ). This data generation process ensures that for any case in U both G ∼ Multinomial(1, p U ) and A | G = i ∼ Multinomial(1, M i * ) are approximately true. We repeat the procedure for L, using the same M but a different p L . This reflects the reality for verbal autopsy data where the symptom-given-cause dynamics is same for all deaths L ∪ U in the new country, but the hospital distribution of causes p L is unlikely to match the population CSMF p U . For resampling to create the final L, we also vary n-the size of L as 50, 100, 200, and 400, to represent varying amount of local labeled that will be available at different stages of a project. We consider two choices of A: Tariff  To ensure that p U and p L are different, we generate pairs of probability vectors (p L , p U ) from Dirichlet(1) distribution and divide the cases into three scenarios: low: CSMFA(p L , p U ) < 0.4, medium: 0.4 < CSMFA(p L , p U ) < 0.6, and high: CSMFA(p L , p U ) > 0.6. Here, CSMFA denoting the CSMF accuracy is a metric quantifying the distance of a probability vector (p L ) from a reference probability  vector (p U ) and is given by (Murray and others, 2011b): For each scenario, we generated 100 pairs of p L and p U . For each generated dataset, we use all the algorithms listed in Table 2 for predicting p U . For an estimate p U (x) generate by a model x, we assess the performance of x using CSMFA(x)= CSMFA( p U (x), p U ). We present a brief summary of the results here. A much more detailed analysis is provided in Section S5 of the supplementary material (http://www.biostatistics.oxfordjournals.org). Figure 2 presents the CSMFA for all the five models for n = 400. The three columns are for the three choices of M described above, and in each figure the x-axis from left to right marks the low, medium, and high settings.
We observe that for almost all settings the Bayesian transfer learning approach was better than its corresponding baseline, i.e., Tariff BTL was better than Tariff G and InSilicoVA BTL was better than InSilicoVA G . The improvement in CSMFA was most drastic for M 2 (middle column) where it was as much as 0.3 in some cases. Only for M 1 , i.e., when the classifier is assumed to be perfect for predicting in the target domain, we see Tariff BTL and Tariff G produce similar CSMFA in the (top-left) and InSilicoVA BTL and InSilicoVA G produce similar CSMFA (bottom-left). This just corroborates Theorem 2.1 that the transfer learning keeps things unchanged if the classifier has zero transfer error. We also observe that within each figure, CSMFA's generally increase as we go from the low to the high setting, indicating that increased representativeness of the class distribution in the small labeled set L leads to improved performance. Also, across all settings we see that transfer learning based on algorithms used to simulate the data performs better, i.e., for the top-row Tariff BTL performs better than InSilicoVA BTL as in this case they respectively correspond to a true and a misspecified model. Similarly, for the bottom-row InSilicoVA BTL performs better than Tariff BTL . However, even under model misspecification, the transfer learning models perform better than their baselines, i.e., even when data are generated using Tariff, InSilicoVA BTL performs better than InSilicoVA G . Finally, across all scenarios, the ensemble learner performs close to the better performing individual learner, highlighting its utility and robustness.
In Section S5 of the supplementary material available at Biostatistics online (http://www.biostatistics. oxfordjournals.org), we present more insights into the simulation study. Section S5.1 assesses the impact of the disparity in the class distributions between the source-and target-domains. In Section S5.2, we compare the biases in the estimates of individual class probabilities. Section S5.3 delves into the role of the sample size and quality of the limited labeled set L. Section S5.4 demonstrates the value of the Bayesian shrinkage by comparing with the frequentist transfer learning outlined in Section 2.1. In Section S5.5, we compare the joint and independent ensemble models and demonstrate how they favorably weight the more accurate algorithm. Section S5.6 shows how one can use informed shrinkage, if a practitioner has a priori knowledge of which causes are likely to be misclassified by an algorithm. Finally, in Section S5.7, we compare the individual-level prediction performance of the models using the algorithm outlined in Section S5.2.

PREDICTING CSMF IN INDIA AND TANZANIA
We evaluate the performance of baseline CCVA algorithms and our transfer learning approach when predicting the CSMF for under 5 children in India and Tanzania using the PHMRC data with actual COD labels. We used both India and Tanzania, as they were the only countries with substantial enough sample sizes (N India = 948, N Tanzania = 728). For a given country (either India or Tanzania), we first split the PHMRC child data into subjects from within the country (L and U) and from outside of the country (G). We then used weighted sampling to select n(∈ {50, 100, 200}) subjects from within the country of interest to be in L, using weights such that CSMFA(p L , p U ) was low. Figure S8 in Section S6 of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org) shows the difference in the marginal symptom distribution between U and L. All the subjects from the country were put in U. We trained models Insilico G and Tariff G using the non-local data G, which were then used to predict the top COD for all subjects in L and U. We classified all causes of death into "External," "Pneumonia," "Diarrhea/Dysentery," "Other Infectious," and "Other." These predictions were then used to estimate the baseline CSMFs and as an input to our transfer learning models Tariff BTL , InSilicoVA BTL , and Ensemble I . Since the true labels (GS-COD) are available in PHMRC, we calculated the true p U for a country as the empirical proportions of deaths from each cause, based on all the records within the country. This p U was used to calculate the CSMF accuracy of each model. This whole process was repeated 500 times for each combination of country and value of n. This made sure that the results presented are average over 500 different random samples of L for each country, and are not for an arbitrary sample. Figure 3 presents the results of this analysis. The top and bottom rows represent the results for India and Tanzania, respectively. The four columns correspond to four different choices of n.
There are several notable observations. First, regardless of n, choice of algorithm A, and country, the calibrated estimates of prevalence from our transfer learning model performed better than or similar to the analogous baseline CSMFs, i.e., Tariff BTL performed better than Tariff G , and InSilicoVA BTL performed better than InSilicoVA G . Second, the magnitude of improvement for the our approach depends on the country and the size of L. Within India, the CSMFA of Tariff BTL and InSilicoVA BTL is similar to respectively those from Tariff G and InSilicoVA G . Tariff does better than InsilicoVA for India with Tariff BTL being the best performer. In Tanzania, the baseline InsilicoVA model InSilicoVA G does better than Tariff G , and similarly InSilicoVA BTL does better than InSilicoVA BTL . The improvement of Tariff BTL and InSilicoVA BTL , respectively over Tariff G and InSilicoVA G is more prominent than in India, with InSilicoVA BTL being the most accurate. The magnitude of improvement in the three TL approaches also increased with increase in n for Tanzania. In Section S7 of the Supplementary material available at Biostatistics online, we investigate the effect of adding more causes of death on the transfer learning CSMFA.

DISCUSSION
Epidemiological studies pose unique challenges to transfer learning, stemming from its focus on estimating population-level quantities as opposed to individual predictions, small sample sizes coupled with highdimensional covariate spaces (survey records), and lack of large training databases available for many other machine learning tasks. Motivated by these settings, we have presented a parsimonious hierarchical model-based approach to transfer learning of population-level class probabilities, using a pre-trained classifier, limited labeled target-domain data, and abundant unlabeled target-domain data.
In order for the transfer learning approach to work, the labeled data L has to be useful for improving CSMF estimation in U, i.e., there has to be some commonality between the distributions in L and U. Usually L is not going to be representative of the marginal cause distribution of U. If additionally, it is also not representative of the conditional distributions of s | G (or, equivalently by marginalizing out s, of A | G), then L is of no use to improve CSMF estimation in U. Hence, our transfer learning is useful when the conditional distributions are same (constant M) between L and U, or has the same functional form (regression approach of Section 4.2) between L and U.
Shrinkage or regularization is at the core of our approach. In data sets with large numbers of variables (dimensions), regularized methods have become ubiquitous. A vast majority of the literature focuses on shrinking estimates (mostly regression coefficients and covariance or precision matrices) towards some known sub-model. We apply the same principle of regularization in a unique way for estimating the population class probabilities. Instead of shrinking towards any underlying assumptions about the true population distribution, we shrink towards the baseline estimate. In absence of sufficient target-domain data, this is the estimate that is commonly used, and this shrinkage principle is desirable for practitioners familiar with this default method. We show how this shrinkage is achieved by shrinking the confusion matrix towards the identity matrix using appropriate Dirichlet priors. This regularized estimation of a confusion matrix (or any transition matrix) can also be applied in other contexts.
The fully Bayesian implementation is fast, owing to a novel data-augmented Gibbs sampler. The ensemble model ensures robust estimates via data-driven averaging over many classifiers and reduces the risk of selecting a poor one for a particular application. Our simulations demonstrate the value of transfer learning, offering substantially improved accuracy. The PHMRC data analysis makes evident the value of collecting a limited number of labeled data GS-COD in the local population using full or minimally invasive autopsy, alongside the nationwide VA survey. Subsequently using transfer learning improves the CSMF estimates. The results also show how our approach benefits from larger sample sizes of the local labeled set L, and from closer alignment between the marginal class probabilities in L and the true target-domain class probabilities.
For VA data analyses, we note that while we used a labeled data set in the source-domain as G, in practice it can be any other form of gold standard information sufficient to train a VA classifier. CCVA methods like Tariff and the approach in King and others (2008) represent a traditional supervised learning approach and needs a substantial labeled training dataset G. InterVA is a semi-supervised learning approach where G is a standard matrix of letter grades representing the propensity of each symptom given each cause. InSilicoVA generalizes InterVA and endows the problem with a proper probabilistic framework allowing coherent statistical inference. It adapts to the type of G and can work with either the default symptom-cause matrix used in InterVA or estimate this matrix based on some labeled training data of paired VA and GS-COD records. Our transfer learning is completely agnostic to the choice of this baseline CCVA algorithm and the form of G they require. We only need the predictions from a pre-trained algorithm for all observations in L ∪ U.
One important direction forward would be to generalize this approach for more complex COD outcomes. Currently COD outcome is viewed as a discrete variable taking values on a set of causes like pneumonia or sepsis. In practice, death is a complex chronological series of several events starting from some root causes and ending at the immediate or proximal cause. In addition to understanding prevalence of causes in the population, another goal for many of the aforementioned programs is to identify medical events that occurred before death for which an intervention could prevent or delay mortality. Extending the current setup for hierarchical or tree-structured COD outcome would be a useful tool to address this aim. Many CCVA algorithms, in addition to predicting the most likely COD, also predict the (posterior) distribution of likely causes. Our current implementation only uses the most likely COD as an input. An extension enabling the use of the full predictive distribution as an input can improve the method. Finally, the VA records, containing about 250 questions for thousands of individuals, naturally has several erroneous entries. Preprocessing VA records to eliminate absurd entries and records entails onerous manual efforts. It is challenging to develop quality control models for VA data due to the high dimensionality of the symptoms. Akin to what we did here, one can consider dimension reduction via the predictions of CCVA algorithms for automated statistical quality control.

SOFTWARE
R-package "calibratedVA" containing code to obtain estimates of population CSMFs from our transfer learning approach using baseline predictions from any verbal autopsy algorithm is available at https://github.com/jfiksel/CalibratedVA/. The package also contains the code for the ensemble model for using outputs from several VA algorithms. A vignette describing how to navigate the package and demonstrating the use of the methodology is provided in https://github.com/jfiksel/CalibratedVA/blob/master/vignettes/CalibratedVA.Rmd. All results in this article can be recreated using the scripts contained in https://github.com/jfiksel/BayesTLScripts.