Main

In response to an immunological challenge, immune cells act in concert to form a complex and dense cell-signalling network1,2. The single-cell evaluation of intracellular signalling responses is particularly valuable in characterizing this cellular network as it provides a functional assessment of an individual’s immune system. In clinical settings, a deep understanding of functional immune responses not only provides diagnostic opportunities, but is also often the first step in developing immune therapies (recent examples include stratification of COVID-19 disease severity3,4 and successful immune modulation in chronic lymphocytic leukaemia5, neurodegeneration6 and Ebola virus disease7).

Advanced flow cytometry technologies can characterize millions of single cells from a given patient, which enables the identification of signalling pathways, even in rare cell populations8. The recent advent of high-dimensional polychromatic flow cytometry9,10 and mass cytometry11,12 technologies have vastly increased our ability to study the human immune system with unprecedented functional depth by increasing the number of features measured per cell. However, the increased dimensionality relative to the cohort sizes in clinical studies, as well as the inherently complex networks of internal correlations between the measured cell types and pathways present unique computational challenges13. Translating complex and high-dimensional observations, like immunological data, into relevant computational models requires statistically rigorous analysis techniques. Multivariable modelling, in contrast to univariable analysis, can simultaneously consider all measured aspects of the immune system to improve predictions. However, multivariable modelling requires exponentially larger cohort sizes as the number of measurements grows (also know as the ‘curse of dimensionality’14,15,16). This is especially true for more modern deep-learning-based models that provide general function approximation but require substantially larger sample sizes. In practice, increasing the cohort size by several orders of magnitude to power such analyses is often a significant challenge in resource-constrained settings. Moreover, multivariable analyses performed on all available measurements produce large complex models that are difficult to interpret and implement in resource-constrained settings17 and often lack robustness18.

Integration of prior knowledge has been broadly recognized as an effective approach for reducing model complexity and increasing robustness19,20,21,22,23,24,25,26,27,28,29,30,31. In biological sciences, examples of such knowledge integration include inference of biological networks24,32 and causal pathway modelling33,34. In modern immunological datasets, however, integration of prior knowledge has been impractical due to the unstructured format of prior immunological datasets and the complex nature of the measured features. In this work, we propose a framework for integration of prior immunological knowledge into the model optimization process of the Elastic Net35 (EN) algorithm (Fig. 1). Our decision to select the EN as a candidate method was primarily motivated by two factors. First, as a sparse model, the EN is broadly used when the number of features exceeds the number of observations, as is the case with single-cell profiling of the immune system. This sparsification not only improves performance, but also makes models more interpretable, which would prove indispensable for tasks that require accuracy from an interpretable model17 (such as conflict forecasting36, drug discovery37, generalizing findings from animal models to human studies38,39 and translational clinical studies). Second, while other algorithms can potentially be modified to account for prior knowledge, this modification for linear models such as the EN is a natural fit, as demonstrated by the Bayesian interpretation of the EN (Methods). Although other multivariable models have previously been extended to incorporate feature group information21,40,41, none has integrated per-feature prior knowledge as needed in this work. In our immunological Elastic-Net (iEN) framework, the prior knowledge developed by a panel of expert immunologists on a per-feature basis is integrated into the EN algorithm during coefficient optimization (similar to adjusting Bayesian priors; Methods). The addition of knowledge-based immunological priors guides the sparsification process towards solutions more consistent with biological knowledge while still allowing all measured immune features to be included in the exploratory analysis.

Fig. 1: The immunological Elastic-Net analysis pipeline.
figure 1

a,b, Immunological prior knowledge for each feature, in response to each ex vivo stimulation condition, is extracted by a panel of experts (a) and encoded into a prior knowledge tensor to guide the model optimization process (b). c,d, Individuals within the cohort of study (c) provide blood samples, which are subsequently stimulated with ligands ex vivo to activate various signalling pathways of the immune system (d). e, This produces single-cell measurements of the immune system, resulting in a complex network of cell types and signalling pathways representing both innate and adaptive immunity. f,g, This dataset is then fed into the iEN algorithm (f) for predictive modelling of the outcome of interest (g).

In our experiments, the iEN outperformed the standard EN, as well as a broad range of standard machine learning algorithms, including the K-nearest-neighbours42 (KNN), support vector machine43 (SVM), random forest44 (RF) and least absolute shrinkage and selection operator45 (LASSO). Additionally, ‘Super Learner’46 (an ensemble of the aforementioned methods) and two alternative knowledge-integrated algorithms, Know-GRRF25 and graper41, were included in our evaluation. A two-layer repeated 10-fold cross-validation (CV) was used to determine model consistency and establish a clear comparison of model performance among the algorithms. In the first CV layer, the free parameters of the models were optimized (including a factoring controlling the impact of domain knowledge in the case of the iEN) and the second CV layer predicted previously unseen observations. These predictions were then aggregated and evaluated by an appropriate hypothesis test (Pearson or Wilcoxon rank-sum), root-mean-squared error (r.m.s.e.) and area under the receiver operator curve (AUROC) to measure model performance. This process was repeated with randomly generated CV folds to visualize the distribution of results subject to variations in the cohort.

In this Article, we have included two real-world clinical examples as well as a large simulation study. The first analysis, as an example of a continuous clinical outcome, identified components of maternal immune adaptations in a longitudinal term pregnancy (LTP) study, which included an independent validation cohort. The second example was a classification analysis of a categorical outcome, modelling patient and control populations for chronic periodontitis (ChP). The third example used synthetic data generated to replicate mass cytometry measurements to enable in-depth understanding of the iEN behaviour across varying cohort sizes. Additional analyses were performed to determine the effect of prior knowledge on general model behaviour and the stability of results subject to errors in the expert-defined prior knowledge. Each of the three examples was chosen to demonstrate the generalizability of the iEN algorithm, as well as its efficacy, in a range of real-world scenarios.

Knowledge integration

Prior knowledge tensors constructed before the analysis emphasized receptor-specific signalling responses describing canonical pathways activated downstream of the ex vivo stimulation conditions used in the mass cytometry assays. The biological priors used in the iEN model were created by an independent panel of immunologists, such that features more consistent with known biology have higher values (Supplementary Tables 1 and 2). For example, panel members broadly agreed on the prioritization of the phosphorylation of signal transducer and activator of transcription 1 (STAT1), STAT3 and STAT5 in all adaptive and innate immune cells in response to interferon-α (IFN-α) stimulation47,48; the phosphorylation of STAT1, STAT3, STAT5 and extracellular signal-regulated kinase 1/2 (ERK1/2) mitogen-activated protein kinase (MAPK) in all adaptive and innate immune cells in response to the interleukin (IL) cocktail containing IL-2 and IL-649,50; the phosphorylation of P38 MAPK, MAPKAPK2, ERK1/2, ribosomal protein S6 (rpS6), cAMP-response element binding protein (CREB) and nuclear factor κ light-chain-enhancer of activated B cells (NF-κB) and total IκB signal in all innate immune cells (except plasmacytoid dendritic cells) and in regulatory T cells in response to the lipopolysaccharide (LPS) stimulation condition51,52,53,54,55. The resulting five tensors from each expert were then aggregated into a single median tensor used for iEN analyses. The use of median aggregation was to avoid bias by any one expert. An example of all measured immune features and those selected by this prior knowledge tensor is visualized in Fig. 2a.

Fig. 2: Integration of immunological priors.
figure 2

a, Overview of the LTP study. A correlation network of intracellular signalling responses, measured in peripheral immune cells and coloured by ex vivo stimulation status, is visualized. Edges represent significant (P < 0.05) pairwise correlation after Bonferroni adjustment for multiple hypothesis correction. Node sizes represent the significance of correlations with the response variable (gestational age during term pregnancy). b, Immune features that were congruent with domain-specific knowledge as determined by a panel of five immunologists were refined into a tensor and used to determine the node colour of the correlation network. Here, immune features that have a value of 1 (full agreement among the panel) are coloured red and all other immune features are coloured black. c, The network is coloured by the standard deviation of scores assigned to each feature by the panel of immunologists. Overall, the consistency of the prior knowledge among panel members is higher in the features with a higher score, indicating a stronger agreement regarding the top features that should be prioritized by the algorithm and disagreements regarding features inconsistent with prior knowledge.

These tensor values vary from zero to one, with one representing the immune features that are most consistent with prior knowledge according to the panel of experts. iEN regularized regression models are constructed by optimization of the objective function L(β) = ||Y − Xϕβ||2 + λ[(1 − α)||β||2/2 + α||β||1], where X is a matrix of p measured immune features (columns) for n patients (rows) and Y is a vector of the clinical outcomes of interest. The algorithm calculates the coefficients β that minimize the objective function subject to the L1 = ||β||1 and L2 = ||β||2 penalties. The combination of these penalty terms allow for the selection of the features correlated with the outcome of interest and the exclusion of redundant measurements, while also accounting for internally correlated measurements. Model optimization for iEN is controlled by three parameters: λ, α and φ. These parameters can be interpreted as the amount of sparsity in a model (λ), how sparsity is balanced between the L1 and L2 penalty terms (α), and the amount of prior knowledge prioritization (φ). Prioritization of biologically consistent features is accomplished through ϕ, which is a p × p diagonal matrix of the form \({\rm{diag}}(\phi ) = \{ \varphi _{1,1},\varphi _{2,2}, \ldots ,\varphi _{p,p}\}\) such that \(\varphi _{i,i} = \left\{ {{\rm{e}}^{ - \varphi (1 - z_i)}} \right.\) where zi is the score of the ith immune feature, and φ is the amount of prioritization attributed to the model}. This definition allows for a limited effect of φ on model coefficients while also increasing the impact of the features consistent with the prior knowledge tensor (Fig. 3). In the most extreme case, this results in only features with a tensor value of 1 being selected (Supplementary Fig. 1b). This behaviour allows the iEN to generate sparse models while still maintaining the exploratory nature of the EN, as opposed to a priori limiting the model to the features consistent with prior knowledge. Similar to the α and λ free parameters, prioritization of prior knowledge affects the sparsification (Fig. 3 and Supplementary Fig. 1) and optimization of the model (Supplementary Fig. 2). The two-layer 10-fold CV used for iEN optimization and estimation was implemented and parallelized over parameters α, λ and φ. Runtime analysis for this procedure is presented in Supplementary Fig. 3b.

Fig. 3: Prior knowledge and sparsification.
figure 3

Visualization of an example of the impact of prior immunological knowledge on various features in an iEN model. As φ is increased across the x axis (increased impact of prior knowledge), the contributions of each feature to the final model (y axis) change to select models consistent with immunological priors. We have highlighted two examples where a feature is emphasized or de-emphasized (in red and black, respectively) by prior knowledge. In this example, the STAT1 response to IFN-α stimulation in regulatory T cells is prioritized as STAT1 is downstream of the IFN-α/β receptor and is integral to their homeostasis and function86. Conversely, the prpS6 response to stimulation by IL-2 and IL-6 in non-classical monocytes is increasingly deprioritized as this signalling response is inconsistent with prior understanding of these signal transduction pathways in this cell type; IL-2 primarily drives T-cell differentiation through the Janus kinase (JAK)/STAT pathway87. Similarly, IL-6 primarily activates the JAK/STAT pathway and IL-6 receptors are expressed only in a subset of immune cells88,89. This confirms that integration of the priors can not only modify the algorithm’s behaviour, but also that the intensity of this impact can be controlled through the φ free parameter.

Experiments

To evaluate the effectiveness of integrating prior immunological knowledge using the iEN algorithm, we have provided two real-world clinical examples as well as a simulated study.

Analysis of longitudinal term pregnancy

The first example investigated the adaptations of the maternal immune system during pregnancy that can be incorporated into predictive models of gestational age56. During a healthy pregnancy, the immune system strikes a delicate balance to enable tolerance towards the fetus and simultaneously mount a response to defend against pathogens. Abnormal immune system adaptations during pregnancy have been linked to adverse maternal and neonatal outcomes, such as pregnancy loss, preterm birth and preeclampsia57,58,59. This study aims to understand the immunological mechanisms behind term birth as a pivotal first step in understanding abnormal pregnancies and their impact on long-term outcomes60. In this example, a total of 54 blood samples from 18 women were studied during and six weeks after pregnancy. Three antepartum blood draws were collected at different gestational ages, with gestation being measured via ultrasound at the time of sample collection. The resulting 54 whole-blood profiles of the immune system were manually gated (Supplementary Fig. 4) into 24 cell types to measure the endogenous activity of 10 signalling markers, as well as the activity of these 10 markers in response to ex vivo stimulations with three different ligands, providing 960 immune features for analysis (Fig. 2a). Unsupervised analysis of each antepartum sample using the t-distributed stochastic neighbour embedding (t-SNE)61 algorithm showed no readily identifiable patterns associated with gestational age (Supplementary Fig. 5a), motivating supervised iEN analysis. Predictive models were built using iEN, and model parameters were optimized to minimize the residual sum of squares of the predicted versus actual gestational age. iEN produced models of immune features that more accurately predicted gestational age than the similar EN analysis and other contemporary machine learning methods, which were agnostic to the immunologic priors (Fig. 4a,e and Supplementary Figs. 5b and 6a). iEN analyses of postpartum samples demonstrated that the immune system returns to a state that is similar to early pregnancy by six weeks postpartum (Supplementary Fig. 5c). An additional cohort of 10 women were prospectively studied and analysed as an independent validation set. Importantly, the validation cohort also demonstrated that iEN models produced substantially more accurate results than the EN algorithm (Fig. 4b,f and Supplementary Fig. 5e,f). Stepwise reduction of iEN and EN model coefficients revealed superior predictions in the validation cohort by the iEN compared to the EN algorithm for models of equal size (Supplementary Fig. 5d).

Fig. 4: Incorporation of prior knowledge improves predictions in two clinical studies and a simulated experiment.
figure 4

a, Boxplot of Pearson correlation P values calculated on out-of-sample predictions from repeated 10-fold cross-validation of EN (black) and iEN (red) models for the LTP dataset. b, Validation of the LTP model on an independent validation cohort. These predictions were compared against the true response variable using −log10 Pearson correlation P values. c, Boxplots of Wilcoxon rank-sum test P values similarly calculated on out-of-sample predictions for the ChP dataset (null hypothesis: the sample-to-class assignment probabilities produced by the model are equal between the two outcome classes). Comparison of model performance for the respective datasets demonstrated improved predictions for the iEN, as shown by −log10 P values. d, A simulated study with varying cohort sizes of simulated ‘patients’ with 700 features demonstrated a larger gain (measured by −log10 Pearson’s test P values) for the integration of prior immunological knowledge in datasets with a relatively small cohort size and a large number of features. Locally fitted polynomial curves of prediction performance over multiple cohort sizes are displayed with 95% confidence intervals (CIs). eh, R.m.s.e. values to demonstrate the effect sizes of the models in ad.

Analysis of chronic periodontitis

The second example investigated the classification of patients with ChP, a chronic inflammatory disease of the oral cavity. ChP is associated with severe systemic illnesses (such as heart disease, various malignancies and preterm labour)62,63 and, in its most severe form, was estimated to affect ~11% of the global population in 201064. A better understanding of the immunological manifestations of ChP is a critical first step for the development of immune therapies that may alter the course of systemic diseases associated with ChP. This dataset was generated from 28 participants, 14 diagnosed with ChP and 14 healthy controls. Blood samples from the study participants were analysed by mass cytometry and manually gated (Supplementary Fig. 4) for 18 cell types to measure 11 signalling markers in response to ex vivo stimulation with four different ligands; this provided a total of 792 immune features for analysis (Supplementary Fig. 7a). Unsupervised t-SNE visualization of the entire dataset showed no clear separation between case and control populations across all immune features, motivating supervised classification analysis using iEN (Supplementary Fig. 7b). For this example, iEN and EN algorithms were used to fit binomial models for the classification of patients and controls. All free parameters were optimized for AUROC65. The analysis results indicated that iEN outperformed the EN (Fig. 4c,g and Supplementary Fig. 7c) as well as other machine learning methods (Supplementary Figs. 6c and 7d).

Simulation study

The third example, a simulation study, demonstrated the advantages of prior knowledge integration in high-dimensional studies in relation to cohort size. Datasets were generated with 700 features per simulated patient while the number of patients varied from 100 to 1,000 in increments of 100. Data were generated with features that were random and uniform. A limited number of features were generated to have various degrees of correlation with the response variable. Specifically, 50 highly predictive features, 200 moderately predictive features and 450 randomly distributed features were assigned a corresponding biological prior. For a more detailed description of the data-generating process see the simulated data section in the Methods. Repeated 10-fold analysis of the simulated data with increasing population sizes displayed a convergent trend between the iEN and EN models as n increased (Fig. 4d,h, Supplementary Fig. 3a and Supplementary Table 6). As cohort size increased, iEN captured the highly predictive features faster than EN (Supplementary Fig. 3c). These results suggest that if the cohort size is too small to the point that model outputs are affected by noise, the advantage offered by prior knowledge also diminishes (left side, Fig. 4d,h). The results also indicate that integration of prior knowledge is of particular importance in settings where the number of instances (for example, cohort size) is limited yet a relatively large number of features are measured, as is common in translational systems immunology studies.

Sensitivity analysis of the prior knowledge tensor

The iEN pipeline depends on the prior knowledge tensor. We therefore investigated the robustness of iEN by introducing errors into the prior knowledge tensor in each of the three examples. Introduction of moderate to substantial error into the prior tensor resulted in a consistent reduction in the predictive benefit of the iEN over the traditional EN model. Error was introduced stochastically and progressed towards uniform random noise, with 11 total incremental steps and 100 tensors generated per increment. As the prior tensor approaches a uniform random distribution (the highest amount of error in the prior knowledge matrix) the iEN and EN performances converge. These results remain consistent across the LTP, LTP validation, ChP and simulation studies (Fig. 5). Additional analyses performed on simulated data investigated the effect of errors and missing information in the prior knowledge tensor of individual experts (Supplementary Fig. 8). This analysis demonstrated that the algorithm is more sensitive to erroneous information introduced into the prior knowledge tensor than it is to missing information.

Fig. 5: iEN is robust to errors in the prior knowledge tensor.
figure 5

ad, Various levels of noise were artificially added to the prior knowledge values, as indicated by the r.m.s.e. values of the true prior values versus the simulated ones (x axis). As the value on the x axis increases, the amount of noise in the simulated prior increases until all priors are sampled from a uniform and random distribution (vertical dashed line). Reassuringly, at this point, the performance of iEN is close to that of the EN (with no priors), as indicated by a horizontal dashed line. Importantly, iEN continues to outperform the EN (horizontal dashed line) for even high amounts of error in the priors. All curves are locally fitted polynomials of predictive performance with the shaded region representing 95% CI.

Empirical evaluation

In all analyses, the integration of expert knowledge improved the prediction of clinical outcomes in comparison to EN35 with no prior knowledge (Fig. 4). iEN also outperformed standard machine learning algorithms including LASSO45, RF44, SVM43,66, KNN42,46, Super Learner46 (an ensemble of the aforementioned methods) and the knowledge integrated Know-GRRF25 and graper41 methods (Supplementary Figs. 6 and 7d). Further comparison of iEN and EN models by selected features demonstrated a substantial overlap between the models; however, model comparison by coefficient weights displayed a substantial difference in the predictive importance of the features selected (Supplementary Fig. 9). The iEN results demonstrated a higher variability in predictions (Supplementary Table 5). This is probably caused by the additional φ parameter applied in iEN, resulting in a larger search space for potential model parameters. Hyperparameter selection frequency across all generated models displayed a consistent prioritization of prior knowledge via φ (Supplementary Fig. 10). The integration of prior knowledge allows the iEN to determine the predictive benefit of prioritizing canonical signalling pathways in a data-driven manner (Supplementary Fig. 2). Importantly, this enables iEN to function without excluding any of the features from consideration, even when scored as inconsistent with prior knowledge by the human experts. This behaviour allows for the iEN to contain the EN as an edge case when the prior knowledge is not beneficial. Similarly, when prioritization is most beneficial, the resulting model will contain only features scored as 1 in the prior tensor (those with complete consensus among the panel of experts).

Discussion

A structured collaboration between clinicians, biologists and computer scientists can lead to machine learning algorithms in the life sciences that achieve stronger results67. In this Article, we proposed a collaborative framework that enables integration of prior knowledge of cell signalling pathways in a machine learning algorithm to improve the predictions and robustness of the resulting models in clinical datasets. This knowledge-integrated approach to data analysis can be generalized under the ‘learning using privileged information’ paradigm described by Vapnik and Vashist68 where external information is used at the time of training to improve the incurring decision rule. In our experiments, this strategy improved the accuracy of predictions in both translational clinical studies and simulated experiments, even when moderate amounts of noise were artificially added to the extracted prior knowledge.

Importantly, the data-driven approach implemented here allows for prior knowledge only to be incorporated when an improvement in the model is observed. Functionally, this reduces the regularization of the features consistent with prior knowledge, resulting in the development of sparse models that prioritize features in line with previous biological studies. This not only increases the predictive performance, but also facilitates biological interpretation and translation of the results. From a biological perspective, the iEN enhances the interpretability of the results as model features are enriched for cell-type and receptor-specific signalling responses (as opposed to over-reliance on visualization using dimension reduction tools14). These resulting multivariable models with increased biological consistency and improved predictive capabilities could also contribute to the development of robust and simplified assays for resource-limited settings. From a Bayesian perspective, this prior knowledge integration could be viewed as a shift in the prior distributions over β towards estimates that are more congruent with the true underlying distributions. A more explicit connection to the Bayesian setting is presented in the Methods.

This study has several limitations, which will guide our future research directions. First, definition of the prior knowledge tensor by individual human experts has the potential to induce a source of bias into the analysis. Although our analysis suggested that the method is robust to potential errors in the prior knowledge tensor, a more accurate and consistent definition of prior knowledge would improve this pipeline. In addition, the development of the prior knowledge is labour-intensive and requires careful and objective analysis of a broad range of studies. We believe stronger results can be achieved using text-mining strategies for direct extraction of prior knowledge from the literature (for example, see immuneXpresso69). Second, although our simulations point to trends in the advantages of prior knowledge integration in relation to cohort size, they cannot be used for specific guidance regarding the cohort size needed in future studies. Preliminary studies for determining the effect size and proper power analysis should be performed to guide this decision for a new study. Third, this work relies on manual analysis for the identification of all cell types (Supplementary Fig. 2) and mapping them to the prior knowledge. This process is labour-intensive, error-prone and may not identify all cell populations of interest70. In our future studies, we will combine state-of-the-art cell population identification algorithms71,72,73,74,75,76 with our prior knowledge integrated to dynamically match clusters to the prior knowledge tensors for a more unbiased analysis. Fourth, this work only investigated the incorporation of prior knowledge into the EN algorithm. However, other methods can similarly be extended to incorporate expert knowledge (for example, see ref. 19 for a relevant extension of SVMs). In our future work, we will focus particularly on the incorporation of prior knowledge into deep learning methods77,78. Although these algorithms can model complex relationships that are valuable in high-throughput characterization of the immune system, the number of patients that are required for training a large neural network is often beyond the reach of typical immunological studies. We believe incorporation of prior immunological knowledge can reduce the number of patients required for implementation of deep learning approaches in clinical studies79.

Additional research directions include exploring ensembles of prior knowledge integrated models, application of the iEN to domains outside of clinical immunology such as proteomics, metabolomics and transcriptomics, and application of domain-knowledge integrated models to multi-omic studies, which would provide a systems-level perspective on human biology80.

Methods

Integration of immunological priors

The iEN framework extends the EN regularized regression method by integrating prior biological knowledge of cellular signal transduction into the coefficient optimization process. Consider an analysis with mass cytometry that generated features X, composed of observations (rows) Xi = (xi1,...,xip)T for i = 1, 2, ..., n, with each observation consisting of p measurements, where pN and p is much greater than n. Corresponding to each observation is a value of interest yi. Values of interest then constitute the response vector Y = (y1,...,yn)T. Response vectors are dataset-specific (for example, a vector of gestational age during pregnancy in the LTP example). A multivariable regression model can be constructed by computing the coefficients β = (β1,β2,...,βp)T that optimize the objective function, L(β) = ||Y − ||2. The EN method expands this definition with a linear combination of two regularization terms, the L1 = ||β||1 and L2 = ||β||2 penalties from the least absolute shrinkage and selection operator (LASSO) and ridge regression, respectively81. L1 penalization reduces the model complexity and increases sparsity, while simultaneously selecting more descriptive features. However, it can select, at most, the number of observations when working in a high-dimensional space (specifically high-dimensional small observation size) and cannot select multiple, highly correlated features. L2 penalization reduces the coefficient size and encourages the inclusion of highly correlated features, but cannot remove features completely. Incorporating both L1 and L2 regularization terms compensates for these issues. Penalization is applied to coefficients during model fitting and is determined by a penalization factor λ, as well as the ratio of penalization applied to each penalty term, α. The optimal ratio (α) and degree (λ) of penalization can be determined through optimization of the EN objective function:

$$L(\beta ) = ||Y - X\beta ||^2 + \lambda [(1 - \alpha )||\beta ||^2/2 + \alpha ||\beta ||_1]$$

EN models are agnostic to any information not included in X. However, the iEN incorporates a third parameter that encodes prior immunological knowledge: ϕ, a p × p diagonal matrix of the form \({\mathrm{diag}}(\phi ) = \{ \varphi _{1,1},\varphi _{2,2}, \ldots ,\varphi _{p,p}\}\) where \(\varphi _{i,j} = 0\,\forall _{i \ne j}\). The ϕ factor guides models to be more consistent with the current understanding of signal transduction response. Biological priors compiled a priori by an independent panel of immunologists are used to prioritize certain signal transduction responses by scaling the features of X. The adapted model takes the form

$$L(\beta ) = ||Y - X\phi \beta ||^2 + \lambda [(1 - \alpha )||\beta ||^2/2 + \alpha ||\beta ||_1]$$

The biological priors, represented as a tensor of domain-specific knowledge, are manually constructed by a panel of experts. These biological priors are represented as a tensor of scores where features more consistent with known biology have higher values. These priors are a conservative indication of response from canonical signalling pathways that the field has a high level of confidence in observing. They are constructed as an m × l × o tensor, Z [0,1]m×l×o, where the associated mass cytometry assay consists of m cell types, l stimulations and o measured responses. An element in this tensor would correspond to a particular cell type and whether it will elicit a specific signalling response in response to each ex vivo stimulation. To make the connection between the prior tensor Z and the iEN parameter ϕ clear, consider the function \(F(Z) \to {\rm{diag}}(\phi ) \in {\Bbb R}_{[0,1]}^P\); that is to say, diag(ϕ) is a vector of dimension m × l × o = p with values in the range [0,1]. In other words, Z is transformed to a p-dimensional vector, where \({\rm{diag}}(\phi ) = \{ \varphi _{1,1},\varphi _{2,2}, \ldots ,\varphi _{p,p}\}\) such that \(\varphi _{i,i} = \left\{ {{\rm{e}}^{ - \varphi (1 - z_i)}} \right.\), where zi is the score of the ith immune feature, and φ is the amount of prioritization applied}. This formulation of \(\varphi _{i,i} \in {\rm{diag}}(\phi )\) as \({\rm{e}}^{ - \varphi (1 - z_i)}\) affects features with lower prior value more than features with larger values. This definition allows for increased model stability than a formulation with \(\varphi _{i,i} \in {\rm{diag}}(\phi )\) as \({\rm{e}}^{\varphi z_i}\) for large values of φ (Supplementary Fig. 1a,b).

Bayesian interpretation

The EN has a Bayesian representation82 with priors over the estimates of β. This can help define the role of immunological priors in improving model predictions. The unnormalized version of this prior is reported as follows:

$$p(\beta |\lambda ,\alpha ) \propto {\rm{exp}}[ - \lambda \{ (1 - \alpha )||\beta ||^2 + \alpha ||\beta ||_1\} ]$$

In the following, we show that the iEN has a similar interpretation, in which the prior distributions over β are altered according to the prioritization of biological knowledge, that is, the value of φ and the shape of Z. That is to say, our definition of the iEN can be represented as an alteration of the prior distribution over β given ϕ. The objective function of the iEN is \(\hat \beta = {\rm{argmin}}(||Y - X\phi \beta ||^2 + \lambda [(1 - \alpha )||\beta ||^2/2 + \alpha ||\beta ||_1])\). Now let \(\bar \beta = \phi \beta\), then substitution for β results in the following optimization problem:

$$\hat \beta = {\rm{argmin}}(||Y - X\bar \beta ||^2 + \lambda [(1 - \alpha )||\phi ^{ - 1}\bar \beta ||^2/2 + \alpha ||\phi ^{ - 1}\bar \beta ||_1])$$

From this formulation, the adjusted Bayesian prior for the iEN can be directly derived as follows:

$$p(\beta |\lambda ,\alpha ,\phi ) \propto {\rm{exp}}[ - \lambda \{ (1 - \alpha )||\phi ^{ - 1}\bar \beta ||^2 + \alpha ||\phi ^{ - 1}\bar \beta ||_1\} ]$$

To further illustrate the connection between iEN and the regular EN and their Bayesian interpretations, we show that the EN is a special case of iEN. For this, let us define the following two sets:

$$S_1 = \{ P|{\rm{where}}\,z_p = 1\}$$
$$S_2 = \{ P|{\rm{where}}\,z_p < 1\}$$

These sets indicate which estimates of \(\hat \beta\) are affected by φ and which remain unaffected as previously defined. We can then subset the Z vector accordingly, with ZS1 being all biological priors of value one and ZS2 being all biological priors of value less than one. Therefore, we can separate the L1 and L2 norms according to these sets, reformulating the optimization problem as follows:

$$\begin{array}{l}{\hat{\beta}} = {\rm{argmin}}\left(\Vert Y - X{\bar \beta}\Vert ^2 + \lambda \left[(1 - \alpha )\Vert {\bar \beta}_{{{S}}_1}\Vert^2/2 + \alpha \Vert {\bar \beta}_{{{S}}_1}\Vert _1\right.\right.\\ \left.\left.\quad +\, (1 - \alpha )({\rm{e}}^{ - \varphi (1 - z_{{{s}}_2})})^{ - 2}\Vert {\bar \beta }_{{{S}}_2}\Vert ^2/2 + \alpha ({\rm{e}}^{ - \varphi (1 - z_{{{s}}_2})})^{ - 1}\Vert {\bar \beta }_{{{S}}_2}\Vert _1\right]\right)\end{array}$$

Here, \(\bar \beta _{{\it{S}}_1}\) and \(\bar \beta _{{\it{S}}_2}\) represent the betas that are affected by the φ value. This allows for us to replace ϕ with \({\rm{e}}^{ - \varphi (1 - z)}\) respective of S1 and S2, which demonstrates how prioritization affects the estimation of β. Similar separation in the prior distribution also illustrates how priors over β are affected in the same manner:

$$\begin{array}{l}p(\beta |\lambda ,\alpha ,\phi ) \propto {\rm{exp}}\left[ - \lambda \left\{ (1 - \alpha )\Vert \bar \beta _{S_1}\Vert ^2/2 + \alpha \Vert \bar \beta _{S_1}\Vert _1\right.\right.\\ \left.\left. \quad+\, (1 - \alpha )({\rm{e}}^{ - \varphi (1 - z_{s_2})})^{ - 2}\Vert \bar \beta _{S_2}\Vert ^2/2 + \alpha ({\rm{e}}^{ - \varphi (1 - z_{s_2})})^{ - 1}\Vert \bar \beta _{S_2}\Vert _1\right\}\right]\end{array}$$

Because φ alters the prior distribution over β, this can be used to improve estimates of the true β in the Bayesian setting of iEN, as it does in the classic formulation. This also allows for the EN estimates to be included as a special case of the iEN when φ = 0.

Parameter optimization

iEN parameters were optimized over a grid of possible parameter values for each parameter (φ,α,λ). The φ search grid was generated from a logarithmic sequence with values between 0 and 100, which allows for the EN as a special case (φ = 0). Similarly, α was uniformly generated from values between 0 and 1. Generation of λ was done so that a large range of model sizes were tested during the 10-fold CV. Specifically, λ values were generated during the inner CV loop to avoid any possible information leak. The metrics used to justify parameter selection were the residual sum of squares for continuous response and the AUROC curve, as appropriate for each example presented.

Simulated data

All simulated data were generated using the ‘simglm’ R package83. A total of 450 features were generated with a standard deviation of 15. However, 200 of these features had a mean value sampled from N(N(0,102),152) to simulate features moderately associated with prior knowledge and 50 were sampled from N(N(0,52),152), representing features highly associated with prior knowledge. The response variable was then generated as a linear combination of these 250 features. The additional 450 features represented features not associated with prior knowledge and were generated randomly using a normal distribution.

LTP cohort

Twenty-one pregnant women were included in the LTP study, all of whom received routine antepartum care at Lucile Packard Children’s Hospital. Three patients were excluded from the study due to premature delivery (<37 weeks of gestation). Analysis was performed on the remaining 18 women who delivered at term (≥37 weeks of gestation). These 18 participants were aged 31.9 years (±3.4 (s.d.)). An independent cohort of 10 pregnant women who delivered at term were later enrolled as a validation cohort.

ChP cohort

A total of 30 patients were enrolled in the study of ChP: 15 healthy controls and 15 patients with ChP receiving treatment at Bell Dental Centre (San Leandro) and Stanford University School of Medicine. Two participants were excluded from the analysis, one patient due to autoimmune disease and one control due to onset of hand infection during the study. The final cohort consisted of 14 patient (aged 42.2 ± 10.5 years) and 14 control (36.5 ± 8.07 years) samples, each of which were split by gender: eight female and six male.

Whole blood sample processing

Whole blood samples were collected in 10-ml heparin-containing tubes and processed within 1 h of collection. Samples for the LTP cohort were stimulated with either 1 μg ml−1 of LPS, 100 ng ml−1 of IFN-α or a cocktail of 100 ng ml−1 IL (IL-2, IL-6), or they were left unstimulated to measure endogenous cellular activity. Samples for the ChP cohort were stimulated with LPS, IFN-α, tumour necrosis factor-α or a cocktail of IL-2, IL-4, IL-6 and granulocyte-macrophage colony-stimulating factor, or left unstimulated. Samples were fixed using a stabilization buffer (SmartTube) according to the manufacturer’s instructions and stored at −80 °C until further processing.

Mass cytometry analysis

Post-thaw, fixed samples were added to an erythrocyte lysis buffer (SmartTube) and underwent two rounds of erythrocyte lysis. Cells were then barcoded as previously described84. In summary, 20-well barcode plates were prepared with a combination of two Pd isotopes out of a pool of six (102Pd, 104Pd, 105Pd, 106Pd, 108Pd, 110Pd) and added to the cells in 0.02% saponin/phosphate buffered saline. Samples were pooled and stained with metal-conjugated antibodies collectively to minimize experimental variation. The panels for the different cohorts are listed in Supplementary Tables 3 and 4. Intracellular staining was performed in methanol-permeabilized cells. Cells were incubated overnight at 4 °C with an iridium-containing intercalator (Fluidigm). Before mass cytometry analysis, cells were filtered through a 35-μm membrane and resuspended in a solution of normalization beads (Fluidigm).

Barcoded and stained cells were analysed on a Helios mass cytometer (Fluidigm) at an event rate of 500 to 1,000 cells per second. The data were normalized using Normalizer v0.1 MATLAB Compiler Runtime85 and debarcoded with a single-cell MATLAB debarcoding tool84. Gating was performed using Cytobank (cytobank.org). Gating strategies for the different cohorts are shown in Supplementary Fig. 4.

Correlation network visualization

All datasets were visualized using correlation network structures. Each immune feature was denoted by a node and the network layout was calculated using the t-SNE algorithm applied to the complete adjacency matrix. For visualization purposes, only the edges with Bonferroni-corrected Pearson correlation P values of less than 0.05 were visualized.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.