An immune dysfunction score for stratification of patients with acute infection based on whole blood gene expression

Dysregulated host responses to infection can lead to organ dysfunction and sepsis, causing millions of global deaths each year. To alleviate this burden, improved prognostication and biomarkers of response are urgently needed. We investigated the use of whole blood transcriptomics for stratification of patients with severe infection by integrating data from 3,149 samples from patients with sepsis due to community-acquired pneumonia or fecal peritonitis admitted to intensive care and healthy individuals into a gene expression reference map. We used this map to derive a quantitative sepsis response signature (SRSq) score reflective of immune dysfunction and predictive of clinical outcomes, which can be estimated using a 7- or a 12-gene signature. Last, we built a machine learning framework, SepstratifieR, to deploy SRSq in adult and pediatric bacterial and viral sepsis, H1N1 influenza, and COVID-19, demonstrating clinically relevant stratification across diseases and revealing some of the physiological alterations linking immune dysregulation to mortality. Our method enables early identification of individuals with dysfunctional immune profiles, bringing us closer to precision medicine in infection.

Mediation analysis estimates how much of the effect of X on Y depends on changes in M (the average causal mediation effect, ACME), as well as how much is independent of M (the direct causal effect, ADE) (18). To do so, Y is modeled as a function of M and X, with M itself being a function of X. ACME and ADE are then estimated either exactly or with simulations.
In this study, SOFA, organ function measurements, and the presence/absence of shock were considered potential mediators of SRSq on mortality. Thus, each of them was modeled as a function of SRSq while accounting for covariates (age and source of sepsis). More specifically, numeric variables (e.g. SOFA) were modeled using a standard linear regression: Where M is the mediator variable, Ci the i-th covariate, and and the regression coefficients and random error term, respectively.
In contrast, binary variables (e.g. shock) were modeled using logistic regression with a probit model: ( ) = 6 ) + ( + 0 !*( ! + !,( 7 Where p(M) is the probability of the mediator, Ci the i-th covariate, and the regression coefficients and random error term, respectively, and the cumulative distribution function (CDF) of the normal distribution: Assuming X is normally distributed with zero mean and unit standard deviation: The dependent variable (i.e. mortality) was then modeled as a function of each mediator, one variable at a time, while accounting for SRSq and covariates. This was done using a logistic regression with probit model: Where p(D) is the probability of death within 28 days of ICU admission, M the mediator of interest, Ci the i-th covariate, and the regression coefficients and random error term, respectively, and the cumulative distribution function (CDF) of the normal distribution.
ACMEs, ADEs, and their 95% confidence intervals were estimated using Monte Carlo simulations, as implemented in the mediation package (v4.5.0). Effect sizes were estimated relative to 0.1-unit increases in SRSq.
Mediation effects are best interpreted in terms of counterfactuals (18). In particular, ADEs tell us how much mortality would increase if SRSq were increased by 0.1 units while keeping the mediator constant. In contrast, ACMEs tell us how much mortality would increase if SRSq were held constant, but the mediator was artificially increased as if SRSq had increased by 0.1 units.

The SepstratifieR algorithm
SepstratifieR was developed using R and is publicly available for download and installation from our GitHub repository (https://github.com/jknightlab/SepstratifieR). SepstratifieR contains two independent approaches for patient stratification: 1) prediction based on pretrained and cross-validated random forest models (recommended mode), and 2) mNN-based prediction (recommended only for situations where samples are profiled individually or sample size drops below 20). The algorithms followed by each of these approaches are detailed below.

A) Random forest-based classification
Through the stratifyPatients() function, the SepstratifieR package provides the user with access to a set of validated random forest classification and prediction models to infer SRS and SRSq based on either the Davenport or the Extended gene signature. When this function is called, the following algorithm is followed: 1. The samples of interest (i.e. the user's input) are aligned to the samples in the Davenport of Extended reference set (as specified by the user) so as to remove differences between batches or technologies. This alignment is performed using the mutual nearest neighbors (mNN) algorithm (15), with the number of nearest neighbors specified by the user. This results in a set of batch-aligned expression measurements. 2. The pre-trained random forest models introduced in this study are used to classify samples into SRS groups based on the batch-alignment set of variables. 3. The pre-trained random forest models introduced in this study are used to predict SRSq for each sample. 4. The algorithm returns an object containing raw expression measurements, batchaligned expression measurements, SRS predictions, SRSq predictions, and information on the parameters used to run the algorithm.
We recommend this algorithm as the default way of assigning SRS and SRSq in new data sets. Further information on this function, as well as documentation of the associated R code, and an associated method to perform sensitivity analysis and determine the ideal number of nearest neighbors are available at https://github.com/jknightlab/SepstratifieR.

B) kNN-based classification
This approach relies on cosine similarities, a measure of the similarity between two data points which does not depend on the magnitude of each variable, but rather on the relationships (i.e. ratios) between different variables (in this case, the 7 or 19 genes in the signature). If we conceptualize each sample as a vector in space (each gene being a dimension), cosine similarity represents the cosine of the angle between two vectors. This angle is independent of vector length, and only depends on the direction in which vectors are pointing. This makes cosine similarities scale-independent and robust to technical differences. A cosine similarity of 1 means two samples are identical (i.e. vectors point in the same direction), a similarity of 0 means they are orthogonal, and a similarity of -1 means they are antiparallel (i.e. vectors point in opposite directions). Thus, cosine similarities are here used to compare a query sample to all samples in the reference sets, thus enabling identification of the top samples most similar to the query patient. Next, these samples are used to infer SRS groups and SRSq using a kNNbased classification system.
This algorithm is available via the projectPatient() function in the SepstratifieR package, and consists of the following steps: 1. Cosine similarities are calculated between the query and all the reference samples 2. The k samples with highest cosine similarity to the query (default k = 20) are identified. These are referred to as the k nearest neighbors. 3. An SRS label is assigned to the query sample based on a 'majority vote' system, where each nearest neighbor gets a vote with a weight proportional to its cosine similarity. The query sample is assigned to the SRS category with the highest vote count. 4. An SRSq value is predicted for the query by calculating a weighted average of SRSq across its nearest neighbors. Each neighbor's contribution is weighted by its cosine similarity with the query.
This algorithm follows a 'lazy learning' approach, because it does not rely on trained and validated machine learning models, but rather on distance calculations. The advantage of this is that it obviates the need for batch alignment and is applicable to isolated samples. The drawback is that distance measurements must be calculated every time a sample is classified, making the approach computationally intensive, and that it cannot learn structure in the way a machine learning model does.
We recommend this approach in situations where samples are either profiled individually, one at a time, or in batches with less than 20 samples each.

CyTOF data analysis
Mass cytometry measurements from whole blood leukocytes were from COMBAT (20). Protein abundances were averaged across all cells from each individual to create pseudobulk protein profiles. Principal component analysis was used for dimensionality reduction. The agreement between mRNA and protein was assessed using Pearson correlation tests, with Bonferroni correction for multiple testing. Associations between protein abundance and SRSq were tested using linear models with limma, with FDR correction for multiple testing. Proteins were deemed SRSq-associated if log2-fold changes ≥ 0.5 at FDR < 0.05.       (I) Correlation between SRSq (x axis) and protein levels (y axis) for the three proteins most significantly associated with SRSq. Cor = Pearson correlation; p = correlation p value.