Evaluating mesenchymal stem cell therapy for sepsis with preclinical meta-analyses prior to initiating a first-in-human trial

Evaluation of preclinical evidence prior to initiating early-phase clinical studies has typically been performed by selecting individual studies in a non-systematic process that may introduce bias. Thus, in preparation for a first-in-human trial of mesenchymal stromal cells (MSCs) for septic shock, we applied systematic review methodology to evaluate all published preclinical evidence. We identified 20 controlled comparison experiments (980 animals from 18 publications) of in vivo sepsis models. Meta-analysis demonstrated that MSC treatment of preclinical sepsis significantly reduced mortality over a range of experimental conditions (odds ratio 0.27, 95% confidence interval 0.18–0.40, latest timepoint reported for each study). Risk of bias was unclear as few studies described elements such as randomization and no studies included an appropriately calculated sample size. Moreover, the presence of publication bias resulted in a ~30% overestimate of effect and threats to validity limit the strength of our conclusions. This novel prospective application of systematic review methodology serves as a template to evaluate preclinical evidence prior to initiating first-in-human clinical studies. DOI: http://dx.doi.org/10.7554/eLife.17850.001


Introduction
The decision to initiate an early phase clinical trial requires careful evaluation of the benefits and risks of a novel intervention. However, for first-in-human studies for which there is no prior clinical experience, the assessment of potential therapeutic efficacy must rely solely on the preclinical investigations. Although regulatory guidance exists for the conduct of preclinical evaluation of novel therapies (U. S. Department of Health and Human services, 2013), there is little guidance to help stakeholders summarize and assess the benefit and risks of novel therapies prior to first-in-human studies. As a result, the evidence from individual preclinical studies is often summarized and described in a non-systematic and potentially biased manner (Food and Drug Administration, 2015). Here, we present an approach to transparently evaluate preclinical evidence of a therapy prior to its potential clinical translation. Our exemplar is mesenchymal stem cell (MSC) therapy for sepsis.
A selective narrative summary of preclinical evidence has significant limitations because the methods used to identify studies are neither comprehensive nor transparent . This is of particular concern given that studies replicating high profile experiments fail in up to 50-90% of attempts (Begley and Ellis, 2012;Scott et al., 2008;Steward et al., 2012) and significant publication bias results in a skewed representation of effects (Sena et al., 2010). Further, fewer than 5% of high impact preclinical reports are clinically translated (Contopoulos-Ioannidis et al., 2003) and only 11% of clinically tested agents receive licensing (Kola and Landis, 2004). Thus trialists have based predictions of clinical success of novel therapies on flawed data and an inappropriately highly selected and positive preclinical evidence base (Grankvist and Kimmelman, 2016).
Systematic reviews and meta-analyses have become very popular because they can overcome many of these challenges by promoting the transparent evaluation of therapies. Systematic reviews are guided by a protocol with explicit methods to identify, synthesize (which may include meta-analysis), and appraise all investigations pertinent to a particular research question. Similarly, meta-analysis enables pooling of effect sizes across studies and increases statistical power by reducing standard error around the average effect size, providing a more precise estimate of an overall eLife digest Most attempts to transform exciting findings from laboratories into clinical treatments are unsuccessful. One reason for this may be the failure to consider all of the laboratory work that has been performed before deciding to test a treatment on patients for the first time. In particular, negative findings (that suggest that a potential new treatment is ineffective) may be overlooked.
Stem cells may help to treat life-threatening infections, but this has not been tested in human patients. However, the effectiveness of stem cell treatments has been tested in animals that act as models of human infection.
Before deciding to begin a clinical trial of stem cell therapy for life-threatening infections, Lalu et al. performed an exhaustive search to find all the studies in which stem cells were used to treat animal models of infection. Combining the results of all of these studies using particular analysis techniques revealed that stem cell therapy increased the survival of these animals overall. These positive effects were seen over a range of different experimental conditions (for example, when treating the animals with different doses of stem cells, or giving the doses at different times).
Lalu et al. also identified some limitations with most of the laboratory studies that had tested stem cell therapy for infections. Many of the studies used animal models that may not be the best representations of humans with severe infection. In addition, many of the scientists did not report that they had used methods (such as randomization) that would generate the most confidence in their results. Despite these limitations, there was a lot of consistency in the reported results.
Overall, the results support the decision to proceed to a clinical trial that tests the effectiveness of stem cells for treating human infections. More generally, Lalu et al.'s analysis demonstrates a way of considering all laboratory evidence before deciding to proceed to a first clinical trial in humans. This may help researchers to identify promising therapies to further develop, and also to identify potential failures before they are tested in patients. treatment effect Cohn and Becker, 2003). Systematic reviews and meta-analyses have long been regarded as essential tools to summarize and evaluate clinical research (Higgins and Green, 2009) and have become a requisite component of grant applications for clinical trials (Canadian Institutes of Health Research, 2016); however, the application of these tools to preclinical studies has been limited.
Preclinical systematic reviews may help predict the magnitude and direction of novel therapeutic effects in high stakes first-in-human trials. For example, preclinical systematic reviews of stroke (Horn et al., 2001) and heart failure (Lee et al., 2003) therapies demonstrated that the resulting negative clinical trials could have been predicted had available preclinical evidence been analyzed in a rigorous manner. Thus, thousands of patients may have avoided exposure to potential risk without any benefit (Kalra et al., 2002;Shuaib et al., 2007). Similarly, previous preclinical systematic reviews have demonstrated that failure to report threats to methodological quality (i.e. internal validity, risk of bias) and construct validity (i.e. extent a model corresponds to the human condition it is intended to represent [Henderson et al., 2013]) influence treatment effect sizes (Crossley et al., 2008;Hirst et al., 2014;Macleod et al., 2008Macleod et al., , 2015Rooke et al., 2011). Unlike this 'retrospective' approach that has been described in previous studies, a prospective application of preclinical systematic review methodology may help delineate the limits of a therapy prior to first-in-human application.
Our preclinical systematic review was conducted prior to the initiation of a Phase 1/2 clinical trial of immunomodulatory cell therapy (mesenchymal stromal cells, mesenchymal stem cells [MSCs], "adult stem cells") for septic shock (NCT02421484). The specific question addressed was: In preclinical invivo animal models of sepsis, what is the effect of MSC administration (compared to control treatment) on death? Septic shock is the result of an overwhelming systemic infection; it is one of the most common and acutely devastating health problems in the intensive care unit with a 90-day mortality rate of approximately 20-30% despite modern therapy (Peake et al., 2014;Mouncey et al., 2015;Stevenson et al., 2014). It is caused by a maladaptive mismatch between host inflammatory response and pathogenic stimuli which leads to organ failure and death. MSCs are ubiquitous cells (da Silva Meirelles et al., 2006) that support tissue repair and are mobilized under inflammatory conditions (Hannoush et al., 2011;Rochefort et al., 2006). Exogenously administered MSCs represent an especially attractive therapeutic for sepsis because they have antibacterial and organ protective effects, in addition to their immune modulatory functions (Walter et al., 2014).
We quantitatively summarized the results of all preclinical studies of MSC therapy for in vivo animal models of sepsis to predict effect size and establish an ethical basis for exposing high-risk patients to this novel therapy. This is the first systematic evaluation of a novel biologic therapy prior to initiating a first-in-human trial. We believe our approach serves as a roadmap to transparently evaluate a preclinical therapy prior to its potential clinical translation. This study has been written in an explicatory manner so that other preclinical and translational researchers not familiar with systematic review methodology may replicate our approach. Readers wishing to replicate our approach for their research agendas are directed to the methods section where explanations are provided in greater depth, and encouraged to contact the authors for further guidance.

Effect of MSCs on sepsis mortality in rodents
MSC therapy in preclinical models of sepsis significantly reduced the overall odds of death (odds ratio (OR) 0.27, 95% confidence interval (CI) 0.18-0.40 ( Figure 2). Since it is important to consider the consistency of results between studies, we calculated the I 2 test, which demonstrated a low degree of heterogeneity across studies (I 2 = 33%). The reduction in mortality was maintained regardless of when death occurred, whether considering deaths before two days after induction of sepsis (OR 0.31, 95% CI 0.21-0.46), between two and four days (OR 0.20, 95% CI 0.11-0.38), or more than four days (OR 0.18, 95% CI 0.11-0.32) ( Figure 3).     . Two studies administered multiple doses of MSCs, with one demonstrating benefit and the other having no statistically significant effect. The multiple dose study with no effect was also the only investigation of autologous cells (Chang et al., 2012). MSCs administered to mice were effective (OR 0.23, 95% CI 0.15-0.36) however MSC administration to rats did not produce a statistically significant effect (OR 0.47, 95% CI 0.18-1.21; . The comparator control group (phosphate buffered saline vs. fibroblast vs. normal saline vs. medium) had no effect; but, the one study that did not administer vehicle to the control animals did not demonstrate a statistically significant effect of MSC therapy (Zhou et al., 2014) (Figure 2-figure supplement 10).

Assessment of threats to internal validity (methodological quality/risk of bias)
Practices such as blinding and randomization are known to affect the magnitude of effect in both clinical and preclinical studies. To determine if these threats to internal validity influenced our findings, we evaluated the risk of bias of included studies ( Table 2). None of the experiments were considered low risk of bias across all six domains of methodological quality. Forty-five percent of experiments reported that the animals were randomized, none described methods of sequence generation or how allocation concealment was achieved. Similarly, no studies described blinding of personnel performing the experiments. One study did not blind assessors for the outcome of mortality, which may be of concern given that surrogate endpoints (i.e. not true death due to animal welfare concerns) were assessed (Kim et al., 2014); the remaining studies were assessed as 'unclear' as  . Forest plot summarizing relationship of mesenchymal stromal cell (MSC) therapy on mortality over time in preclinical models of sepsis and endotoxemia (outcome windows: 2 days, >2 to 4 days, > 4 days). Point estimates (odds ratio) and 95% confidence intervals (CI) are depicted for individual studies; size of point estimate depicts relative contribution to pooled effect. A pooled meta-analytic summary (random effects model) of overall effect of MSC therapy on mortality is depicted by the diamond at the bottom of each time interval (vertical points represent odds ratio point estimate and horizontal points represent 95% CIs). Heterogeneity is represented with the I 2 statistic. Data from Pedrazza et al. (2014) was included in total counts but not included in meta-analysis due to 100% mortality in both study arms. DOI: 10.7554/eLife.17850.017 insufficient details of outcome assessment were reported. An assessment of high risk of bias for incomplete outcome data occurred in 10% of studies (examined as consistent n values reported from methods to results); in 65% of experiments the numbers (n) were not presented in both the methods and results in sufficient detail to permit judgment. No studies reported an appropriate rationale for selection of study sample size (where appropriate rationale included a correctly calculated sample size, Table 3). Given the paucity of studies that adequately implemented and reported internal validity practices, an analysis to determine the effects of high vs. low risk of bias on the effect size was not feasible.

Assessment of threats to construct validity
It has been suggested that failed preclinical to clinical translation may be related to a mismatch between experimental conditions and the clinical disease the model is intended to represent (i.e. construct validity) (Henderson et al., 2013;Kimmelman and Henderson, 2016). To evaluate clinical generalizability of the experimental conditions used, we performed a formal evaluation of construct validity using an eight item index that had been developed in a systematic review of preclinical sepsis ( Table 4) (Lamontagne et al., 2010). None of the experiments used large animal models. Two Table 2. Risk of bias assessment of preclinical studies investigating the efficacy of mesenchymal stromal cells in models of sepsis. Blinding of Outcome Assessment for Mortality: Low risk = Outcome assessors were blinded to the study groups when assessing mortality through surrogate endpoints or animals were allowed to die. Unclear = Insufficient information to determine if outcome assessors were blinded during assessment or if animals were allowed to die. High Risk = Outcome assessors not blinded to the study groups and death was defined according to surrogate endpoints.
Incomplete Outcome Data: Low risk = N values were consistent between methods and results for the mortality outcome. Unclear = N value was either not presented in the methods or in the results, and therefore there is insufficient information to permit judgement. High risk = N values were not consistent between methods and results for the mortality outcome.
Selective Reporting: Low risk = The methods section indicated mortality as a pre-specified outcome measure. High risk = The mortality outcome was presented in the results but not pre-specified in the methods section. DOI: 10.7554/eLife.17850.018 experiments (10%) used animals with comorbidities (both used immunodeficient mice), 40% of experiments used adult animal models (40% did not report animal age), and 50% used infectious models of sepsis. 90% of studies initiated MSC therapy after the induction of the disease (as opposed to at the time of disease induction) but none documented severity of the disease state prior to initiating MSC therapy. Four studies used fluid resuscitation while two of these studies also administered antibiotics. Two studies incorporated a majority of construct validity elements (i.e. at least five of eight elements); there was no difference in effect size between these studies (OR 0.18, 95% CI 0.08-0.42) and those studies that incorporated fewer elements (OR 0.28, 95% CI 0.17-0.44) (Figure 2-figure supplement 11).

Evidence of publication bias
For the 20 experiments, 50% demonstrated statistically significant beneficial effects of MSCs with a median sample size of 19 animals per group. Visual inspection of a funnel plot analysis of all experiments suggested that publication bias exists (Figure 4), which was confirmed by Egger regression (p=0.019). Post-hoc trim and fill analysis suggested a relative overestimation of effect size of 27%,  although MSCs remained associated with a statistically significant reduction in mortality after adjustment (OR 0.34, 95% CI 0.22-0.52).

Discussion
Preclinical studies provide necessary justification to conduct a first-in-human clinical trial. Thus, a systematic review approach offers an attractive method to comprehensively synthesize the totality of available evidence. Our systematic review demonstrates that MSC therapy reduces the odds of  death in preclinical animal sepsis models. This effect is maintained over a range of time periods (less than two days, between two to four days, and longer than four days). These early outcome windows capture the majority of deaths that occur in these acute models. Moreover, the effect sizes are robustly maintained (replicated) over a variety of experimental conditions, varying models, and differing MSC immunologic compatibility (e.g. allogeneic vs. syngeneic). It has been suggested that individual study findings have low probability of being 'true' (Ioannidis, 2005), however by aggregating results of similar experiments the positive predictive value of a finding dramatically increases (Moonesinghe et al., 2007). Thus, the findings of this systematic review helped support our decision to initiate a Phase 1/2 trial to evaluate the safety of MSC therapy in human patients with septic shock (NCT02421484). We believe our approach of systematically reviewing preclinical evidence is widely applicable for researchers considering first-in-human studies. Although our synthesis suggests MSC treatment of sepsis may be beneficial these results are tempered by the presence of potential threats to validity.
Our preclinical systematic review evaluated internal, external, and construct validity of the data. Methodological weaknesses (i.e. poor internal validity) in clinical trials are associated with an exaggeration of the treatment effect. Similarly, in preclinical studies, failure to address selection bias (through methods such as randomization and allocation concealment) and detection bias (through blinded outcome assessment) results in significantly increased effect sizes (Crossley et al., 2008;Hirst et al., 2014;Rooke et al., 2011). The significance of selection and detection bias has been acknowledged by The National Institutes of Health's recently issued guidelines for reporting preclinical research. These guidelines have specifically proposed randomization, blinding, and sample size calculations as key methodological information that must be described in preclinical reports (National Institutes of Health, 2015;Landis et al., 2012). In our review, none of the included studies reported randomization or allocation concealment in a manner that could be considered at low risk of bias. Similarly, no studies reported appropriate a priori defined sample sizes. Most of these Figure 4. Funnel plot to detect publication bias. Trim and fill analysis was performed on overall mortality. Open circles denote original data, black circles denote 'filled' studies. Open diamond denotes original pooled effect size (log odds ratio) and 95% confidence interval. Filled diamond represents adjusted effect size and 95% confidence interval. DOI: 10.7554/eLife.17850.021 items were judged as 'unclear' in our risk of bias evaluation due to the convention to judge unreported items as 'unclear' rather than 'high risk'. We speculate that many of these 'unclear' items were not performed (i.e. they were 'high risk') due to a general lack of training of basic scientists in methods to reduce risk of bias (Landis et al., 2012;Collins and Tabak, 2014). This lack of reporting precluded an evaluation of their efforts and points to the need to improve the methodology used in preclinical investigations.
To address external validity (i.e. generalizability) we performed a number of subgroup analyses. Overall, subgroup analyses suggested that MSC effects appeared to be robust over a number of varying experimental conditions and across a number of different laboratories. Results of specific subgroups (e.g. autologous cells, multiple doses, intraperitoneal administration, and adipose tissue source) should be interpreted cautiously as few studies were included in these groups, and the results of one study with differing results (Chang et al., 2012) may have skewed data. The ability of one study to heavily influence overall effect estimates is a short-coming of meta-analyses that include few studies. As such, these subgroup analyses should be treated as exploratory.
Despite the large effect sizes noted, one must bear in mind the potential effect of publication bias (i.e. bias due to the publication of only positive studies). Our funnel plot demonstrated a highly asymmetrical pattern and our trim and fill analysis indicated that a number of unpublished negative studies may exist. This is in keeping with previous analyses of preclinical stroke data that suggested up to one in six animal studies in that field were unreported and unpublished. (Sena et al., 2010) Our inability to analyze these potential studies may have led to an overstatement of effect size.
To evaluate the potential clinical applicability of these results, we examined the construct validity of included studies. This was determined using recommendations that had been developed to improve the clinical generalizability of preclinical sepsis studies (Lamontagne et al., 2010). Animal sepsis models may not be representative of human sepsis because of the timing and severity of sepsis induction, the dose and timing of the treatment in relation to sepsis induction, the use of small/ young animals without comorbid illnesses, and lack of administration of standard of care co-interventions such as fluids and antibiotics during the study period. How well animal models of sepsis mimic the pathophysiology of human sepsis has also been a contentious issue (Dyson and Singer, 2009;Osuchowski et al., 2014;Seok et al., 2013). Only two studies incorporated a majority of elements addressing construct validity, thus the effect of construct validity on MSC therapy of sepsis remains to be determined.
There are a number of other issues of note that may impact the translation of MSC therapy to the clinical setting. First, although we did not formally evaluate characterization of cell products, this was variably reported in the included studies. Differences in the quality of cell therapeutics may have accounted for some of the heterogeneity of results observed. Second, dosing of cell products was not equivalent between species, even after adjusting for total cells given. Equivalence dosing of drugs between species is a complex issue and the FDA has endorsed conversion based on body surface area, rather than a dose per weight basis. (Reagan-Shaw et al., 2008) Applying this guidance, 1 million cells in a mouse may be equivalent to 0.5 million in a rat; similarly, this dose in a human would be roughly equivalent to 3 million cells/kg. These equivalencies should be interpreted cautiously given the differences between typical drug therapies and the cellular therapy evaluated in our review. Third, the severity of disease in these animal models at the time of MSC administration is unclear. Based on our experience with endotoxemia and cecal-ligation and puncture models, at 1 hr after disease induction some symptoms may be apparent and after 6 hr most animals have both biochemical and physiological evidence of inflammation and organ-dysfunction. Thus, we performed a subgroup analysis based on timing of administration <1 hr, >1-6 hr, >6 hr as a rough correlate to early and more delayed (intermediate and late) administration of cells in an attempt to simulate the delays in treatment that may be seen in humans who present with severe infection. Of note, no study administered cells at a late time point. A clearer reporting of disease severity at time of cell administration may allow a more precise analysis of when these cells are more (or less) efficacious. A fourth issue is the lack of transparent reporting of risk of bias elements that minimize the ability to evaluate threats to validity in our systematic review. We would suggest that general poor understanding of these core methodological issues may underlie their incomplete reporting. In order to increase the robustness and interpretation of future preclinical systematic review results we submit that authors of primary studies and journal editors should ensure adherence to published reporting guidelines for pre-clinical research studies (National Institutes of Health , 2015; Kilkenny et al., 2010). These guidelines not only detail items relating to risk of bias (e.g. randomization and blinding) but also touch on issues that are very important when primary studies are included in systematic reviews (e.g. differentiating between biological and technical replicates, providing exact n numbers).
The strengths of our systematic review are in the transparent and thorough literature search and an attempt to examine potential for translation by evaluating threats to validity. To date, three clinical trials have been initiated following a systematic review and meta-analysis of animal data van der Worp et al., 2007); all have repurposed currently used interventions for neurological conditions and are currently recruiting (NCT01833312, NCT01910259, ISRCTN83290762). To the best of our knowledge ours is the first preclinical systematic review that has evaluated a novel biological therapeutic in preparation for a high risk first-in-human clinical trial.
The limitations of our review should be noted. First, we restricted our search to unmodified MSCs since our group was only considering a clinical trial of unmodified cells for sepsis. Although modified cells may be of clinical interest, there are a number of additional regulatory, ethical, and safety issues which significantly increase the complexity of clinical trials using these cells; this is an issue that members of our group have experienced first-hand (NCT00936819). Other limitations of our review relate to the potential methodological limitations of the included studies. None of the included studies were considered low risk of bias across all domains, and their construct validity was highly variable. It is unclear what the influence of these methodological limitations might be in this particular study due to our inability to perform meaningful subgroup analyses. Our evaluation of the methodological aspects of included studies also relied on what the authors reported, and this may have been incomplete in cases. We would suggest however, that similar to other fields, the failure to address threats to internal validity likely contributes to an exaggerated effect size.
Despite the stated limitations of this review, the consistency of the results across the included studies and the large effect size suggest that MSCs reduce the odds of death in preclinical models of sepsis. Moreover, there are a number of studies that have demonstrated biological mechanisms that may underlie the benefits of MSCs in sepsis, including antibacterial, anti-inflammatory, and trophic effects (Spees et al., 2016). These mechanisms do not require engraftment and have been demonstrated to work over thousands of molecular pathways that include improved cellular energetics and activation of macrophages (dos Santos et al., 2012). Given the results of our review along with this biological plausibility, our group gained the support of regulatory agencies, ethics boards, and other stakeholders to proceed to a first-in-human clinical trial. Nonetheless, our efforts to translate this therapy into a clinical trial were tempered by the limitations of the preclinical studies performed to date. If this support was not provided, alternative methods to address efficacy of MSC therapy for sepsis could include conducting a low risk of bias 'confirmatory' preclinical study that was informed by the results of this systematic review (Kleikers et al., 2015), or performing a multicenter preclinical study (Llovera et al., 2015). Ultimately, ongoing and future clinical evaluations will determine whether the therapeutic effects of MSCs will translate to the human patient population.

Materials and methods
The methods section has been written completely and transparently for researchers unfamiliar with systematic review and meta-analysis methodology. We would encourage readers wishing to replicate our approach for their own research agendas to refer to available resources Higgins and Green, (2009)) and/or our group for further guidance.

Review question and protocol
The research question for this review was, "In preclinical in-vivo animal models of sepsis, what is the effect of MSC therapy (compared to control treatment) on death?" The protocol for this review was published on the Collaborative Approach to Meta Analysis and Review of Animal Data from Experimental Studies (CAMARADES) website (http://www.dcn.ed.ac.uk/camarades/research.html#protocols) and also the University of Ottawa's Open Access Research Institutional Repository (http://hdl. handle.net/10393/32833). A priori publication of our protocol encourages transparency in the systematic review process and safeguards against reporting biases in the review. This review is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) Statement (Moher et al., 2009). The PRISMA guidelines are an evidence-based minimum set of items that should be reported in a systematic review and meta-analysis. Similar to other reporting guidelines, PRISMA ensures complete and transparent reporting of a study.

Inclusion and exclusion criteria
We included all pre-clinical in vivo studies of sepsis and endotoxemia that investigated treatment with mesenchymal stromal cells. MSCs must have been administered during or after experimental induction of sepsis. Since our group was considering a clinical trial of unmodified MSCs, studies were excluded if the MSCs were differentiated, altered, or engineered to over or under express particular genes. Neonatal animal models were excluded, as were models of acute lung injury. Finally, studies where MSCs were administered with another experimental therapy or cell type were excluded.

Literature search
To identify all relevant studies, we designed a search strategy in collaboration with a medical information specialist. We would suggest readers consult a medical librarian experienced in systematic searches if they wish to perform a literature search for a preclinical systematic review; this will ensure a comprehensive search is conducted. Although MSC terminology has been codified (Dominici et al., 2006) non-standard terms continue to be used in the literature, thus a number of MSC related terms were used in the search strategy. Validated animal filters were applied to increase relevancy Hooijmans et al., 2010); post-hoc, an inadvertent truncation was noted in the application of these filters, thus an updated search was performed to include the complete filters. We searched Ovid MEDLINE In-Process and Other Non-Indexed Citations, Embase Classic+Embase, BIOSIS and Web of Science (using Web of Knowledge) from inception until May 2015. The full search strategy is listed in the Appendix. Additional references were also sought through hand-searching the bibliographies of reviews and included primary studies.

Screening
Studies were independently screened by two reviewers, with consensus required for articles to proceed to either the next screening stage or to the final analysis. Disagreements were resolved by discussion or by consultation with a senior team member when necessary.

Data extraction
Data was extracted on the general characteristics of the study (e.g. study design, country of origin, sample size), animal model (e.g. disease induction method, use of resuscitation), and mesenchymal stromal cells (e.g. condition and source of cells). Data was collected for the primary outcome of overall mortality. Mortality was further stratified by time: 2 days, > 2-4 days, and > 4 days. If multiple measurements were reported within a period, the latest measurement within the period was used. Data in graphical format was extracted using open source software (Engauge Digitizer, github. com; http://markummitchell.github.io/engauge-digitizer/). Extracted data were verified by a second reviewer with disagreements resolved by consultation with a third team member. Additionally, authors were contacted when further clarification was required.

Subgroup analyses/generalizability -assessment of threats to external validity
A priori determined subgroup analyses were conducted to determine the effects of important factors on the estimated treatment effect. These analyses were performed to assess generalizability of results over varying experimental conditions. Subgroups were analysed for the following: animal model (e.g. mice, rat), gender, experimental model (e.g. cecal ligation and puncture, endotoxemia), source of MSC (e.g. autologous, xenogenic), route of MSC administration (e.g. intravenous, intraperitoneal), dose of MSC (less or greater than 1.0 Â 10 6 cells), frequency of MSC dose, timing of MSC administration (less than one hour, greater than 1 hr to less than or equal to 6 hr, greater than 6 hr, or multiple dosing), resuscitation used (e.g. fluid, antibiotics), and control group (phosphate buffered saline, fibroblasts, normal saline, medium, nothing administered). Given the number of analyses performed, the results were considered exploratory and hypothesis generating. Readers employing a similar analysis may consider adjusting the value of significance based on the number of comparisons (e.g. for 11 analyses p<0.0045 would be considered significant).

Risk of bias -assessment of threats to internal validity
Risk of bias was assessed independently in duplicate as high, low, or unclear for the six domains of bias identified by the Cochrane Risk of Bias tool (Higgins and Green, 2009). Domains include: (1) sequence generation, (2) allocation concealment, (3) blinding of participants and personnel, (4) blinding of outcome assessors, (5) incomplete outcome data, and (6) selective outcome reporting; operational definitions can be found in the legend for Table 2. Any disagreements were resolved through consultation with a senior member of the team. Other domains of risk of bias assessed were (1) source of funding, (2) conflict of interest, and (3) sample size calculations. Following reviewers' suggestions we also included the SYRCLE Risk of Bias Tool, an alternative method of assessing risk of bias in preclinical animal studies . This tool is largely based on the Cochrane Risk of Bias Tool and includes several additional domains: (1) similarity of groups or adjustment for confounders at baseline, (2) random housing of animals, (3) animal selection at random for outcome assessment. The last domain was not evaluated given the outcome being assessed was death, and it was unclear for most studies whether true death or surrogate measures were being evaluated.

Assessment of threats to construct validity
In preclinical studies construct validity refers to the extent an animal model corresponds to the clinical entity it is intended to represent (Henderson et al., 2013). We used a previously published framework to evaluate construct validity of the included studies (Lamontagne et al., 2010). Items evaluated in each study included: (1) use of a large animal model (e.g. pig, dog, sheep), (2) use of adult animals, (3) presence of co-morbid diseases, (4) use of an infectious model of sepsis, (5) documentation of severity of illness prior to initiating therapy, (6) follow-up duration !24 hr, (7) use of antibiotics, and (8) use of intravenous fluid resuscitation. Each item was assessed independently by two reviewers and assessed as either a 'yes' or a 'no'. Disagreements were resolved by consultation with a third team-member.

Statistical analysis
Statistical analysis was performed in consultation with a statistician experienced in systematic reviews and meta-analysis. Readers seeking to replicate these methods for their own purposes are encouraged to similarly seek advice from an experienced statistician. Data from studies were pooled using meta-analysis that was performed with random effects modeling employing the DerSimonian and Laird random effects method (Comprehensive Meta-Analysis 2.0, Englewood, USA). Outcomes are expressed as odds ratios and 95% confidence intervals. There were completely independent control groups for the studies with more than one experiment extracted (i.e. a control group was not shared between two experimental groups). Thus, no correction for the number of control animals was required for multiple comparisons within a single meta-analysis. Heterogeneity of effect sizes in the overall effect estimates was assessed using the I 2 statistic. The following are suggested thresholds to interpret the I 2 statistic: 0-40% may not be important, 30-60% moderate heterogeneity, 50-90% substantial heterogeneity, 75-100% considerable heterogeneity (Higgins and Green, 2009).
Presence of publication bias was assessed using a funnel plot (visually) and Egger regression test (statistically). The funnel plot is a scatterplot of the intervention effect of individual studies plotted against a measure of its precision or size. The characteristic 'inverted funnel' shape arises from the fact that precision of the effect estimate increases as the as the study size increases (i.e. small studies will scatter more widely at the bottom of the funnel). A funnel plot would normally be expected to symmetrical, however the absence of symmetry can suggest publication bias (Sterne et al., 2011). Duval and Tweedie's trim and fill estimates were generated to estimate the number of missing studies and to estimate the adjusted effect size assuming the studies were present.
acknowledge Risa Shorr, Information Specialist from the Ottawa Hospital Research Institute (OHRI) for assistance in designing the systematic search strategy and Ranjeeta Mallick statistician at the OHRI for consultation and conduct of initial statistical analysis. We also thank Dr. Tania Bubela from the University of Alberta, and Drs. AJ Frenette and Jennifer Tsang from the Canadian Critical Care Translational Biology Group for review of the manuscript. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.