Use of propensity score methods to address adverse events associated with the storage time of blood in an obstetric population: a comparison of methods

A recent topic of interest in the blood transfusion literature is the existence of adverse effects of transfusing red cells towards the end of their storage life. This interest has been sparked by conflicting results in observational studies, however a number of methodological difficulties with these studies have been noted. One potential strategy to address these difficulties is the use of propensity scores, of which there are a number of possible methods. This study aims to compare the traditional methods for binary exposures with more recently developed generalised propensity score methods. Data were obtained from probabilistically linked hospital, births and blood bank databases for all women giving birth from 23 weeks gestation in New South Wales, Australia, between July 2006 and December 2010 with complete information on the birth admission and blood issued. Analysis was restricted to women who received 1–4 units of red cells. Three different propensity score methods (for binary, ordinal and continuous exposures) were compared, using each of four different approaches to estimating the effect (matching, stratifying, weighting and adjusting by the propensity score). Each method was used to determine the effect of blood storage time on rates of severe morbidity and readmission or transfer. Data were available for 2990 deliveries to women receiving 1–4 units of red cells. The rate of severe maternal morbidity was 3.7 %, and of readmission or transfer was 14.4 %. There was no association between blood storage time and rates of severe morbidity or readmission irrespective of the approach used. There was no single optimal propensity score method; the approaches differed in their ease of implementation and interpretation. Within an obstetric population, there was no evidence of an increase in adverse events following transfusion of older blood. Propensity score methods provide a useful tool for addressing the question of adverse events with increasing storage time of blood, as these methods avoid many of the pitfalls of previous studies. In particular, generalised propensity scores can be used in situations where the exposure is not binary.

blood can be difficult in an observational setting. The age of the blood that a patient receives varies due to blood bank inventory management processes, blood type, number of transfusions, and time of year [2][3][4]. These factors may also affect outcomes, and so need to be considered in any analysis. In addition, observational studies of the effect of age of blood on outcomes are prone to a number of confounding factors, including the need to untangle any adverse outcomes due to receiving older blood, from adverse outcomes which result from the condition requiring the transfusion (confounding by indication) [5]. Patients receiving greater numbers of transfusions are also more likely to receive older blood [5]. In addition, there are difficulties in defining the age of blood received, where more than one pack is transfused [6]. Many studies to date have focused on a binary exposure of 'fresh' or 'old' blood, where there is also the important consideration of what cutpoint to use. Although there are changes that occur in blood as it is stored (termed the 'storage lesion'), [7] there is no biologically intuitive cutpoint where the build-up of changes would be expected to have an effect, and so cutpoints are somewhat arbitrary, and may not provide sufficient distinction between patients receiving fresh and old blood.
The use of propensity scores has the potential to reduce the effect of confounders on associations between outcomes and age of blood. Propensity score methods have been developed to enable conclusions about causality to be drawn from observational data [8]. Propensity score methods involve the development of a model of the probability of a patient to have received the particular treatment/exposure they received, based on their observed covariates. Under several assumptions [9], patients with similar propensity scores can be considered to have the same likelihood of exposure, and so the average difference in outcome for patients with the same propensity score, but different exposure, can be interpreted as being due to the exposure. The most important assumption for this approach is that treatment/exposure assignment is independent of the outcome given the observed covariates. Propensity scores are most commonly applied to binary exposures, which are less applicable for considering age of blood. While applications of propensity score methodology to ordinal and continuous exposures are less common, methods have been proposed [9,10], however their use has predominantly been in disciplines other than medicine [11][12][13][14]. Applications vary not only in the methods of construction of the propensity score or scores, but also in the approaches to using the score for matching, stratification, weighting or in regression adjustment [8,15,16].
Researchers typically perform a single propensity score analysis for a given study, meaning that the performance of the different applications in a real-life situation cannot be compared, although Brookhart et al. [17] perform such a comparison considering only the different approaches to estimating the effect using a binary propensity score. Understanding the differences between the varying propensity score applications in an applied context gives researchers the opportunity to select the application best suited to their research question. Our paper explores the application of three different methods of constructing the propensity score (binary, ordinal and continuous exposures) combined with four approaches to estimating the effect (matching, stratification, weighting and regression adjustment) to the problem of adverse outcomes after transfusion of older blood in a maternity population, focusing on differences in results and implementation.

Methods
The study population was all women giving birth from 23 weeks gestation in New South Wales, Australia, between July 2006 and December 2010, with complete information on the birth admission and blood issued. Only women receiving from 1 to 4 transfusions were selected to create a group of relatively homogenous risk (by excluding women with massive haemorrhage). The data for this study come from five sources: the New South Wales (NSW) Perinatal Data Collection ('birth data'); the Admitted Patients Data Collection ('hospital data'), Clinical Excellence Commission Red Cell Utilisation Database ('Red Cell data') and the Australian Red Cross Blood Service ('Red Cross data'), and the NSW Registry of Births, Deaths and Marriages death registrations ('deaths data'). These datasets have been described elsewhere [18]. The birth data contains pregnancy, labour and delivery data for women giving birth in NSW, and the hospital data contains data on diagnoses and medical procedures (including transfusion) for all hospital admissions. The Red Cell and Red Cross data together contain information on all blood packs issued from hospital pathology laboratories, including collection date and issue date, from which age of blood at transfusion can be derived. Fact of death was established from the deaths data.
The outcomes considered were readmission to the same or another hospital within 6 weeks of birth, and severe maternal morbidity. Transfers from the delivery hospital were considered a readmission. Severe maternal morbidity included a diagnosis of one or more of sepsis, thromboembolic events, organ dysfunction, shock, cardiac arrest, cerebral oedema, coma, cerebral-vascular accident, assisted ventilation, or dialysis within 6 weeks of delivery, or death (within 12 months). Potential confounders considered were parity, plurality, antepartum or postpartum haemorrhage, gestational diabetes, pregnancy hypertension, maternal age, bleeding or platelet disorders, number of transfusions, month and year of admission, blood type of blood, hospital, hospital level, and leucodepletion.
The age of blood was defined as the age (time between collection and issue of blood pack from the blood bank) of the oldest blood a patient received within the delivery admission. Three methods of constructing the propensity scores were considered: using a binary exposure (splitting age of blood at the median of the maximum age of blood transfused), using quartiles of the maximum age of blood transfused (ordinal exposure), and using the maximum age of blood transfused as a continuous exposure. In each case, the propensity score model was developed using binary logistic, ordinal logistic or linear regression models as appropriate, considering both supply and maternal factors as possible confounders. Models were built using an iterative approach, whereby a model was proposed, balance across covariates assessed, and then the model refined to promote balance. Interactions were included where they improved balance on the propensity score. Balance was assessed by dividing the population into quintiles based on the propensity score, and comparing the proportions of women receiving older vs fresher blood (binary case), or proportions within quintiles of actual age of blood received (ordinal and continuous cases). The application of the four approaches to incorporating the propensity scores differed for each method, and are explained in more detail below. For the purpose of comparison, results are presented as the rate of adverse outcomes with 95 % confidence intervals in each case.

Ethical approvals
This study was approved by the NSW Population and Health Services Research Ethics Committee.

Method 1: binary propensity score
An arbitrary cutpoint of 22 days (the median age of the oldest blood transfused) was used to divide patients into groups having received any blood >22 days or not. The mean or median are commonly chosen cutpoint in age of blood studies [4,19,20], to increase the power of the analysis [5]. Logistic regression was used to construct the propensity score. In order to avoid extrapolating findings beyond the range supported by the data, overlap of cases, the "common support" was assessed by plotting the distribution of propensity scores by older/fresher blood. Cases outside of the common support were excluded to remove patients where the overall treatment effect may be unreliably estimated. This was not needed when matching on propensity scores within a caliper, as the matching process selects only similar cases. A summary of methods for binary propensity scores can be found in Williamson et al. [21] with relevant details outlined below.

Matching
Greedy one to one matching without replacement was used to match those receiving older blood to those receiving blood ≤22 days having a similar propensity score. Matches were restricted such that a woman receiving older blood could only be matched to a woman receiving fresher blood whose propensity score was within ±0.05 (the caliper). The rate of adverse outcomes in each group was compared.

Stratification
The sample was divided into strata based on quintiles on the basis of the propensity score, and the rate of each adverse outcome calculated within each stratum, and estimates weighted by stratum size summarized over strata.

Weighting
Inverse probability of treatment weights were calculated as the inverse of the propensity score for those receiving older blood, and the inverse of 1 minus the propensity score for those receiving fresher blood. These weights were multiplied by the marginal probability of receiving/not receiving older blood for those receiving and not receiving older blood. This stabilization results in the weighted sample size being the same as the original sample size, and reduces the variance of the estimates.

Regression adjustment
Logistic regression was used to calculate the odds of adverse outcomes for those receiving blood >22 days, including propensity score in the model. This was used to calculate the predicted adverse outcome rate for those receiving older and fresh blood.

Method 2: generalized propensity score-ordinal exposure
Women were divided into quartiles (≤15, 16-22, 23-30, 31 days or greater) based on the age of the oldest blood received, with ordinal logistic regression used to model the probability of receiving blood belonging to each age quartile. It has been suggested that where an ordinal logistic regression is appropriate for the data, a single score can be developed for each patient [22,23]. The linear predictor part of the model is taken as a balancing score, which balances covariates across quartiles [10]. This method results in a single balancing score, and four propensity scores (the probability of belonging to each quartile).

Matching
Following the work of Lu et al. [22,24] and using matching algorithms available in R (nbpMatching) [24] we created matched pairs of subjects where the subjects had similar balancing scores, but different actual age of blood (quartile) received. In matching, preference is given to pairs with the greater difference in quartile (i.e. a patient receiving blood in the first quartile would match to a patient in the third or fourth quartile in preference to one in the second). Within each pair, the patient belonging to the higher quartile was considered to have received 'older' blood. The rates of adverse outcome compared for those receiving older vs fresher blood.

Stratification
Strata were created by dividing patients into quintiles based on their balancing score. Logistic regression, stratified by balancing score strata, was used to obtain estimated probabilities of adverse outcomes for each quartile and strata. Stratum specific estimates were combined to assess the effect of age of blood quartiles on adverse outcomes.

Weighting
Inverse probability of treatment weights were defined as the inverse of the propensity score for the quartile of age of blood received, divided by the marginal probability of that quartile [25]. These weights were applied to estimate the adverse outcome rates.

Regression
A logistic regression model including the propensity score and quartile of age of blood, with polynomial terms up to degree 4, was used to examine the relationship between age of blood and adverse outcomes. The model was developed using the actual propensity score and quartile of age of blood received, and then the probability of adverse outcome at each age of blood quartile for each patient calculated using this model and the estimated propensity scores for unobserved quartiles, giving the expected proportion of adverse outcomes for each quartile. Confidence intervals were calculated using 1000 bootstrap samples [9,12].

Method 3: generalized propensity score-continuous exposure
A linear regression model was built to predict age of blood received (as a continuous variable), considering supply and maternal factors. The analysis followed the method outlined above for quartiles, using each day of age of blood (days 1-42), instead of quartiles, and using the predicted age of blood as a balancing score. The assumption of constant variance on the multiple linear regression used to construct the propensity score was necessary to create a scalar balancing score and appeared reasonable. The regression model was built considering age of blood as a continuous variable [12]. Rates of adverse events were calculated for each approach, summarized by decile of age of blood.

Results
Data were available for 2990 deliveries to women receiving 1-4 bags of blood. The median age of the oldest pack of blood transfused to each woman was 22 days. The rate of severe morbidity was 3.7 % (N = 111) and of readmission/transfer was 14.4 % (N = 430).

Method 1: binary propensity score
Of the 1424 women receiving older blood, 1018 (71 %) were matched to a woman receiving ≤22 day old blood. After matching the rate of severe adverse outcome was 4.2 % (95 % CI 3.1, 5.7) for those receiving fresher blood, and 3.1 % (2.1, 4.3) for those receiving older blood, with an average difference in age of blood of 14.5 days. Removing subjects outside the common support, there were 1412 patients receiving older blood (>22 days) and 1535 receiving fresher blood. The rates of severe morbidity after stratification, weighting and regression adjustment ranged from 3.8 to 3.9 % for fresher blood and 3.0-3.4 % for older blood (Table 1), and for readmission/transfer were from 14.1 to 14.9 % for fresher blood and 13.6-14.7 % for older blood. When considering severe morbidity, each method showed lower rates in the groups receiving older blood, although differences were small and not statistically significant. When considering readmission and transfer this pattern was reflected across matching, stratification and weighting approaches, but not regression adjustment. Regression adjustment was associated with the narrowest confidence intervals, and very similar estimates were obtained for regression and stratification.

Method 2: generalised propensity score-ordinal exposure
Women were divided into four groups (≤15, 16-22, 23-30, 31 days or greater) based on the age of the oldest blood received. There were 1472 matched pairs created (N = 2944, 98.5 %). The average difference in age of blood received between those receiving older blood and those receiving fresher blood was 12.1 days. After excluding patients outside the common support there were 2860 remaining for analysis. The rates of severe morbidity ranged from 3.5 to 4.9 % for fresher blood, and 3.1-3.8 % for older blood (Table 2), and for readmission/ transfer were from 13.4 to 14.9 % for fresher blood, and 13.0-15.2 % for older blood. There were only small differences in outcome rates between quartiles across the different methods, with the middle quartiles tending to have lower morbidity rates.

Method 3: generalised propensity score-continuous exposure
There were 1490 matched pairs created (N = 2980, 99.7 %). The average difference in age of blood received between those receiving older blood and those receiving fresher blood was 12.3 days. After excluding patients outside the common support, there were N = 2756 available for the remaining analyses. There was no difference in rates of severe morbidity or readmission/transfer across deciles of age of blood (Table 3). With some exceptions, rates across deciles tended to be similar for the stratification and weighting approaches, where they differed, the weighting values tended to be more extreme. Both sets of rates tended to 'jump around' , with little trend evident between deciles. The regression rates however were smoother between deciles, and less extreme than either weighting or stratification, except for the highest and lowest deciles.

Discussion
This study found no adverse effect of transfusion of older blood on maternal outcomes. Twelve different analyses using three methods of constructing the propensity score and four approaches to applying it were performed for each adverse outcome, with a high degree of consistency across methods. None of the methods considered showed a beneficial or detrimental effect of older blood on patient outcomes. The obstetric population provides an ideal population in which to study the effect of age of transfused blood on patient outcomes, as in this population patients are generally young and otherwise healthy [26]. A more complete discussion of the effect of age of blood in an obstetric population can be found in Patterson et al. [26]. This finding of no effect of age of blood is consistent with several studies amongst lower risk patients [27,28], however in some specific populations age of blood has been shown to affect outcomes [1,[29][30][31][32]. The adequacy of methods used in these studies to address confounding has been questioned [5]. Use of propensity score methods enabled us to separate the effect of older blood from confounders, particularly the number of units transfused. Consideration of propensity scores for ordinal and continuous exposures allowed us to move away from the need to dichotomise age of blood, which although commonly used, has little physiological justification [5]. Different propensity score methods were used, resulting in different estimates of effect, with each method and approach having different benefits and drawbacks.  Propensity scores are becoming more widely used, and have a number of advantages over other methods. In particular, matching on a propensity score for a binary exposure creates a situation similar to the baseline balance achieved in a randomized trial (on measured confounders), and so is accessible for clinicians [8]. It is also possible to assess the balance created by the propensity score method [14,33], and to exclude subjects where the results are unlikely to apply [33]. Another key benefit lies in the two step process of analysis, where the modeling process (constructing the propensity score) is conducted separately to the analysis of results, maintaining a level of objectivity [14,33]. As noted by Zanutto et al. an added benefit of this approach, used in this study, is that the same propensity score can be used across multiple outcomes [14]. In cases where the outcome is rare, but the exposure is common, traditional regression models are not able to fully model confounding, however propensity scores are able to adjust for more confounders [8,23].
Across the three methods of constructing the propensity score (using binary, ordinal or continuous exposures) used in our study, there were differences in the performance of the different approaches. The difference in age of blood received between pairs decreased when using the generalized propensity score compared with the binary and ordinal score approaches, however matching was more successful (greater proportion of patients matched) when using the generalized propensity score than with the ordinal and binary methods. This is due to the smaller number of potential matches excluded due to having the same value of the exposure. The impact of this can be seen in narrower confidence intervals compared with other methods when using ordinal or generalized propensity score methods. In contrast, more patients were excluded when using the generalized propensity score and ordinal propensity score methods for being outside of the 'common support' , where a patient is deemed to have received an unusual treatment given their covariate pattern.
Within the binary propensity score method, stratification, weighting and regression produce similar estimates of effect, with the narrowest confidence intervals associated with the regression estimates. The confidence intervals associated with matching were wider, reflecting the smaller sample size used in this approach. With the ordinal model, stratification tended to give the widest confidence intervals, with matching and regression producing narrower intervals. However, using the generalized propensity score, the greatest uncertainty was associated with the regression model, reflecting the variability both in the propensity score and the modeling process. Stratification typically performed better in terms of reduced variability. Patterns of estimates obtained via stratification and weighting were similar when using the ordinal and generalized propensity scores, but differed from the results obtained from the regression based approached. The regression approaches impose a degree of smoothness between estimates of adjoining categories which the other estimates are unable to account for. This additional smoothness was more noticeable with the generalized propensity score than with the quartiles. Given the known ordering of quartiles and deciles, it seems beneficial to incorporate this knowledge in the effect estimates. The development and assessment of the propensity score models was most straightforward for the binary propensity score, as this resulted in two groups that could easily be compared to assess balance. In the more complex methods, both the exposure and propensity score need to be stratified, and patients within stratum compared in order to determine if balance has been obtained.
Here we used quintiles of propensity score and observed age of blood, resulting in 25 strata. This difficulty carries over to the interpretation of results. It is possible to obtain effect estimates, odds ratios and other measures of effect for the binary and ordinal propensity score models, but for the generalized model, with the exception of matching, the relationship between outcome and exposure is difficult to summarise, and may be better captured graphically [12,34].
The relative merits of the different methods for a binary exposure have been discussed elsewhere [17] and carry across to the more complex designs with several exceptions. Weighting and stratification in the case of non-binary exposures represent only a small increase in complexity compared with the binary case. Weighting methods, while easily applied in the binary propensity score case, do not exploit the extra information available in the case of ordinal or continuous data, [34] and are more difficult in cases with a truly continuous outcome [35]. Weighting is also difficult when more than one propensity score has been used for each subject [22,34,36]. Stratification however is easy to apply in cases of one or more propensity score, [14] and retains the ease of interpretation that is present with binary scores [37].
Applications of matching and regression adjustment methods are somewhat more complicated in the case of non-binary exposures. While matching is intuitive in the case of binary exposures, in some sense replicating the setup of a controlled trial (although only ensuring groups are the same based on observed covariates), with more than 2 exposures matching is less intuitive. In the continuous case, the comparison of 'older vs fresher' blood is conducted without defining 'older' or 'fresher' across the sample. It is possible that there could be a matched pair with ages of blood 7 and 10 days, and another pair with ages 32 and 36 days. In this case, the patients receiving blood of 7 and 32 days would be included as receiving 'fresher' blood, even though the difference in age of blood received is considerable. It has been argued that matching can be helpful, even when a null result is returned, in that it can be interpreted as follows: 'considering the greatest possible differences between age of blood received, there was no effect of age of blood on outcomes, hence no effect would be expected at smaller differences' [22]. Newer techniques allow for matching in categorical outcomes, where one subject from the group with the smallest number of subjects is matched to one or more subjects in the remaining groups, and the matched sample used in regression analysis or similar [36,38]. The matching algorithm needed for matching with more than two exposures, although available, makes this method less accessible than stratification, weighting and regression which can be performed using traditional software. For this reason, it was considered beyond the scope of this paper [14]. Regression adjustment in the non-binary case can be performed in normal statistical software, but requires several additional steps. In particular, parameters obtained from the regression equation are not able to be interpreted directly, but need to be averaged over the distribution propensity scores evaluated at that dose [34].
It is important to note that when using matching techniques, the whole population is not included in the analysis, and so reported rates reflect the incidence of outcomes only in those women who were able to be matched to a woman in the other arm. With other methods, where 'common-support' criteria are imposed, the study populations are likewise limited to those women who similar in terms of likelihood of exposure at varying treatment levels. In some cases, when a large number of women have been excluded, this may be quite different from the population rate [33]. A comparison of included and excluded cases would be important in practice to aid in the interpretation of results and generalisation to the wider population.
This study explored the application of different propensity score methods to the effect of age of blood on adverse outcomes. We used a large sample, with information on many potential confounders, and considered two outcomes: one with low incidence and one more common. A possible limitation of this study is that any unmeasured confounders such as hospital policies or clinician practice of preferentially transfusing fresher blood to sicker patients would not be adjusted for by the propensity score (regardless of method used), and may introduce bias into the result. Adjustment by hospital may somewhat offset such bias, but other unknown confounders may also affect results. In addition, for simplicity we only used one method of checking for balance, although several methods have been proposed and are used in practice, [9,10,22,39] and these methods may be more appropriate for particular approaches, such as the use of standardized mean differences [39] for binary matching. In practice, the balance method chosen should relate to the methods used in the analysis [33].

Conclusions
Propensity score methods are useful for the analysis of observational data around the age of blood transfused, and allow causal inference from such data. These methods are able to account for differences in number of blood packs transfused and other confounders that influence both the age of blood transfused and potential adverse outcomes. Although less intuitive than their binary exposure counterparts, propensity score methods for ordinal and continuous exposures are feasible, able to be implemented with standard software (using packages available online), and better reflect the underlying mechanism of age of blood. These methods should be considered in similar studies where it is not appropriate to dichotomise an exposure, and where the outcome is sufficiently rare to limit the utility of regression modeling. Each of the three methods (binary, ordinal and continuous exposure) produced slightly different estimates of effect, but found no significant relationship between age of blood transfused and adverse outcomes.