Assessing the validity of indicators of the quality of maternal and newborn health care in Kenya

Background The measurement of progress in maternal and newborn health often relies on data provided by women in surveys on the quality of care they received. The majority of these indicators, however, including the widely tracked “skilled attendance at birth” indicator, have not been validated. We assess the validity of a large set of maternal and newborn health indicators that are included or have the potential to be included in population–based surveys. Methods We compare women’s reports of care received during labor and delivery in two Kenyan hospitals prior to discharge against a reference standard of direct observations by a trained third party (n = 662). We assessed individual–level reporting accuracy by quantifying the area under the receiver operating curve (AUC) and estimated population–level accuracy using the inflation factor (IF) for each indicator with sufficient numbers for analysis. Findings Four of 41 indicators performed well on both validation criteria (AUC>0.70 and 0.75<IF<1.25). These were: main provider during delivery was a nurse/midwife, a support companion was present at birth, cesarean operation, and low birthweight infant (<2500 g). Twenty–one indicators met acceptable levels for one criterion only (11 for AUC; 9 for IF). The skilled birth attendance indicator met the IF criterion only. Interpretation Few indicators met both validation criteria, partly because many routine care interventions almost always occurred, and there was insufficient variation for robust analysis. Validity is influenced by whether the woman had a cesarean section, and by question wording. Low validity is associated with indicators related to the timing or sequence of events. The validity of maternal and newborn quality of care indicators should be assessed in a range of settings to refine these findings.

Background The measurement of progress in maternal and newborn health often relies on data provided by women in surveys on the quality of care they received. The majority of these indicators, however, including the widely tracked "skilled attendance at birth" indicator, have not been validated. We assess the validity of a large set of maternal and newborn health indicators that are included or have the potential to be included in population-based surveys.

Methods
We compare women' s reports of care received during labor and delivery in two Kenyan hospitals prior to discharge against a reference standard of direct observations by a trained third party (n = 662). We assessed individual-level reporting accuracy by quantifying the area under the receiver operating curve (AUC) and estimated population-level accuracy using the inflation factor (IF) for each indicator with sufficient numbers for analysis.
Findings Four of 41 indicators performed well on both validation criteria (AUC>0.70 and 0.75<IF<1.25). These were: main provider during delivery was a nurse/midwife, a support companion was present at birth, cesarean operation, and low birthweight infant (<2500 g). Twenty-one indicators met acceptable levels for one criterion only (11 for AUC; 9 for IF). The skilled birth attendance indicator met the IF criterion only.
Interpretation Few indicators met both validation criteria, partly because many routine care interventions almost always occurred, and there was insufficient variation for robust analysis. Validity is influenced by whether the woman had a cesarean section, and by question wording. Low validity is associated with indicators related to the timing or sequence of events. The validity of maternal and newborn quality of care indicators should be assessed in a range of settings to refine these findings.
Nearly 275 000 maternal deaths occurred globally in 2011, nearly all of which took place in low-and middle-income countries (LMIC) [1]. Most of these countries did not reduce maternal mortality to levels targeted in the Millennium Development Goals (MDG5) [1]. Progress has been hindered, in part, by a lack of reliable maternal health data, especially on maternal deaths [2]. Measurement challenges are particularly significant in LMIC with irregular and incomplete health system reporting.
To measure progress in maternal health, monitoring agencies have relied on tracking indicators proposed as measures of quality of care, such as the proportion of births attended by a skilled birth attendant, that are assumed to be strongly correlated with maternal mortality [3]. Such indicators are routinely assessed in population--based household survey programs, such as the Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS), in which female respondents report on events surrounding recent births [4]. Despite their widespread use, the majority of proposed quality of care indicators, including skilled birth attendance, have not been validated [1,5,6]. In fact, numerous researchers have noted the lack of correlation between these indicators and maternal mortality levels [5,[7][8][9]. These researchers argue that information on the category of provider at birth is deficient as a measure of quality of care as it relies on assumptions about provider training and competence as well as access to essential supplies and equipment. It is important therefore to identify alternate indicators that describe the actual content of care, can be reported with accuracy, and have the potential to be included in routine data collection programs.
A growing, but still limited, body of research has examined the validity of indicators of the quality of care in the intrapartum and early postpartum period. To our knowledge, however, no study has yet reported on how accurately women can recall the skill level of their provider at birth, although there have been some attempts to look at data quality issues [10]. Furthermore, the few validation studies that have taken place have generally compared maternal self-reports with hospital records, which may be incomplete or inaccurate, or have been conducted in high-income settings, where maternal mortality rates are generally low [11][12][13][14][15].
To address this gap, this study assessed women' s ability to report on a set of quality of maternal and newborn health care indicators that are either currently in use or have the potential to be included in routine survey-based data collection. In spite of its limitations, it seems likely that the "skilled birth attendance" indicator will continue to be used and so we assess how accurately women report on the skill level of their provider during delivery. We compare women' s self-reports of maternal and newborn care received against third party observations during labor and delivery. Finally, we provide suggestions for modifications to data collection procedures that could improve the measurement of maternal and newborn health care.

Study sites
Validation exercises were conducted in two high volume public hospitals located in Kisumu District and Kiambu District in western and central Kenya, respectively. According to the 2014 Kenya DHS, nationally, 61% of births in the five years preceding the survey were delivered in a health facility; in Kisumu and Kiambu districts the prevalence was 70% and 93%, respectively [16]. Facility-based delivery is less likely among older women, those who have lower education, are poorer, or reside in rural areas [16]. Fertility levels among women in the two districts are lower than the national rate, with the total fertility rate in Kisumu at 3.6 births per woman and in Kiambu at 2.7, compared with 3.9 nationally [16].

Data collection
Data collection took place from July to September 2013. All pregnant women aged 15 to 44 who were admitted to a study facility maternity unit and in early labor were invited to participate. Participants included eligible women who underwent labor and delivery and were able to provide consent.
Our reference standard for validity analysis is data collected by trained researchers who observed providers in the maternity admission room and labor and delivery rooms using a structured checklist-type form. Observers were registered Kenyan nurse/midwives with at least three years of experience in a maternal and newborn health unit and previous research experience. Observations were used as the reference standard as they reflected all facets of caregiving including events related to the birth itself as well as interactions between the women and provider, before, during and up to one hour after delivery. In the few cases in which clarification was needed (eg, in the event the mother and infant were taken into separate rooms, the observer remained with the mother) observations were supplemented by checking facility records and by asking providers.
Exit interviews with women took place prior to hospital discharge. Data collectors who were degree holders in a social science interviewed women using a structured questionnaire. Interview questionnaires were translated into Kiswahili, Dholuo and Kikuyu and were administered in the woman' s language of preference.
All data collectors received four days of intensive training on the study procedures, the rationale behind each element of the client questionnaire and observation checklist to ensure full understanding of the instrument components, and how to record responses and observations.

Ethical review
Written informed consent was obtained from all participants and their attending providers prior to participation. All women and providers were provided with a description of the study and procedures, including their right to refuse participation at any time. In Kenya, pregnant adolescents between ages 15-17 years are considered "emancipated minors" and their written informed consent was also obtained [17][18][19]. Staff who provide labor and delivery care were identified by the hospitals' obstetrics and gynaecology director, and approached for recruitment and consent. No providers refused participation.
Prior to participant enrollment, the study protocol was approved by the ethical review committees of the Population Council and the Kenya Medical Research Institute (KEMRI).

Indicator selection
To identify indicators to be validated, a landscaping scan of published and grey literature was conducted from April to July 2012. The scan focused on indicators of the quality and content of care received during labor and delivery and the health outcomes related to this period [20]. Indicators were included if they were currently in use or proposed for use in household survey programs such as the DHS and MICS or reflected standard practices of maternal and newborn labor and delivery care. The scan yielded a list of 285 indicators. This list was assessed by a group of public health experts specializing in maternal health to select a set of 80 indicators for validity testing. Indicators were selected on the basis of their wide use and/or potential to assess the critical elements of maternal and newborn care during the initial assessment of the woman, the first, second and third stages of labor, and immediate postnatal period.

Analysis
Sample size was calculated assuming 50% prevalence for all indicators, given that some harmful practices would rarely occur, and some beneficial practices would almost always occur, at 60% sensitivity ±6% precision, 70% specificity ±6% precision, with type 1 error set at α = 0.05 assuming a normal approximation to a binomial distribution. These specifications imply a minimal sample size of 500, which was increased to 600 women to allow for 20% attrition in a separate study to re-interview women approximately one year following delivery.
Statistical analysis was performed using Stata Version 12 [21] to assess indicator accuracy at the individual and population level. For individual-level reporting accuracy, we calculated the sensitivity (ie, true positive rate) and specificity (ie, true negative rate) of indicators by constructing two-by-two tables for each indicator that had at least five counts per cell [22].
Missing pairwise data were excluded. To summarize the accuracy of each indicator, we quantified the area under the receiver operating curve (AUC), which plots the sensitivity (ie, true positive rate) of each indicator against its false positive rate (1-specificity). To measure uncertainty associated with validity, we estimated 95% confidence intervals (CI), assuming a binomial distribution. In practice, the AUC represents the "average accuracy of a diagnostic test" [23,24]. AUC values range from 1.0 (perfect classification accuracy) to 0 (zero accuracy). An AUC value of 0.5 is the equivalent of a random guess.
To assess the population-based validity of indicators, we estimated the inflation factor (IF), also known as the Test to Actual Positives (TAP) ratio [25]. The IF reflects the prevalence of the indicator as it would be reported by women in a survey after accounting for sensitivity and specificity (Pr) divided by the true prevalence (ie, observer report) (P). By comparing the ratio of the estimated survey-based prevalence to its true prevalence, we calculated the degree to which each indicator would be over or under-estimated by women' s self-report (IF = Pr/P) [25,26].
The prevalence of women' s self-report in a survey (Pr) is calculated by applying each indicator' s estimated sensitivity (SE) and specificity (SP) to its true prevalence (P), using the following equation: Pr = P × (SE+SP-1)+(1-SP) [26]. We caution that the estimated survey-based prevalence is dependent on the observed prevalence of the indicator. Therefore, IF estimates reflect the magnitude of over or under-estimation in the study setting. To illustrate the implications of the IF estimates for other contexts in which the true prevalence is different from our study setting (eg, outside of a hospital facility), we model the estimated survey prevalence for select indicators across all possible coverage levels (ie, true prevalence ranging from 0 to 100%) using the above equation [27].
We categorized individual-level reporting accuracy as high (AUC>0.70), moderate (0.60<AUC<0.70), and low (AUC<0.60) [22] and the degree of bias reflected by the IF as low (0.75<IF<1.25), moderate (0.50<IF<1.5) and large (IF<0.50 or IF>1.5) [11]. In order to summarize indicator validity in terms of both individual and population-level accuracy, we considered indicators with high AUC and low IF to have high overall performance [22].

Role of the funding source
The funders of the study had no role in data collection, analysis, interpretation or writing of the study results, or decision to submit for publication.

Sample description
1039 women admitted to the maternity unit at participating study facilities were recruited to participate. Of those, 676 women were observed (Kiambu = 395, Kisumu = 281). Approximately one-third of women were not observed because they were not in labor but required monitoring on the antenatal ward, did not progress into labor, or they progressed rapidly into labor and full observation was not possible (Figure 1). Fourteen women who were observed did not participate in the exit interview.
Participants' background characteristics and differences by facility location are presented in Table 1. The majority of women were under age 25 with fewer than two prior births, married, and with no or primary education. A greater percentage of Kisumu participants were never married, while fewer were married/living together or separated/widowed.

Validation results
The full list of indicators selected for validity testing is presented in Table 2. The table provides the prevalence for each indicator as reported by women and observers, which, for some indicators, varied substantially. For example, 73% of women reported that the provider(s) washed his or her hands or used antiseptic before any initial examination, while 27% of observers recorded that this took place. "Don't know" responses were minimal for most indicators. However, four indicators for which the proportion of women who responded "Don't know" exceeded 5% are reported in Table  3. Two of these indicators refer to the immediate postnatal period: whether the newborn was immediately dried after birth and whether the newborn was immediately wrapped in a towel. Having a cesarean section as opposed to a vaginal delivery was significantly associated with responding "Don't   Asked of mothers whose babies were breathing at birth. § Newborn was placed against mother' s chest after delivery. # Indicator constructed from two skin-to-skin items: (1) newborn placed against mother' s chest after delivery and (2) newborn was naked on skin (not wrapped in a towel). A total of 8 indicators had high individual reporting accuracy (AUC>0.70), 7 had moderate accuracy (AUC>0.60), and 26 had low accuracy (AUC<0.60). Indicators with high AUC results reflected events leading up to (eg, induction or augmentation of labor, episiotomy) and during the birth itself (eg, cesarean section, main provider during delivery was a doctor or medical resident, main provider during de- Table 3. Indicators with greater than 5% "Don't know" responses respoNdeNt questioN N "doN't KNoW" (%) Did the health provider(s) wash his/her hands with soap and water or use antiseptic before examining you? 662 29.5 Was your baby dried off with a towel or cloth immediately after his/her birth, within a few minutes of delivery? 660 8.3 (Of women who reported newborn was not placed against her chest immediately after delivery) Was your baby wrapped in a towel or cloth immediately after birth? 170 20.6 In your first physical examination after delivery, did a health provider do a perineal exam? 662 9.8 livery was a nurse/midwife, support person present during birth, low birthweight infant). Indicators with low value AUC results tended to be related to events immediately following the birth (eg, uterotonic received following delivery of the placenta) and postnatal health checks for the mother and newborn. For population-level bias, a total of 13 indicators had low bias (0.75<IF<1.25), 7 had moderate bias (0.5<IF<1.5), and 21 had large bias (IF<0.5 or IF>1.5). Indicators with large bias varied with respect to phase of labor and delivery, but those with the largest bias tended to have a low observed prevalence and be those that may require medical knowledge to report accurately (eg, experience of complications).
To assess women' s ability to recall the type of provider who attended them, respondents were asked, "Who was the main provider assisting you during delivery?" There were sufficient cell counts to assess two provider categories with robust analysis: nurse/midwife and student nurse. The nurse/midwife indicator met both the high AUC and low IF criteria while the student nurse indicator had low individual accuracy (AUC = 0.57) and large bias (IF = 0.45). An indicator constructed in analysis that combines responses of "doctor", "medical resident" and "nurse/midwife" as "skilled attendants" had low individual accuracy (AUC = 0.55), primarily due to low specificity, and low population-level bias (IF = 1.02) [28]. Cross-tabulation results that compare women' s reports to observers' reports on their main provider during delivery suggest a tendency for medical residents and nurse/midwives to be misclassified by women as doctors ( Table 5).  *Validation analysis based on matched data, excluding 'Don't Know' responses. Sensitivity and specificity analysis was not performed for indicators that had fewer than 5 counts per cell. †Skilled provider includes doctor (ob-gyn), medical resident or nurse/midwife. ‡Indicator constructed from two skin-to-skin items: (1) newborn placed against mother' s chest after delivery and (2) newborn was lying naked against the mother' s chest.   6  46  2  7  0  123  Medical resident  0  1  0  0  0  0  0  1  Medical intern  0  0  0  1  0  1  0  2  Nurse/midwife  2  7  1  450  3  17  3  483  Clinical officer  1  0  0  To illustrate the implications of indicator properties established in this study for other contexts, we plot the values of the predicted survey prevalence (Pr) of select indicators across all possible levels of intervention coverage (from 0 to 100%). Figures 2 and 3 compare the predicted prevalence using the sensitivity and specificity calculated in this study (blue line), to perfect reporting accuracy assuming 100% sensitivity and specificity (black line) across all levels of coverage. Using the example of a high sensitivity and low specificity indicator such as "skilled birth attendance", these data demonstrate that in a high coverage setting the estimated survey-based prevalence from women' s self-re-port more closely approximates the true prevalence while in low coverage settings, the estimated survey-based prevalence would greatly overestimate the true rate (Figure 2). For example, in a setting where the true prevalence of skilled attendance is 40%, rather than the 93% observed in this study (the red triangle), the estimated survey-based prevalence would exceed the true prevalence by 50 percentage points. In contrast, an indicator with both high sensitivity and high specificity, such as "cesarean operation", would generate a survey-based estimate that closely approximates the true prevalence across all coverage levels (Figure 3).

DISCUSSION
This study provides validity results for 41 indicators of the quality of maternal and immediate newborn health care that are either currently in use or have the potential to be included in household surveys. Across phases of labor and delivery, we found indicators related to concrete, observable aspects of care or which reflected pain or concern were reported with higher accuracy. These results are consistent with previous studies which have found particularly memorable events, such as having a cesarean operation [11,22,29] and having a support person present [22], have high overall validity.
That a small number of the initial list of 82 indicators met both validation criteria is partly due to the fact that many preventative care interventions almost always occurred, and there was insufficient variation (ie, not enough cases in each cell) for robust analysis. For many preventative care indicators we found that most women accurately reported receiving the care (ie, high indicator sensitivity). For example, an indicator that is a proxy for receiving a uterotonic for the prevention of postpartum hemorrhage (ie, if an injection, IV medication or tablets were received in the first few minutes following birth), was accurately reported by nearly all women. Although the near universal imple- mentation of this practice limited robust analysis, these results suggest some aspects of routine care can be accurately reported. However, given that there were few instances in which standard preventative interventions were not received, unless there was almost perfect negative classification by women, these indicators also tended to have low specificity. An alternate interpretation is that the observed pattern of high sensitivity and low specificity for many preventative practices may reflect "facility reporting bias" among women based on the expectation of receiving appropriate care. This finding has also been described in a study of women' s reporting of maternal and child health care in China [11].
The potential for facility reporting bias may also be relevant for indicators on skilled birth attendance. Indicators measuring the assistance of a skilled provider had high sensitivity and low specificity for both labor and delivery. Women tended to underreport the presence of less skilled providers, such as student nurses, and over-report the presence of a doctor or obstetrician/gynecologist. The positive bias may also be due to differences in how women conceptualized who their "main" provider was. It is possible that women understood their "main" provider to be the attendant with the highest rank and who may have been deemed 'in-charge' of her care, while observers iden-tified the primary provider as the attendant who administered the majority of the care to the woman.
Study findings also suggest the validity of some indicators may be dependent on context and question wording. Indicators that performed worse on the validity tests tend to be related to the timing or sequence of events, such as whether the newborn was placed skin-to-skin on the mother' s chest immediately after delivery. A two-item question sequence that clarified the precise meaning of "skinto-skin" greatly reduced women' s overestimation of the practice compared to a one-item indicator ( Table 4). These results are consistent with findings that women had difficulty reporting whether their newborn was placed skinto-skin in a qualitative study of delivery and newborn care among women in Bangladesh and Malawi [30], but contrast with findings from a recent validation study in Mozambique [22]. The mixed results may be attributable to a longer recall period in the Mozambique study.
An influential aspect of the birth context was the type of delivery. Women who had a cesarean operation were less likely to be able to report on immediate newborn care than women with normal deliveries. This is reflected in high levels of "Don't Know" responses. This finding suggests that it may be worth excluding women with cesarean sections from questions about care immediately after birth in routine household surveys.
A number of indicators performed well on the IF test only; individual-level misclassification does not inherently signify that measurement at the aggregate level will be inaccurate [25]. In studies where the goal is to estimate the approximate population-based coverage of an indicator, false positives and false negatives may balance out to produce a close approximation of population level coverage (ie, indicators that meet the IF criteria alone). Knowing if an indicator' s IF is large can inform when corrective methods may be needed to limit false positive reporting (eg, use of a twoitem indicator).
Knowledge of whether an indicator is likely to be overestimated can also have significant programmatic implications. For example, where skilled birth attendance is overreported, progress in scaling up the presence of higher cadre providers may not be as great as expected. It is im-portant to recognize that the presence of a skilled provider is one aspect of receiving quality care, one which relies on the assumption that providers have received the necessary training to administer essential interventions and have acquired the competencies to address complications during childbirth. Additionally, even "skilled" providers may not be able to deliver adequate care if they do not have access to necessary equipment and supplies. Information on skilled attendance as reported by women should be corroborated with indicators on the content of care. When possible, we recommend that users also triangulate self-reported data on quality of care with other data sources such as information on stock-outs of essential medicines [4].
While a strength of this study is the use of direct observation as the reference standard, there are some limitations. Validation results are based on women seeking delivery in a large public hospital, and may not be generalizable to women who deliver in other types of facilities or at home. The lack of variation in hospital practices also limited the ability to analyze all of the indicators, which may have otherwise proven to be valid if we had collected data in a range of health institutions. Finally, our results inform a 'best case' scenario in terms of recall accuracy because women were interviewed shortly following delivery. To inform how recall changes over time, as well as to investigate women' s understanding of concepts such as who their main provider was, a follow-up study is under way to re-interview women one year after delivery.

CONCLUSION
The measurement of the quality of maternal and newborn health care received in LMIC settings often relies on data from surveys of women. Little research has examined the validity of these indicators. The primary indicator of interest in this study-delivery by a skilled birth attendant-met validation criteria for reporting at the population level only and the results indicate that reporting accuracy may be particularly problematic where skilled birth attendance coverage is low. Indicator properties established here provide insight into contexts where indicator use is appropriate, and where modifications to data collection procedures or question construction may be warranted.