Advancing contraceptive security , availability , and choice in Malawi using a quality improvement methodology

Many initiatives to improve contraceptive security (CS) rightly focus on strengthening national and regional systems. However, local health facilities are often under-resourced and lack technical capacity that feed into the larger supply chain. This study’s objective was to assess whether changes in facility CS indicators were associated with participation in 2014-2016 implementation and scale-up of a quality improvement methodology—Client-Oriented, Provider efficient services (COPE®) for Contraceptive Security—in 60 facilities across 10 districts of Malawi. The intervention included facility self-assessment guides and action plans to address local challenges. Results showed significant improvements in facilities having a trained both provider and contraceptive supplies. The percentage of health centers with all requirements for implant services increased significantly, including implant removal (from 26.5%; 95% CI: 14.9-41.1 to 77.6%; 95% CI: 63.4-88.2, p<.001). Health centers (from 0.0%; 95% CI: 0.0-7.3 to 10.2%; 95% CI: 3.4-22.2, p<0.05) and hospitals (from 45.5%; 95% CI: 16.7-76.7 to 90.9%; 95% CI: 58.7-99.8, p<0.05) significantly improved in the percentage of facilities able to insert intrauterine devices. Hospitals improved their ability to offer female sterilizations (27.2%; 95% CI: 6.0-61.0 to 63.6%; 95% CI: 30.7-89.1, p<0.05) and male sterilizations. Low performing health centers showed significant improvement in staff capacity, logistics management information systems, equipment, and total CS performance. The percentage of facilities placing emergency orders for contraceptives during the three months prior to an assessment showed a decreasing, non-significant trend among hospitals but was significant among health centers (from 69.2%; 95% CI: 54.6-81.7 to 36.7%; 95% CI: 23.4-51.7; p<0.001). Facility staff commitment was associated with action item completion. Improvements tended to be sustained over time. Community engagement is thought to be important to intervention success. COPE for CS may be an effective intervention and future 1 2


Introduction
Contraceptive security (CS) exists when people are able to choose, obtain, and use the contraceptive methods and services they desire from among a full range of methods (see Box 1) 1 . Achieving CS is critical to meeting the Sustainable Development Goals, especially Goal 3, which seeks to ensure healthy lives and promote well-being for all at all ages 2 . The international public health community recognizes that CS remains weak in many resource poor settings of sub-Saharan Africa and elsewhere, calling for action and commitments in two recent London Summits on Family Planning in 2012 and 2017 3-5 . Importantly, improving availability and choice of contraceptive methods and services is essential to fulfilling sexual and reproductive health and rights.
Improving CS requires systems transformation and concerted efforts to make improvements sustainable. Many donor and government-funded initiatives aimed at improving CS rightly focus on strengthening national and perhaps regional-level systems of forecasting, procurement, central stock management, supply chain and related elements. However, district zones and their local health facilities that are closer to clients are often underresourced and/or lack technical capacity in logistics management, requisition, stock management and stock reporting which feed into national systems. These "last mile" facilities face challenges that require specific tools and approaches designed to identify and solve local problems as part of the larger supply chain 6 . Examples of important initiatives geared towards last mile needs include: tools for community-based distribution programs; community score cards for accountability and community involvement in CS; and, inititives in pharmacy management 2,7,8 . Recognizing gaps in efforts for the last mile, the Reproductive Health Supplies Coalition's (RHSC) Advocacy and Accountability Work Group announced a call to join its new "Last Mile Workstream" as recently as late 2017 9,10 .
Recognizing a gap in methodologies, tools and approaches specifically targeted to facility management of CS issues at the last mile, EngenderHealth's RESPOND project developed and tested the quality improvement methodology COPE ® for Contraceptive Security (COPE stands for Client-oriented, Provider Efficient services) 11 . The methodology includes facility self-assessment guides and subsequent development and implementation of facility-level action plans to address local gaps and challenges to contraceptive security. This article presents findings from the 2014-2016 implementation and scale-up of the methodology in Malawi with support of the RESPOND project and the RHSC. The initiative was carried out at the request of, and in partnership with, the Ministry of Health and district health officials in 60 facilities across 10 districts of Malawi.
Program description COPE ® for Contraceptive Security Client-oriented, provider-efficient services (COPE ® ) is a quality improvement methodology first developed in 1995 by EngenderHealth to address clients' rights to health services and provider needs to deliver quality services 12,13 . Since that time, numerous iterations of COPE for different technical areas have been tested and published, including adaptations to improve the quality of services in reproductive health, HIV care and treatment, male circumcision, and abortion care, among others [14][15][16][17][18] . The COPE ® for Contraceptive Security methodology and tools are used by frontline health and logistics personnel to identify and implement low-cost, local solutions to address problems related to contraceptive supply 19,20 . The process incorporates staff accountability and linkages with district supervision systems and community and local leadership, as needed, creating ownership in improving quality and strengthening systems for sustainability.
The COPE for CS process begins with an exercise conducted by trained facilitators to orient facility teams on the activity. Once staff agree to tackle the issues under consideration, facility teams complete a series of self-assessment guides on issues ranging from stock management, reporting, requisition, transportation, warehousing and personnel. Problem identifications lead to staff developing action plans to address local bottlenecks and issues that the facility and district can try to address themselves, and formation of a COPE for CS committee to oversee implementation and follow-up of their action plan. A job aid is available to foster continued reflection and reanalysis of issues during implementation and for use in district supervision 21 . Intended for global use, COPE for CS was originally designed and tested in Tanzania from 2011 to 2013, where results showed statistically significant improvements in contraceptive availability and increases in family planning use after more than one year 11,22 .

Introduction and scale-up in Malawi
In Malawi, the Ministry of Health (MOH) made remarkable progress in improving family planning access over the past 15 years. Modern contraceptive prevalence among married women increased from 28% in 2004 to 42% in 2010, and again to 58.1% by 2015-2016 23-25 . The MOH made strides toward achieving contraceptive security at the national level, while noting that contraceptive security is weaker at district and lower-level health facilities. In 2014, the MOH Directorate for Reproductive Health (DRH) requested assistance to address contraceptive security at the last mile. Challenges identified at the local level included: lack of trained providers (especially for long-acting reversible contraception (LARCs) and permanent methods); unclear roles and responsibilities for logistics management; and, lack of training in requisitioning and ordering (which can lead to Box 1. Contraceptive security exists when people are able to choose, obtain, and use the contraceptive methods and services they desire from among a full range of methods (short-acting, longacting reversible, and permanent). In order for family planning programs to provide a full range of methods, three basic elements must be consistently present at a service delivery point: the contraceptives themselves; necessary medical equipment, instruments, and expendable supplies; and trained staff able to provide each method. When any of these elements is missing from a service delivery point, the method cannot be offered, and contraceptive security is neither achieved nor maintained 1 . stock-outs of contraceptives and related supplies) 26 . Box 2 shows the programmatic process for COPE for CS introduction and implementation in the two districts as well as 2015-2016 implementation in additional districts and facilities at the request of MOH/DRH.

Implementation process of COPE for CS
To understand the intervention, it is necessary to understand the structure of the COPE for CS process. Site facilitators, trained in the methodology and tool, lead a facility team through an initial exercise and their continuous quality improvement efforts. COPE for CS is designed to be easy to implement without outside technical assistance and is adaptable for local facility contexts. The site facilitators, in coordination with facility leadership, determine the length of their initial COPE for CS facility exercise based on workflow at the facility, usually consisting of several hours in the late afternoon when client flow is slower over two to three days. In larger facilities, staff from multiple departments are asked to join the exercise, including those outside family planning and/or logistics management. For example, facilities are encouraged to invite cleaning personnel who use supplies for infection prevention and guards who may be a client's first point of contact at the facility. In smaller facilities, the entire staff may work on COPE for CS together for the duration of the work. Whatever staff configuration is chosen, the goal is to have teams with first-hand operational experience with different types of challenges within the facility and who want to participate in problem solving with their colleagues, generating a shared sense of ownership for results 19 .

Self-assessment guides
The COPE for CS tool includes 10 self-assessment guides containing a series of questions, based on international standards, regarding the quality of services, systems, and procedures (see Box 3) 19 . Site facilitators and teams review the guides during the initial exercise and can complete individual assessments as a team, in small groups, pairs, or as individuals, depending on their preference and the staff participating. Each guide includes instructions on which type of staff, by function, is best placed to respond to its questions. The guides are designed to be flexible and adaptable to the facility team's needs. Staff can write in issues that are not directly raised by a guide and are relevant, and can choose to skip questions they do not find relevant to their context. In addition, the team is not required to complete all 10 assessments at one time; rather, they can prioritize which guides to use and at which points in their exercises/process. After completing guide(s), the facility team reviews and identifies issues at the site as revealed by assessment questions.

Action plans
Following discussion of any one of the self-assessment guides, facility teams develop an action plan to consolidate and prioritize recommendations. Action plans identify problems related to CS; identify the root cause(s) of each problem; propose action items that are realistic, measurable, attainable, and address the root cause(s); assign an individual facility staff member responsibility for each action item; ensure a time-bound goal for • August-October 2015: EngenderHealth project staff conducted follow up visits at both preliminary and scale-up sites to report on progress, troubleshoot, and document for broader learning.
• June 2015-January 2016: District Health Management Team staff agreed to incorporate discussion of COPE for CS action plan progress into their regular supervisory visits.
completion of each action item; and provide space to comment on action item status and result. Box 4 shows an excerpt from a study facility's action plan.
COPE for CS emphasizes targeting problems with root causes at the facility level and within facility control. Facilities are also encouraged to include action items that address problems at the district level, if teams can identify a pathway through which the facility may affect change. For example, a facility requiring more trained staff in logistics management, or more trained providers in FP provision, may advocate to district-level management to assign additional personnel to the facility in need. National problems identified are not recommended for inclusion as facility action items.

COPE for CS committees
In addition to completing self-assessment guides and developing action plans, facility teams form a COPE for CS Committee to ensure follow-up and monitor action plan progress. Opportunities to discuss action plan progress include: regular staff meetings, special committee meetings, and during district supervision. COPE for CS Committees may decide to conduct additional full team COPE for CS exercises and continue to complete self-assessments as new issues arise. The COPE for CS job aid is another resource for staff to revisit key self-assessment issues in an abbreviated manner.
Committee members are encouraged to post the facility's action plan in a visible place where all staff and the public can see it to show the site is dedicated to quality improvement, to encour-age accountability and transparency, and to monitor progress against goals. Committees and facility leadership are encouraged to share their action plans with local stakeholders, health advisory committees, politicians, implementing partners and community organizations, as an advocacy tool to request assistance and resources.

Intervention follow-up and support
The introduction and scale-up projects supported training of trainers, site facilitator trainings, facility exercises, and limited follow up visits, to check on action plan progress and provide space for COPE for CS Committees or facility leadership to ask questions about the methodology or seek implementation guidance as needed.
However, the intervention did not include additional inputs to improve contraceptive security at the 60 sites. The COPE for CS initiative purposefully did not provide technical assistance for clinical or logistics training, for example, nor for the procurement of contraceptives or required equipment and supplies. The idea is for facility staff to look for local solutions. The 10 supported districts received varying levels of support for health services, including FP, from other multilateral agencies and partners. Following COPE for CS exercises and action plan development, the facilities may clearly articulate their needs to district leadership who coordinate donor funding in the decentralized Malawian healthcare system.

Outcomes of interest/research questions
The overall objective of this study is to assess whether improvements in facility performance on contraceptive security indicators is associated with participation in the COPE for CS intervention by comparing baseline and endline performance levels. Primary sub-objectives of this study include to: 1) describe implementation characteristics of intervention components; 2) assess whether facilities achieve key intervention-related outputs; and 3) examine changes in performance according to intermediate and ultimate outcomes, in particular, how implementation of the intervention relates to changes in performance, and whether any improvements in performance observed are sustainable over time. Figure 1 presents a logical framework that illustrates how the programmatic components fit together with the research questions of interest.

Data collection Design and sampling
We obtained data for this analysis through facility surveys designed to collect facility-level data on performance related to contraceptive security. Data collection occurred in three-waves as described in Figure 2. For all facilities, baseline data collection occurred before the start of the intervention (in 2014 for preliminary facilities and 2015 for scale-up facilities). Data was also collected at preliminary facilities in 2015 as a midline assessment. Endline data collection occurred for all sites in 2016. In the 18 preliminary facilities the COPE for CS intervention was introduced between the 2014 and 2015 assessments and continuously implemented through 2016. In the scale-up facilities, the COPE for CS intervention was introduced between the 2015 and 2016 assessment points.

Survey administration and data management
We used a standardized, facility questionnaire to assess each of the dimensions of contraceptive security, as defined by COPE ® for contraceptive security: An assessment guide (RESPOND, 2013). These dimensions included: EngenderHealth staff trained the data collection team on survey administration. For each of the survey rounds, a one-day data collector training was held during which data collectors reviewed the questionnaire and staff provided instructions on its administration. This training also instructed and assessed understanding of standard precautions for protecting human subjects, and included role-playing exercises.
Data collectors conducted the facility survey in English using a paper questionnaire. Interviewers identified the Facility In-Charge or their designate in each of the facilities to obtain permission to conduct the assessments. Following informed consent procedures, they administered the facility questionnaire. The facility survey tools are included as extended data to this paper 37 .
One data collector implemented the facility assessment at each facility. At the conclusion of each assessment, data collectors sent the completed questionnaire to EngenderHealth project staff, who reviewed the questionnaire to ensure it was complete.
A project staff member trained in appropriate coding and data entry techniques entered data from the physical questionnaires into a Microsoft Excel (Microsoft Corporation, Redmond, Washington, USA) database. A project manager then reviewed data entry, comparing the physical questionnaires to the database and documenting any discrepancies and subsequent discussion and resolution.

Protection of human subjects
The 2014, 2015 and 2016 survey protocols received ethical approval from an EngenderHealth review board. External review was not obtained given that the research did not meet the threshold to quality as research conducted among human subjects. The data were primarily collected to inform program decision making and questions were not focused on individual perspectives; rather, they focused on health facility capacity, staffing, stock and related issues 27 . Data collectors conducted facility assessments only after administering a standard informed consent form. We employed standard measures to maintain confidentiality and anonymity for the facility staff respondent. All information collected was strictly confidential and used only for study purposes. Respondent names were not stored with the final clean data.

Description of variables
We identified variables included in the analysis according to the components of the logic model presented in Figure 1.

Inputs
Input variables include those related to how the intervention was implemented, such as the number of days spent on exercises, frequency of group discussion of the COPE for CS action plans, and the number of COPE for CS committee meetings in each facility.

Outputs
We developed variables pertaining to action plan quality, content, implementation, and commitment. Project staff assessed action plan quality according to a scoring rubric developed a priori (see extended data 28 ) according to several quality dimen- sions including: whether the problems were clearly identified, whether the action plan identified root causes, whether the action plans offered attainable/realistic solutions, whether individuals were assigned responsibility, and whether items that were identified were time bound. Staff conducted action plan content mapping to determine whether a facility identified items relating to staffing, LMIS, procurement/requisition, inventory control procedures/receiving supplies, warehousing and storage, transport and distribution, financing/budgeting, planning, and medical equipment/instruments/expendable supplies. Box 5 further details the development of variables related to action plan quality.
We measured commitment to COPE for CS action plans by the frequency of group discussion of the action plan and the number of COPE for CS committee meetings. Project staff assessed completion of items in a facility's action plan based on facility reporting. After reviewing reported plan updates, project staff calculated the number of action plan items that were not initiated, initiated but not yet complete, and completed at endline.

Outcomes
Outcome variables are divided between intermediate and ultimate outcomes. Intermediate outcomes relate to changes in facility performance according staffing, LMIS, supplies/equipment, storage, and procurement, and were developed based on existing literature and expert consultation 20,29,30 . We developed a detailed composite score in order to assess changes in facility performance (Box 6). Ultimate outcomes pertain to performance measures that may be indicative of improvement in overall facility performance. These indicators were adapted from the RHSC's harmonized list of CS indicators, including number of emergency orders for contraceptives that a facility placed in the three months prior to assessment and the number of contraceptive methods available at a facility on the day of the survey 31,32 . We defined method availability as whether the facility had the commodity in stock, all required method-specific equipment, and a provider who is trained in provision (and removal, if applicable) of a given method.

Confounding variables
Due to the possibility of confounding variables to influence both the implementation of the intervention and the outcomes of interest, we stratified results according to a variety of important facility-level variables on which data were available including facility location (urban/rural), facility type (health center or hospital), region, and baseline facility performance.

Data analysis
We analyzed the quantitative data using Stata ® v14.0 33 , and produced graphics using Statistical Software R's ® ggplot2 package 34 . We presented descriptive statistics for the aforementioned variables of interest, and stratified according to possible confounding variables. We report means and medians for continuous variables and proportions for dichotomous and categorical variables.
We stratified results by key confounding variables, including baseline performance. We considered facilities that had baseline performance scores over 90% as having limited room for improvement between baseline and endline. We constructed a dichotomous baseline performance variable to stratify facilities that performed below/above the 90% threshold. We used t-tests to assess differences between groups for continuous variables,

Box 5. Development of output variables related to quality and content
Action Plan Quality: Two individuals initially scored a random sample of 10% of the action plans on whether the action plan clearly identified problems, identified root causes, offered attainable/realistic solutions, assigned individual's responsibility, and assigned completion deadlines. The two individuals then discussed discrepancies and reached consensus in using the scoring rubric. One individual then scored the remaining action plans. Each quality dimension was scored on a scale from 0-4 (0 being poor quality).
Content Mapping: Project staff reviewed and coded action plans according to whether items regarding staffing, LMIS, procurement/requisition, inventory control procedures/receiving supplies, warehousing and storage, transport and distribution, financing/budgeting, planning, and medical equipment/ instruments/expendable supplies. Two individuals reviewed five of the action plans for consistency and reached consensus on any discrepancies in coding. One individual continued coding the remaining action plans. We then streamline the areas identified in the content mapping exercise according to the CS performance dimensions. If a facility identified at least one item in a CS performance dimension, then we considered it a priority area for that facility.

Box 6. Development of the CS composite score
We developed a composite score based on a facility's performance in relation to a list of questions in the facility questionnaires on staffing, LMIS, supplies/equipment, storage, and procurement. We determined the content of each dimension based on existing CS literature, the RHSC website, and toolkits. For example, the composite score for storage is consists of 15 questions pertaining to a facility's performance on storage conditions, such as whether stock is properly labeled, products are stored away from direct sunlight, storeroom is clean and free of trash, products are not stacked too high or close together, products are organized according to expiry date, etc.
A detailed explanation of scoring for each CS performance dimension, as well as the set of items included for each dimension, is available in Supplemental File 4.
We assigned a specific number of points to each item and then calculated and normalized scores for each individual CS performance dimension. We also calculated a total normalized score of up to 100 possible points achieved by summing the score in each performance dimension. Facilities' scores were not penalized if they are not required to provide a certain method as per national service delivery guidelines.
chi-2 tests to assess differences in binary variables, and ANOVA to assess differences in continuous outcomes between categorical variables with three or more categories.
To assess changes in proportions between baseline and endline, we used McNemar's chi-2 paired tests of proportions. We also used simple linear (for continuous outcomes) and logistic (for binary outcomes) regression analysis to assess changes between baseline and endline according to key variables of interests.
As discussed above, a facility's baseline measurement is the first assessment (2014 for the 18 preliminary facilities and 2015 for the 42 scale-up facilities) and a facility's endline assessment is the last assessment conducted (2016 for all facilities). For the 18 preliminary facilities that had a midline assessment conducted in 2015, we conducted an additional sub-analysis to examine CS performance trends over time at all assessment points. Table 1 presents a description of the facility characteristics and details on the types of intervention-related characteristics, by facility type. Hospitals have a greater median number of staff (n=376) as compared to health centers (n=27). Health centers are almost universally located in rural areas (97.6%), as compared to hospitals (45.5%, rural areas).

Results
As shown in Table 2, health centers and hospitals also implement the COPE for CS similarly with regard to the number of days used for the initial exercise (3 days) and the number of CS Committee Meetings per facility (5 meetings). Similarly, most facilities had monthly group discussions of the COPE for CS action plan (health centers, 65.3%; hospitals, 63.6%).
The majority of facilities implemented the initial exercise in three days (75.6% of facilities), while 6.7% used less than three days and 18.7% used more than three days (data not shown). All facilities established a COPE for CS committee and all posted their action plan in a visible space within the facilities (data not shown).
Baseline facility performance according to the five CS dimensions appeared similar across facility type, location, and region (  On average, hospitals completed 20.1% more items in their action plan items than health centers (95% CI: 4.0%, 36.3%; p=0.016) (data not shown). Additionally, health centers also initiated fewer items than hospitals. Simple linear regression analysis revealed no significant differences according to location (rural/urban) and region in the overall quality score of the action plans or the progress made in completing action items.
Results suggest that the frequency of group discussion of the COPE for CS action plan is associated with the facility's     completion of action plan items. Simple linear regression analysis (not shown) found that facilities that reported having group discussion of their action plans more than once per month completed on average 21.5% (95% CI: 4.0%, 39.0%; p=0.017) more items than those that discussed the plans either monthly or less than once per month. No significant associations were found between the number of official COPE for CS committee meetings reported or the length of time spent on the initial exercise and the percentage of items completed and/or started. Table 5 shows whether priority identification in action plans was consistent with areas of low performance. Overall, facilities did not consistently identify areas of low performance as priorities in their action plans. Only 60.6% of low performing dimensions were identified as priorities. Health centers were significantly more likely than hospitals to identify low performing dimensions, 63.2% (95% CI: 57.8-68.7) versus 49.1% (95% CI: 32.7-65.4, p<0.05). Similarly, rural facilities identified a greater percentage of low performing dimensions as priorities when compared to urban facilities (p<0.05). Only one in three (29.2%) health centers with low performance in procurement and only 36.6% of hospitals with low performance in equipment identified the respective dimensions as priorities. We found no significant associations between the percent of low performing dimensions identified as priorities and the length of time used for the initial COPE for CS exercise. Figure 3 compares mean baseline and endline performance scores by CS dimension. Overall, the change in scores suggests improvement between baseline and endline in each individual CS dimension as well as total performance score. This holds true when results are disaggregated by facility type. The improvements between baseline and endline are most pronounced when results are presented separately for low performing facilites. The improvements in low performing health centers reach statistical significance in several areas, including staff capacity, LMIS, equipment, and in total score. We observe qualitative improvement to indicate positive change among hospitals, although the results do reach statistical significance. Table 6 compares baseline and endline performance with regard to FP commodity stock levels, method-specific trained providers, required equipment/supplies for specific methods, and placement of emergency orders. Health centers generally show some improvement in the types of contraceptive commodities in stock on the day of the assessment (except for a slight decrease in percentage of health centers that have progestin only pills and female condoms), although none of the differences are large enough to reach statistical significance. Hospitals perform very well at both baseline and endline, and show an increase in the percentage with CycleBeads in stock, though the increase is not statistically significant at the 0.05-level (p=0.08).
The percentage of health centers with at least five modern methods in stock on the day of the assessment increased slightly (from 87.8%; 95% CI: 75.2-95.3 to 91.8%; 95% CI: 80.3-97.7) though not statistically significant. All hospitals at both baseline and endline were found to have at least five modern methods in stock. The vast majority of health centers (>90%) had providers trained in general family planning and implant insertion/removal, while all hospitals had providers trained in these skills. The percentage of health centers with a provider trained in interuterine device (IUD) insertion/removal increased  significantly between baseline and endline by nearly 30 percentage points (p<0.05), while there was a slight decrease (not significant) in IUD providers at hospitals. At endline, just over 80% of hospitals had a trained provider in IUD insertion and removal. There was no change observed in the percentage of hospitals with a provider trained in male or female sterilization (45.0%; 95% CI: 16.7-76.6, and 81.8%; 95% CI: 48.2-97.7, respectively).
When examining whether a facility had both a trained provider and all of the essential equipment and supplies (including the contraceptive) to provide a method, the results indicate significant improvements between baseline and endline. The percentage of health centers with everything needed to insert and remove an implant increased signficiantly between baseline and endline, especially the percentage of health centers able to offer implant removal (from 26.5%; 95% CI: 14.9-41.1 to 77.6%; 95% CI: 63.4-88.2, p<.001). Both health centers (from 0.0%; 95% CI: 0.0-7.3 to 10.2%; 95% CI: 3.4-22.2, p<0.05) and hospitals (from 45.5%; 95% CI: 16.7-76.7 to 90.9%; 95% CI: 58.7-99.8, p<0.05) showed significant improvement in the percentage of facilities able to insert an IUD. Similar, positive trends were found for removals. Hospitals also improved in their ability to offer female sterilizations (27.2%; 95% CI: 6.0-61.0 to 63.6%; 95% CI: 30.7-89.1, p<0.05) and male sterilizations, although the latter was not statistically significant. Finally, the overall percentage of facilities placing emergency orders during the three months prior to the assessment showed a decreasing, but non-significant trend among hospitals but was signficiant among health centers (from 69.2%; 95% CI: 54.6-81.7 to 36.7%; 95% Ci: 23.4-51.7; p<0.001).
Table 7 examines in more detail the extent to which emergency orders were placed in the three months prior to the assessment. The results suggest an overall reduction in the placement of emergency orders across all variables related to facility characteristics (facility type, region, and location), though results were not significant with the exception of the Southern region found to be statistically significant (from 1.3; 95% CI: 0.41-1.85 to 0.4; 95% CI: (-0.0-0.8 on average; p<0.05). In terms of intervention characteristics, the mean number of emergency orders decreased overall (from 1.2; 95% CI: 0.7-1.8 to 0.6; 95% CI: 0.2-1.0, p=0.09), but was only significant among facilities that had group discussion of their action plans more than once per month (from 1.3 to 0.0 on average; p=0.01). Overall, among the 37 facilities with complete data on emergency orders at both baseline and endline, there was a mean reduction of 0.8 orders over the last three months (p=0.09); while not statisically significant, this shows a meaningful decrease programatically. The reason for the smaller sample size with regard to this question is that 18 facility respondents at baseline and 4 at endline reported not knowing the number of emergency orders placed in their facilities; these facilites are not included in this calculation to avoid missing data bias. Figure 4 shows the results of a sub-analysis to assess changes in performance scores by CS dimension between 2014 and 2016 for the 18 preliminary facilities, and between 2015 and 2016 for the scale-up facilities. The chart stratifies by baseline performance level (lower performing facilities and all facilities combined). Generally, the results show that performance continues to improve across most dimensions nearly two years after initial implementation of the intervention, except in the areas of stock and storage. We observed the most pronounced improvements in equipment in the initial facilities between 2015 and 2016.

Discussion
The results highlight four important findings related to the association between both the implementation of the COPE for CS intervention and how the intervention may relate to changes in facility performance across the dimensions of contraceptive security.

Facilities overall maintained fidelity to the intervention
The data available on implementation provide important insight on which components worked well and which could improve in the future. The majority of facilities implemented the Table 7. Mean emergency orders placed in the three months prior to assessments, by facility and intervention characteristics (n=37) § . intervention according to plan, meaning they held exercises, developed action plans, displayed action plans in a visible location in the facilities, assembled CS committees, had broader group discussions among their staff. Despite the leeway given to facilities to implement in the way that they thought to be most helpful, we observed little variation in implementation. For example, nearly all facilities held the initial exercise in three days. Additionally, the facility action plans tended to be of high quality based on the a priori rubric developed to evaluate them. Given the limited variation observed in quality, however, it is difficult to assess the relationship between the quality of a facility's action plan and changes in CS performance. Future research may build on the quality rubric presented here to be more discerning. One issue to note is that the quality rubric does not consider whether a facility distinguished areas of low baseline performance and specifically identified action plan items related to those areas.

Mean Number of Emergency Orders Placed in 3 Months
The data also point to areas for future implementation research and possible improvement. Facilities identified few action plan items in the domains of procurement and equipment, despite low performance in these areas. It is possible that facilities determined that, to realize results, progress in procurement and equipment would require a longer-term advocacy strategy more dependent on the system as a whole and outside of the control of the facility itself. In the future, working with facilities to better identify actionable areas of the local system in these domains, and supporting sites with advocacy approaches for local financing and community participation with in-kind donations, may be an area for improvement in the COPE for CS exercise and action plan implementation.
The COPE for CS intervention is associated with improvements in facility CS performance Facility performance improved as a whole across CS dimensions between baseline and endline. Given the nature of the intervention, these results suggest that facilities successfully advocated for resources and other inputs needed to improve performance, perhaps including trainings or additional staff assigned to the facility. The data show large improvements in the technical capacity of facility staff -especially in health centers. Additionally, health centers developed staff capacity in IUDs, which goes above and beyond the minimum service requirements for that type of facility. Of note, the projects did not provide trainings or other support to facilities aside from the initial COPE for CS exercise and a brief follow up visit a few months later to discuss action plan progress. Therefore, the changes observed were driven from within each facility.
Additionally, the findings suggest that after participation in the COPE for CS intervention, facilities were better able to accommodate a wide variety of client needs. By endline a large percentage of facilities met requirements to serve clients selecting a new method, or switching from or discontinuing a method requiring removal, as discussed in Table 6. Also, the decreases observed in the number of emergency orders placed may indicate that the CS system within facilities improved as a whole.
Higher levels of staff commitment to the intervention appears to be associated with greater CS improvements More frequent staff discussion of action plans-a proxy for staff commitment-is associated with improved outputs and outcomes, in particular with action item completion. As staff commitment and engagement to facility quality improvement is a key underpinning of the COPE for CS methodology, this finding provides a proof of concept that staff action is a key mechanism of action. Interestingly, we found no evidence for association between the number of official COPE for CS committee meetings and output/outcomes. However, any association between committee meetings and performance improvement is difficult to assess as there was little variation among facilities in the number of these meetings and more data on meeting frequency and timing is needed.
The improvements observed tend to be sustained two years after the initial COPE for CS exercise in the 18 preliminary facilities This is an important finding regarding sustainability of the intervention. The sub-analysis of the baseline, midline, and endline using data from the preliminary facilities suggests that improvements tend to be sustained over time (measured up to two years post-initiation), despite there being very limited additional investment by the projects. Of note, the preliminary facilities did hold another project-supported facility exercise at the same time that the scale-up sites held their initial exercises. However, the data do not suggest that holding another formal exercise had a major impact on facility performance. Figure 4 shows that preliminary sites achieved their largest gains between 2014 and 2015, with minimal gains after that point.
An interesting observation in assessing the longer-term data is that there appears to be a slight lag in the improvement of equipment scores. While it is impossible to rule out the potential influence of the second supported exercise, it is possible that equipment performance takes more time to improve. Changes in these dimensions may require larger investment, action at higher-level of the health system that may be difficult for a facility to influence, and/or longer-term advocacy and planning. Future research could examine sustainability of the intervention in a more rigorous way, and explore whether there is a need for additional supported facility exercises.

Limitations
While this study offers both important data and programmatic reflection on implementation of the COPE for CS intervention, there are several limitations worth noting. First, this study was designed within a programmatic setting where implementation decisions were made in conjunction with local priorities and realities, and not just from a research perspective. Within this context, study districts and facilities received varying levels of support for health services, including FP, from other multilateral agencies and partners. As a result, we cannot assess attribution or make any causal claims. As there is no control group and assignment to the intervention was non-random, a variety of influences not related to the intervention itself could have led to secular changes in performance. Additionally, given the relatively small sample size of 60 facilities in 10 regions, there is not sufficient power to analyze the data using more robust statistical methods; as a result, we were limited to using descriptive analysis and simple regression. We did not adjust for multiple comparisons given the small sample size, the number of comparisons made, and the exploratory nature of this study; a decision that is supported in the statistical literature 35,36 . Finally, the implementation period was rather short, and there may not have been adequate time for the output and outcome measures used in this analysis to register change. However, despite these limitations, there appears to be an association between some of the intervention components (in particular, staff commitment) and key outputs/ outcomes along hypothesized mechanisms, which lends some support to the intervention being at least partially responsible for the results. Given this, the results of this study warrant more rigorous, future evaluation of this intervention to assess causality.
While several important variables relating to facility-level characteristics are included in the analysis, there remains the potential for unmeasured variables (such as those identified in Figure 1) to confound the relationship between the action plan and the performance score. While stratification according to baseline performance offers important insight on how the intervention may influence performance in low-performing facilities, we do not adjust for these variables. Additionally, it is possible that the results are influenced by regression to the mean between the baseline and endline measurements (i.e. the tendancy of outliers to revert back to mean levels of a variable over time). However, given that the results are consistent according to all performance measures, it does not appear that regression to the mean is the primary driver of the observed results.
In several cases, we constructed key variables of interest ourselves as there was no gold standard measurement available. However, in these cases, we were as rigorous and transparent as possible. For example, as measuring quality of the action plans is inherently somewhat subjective, we developed our measure a priori, so as not to be influenced by the content of action plans, and also used two coders and assessed inter-rater reliability to ensure that the measure was consistent. Additionally, in order to ensure content validity in our measure of CS, we drew heavily on the existing literature, standardized indicators, and expert option to develop the composite scores that we used. While we attempted to use standardized indicators as much as possible, some of the standardized CS indicators were not published until after baseline study implementation 31 ; thus, we adapted these measures as best as possible to the available data. Despite these attempts to develop a robust measure, there remains the possibility that the measures do not completely represent all domains of facility CS performance. Future work in developing and validating ways to measure performance across CS dimensions at the facility-level would be a great contribution to the field.
Finally, an important component of the hypothesized mechanism of action underlying the COPE for CS intervention is engaging the broader community and local leaders in the implementation of the action plans. The facility assessments did not include questions on community engagement; however, a separate analysis of key informant interviews collected from stakeholders participating in the COPE for CS intervention indicate that strong community engagement through advisory committees was essential to the success of the intervention 37 . As this is considered to be an important aspect of the success of the intervention, community engagement should be measured during future implementation of COPE for CS.

Conclusions
The results of this study suggest that COPE for CS may be an effective intervention to improve contraceptive security at last mile health facilities. Given the dearth of research and programmatic experience in this area, this study provides important preliminary programmatic experience, research insights, and lessons learned upon which future research and programs attempting to address this important, but often underprioritized area of contraceptive security, can build.  The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data availability
Secular trends may account for most of the changes seen, since this corresponds to a period of rapidly increasing CPR in the country.
In terms of strength of implementation only frequency of meetings was associated with differences in improvement this may reflect underlying attitudes of the staff and contexts in the facility rather than the intervention leading to improvement. Contraceptive security at "last mile" facilities is enormously important, but often overshadowed by the issues with CS at national level. However, improvements at national level may or may not lead to equal improvements at local levels, and it is essential to understand why and what could be done to address it. While the paper presents information on this important topic and has the potential to contribute to the body of knowledge on contraceptive security and the use of the COPE CS method, in its current form, the paper has several important limitations/issues that need to be addressed before further consideration. The principal problem with regard to the current manuscript is that the aim it purports to achieve -to assess the association between changes in performance indicators and participation in the COPE intervention is not possible with the study design. There is no variation in participation -all facilities participated in the intervention; thus, the only thing that can be examined with regard to performance indicators is change over time, acknowledging that whether or not the change is related to the COPE intervention is impossible to determine with any degree of certainty. The single group, pre-/post-test design is subject to numerous threats to internal validity which preclude any conclusions or inferences with regard to causality. Although the authors touch on this issue tangentially in the limitations section of the discussion, it needs to better frame the entire paper. The fact that the entire paper is framed around this aim, despite not being possible, requires that the paper be refocused.
Another important issue pertains to the use and interpretation of the statistical analyses. While it is not uncommon for researchers and evaluation specialists to use tests of statistical significance on data from such study designs with small convenience samples, these tests should be considered exploratory in nature -hypothesis generating if you will. In this paper, this distinction is not noted. Also, despite employing tests of statistical significance, the authors highlight the very few relationships that are statistically significant, but also appear to discount non-significant results with statements such as "Health centers generally show some improvement in the types of contraceptive commodities in stock on the day of the assessment (except for a slight decrease in percentage of health centers that have progestin only pills and female condoms), although none of the differences are large enough to reach statistical significance." Here they appear to imply the relationship exists, but it's just not big enough to be detected. This is a misinterpretation of the results. Given the small sample size, the non-experimental nature of the design and the convenience sampling approach, it seems more reasonable to simply present their actual findings, without any statistical testing, since there is no defined population to which these results can be inferred. Presenting their results without tests of statistical significance is still interesting.
In terms of the actual analyses chosen to use on these data, the McNemar's Chi square test is appropriate for their examination of the change in proportions between timepoints; however, the authors also state "We also used simple linear (for continuous outcomes) and logistic (for binary outcomes) regression analysis to assess changes between baseline and endline according to key variables of interests." These tests would be inappropriate for these data which are not independent/paired; they should have used tests such as the paired t-test, McNemar's chi square or one of the other many possible tests, depending on their variables.
Some specific concerns include: p. 3 (Program description section): The authors state: "The COPE® for Contraceptive Security methodology and tools are used by frontline health and logistics personnel to identify and implement low-cost, local solutions to address problems related to contraceptive supply." Is the COPE for CS method limited to supply issues? It seems broader based on the rest of the paper.
p.4, Box 2: The authors state "Qualitative data collection via key informant interviews noted improvements in stock management, on-time ordering, decreases in stock-outs, and improvements in collaboration between facilities and district medical stores." The wording here is a bit odd. Do you mean that through KI interviews, respondents reported they saw improvements in these things? The difference may seem small, but data collection did not note improvements (current wording).
p. 5 (Outcomes of interest/research questions): The overall objective is oddly phrased. Consider reversing (e.g. to assess whether participation in COPE is associated with improvements instead of if improvements are associated with COPE). Additionally, primary sub-objective 3 looks at "how implementation of the intervention relates to changes in performance". Wasn't implementation standardized? Does it refer to fidelity? Quality? Frequency of discussions? It is not clear why implementation would be different when there is one structured approach, which was subsequently scaled up.
p. 6: The criteria for selection into the intervention have a potential of greatly influence the outcomes. It is noted that "The MOH/DRH considered districts for inclusion if they reported stockouts in a high percentage of facilities and had previously submitted requests for assistance with contraceptive security." The potential effect these characteristics have on the observed outcomes should be discussed in the discussion section. In particular, facilities that had submitted requests for support may be more motivated to make changes than facilities that have low performance, but have not made such requests -might the to make changes than facilities that have low performance, but have not made such requests -might the intervention meet different results in such settings? Also, can you be more specific and clarify what was considered "a high percentage". The districts where more than 30% of facilities reported stockouts? 50%? Higher?
p. 6, Fig 1: inputs and outputs seem to be very similar. Shouldn't inputs include the elements that required for implementation (guides/questionnaires developed, activities conducted, such as TOTs, facility exercises, number of discussions, supervisory visits, etc)? However, the inputs in Fig.1 says "Staff develop action plans", which is technically one of the outputs. Additionally, for the first research question (Fig.1/yellow box), the concern is the same as for primary sub-objective 3. Not clear what "the way in which the intervention was implemented" refers to considering the intervention was standardized. Are you asking about factors that contributed to improvements? p. 7: In the first full paragraph, a rough timeline is provided (and in Box 2 as well), but more information on exact dates (month and year) for data collection should be presented for each wave so the reader can understand the timeline better. Additionally, note that 2015 midline assessment for preliminary facilities is not reflected in Fig 2. p. 7, Description of variables: Those things described in the logic model as inputs are not the same as what is noted in Figure 1. Those inputs in Figure 1, as pointed above, would not typically be considered inputs.
p. 9, paragraph 4: the authors say: "Health facilities had significantly lower overall performance scores at baseline than hospitals (80.7 versus 87.0, p<0.01)". Would help to clarify that you refer to statistical significance. E.g. "At baseline, health facilities had lower overall performance scores than hospitals, with the difference being statistically significant". Because in actual numbers the difference was not dramatic.
Results tables overall: The way the p-values are presented in each table differs. It's not clear why, in some tables, there is a column for p-values, but no values are presented, just asterisks. If you choose to continue to report p-values, please present the actual values, as opposed to cut-off values. Also, please add numbers for facilities in Table 6, the same way as in all other tables.
p. 13: In the first full paragraph, results from Figure 3 are summarized. The authors note that "Overall, the change in scores suggests improvement between baseline and endline in each individual CS dimension as well as total performance score. This holds true when results are disaggregated by facility type." When reviewing the results presented in Figure 3, when all facilities are combined in terms of performance (top row), only one indicator shows improvement (equipment) for health centers and this holds true when combined with hospitals.
When the researchers examine only low-performing facilities, hospitals show no improvement on any of the indicators, whereas health facilities show improvement on 3 of 5.
It's unclear in the bottom row how the combined results (hospitals and health centers) mirror the health centers exactly (thus hospitals have no influence) for staff capacity, LMIS equipment and total score; even more unusual is how the combined non-significant results for health centers and hospitals for storage and procurement go from non-significant for both to significant when combined. It seems these analyses should be verified. p. 14-15, Table 6: A formatting issue: Delete "Facilities with the following methods in stock on the day of assessment" A formatting issue: Delete "Facilities with the following methods in stock on the day of assessment" from the second page of the table. It seems like it was treated as a heading that must be repeated across multiple pages, but it is not the case here.
The note § § implies that availability of IUDs and implants on site is assessed only in conjunction with the presence of a trained provider at the facility. Does it mean that IUD/Implant availability is assumed in the section "Facility has a provider trained in…" on page 14? Or in the section "Facility has all required method-specific equipment AND a trained provider in…"? Or both?
The row on IUD removal on the page 15 shows that at baseline, only 18.9% of hospitals had everything needed for removal, including trained provider. At the same time, 45.5% of hospitals had everything in place for IUD insertion. Usually, these numbers are reversed (removals of IUD are more available than insertions). Is there an explanation for that? Because IUD removal is much easier, and instruments/supplies required are the same --but fewer --than those used during insertion. Any provider who is trained to insert an IUD should be able to remove it (which we cannot say about implants). Were you looking at complicated removals as well, those that usually require referral to Ob/Gyn and the use of alligator forceps? These are quite rare, but if so, it would be good to clarify as it may explain lower capacity to remove an IUD. p. 16, Table 7: Southern characteristic is noted as having a p-value=0.02, which would be considered significant using a cut-off of <0.05; however, the 95% CIs overlap considerably. This requires reviewperhaps it's a typo and the p-value is 0.2?
One final thought for consideration by the authors. Overall, the baseline values of most measures for all facilities were quite high, leaving little room for improvement. Total performance scores at baseline were all 80% or greater on a scale going to 100%. Many of the individual dimensional scores were even higher. Improvement on these scores would not only be difficult to detect (statistically speaking) but call into question the real-world, practical value of efforts to make such small improvements. This is also the case for other findings. For example, while a couple of individual dimension indicators appear to have changed substantially in Figure 4, the relative importance of all changes appears inflated because of the scale used for the figure. Although the Y-axis represents a range from 0-100, only 55-100 are shown. If the values were noted for each data point, one would see that, for example, the mostly positive increases in scale-up are really very small -only a few percentage points, and given the small sample size, even the larger changes (e.g. 8 percentage points) could have been driven by change in one or two facilities. A more thorough discussion of the practical significance of these findings is warranted.

If applicable, is the statistical analysis and its interpretation appropriate? No
Are all the source data underlying the results available to ensure full reproducibility? Partly