1 Introduction

1.1 Background

Medical equipment plays an integral role in the delivery of healthcare services and as a result, it is becoming increasingly critical for delivery of efficient healthcare services. For this reason, healthcare facilities are expending considerable resources in procuring diagnostic and treatment devices such as linear accelerators for cancer treatment, magnetic resonance imaging (MRI), computerized tomography (CT), Patient Monitors, among other medical devices (uk and accessed on 8/10/2016). To ensure optimal availability of such equipment, prudent asset management strategies are necessary, since outage of critical devices is directly linked to quality of healthcare delivery, for instance, delayed treatment leading to the need for repetitive treatment. The poor healthcare delivery as a result of equipment unavailability is associated with a rather high cost of treatment, with negative consequences on the quality of life for many households, especially in developing countries (Performance Audit Report of the Auditor-General Specialized Healthcare Delivery at Kenyatta National Hospital Waiting-time for Cancer, Renal and Heart Patients. Office of the Auditor General November 2012). This is especially the case for household members in need of specialized treatment, and who largely depend on public funded hospitals for subsided treatment. For such patients, delayed treatment owing to outage of specialized equipment such as linear accelerators often worsen their treatment outcomes. In extreme cases, the delayed treatment increases the financial burden for patients often forcing needy households to divert resources meant for housing, or education towards treatment in costly privately funded healthcare facilities.

Primarily, the high healthcare costs are a concern for developing countries such as Kenya where despite considerable progress in economic development, publicly funded hospitals continue to grapple with healthcare delivery challenges, especially related to access to quality diagnostic and treatment services facilitated through critical medical devices previously discussed. Part of the aforementioned challenges faced by publicly funded healthcare facilities include few critical devices that are often highly utilized owing to the high number of patients seeking diagnostic and treatment services in such facilities. The high utilization largely translated to high equipment unavailability owing to frequent failures. Importantly, this is often the case in the absence of appropriate operation and maintenance strategies. Hence to ensure that critical medical devices are safe, reliable and operating optimally at required performance levels, it is important for healthcare facilities to implement robust operation and maintenance strategies (Ridgway 2009).

The above-mentioned strategies would ideally guide healthcare practitioners tasked with operating and maintaining critical medical devices develop, in a structured way, appropriate operation protocols for ensuring prudent use of such critical devices. The strategies also extend to developing appropriate maintenance strategies specifying aspects such as when to intervene prior to occurrence of failure, or prudently repair the equipment such that the operational lifetime of the critical device is enhanced. Developing appropriate asset operation and maintenance strategies is also expected to maximize equipment use, and decrease the total cost of ownership. This in turn is expected to translate in greater access to diagnostic and treatment services offered at Kenya’s publicly funded hospitals (Pun et al. 2002) (Mkalaf et al. 2013). Examples of well-known maintenance strategies include the time or use-based maintenance (TBM/UBM), condition-based maintenance (CBM), and failure-based maintenance, where the latter is performed for non-critical equipment failures. Maintenance protocols are also a fundamental part of maintenance strategies. On the other hand, prudent operation strategies include the development and use standard operating protocols for critical medical devices. Operating protocols, which refers to a set of written instructions are intended to document routine operations which may be performed on the critical devices, thus ensuring consistency and attainment of desired performance, and prevent inadvertent failure of the equipment owing to poor usage (United States Environmental Protection Agency Guidance for Preparing Standard Operating Procedures 2007).

Nonetheless, developing optimal or appropriate maintenance and operating strategies for critical medical devices is not straightforward. This is especially the case where hospitals lack structured and systematic methodologies for formulating such strategies (Taghipour et al. 2011). Importantly, such systematic approaches ought to consider aspects such assessing the criticality of the equipment, where such as assessment ideally leverages on maintenance and operational data for decision support. However, in many instances, such data is often not well-structured and thus, identifying, prioritizing and mitigating recurrent critical failures is rather challenging. Often, when properly structured, such data could be leveraged on for decision support aspects such as strategy formulation, and root cause analysis (Fouad et al. 2012). For hospitals in developing countries such as Kenya, adopting such structured strategy, and decision support formulation approaches is challenging, more so, with the absence of clearly structured frameworks for assessing equipment failure risks. Moreover, this challenge is compounded with the lack of robust frameworks for collecting, structuring, and analyzing operation and maintenance data, which limits the extent to which equipment operators and medical practitioners are able to leverage on such data for decision support. As a result, formulating operation and maintenance strategies is ad hoc and often reactionary, with formulated strategies seldom linked to empirical historical data collected from critical medical devices.

In addition to the aforementioned reasons, there is a tendency for over relying on information from original equipment manufacturer, at the expense of recurrent maintenance and operational related aspects experienced in practice during usage and maintenance of medical devices (Ridgway 2009; Khalaf et al. 2014). In this way, emergent maintenance and operational problems are seldom adapted into the strategies adopted for managing the medical devices. In turn, sub-optimal operation and maintenance practices are often adopted, leading to recurrent failure and unavailability of critical diagnostic and medical devices. By extension, unavailability of such critical services leads to poor delivery of healthcare services, especially to the underprivileged persons in society who rely on public funded hospitals for medical services. Hence, this study is motivated by the need for developing a structured and systematic methodology for formulating appropriate operation and maintenance strategies for medical devices. This study primarily focuses on critical diagnostic and treatment devices installed in public hospitals in resource-constrained developing countries, where the case study of Kenya is discussed.

1.2 Study aim and motivation for the research

This study is motivated by the need to develop a robust methodology for developing robust maintenance and operational strategies for medical device management in Kenya. The proposed methodology leverages on a risk assessment which guides practitioners in systematically identifying, analyzing and prioritizing operation and maintenance factors influencing equipment unavailability, either due to poor usage, or poor maintenance practices. The proposed methodology considers the unique problem context of devices operated by public hospitals in Kenya, which are characterized by high utilization, and absence of decision support methodologies for formulating operation and maintenance strategies. Moreover, operation and decision support is constrained by absence of data which could be leveraged on, for formulating robust operation and maintenance protocols, which would mitigate risks of failure of such critical medical devices.

The proposed methodology starts with a data standardization step, which involves structuring operation and maintenance data, from which, failures modes and component failures are classified. The structuring scheme adopts intuitive criteria practitioners are familiar with, where the operation and maintenance failure modes associated with diagnostic and treatment devices are classified according to the equipment serial number, equipment type, model, type of component and subsystem the failure originates from, spare parts requisition linked with the failure, and whether in-house maintenance and external maintenance support is considered.

In the next step, a modified Failure Mode and Effect Analysis (FMEA) methodology is adopted where operation and maintenance-related failure modes are systematically identified, analyzed and consequently prioritized. For the prioritized failure modes, Pareto and ‘5-Whys’ analysis is performed for identifying the focal root causes of recurrent equipment failure modes identified through the FMEA. On the basis of the root causes, a structured approach is adopted for formulating appropriate maintenance and operational protocols for mitigating the recurrence of prioritized failure modes. For formulating robust maintenance strategy, a robust data collection and structuring approach is proposed with a view of enhancing use of maintenance and spare part data, which in turn, is expected to assist practitioners align the developed maintenance strategy with historical maintenance data acquired from medical devices. The operation protocols are also expected to robustly guide practitioners formulate standard operating practices, which in turn is expected to ensure optimal use of critical medical devices.

2 Review of related literature

Risk-based maintenance approaches, and in particular, the Failure Mode and Effect Analysis has been applied for prioritizing failure modes in diverse sectors such as manufacturing, service, and healthcare. The FMEA embeds a systematic approach which assists practitioners identifying and understand factors contributing to equipment failures, and the associated causes and effects of such potential failure modes. For instance, Sutrisno and Lee (2011) evaluated the use of FMEA for assessing the reliability of operable equipment, where the authors mention the need for enhancing the computation of the Risk Priority Number (RPN), an important metric used for prioritizing operation and maintenance related failure modes. Moreover, the authors propose the need for addressing the subjectivity of the RPN, and in addition, incorporate aspects such as human error issues. Nonetheless, the approach proposed by the authors focuses on service oriented aspects of operable equipment, and ignores the need for incorporating operation and maintenance data for assessing and prioritizing equipment failure modes. Similarly, Liu et al. (2013) reviewed risk evaluation approaches addressing the limitations of the RPN metric for prioritizing equipment failures in the FMEA approach. He suggests alternative prioritization metrics based on methods such as multi-criteria decision making, linear programming, and fuzzy approaches. However, the proposed methods are often not intuitive to decision makers and practitioners, for instance, in many public hospitals in developing countries. Hence, their use for robust decision support is questionable.

Jamshidi et al. (2014) proposed a fuzzy FMEA approach for prioritizing failure modes of medical devices, where their approach is based on assessing multiple risk factors influencing optimal operation of the devices, such as device utilization, or age. Based on the assessed risk factors, the authors propose an approach for selecting appropriate maintenance strategies based on the criticality scores of each device, where some of the selected strategies includes corrective maintenance, time-based maintenance, condition-based maintenance, and predictive maintenance are proposed. Although the study presents a numerical case, the mitigation strategies only go as far as proposing maintenance strategies, ignoring operation-related failure modes, which invariable influences equipment availability. Moreover, the proposed maintenance strategies are delinked from outcomes of root cause analysis, hence its use for decision support for assessing risks in critical medical devices is questionable. This is in addition to delinking their proposed approach to maintenance data that is often collected from medical devices.

Rahimi et al. (2013) developed a FMEA approach combining a fuzzy cost based method, Grey Relational Analysis, and profitability theory. They apply their proposed approach for minimizing equipment failures. For mitigation strategies, they propose use of an optimization modelling approach for bundling failure modes that need to be repaired. However, apart from intuitiveness of the approach, their proposed method relies primarily of expert assessment of risks, and potential cost effects of the failure modes. As such, practical metrics that may be used for prioritizing equipment failure modes, such as, effect of lost patient treatment time are ignored. Moreover, operational-related factors influencing equipment downtime are ignored.

Carmignani (2009) proposed an integrated cost-based FMEA which allows equipment failure modes to be prioritized based a profitability metric, the latter considering corrective actions performed on the equipment when a failure mode occurs. The author demonstrates the important of the cost metric. However, for medical devices, cost is often not the overriding metric for assessing risk, rather, metrics such as patient lost treatment time and patient safety are very important. Von Ahsen (2008) also developed an improved FMEA approach which considers an economic perspective for prioritizing equipment failure modes. Similarly, considering failure cost as the primary prioritization metric may yield sub-optimal operation and maintenance strategies, especially where metrics such as patient safety, or lost treatment time are ignored. Rosen et al. (2014) developed an FMEA approach for formulating maintenance strategies for medical equipment in a resource constrained environment characterized by aspects such as lack of spare parts. The mitigation strategies proposed in the paper are oriented to maintenance and logistical aspects, hence ignores operational related aspects which also lead to equipment unavailability. Moreover, their methodology delinks the formulated maintenance strategies to a structured root cause analysis process, an aspect addressed in this study.

Liu et al. (2012) developed an improved FMEA approach based on fuzzy logic and grey relational analysis, where they used their approach for assessing user-related risks for medical devices. Some of the anomalies identified in the study includes ineffective use of operation protocols for optimal use of medical devices. However, the aforementioned study is limited only to prioritizing user-related equipment failure modes, without proposing mitigation strategies which integrates maintenance and operation perspectives of the equipment. Moreover, their approach may not be intuitive for decision making, especially considering practical aspects influencing operation and maintenance of medical devices in resource scarce environment characterizing public hospitals in developing countries. Lin et al. (2014) also proposed an improved FMEA methodology based on fuzzy linguistic theory, where their approach is applied for prioritizing user-related risks associated with medical devices. Although their study focused on actual medical device failures, the study is however, more general as it focuses on identifying user-related equipment failures without orienting the mitigation strategies to both operation and maintenance aspects.

Onofrio et al. (2015) evaluated the application of the failure mode, effects and criticality analysis (FMCEA) for assessing risks associated with use and maintenance of medical devices in practice. From the study, the authors observe that practitioners often follow strict standards which guide the FMECA analysis. However, the authors also note that the presence of these standards limits practitioners on the nature of mitigation strategies which may be developed from both operation and maintenance perspective. This study addresses this flaw, by proposing a structured and systematic methodology for aligning operation and maintenance risks associated with medical equipment, to appropriate mitigation strategies. Xiuxu et al. (2010) similarly applies the FMEA methodology for enhancing the risk management process of medical devices throughout their life cycle. Although their approach focuses on design, manufacturing, and operation-related, risks, the authors primarily apply the Risk Priority Number (RPN) for prioritizing equipment failure modes. The RPN is largely a subjective metric which relies on expert elicitation, hence often yielding suboptimal operation and maintenance strategies. Moreover, the study only proposes general guidelines for identifying critical failure modes of medical devices, but ignores important decision-making facets such as the need for orienting the mitigation strategies to a structured root cause analysis for recurrent failure modes.

Rahimi et al. proposes (2016) a FMEA approach for prioritizing failures of radiotherapy equipment. They propose a criticality assessment index, based on fuzzy RPN for assessing the effects of equipment failure on safety of patients. However, their study only focuses on patient safety, and ignores other operation and maintenance related facets influencing robust management of critical medical devices, i.e. operation and maintenance. Fechter et al. (2004) applied the FMEA methodology for assessing failure risks of medical infusion pumps, where the authors identify equipment failure modes, their potential effects, and probable causes. The authors prioritize the failure modes based on the RPN, from which, mitigating strategies are proposed, e.g. enhanced infusion pump inspection, and proper pump calibration. Apart from applying the RPN metric, the study focuses largely on maintenance related aspects of the equipment, and does not align the proposed mitigation strategies to a structured root cause analysis process.

From the above studies, several of the FMEA approaches discussed above largely limited to prioritizing failure modes, often at the expense of formulating mitigating strategies. On the other hand, the proposed methods are limited to specific operation of maintenance aspects of the equipment, hence, seldom integrated both facets. This is in addition to overlooking practical aspects such as availability of data that is often recorded for critical medical devices, which could aid in developing optimal operation and maintenance strategies. The need for a structured root cause analysis, and aligning the analysis to formulated maintenance and operation protocols is also overlooked. The paper addresses these flaws by proposing an integrated approach for assessing failure risks of medical devices, with a view of aligning operation and maintenance strategies to the prioritized risks. The proposed methodology furthermore applies a modified FMEA, where failure modes for medical devices are prioritized based on two metrics; maintenance cost, and lost patient treatment time as a result of unavailability of the medical device. The maintenance cost metric in this regard, focuses on technical risks associated with equipment failure, for instance, usage of spare parts, and cost of repair. The lost patient treatment time, on the other hand, focuses on patient safety risks where it is assumed and delayed access to diagnostic and treatment services leads to deterioration of the condition of the patient. A novelty of the approach is where operation and maintenance protocols are developed such that they align with the focal root causes of equipment failure. For this reason, the integrated approach incorporates a structured root cause analysis process where a ‘5-whys’ analysis is performed. A case study of three critical medical; dialysis equipment, Cobalt 60 radiology equipment, and a patient ventilator, are discussed in this study.

3 Methodology

3.1 Case study description

The study was conducted in a large referral hospital in Kenya. The equipment availability was critical in this study since, by virtue as the only public hospital offering specialized diagnostic and treatment services. For this reason, the patient turnover was often high and unavailability of critical medical devices directly impacted patient waiting times, and by extension, the quality of healthcare extended to patients. Three departments were considered in this study, radiotherapy, renal and intensive care unit (ICU). The radiotherapy department operates the Cobalt-60 radiotherapy equipment for treating cancer. The renal department, on the other hand, operates dialysis machines for treating kidney related diseases. Lastly, the ICU department operates patient ventilators for critically ill patients requiring support for breathing.

3.2 Phases in the methodology

Figure 1 summarizes the five main phases of the methodology, where the first step involves data collection, and structuring to facilitate enable meaningful risk assessment. The second main step involves analyzing operation and maintenance related failure modes, their underlying causes, and trends of recurrent failure modes. To facilitate this step, two main types of analysis were performed; failure frequency analysis, and failure mode prioritization where an adapted FMEA is performed. In the fourth step, a root cause analysis was performed with a view of identifying the root causes of recurrent failure modes, both operation and maintenance related. For the root cause analysis, a systematic process was embedded in the analysis where the ‘5 whys’ method was used to systematically identify the focal root causes of failure. It is recognized that by eliminating such focal root causes, recurrent failure modes are thus avoided, leading to enhanced equipment availability (Mahto and Kumar 2008). The fifth step aligned the focal root causes with operation and maintenance strategies, where in this study, operation and maintenance protocols were formulated. The steps of the methodology are discussed in the next sections.

Fig. 1
figure 1

Methodology for risk based maintenance for critical care equipment

3.3 Data collection

In this study, maintenance data detailing failure modes for the three critical medical devices discussed in Sect. 3.1 were used for the risk assessment and protocol formulation analysis. The data was initially unstructured form, and recorded over a period of 3 years. In the unstructured form, however, meaningful analysis was not feasible since important parameters relevant for the risk assessment exercise were not linked. Examples of such parameters include, the time of failure, the type of failure mode, nature of repair actions, spare parts usage, potential root causes of failure, and commission time after completion of the repair processes. As an example, the time of failure and time the equipment is commissioned after repair is useful for computing the device unavailability, and patient lost treatment time. The aforementioned metrics are important measures of technical and patient safety risks. Spare parts information was also useful for computing the expected cost of repair, an important indicator of technical risks.

3.4 Data structuring

In this step, apart from linking the parameters discussed in Sect. 3.3 necessary for assessing failure risks, additional information relevant for risk assessment were incorporated in the analysis. These include information is depicted in Table 1 columns 8–14 which includes, among other information, the component the failure originated from, potential root causes based on the repair activity, and whether or not spare parts were required. The data structure further enhanced the assessment of technical risks (associated with repair costs), and patient safety risks (associated with patient lost time during equipment downtime).

3.5 Statistical analysis

As discussed, the statistical analysis entailed two main steps; failure frequency analysis, and failure mode prioritization through an adapted FMEA approach.

3.5.1 Failure frequency analysis

The failure frequency analysis entailed tallying the occurrences of the failure modes for the three-critical equipment analyzed in the study. The tallying focused on the failure modes, the components the failure modes originated, and finally, the sub-system of origin of the failure modes. After tallying, the failure modes were ranked in order of recurrence or frequency of the failures. In addition, the frequency of commonly performed maintenance actions was also computed, where Pareto analysis was relied for the frequency analysis. From the frequency analysis, components, and sub-system contributing to the highest cumulative occurrences of operation and maintenance failure modes were visualized. The high frequency failure modes formed the basis of the root cause analysis, and strategy formulation processes.

3.5.2 Adapted failure mode and effect analysis

The adapted FMEA was used for prioritizing operation and maintenance related failure modes based on their impact on risk aspects such as lost patient treatment time resulting from equipment unavailability. The patient lost time is computed on the basis of the diagnostic or treatment time lost when performing repair processes, or verifying the performance of the equipment after repair to ensure optimal operations. The adapted FMEA process considered in this study consists of five main steps depicted in Fig. 2, selecting the type of component, identifying failure modes associated with the component, enumerating potential causes of the component failure, computing the cumulative down time associated with the device unavailability, and computing the number of patients whose treatment is differed due to device outage.

Fig. 2
figure 2

Adapted Failure Mode and Effect Analysis Process

The first three steps were of the modified FMEA process is similar to the steps discussed previously, i.e. data collection and structuring process. The additional phases of the FMEA influences Steps 4 and 5 of the process in Fig. 2 where the cumulative downtime is calculated, the basis of which, the impact of device unavailability on patient lost treatment time is derived. The prioritized failure modes derived from the FMEA were thereafter compared to the results of the failure frequency analysis.

The patient lost treatment time is calculated as follows:

$$ \frac{Cumulative\;downtime\;of\;the\;component\;failure}{Average\;treatment\;time\;per\;patient} = Lost\;patient\;treatment\;time $$
(1)

From the FMEA, prioritized failure modes, component failure, and sub-systems the failure originated, were subjected to root cause analysis. Such failure modes were associated with significant negative impact on patient lost treatment time.

3.6 Root cause analysis (RCA)

After prioritizing the critical equipment failures based on their impact on lost patient treatment time, and based on frequency of failure occurrences, a root cause analysis process was performed with a view of identifying causal factors responsible for the recurrent failures. To perform the analysis, a structured approach was followed where focal root causes were identified based on a ‘5 whys’ analysis approach. In this approach, decision makers are required to query the sequential cause and effect relations leading to the specific failure event of interest. The ‘5 whys’ technique was selected mainly due insufficient maintenance data which would have motivated application of quantitative RCA approaches such as data mining, or multivariate analysis. A series of ‘5 whys’ questions are often asked on the potential cause of failure, from which, the focal root causes are systematically identified. In this study, the RCA process involved gathering information through interviewing biomedical engineers and device users at the case study hospital. The RCA process followed in this study is depicted in Fig. 3.

Fig. 3
figure 3

Root cause analysis process

3.7 Development of maintenance and operational protocols

In this step, mitigation strategies were formulated targeting recurrent device failure modes identified from the frequency analysis, prioritization and root cause analysis steps. The mitigation strategies were formulated with a view of identifying specific measures which would minimize or eliminate unacceptable operation and technical risks associated with failure of the medical equipment. Thus, from the results of the root cause analysis, operation and maintenance protocols were formulated for mitigating recurrent failures affecting the three critical devices analyzed in this study. For mitigating operational-related failure modes, operation protocols were formulated, which proposed prudent guidelines device users are required to follow when operating the medical devices. The operation protocols largely focused on recurrent human errors which could potentially lead to device misuse, and hence unavailability leading to lost patient treatment time. The protocols proposed aspects such as procedures for prudently operating the medical devices.

The maintenance protocols, on the other hand, were formulated with a view of guiding the biomedical engineers perform maintenance more effectively, where periodic maintenance activities were suggested, i.e. weekly, monthly, quarterly, and yearly. The maintenance protocols also extended to formulating structures for collecting maintenance data such that the data could provide maintenance decision support.

4 Results

Although this study was carried out for three types of medical devices i.e. the Cobalt-60 radiotherapy machines, dialysis and patient ventilator machines, for brevity, the results of the Cobalt-60 radiotherapy equipment are discussed in-depth. The results derived following the proposed methodology for the dialysis and patient ventilators are briefly discussed.

4.1 Failure frequency analysis- Cobalt 60 radiotherapy machine

4.1.1 Analysis as per type of model

The case study hospital operates two Cobalt-60 models, namely CM1 and CM2 whose failure information was analyzed over a 3-year period. Figure 4 shows the failure frequency analysis where it was found that model CM1 contributed 62.7% of the total Cobalt-60 equipment-related failure modes experienced for the two models, while model CM2 contributed the remaining proportion of 37.3%. For this reason, CM1 was considered for a more detailed root cause analysis discussed in Sect. 4.3 of this paper.

Fig. 4
figure 4

Failure data bar graph for the Radiotherapy machine—Cobalt model

Apart from the Cobalt-60 equipment, the frequency analysis was performed for the dialysis machines where failure modes of four equipment models were assessed using the proposed methodology. From the analysis, it was found out that model DM2 contributed 61.4% of the total failures experienced for the dialysis equipment, followed by DM4 (26.1%), DM1 (6.9%), and finally DM3, with 5.3% of the total failure modes. For this reason, models DM2 and DM4 were evaluated further, where root cause analysis was performed owing to the high observed frequency of failures. For the patient ventilators, five models were assessed, where the PVM3 model accounted for 43.5% of the total patient ventilator failures, followed by PVM2 (31.5%), PVM4 (15%), and lastly, PVM1 and PVM5, each contributing 5% of the total ventilator related failures. Hence, models PVM3 and PVM2 were considered for root cause analysis owing to their high frequency of failure.

4.1.2 Pareto analysis as per the subsystem

Figure 5. illustrates the results of failure frequency analysis for sub-systems of the Cobalt-60 device. The cumulative failures of the couch subsystem, software sub-system, gantry and control subsystems were 85% of the total sub-system related failures. The couch subsystem positions patients in the rest-position during treatment, where a braking mechanism ensures that the patient is held stationery in the desired position. On the other hand, the software subsystem actuated various commands entered by the operator. The gantry subsystem consists of the radiation source where beams for treating cancer are emitted. The gantry also embeds the collimator which directs the radiation source to the desired area of treatment.

Fig. 5
figure 5

Pareto analysis for subsystem

Of the total Cobalt-60 sub-system related failures, in particular, the couch subsystem contributed 40% of the total equipment related failures, followed by the gantry (23.6%), the software sub-system (12.4%), and the control sub-system (8.5%) of the total Cobalt-60 equipment related failure modes. From the frequency analysis, a detailed root causes analysis was performed for the couch subsystem owing to the rather high frequency of subsystem failures (discussed further in Sect. 4.3).

For dialysis machines, the subsystem failure frequency analysis showed that the fluid path and control sub-systems experienced a cumulative failure of 87.6% of the total sub-system related failures for the dialysis equipment. Decomposed as per the type of subsystem, the fluid path contributed 69% of the total recorded failures, while the control-related subsystem contributing the remaining proportion of 18.6%. From this analysis, the fluid subsystem was selected for a more detailed ‘5 whys’ analysis.

For the patient ventilators, the pneumatics and electronics subsystems had a cumulative failure frequency of 90.5%, with the pneumatic subsystem accounting for 71.4% of the recorded subsystem related failures, while the electronics subsystem contributed to the remaining proportion of 19.1%. For this reason, the pneumatics was selected for a more detailed ‘5 whys’ analysis.

4.1.3 Pareto analysis as per component type for the Cobalt 60 radiotherapy machine

Figure 6 illustrates the Pareto analysis for the percentage component failures of the two Cobalt-60 radiotherapy machines from which, the brakes, collimator, and host PC were highly ranked. To underscore the importance of the components, their brief function is explained. The braking system performs the function of maintaining the couch in the desired position depending on the type of examination, or cancer treatment offered to the patient. The collimator performs the function of directing the radiation beams to the desired location on the patient. The host PC acts as the commands console for the equipment, while the hand controller performs the function of actuating movements of different parts of the Cobalt-60 equipment.

Fig. 6
figure 6

Pareto analyses for cobalt machines component

From the Pareto analysis, the brakes, collimator, host PC, compressor, hand controller, emergency switch, and power supply had a cumulative failure frequency of 82.5% based on the data analyzed for the equipment over the 3 years. Of these, the braking components accounted for 36% of the total recorded component failures, followed by the collimator (5%), host PC (12%), compressor (7.1%), and hand controller (5.9%). Consequently, the brakes, collimator and host PC were considered for a detailed ‘5 whys’ analysis discussed Sect. 4.3 of this paper.

For patient ventilators, the oxygen sensors accounted for 23.6% of the total component related failures for the device, followed by the flow sensors (20.7%), and powers supply (18%). These components were therefore selected for a more detailed ‘5 whys’ analysis. For the dialysis machines, the Bicart holder contributed to 14.9% of the total device related component failures, followed by the inlet/outlet connector (13.7%), and flow pump (11.7%). Hence, components of the dialysis equipment with the highest cumulative failures were selected for a detailed ‘5 whys’ analysis.

4.1.4 Pareto analysis of the maintenance activities

Apart from analysing the failure frequencies of the medical devices, it was neccesary to understand the type of maintenance activities implemented for the aforementioned critical medical devices. Figure 7 illustrates the Pareto analysis for the maintenance activities performed for the devices over the 3-year period of analysis. From the Pareto chart, the repair and replacement actions had a cumulative percentage of 80.5% of total maintenance activities. Of these, the repair policy contributed 67.5% of the total activities, while the replacement policy contributed 13% of the total maintenance activities. These actions were mainly performed mostly when the equipment failed and thus they imply a corrective maintenance approach. Also here a ‘5 whys’ analysis was performed in Sect. 4.3 with a view of understanding the reasons informing reliance on the corrective maintenance approach, rather than more proactive maintenance approaches.

Fig. 7
figure 7

Pareto analysis cobalt machines maintenance activities

Analysing the dialysis machines, the Pareto analysis showed that cumulatively, the repair and replacement activities accounted for 89.4% of total maintenance activities, of which, the repair activity accounting for 69.5% of the total activities, while the replacement activity accounted for 19.9%. Similarly, these activities can be considered as part of a corrective maintenance approach and a ‘5 whys’ analysis was performed to investigate the reasons for reliance on this type of approach. For the patient ventilators, the Pareto analysis indicated that replacement and repair policy accounted for cumulative of 83.3% of the total maintenace activities carried out. Of these, the replacement policy accounted for 52.1% of the total activities, while repair actions accounted for the remaining proportion of 31.2%. For this reason, a more detailed ‘5 whys’ analysis was performed on the underlying reasons for reliance on the corrective actions (Table 1).

Table 1 Structured data maintenance record for radiotherapy department

4.2 Adapted failure mode and effect analysis

The adapted FMEA was used to determine the effects of the equipment failures on the patient treatment times where the results of the Cobalt-60 devices are discussed in detail. As earlier indicated, the braking system were associated with the highest percentage failure frequency, and from Table 2 below, were also associated with a high number of patient lost treatment time of 219 patients over the 3-year period. This was followed by the collimator, the host PC, the power supply, compressor, emergency stop and finally the hand control failures respectively. The number of patients not treated during the equipment downtimes is calculated using Eq. 1 earlier described, where for the braking system, the failure frequency was 56 over the period of analysis. Respectively, the 56 failures contributed to a total downtime of 3285 min, given an average downtime of 59 min for each failure. Thus, considering a treatment time of 15 min per patient, a total of 219 patients were not treated as a result of the device failures.

Table 2 FMEA analyses for Cobalt radiotherapy machine

The FMEA was carried out for all the components where the results of the adapted FMEA correlated to those of the failure frequency analysis previously discussed. This is because, in both instances, the brakes, the collimator, the host PC, the hand controller were ranked highly based on the cumulative percentage failure. For the dialysis machines, the pump failures accounted for the highest number of deferred patient treatment, with 236 patients not accessing treatment due to failure of the pump. For the patient ventilators, the oxygen sensor accounted for the highest number of patient lost treatment time, with 53 patients not accessing treatment due to the breakdown of this component.

4.3 5-WHY’s analysis Cobalt 60 radiotherapy machine

4.3.1 5-WHY’s analysis for the CM1 cobalt 60 model

A ‘5 whys’ analysis for the CM1 Cobalt-60 model was performed based on the failure frequency analysis which showed that it contributed to a high number of failures for the two models operated by the case study hospital. A further failure frequency analysis also indicated that the couch mechanism failure contributed a higher proportion of total failures of the Cobalt-60 equipment. Performing the ‘‘5 whys’ analysis, high utilization of the equipment was evaluated as one of the main reasons for the high frequency, coupled with few opportunities for maintenance. The high utilization was as a result of high demand for radiology services at the hospital; In addition, the ‘5 whys’ analysis indicated a general lack of spares parts for the CM1 model type, with the model obsolete as it was more than 20 years old. Hence, critical spare parts were no longer manufactured and as a result, the hospital relied on modified spare parts to keep the equipment operational. This, in the opinion of the decision makers contributed to the high failure rate.

4.3.2 5-WHY’s analysis as per the subsystem

From the analysis, failure of the couch mechanism was highlighted as an important problem and the ‘5 whys’ analysis was carried performed as shown in Fig. 5. From the analysis shown in Table 3, the high utilization and need for frequent adjustment of the patient position was observed as contributing to high failure due to wear of the positioning mechanism.

Table 3 5 WHY’s analysis for the radiotherapy

Further analysis indicated that the couch had not been replaced despite the aging of the equipment (20 years of age). Hence, coupled to technical obsolescence, the positioning system experienced frequent failures modes as per the results of the root cause analysis. Moreover, spare parts for the positioning system were difficult to procure, necessitating minimal repair actions which invariably also necessitated constant repairs. Owing the unavailability of critical spare parts for the positioning system, the hospital resulted to using modified components, which necessitated technical modifications that contributed to device outages owing to mismatches for the specification of the modified versus the original components.

4.3.3 ‘5 whys’ analysis for critical device components

For the braking system, which were analyzed as contributing to highest number failures in the cobalt system, the root cause analysis indicated an underlying root cause of high utilization with few opportunities for repair. Moreover, the mechanical movements of the braking mechanism while positioning the patient implied wear of moving parts of the mechanism. On further analysis, it was found out that due to obsolesce, hence lack of spares, the repair process relied on modified components whose reliability was questionable as depicted in Table 3. A ‘5 whys’ analysis of the collimator indicated that the component failed due to high utilization, and moving parts of the mechanism when positioning and concentrating the treatment beam to desired treatment area. Further analysis indicated that rather than replacing the component periodically as recommended by the equipment manufacturer, modified parts were used coupled with minimal repair actions. The results of this analysis are indicated in Table 3.

Finally, for the host PC (computer), the root cause analysis indicated instability in electricity as one of the main contributors to equipment outage. Often, the manufacturer of the equipment usually recommends the need for installing uninterrupted power supply systems (UPS), however, this was not the case for the case study hospital. Additional results of the root cause analysis are shown in Table 3.

4.3.4 5 Why analysis for the maintenance activity

For the maintenance activities, the root cause analysis considered both the operational and maintenance perspectives as indicated in Table 3. From the summary, the high utilization of the radiotherapy machines was viewed as an important root cause, which is expected given the high demand for radiotherapy services at the referral hospital. The high utilization implied early on-set of operational failures, more so coupled with few opportunities for maintenance. On average, the Cobalt 60 radiotherapy machine should serve on average, 30–40 patients, but owing to the fact the hospital operated two devices, the equipment served on average, 100 patients each day. Secondly, the root cause analysis indicated absence of structured approaches for performing inspection on the system before and during operations of the devices. The inspection is critical since it informs the biomedical engineers of deviations in operation of the system. Absence of the inspection checks implied on-set of early failures which were avoidable were the checks to be performed.

Considering the maintenance perspective of the equipment, the root cause analysis indicated a tendency towards applying a reactive approach to maintenance of the devices. Importantly, the reactive approach was performed when the system was out service. Moreover, the analysis indicated that preventive maintenance was performed every 6 months as per the recommendation of the Original Equipment Manufacturer, not taking into consideration the high utilization of the equipment. Ideally, the high utilization should necessitate more frequent maintenance interventions, which was not the case. Further analysis also indicated that maintenance data analysis was not carried out because a structured maintenance data collection and analysis procedure was not in place. Such data would have assisted the biomedical engineers formulate more effective maintenance strategies. In absence of such a system, formulating effective maintenance activities is not straightforward. From the results of the root cause analysis, the need for more effective operation and maintenance strategy was considered important for mitigating the root causes of failure.

4.4 Maintenance and operational protocols

From the root cause analysis, it was obvious that there was a need to develop appropriate mitigation strategies with a view of preventing recurrence of focal device failure modes. Hence, in this research, operation and maintenance protocols were developed for the three critical devices; Cobalt 60 radiotherapy, dialysis machines and patient ventilators machines.

4.4.1 Operator protocols for Cobalt-60 radiotherapy equipment

Since device operators are always in contact with the equipment, they are more often the first persons to note device malfunctions, or onset of early defects. From the risk assessment and root cause analysis process, the braking system, collimator and host PC components were analyzed as the most critical. Hence, there is the need to develop structured operator protocol for assessing whether the system is functioning correctly, and within the required parameters. The operator protocols will ideally guide the device users, and hence, enhance equipment availability through prudent device usage.

Table 4 illustrates the proposed operator protocols which were developed for mitigating against recurrent equipment failures. The proposed protocols indicate daily tasks and checks should be performed for effectively operating the Cobalt-60 equipment. As an example, in order to prevent failure of the braking system, regular cleaning of the mechanism is necessary to avoid dust accumulating which may result in reduced functionality. Moreover, inspecting the proper functioning of the braking system before positioning the patient of the couch and gantry subsystems could indicate onset of malfunctions, for instance, presence of abnormal noise during movement of the couch/gantry movements. Additional operation protocols that would assist in prudently operating the equipment includes recording defects in structured log sheets, since such records would enhance prompt maintenance interventions.

Table 4 Operator protocols for the Cobalt-60 radiology equipment

4.4.2 Maintenance protocols for Cobalt-60 equipment

The maintenance protocol for the Cobalt-60 device is divided into three parts. The first part proposes weekly activities which the biomedical engineers are required to perform on the equipment. The second part of the protocol proposes monthly activities, which the biomedical engineers should observe when maintaining the equipment. From the protocols, preventive maintenance is proposed after every 3 months instead of 6 months as is the case during the study. The more frequent maintenance is necessitated by the high utilization of the equipment, where often, an average of 100 patients are attended to, instead of the recommended number of 30 to 40 patient each day per unit. The high utilization thus necessitates more frequent maintenance interventions to mitigate frequent equipment breakdown.

Table 5 shows the proposed maintenance protocols for Cobalt-60 machine. Operation and maintenance protocols were also developed for the dialysis machine and patient ventilators, however, for brevity, the protocols are not illustrated in this paper.

Table 5 Maintenance protocols for the Cobalt-60 device

4.4.3 Importance of the developed protocols

The high utilization of the medical equipment is unavoidable because of the few available equipment. Hence, to ensure availability of the equipment, one of the mitigation strategies identified in this study, and discussed in the previous section is developing operation and maintenance protocols targeting important focal root causes of equipment failure earlier identified following our proposed methodology. The operator protocols are modelled such that they guide operators prudently operate the equipment, thus reduce operation related failures. The maintenance protocols were expected to guide biomedical engineers optimally maintain the equipment efficiently by increasing opportunities for maintenance, i.e. daily, weekly and monthly basis. The enhanced interventions are expected to assist the engineers identify early onset of failures, hence intervene more promptly prior to equipment outage. Operation and maintenance policies are also expected to contribute to enhanced availability of the critical medical devices, and optimize repair costs.

5 Summary and conclusion

This paper demonstrates a methodology for developing strategies for mitigating operation and maintenance failures. The strategies are linked to focal root causes of recurrent equipment failure, and which are expected to guide practitioners better manage the operation and maintenance aspects of critical medical devices. By formulating such strategies, better outcomes are expected for critical diagnostic and treatment devices, hence better healthcare delivery to patients in need of such services. The proposed approach embeds a data collection and structuring framework which assists the decision makers analyze and prioritize equipment failures. Based on the prioritization process, a systematic and structured root cause analysis is suggested for assessing the focal root causes of equipment failure, for which, mitigation strategies are proposed. This approach differs from existing studies in literature where risk assessment is often performed with the view of prioritizing equipment failure, at the expense of analyzing the focal root causes, and formulating mitigation strategies. Moreover, the proposed approach is novel in the sense that both technical and patient safety related risk metrics are analyzed, i.e. repair interventions and deferred patient treatment (lost patient treatment time). Furthermore, the study focuses on operation and maintenance data collected from the equipment.

From the study, important areas of research which could enhance management of critical medical devices are identified. This includes the need for a robust data collection and analysis structure for operation and technical information generated from the equipment. Such data would invariably enhance risk assessment, and yield more robust root causes and mitigation strategies. However, several limitations are also apparent, more specifically, related to the need for evaluating the effectiveness of the strategies proposed in this study. This requires actual implementation of the strategies and evaluating their outcomes on operation and maintenance outcomes. This aspect is evaluated in future work.