Failure analysis and lessons learned on LHC experiments crate and power supply equipment

The LHC accelerator's first long shutdown period (LS1), in 2013–2014, has given the experiments the opportunity to perform planned upgrade and maintenance activities on systems and equipment. It has also been the right time to conduct a preventive maintenance campaign on crate and power supply equipment which is foreseen to operate smoothly for another 4 to 8 years. This paper presents the lessons learned during the LS1 power supply preventive maintenance activities as well as an in-depth analysis of the most common failure modes and weaknesses encountered on the power supplies in the LHC experiments over the past years of operation.


Introduction
Large quantities of high and low DC voltage power supplies (PS) and crates are used to house and power sub-detector electronics in the LHC experiments (Atlas, CMS, Alice, LHCb, TOTEM) and else where. The PS and crate service in the PH-ESE group supports this equipment and manages a set of long-term maintenance and purchase contracts with the PS and crate manufacturers.
The equipment supported by the PS and crate service is individually traced during maintenance (verification, calibration, preventive maintenance or repair). All maintenance tasks are centrally managed via a technical database and a maintenance history is individually recorded for every piece of equipment. Besides tracking equipment, the database offers the possibility to generate detailed statistical analysis of failure rates, identify possible weak equipment and highlight the type of interventions performed.
The numbers presented in this paper are mostly sourced from the PS and crate technical database. Additionally, where needed, closer collaboration was established with the manufacturers to get an in-depth knowledge and understanding of the failures, their origin and fixes.

JINST 10 C02038
2 Failure rate analysis The failure rate calculation is based on the number of "Fault report" actions versus the total number of equipment registered in the database. A "Fault report" is issued every time a user requests a repair action using the technical database. Only fault reports leading to a repair from the contractor (proven failure) are taken into account in the failure rate calculation.
In the technical database the first deliveries were registered in 2002. However, the failure tracking (fault report) only started in 2007. The data used for the analysis were collected from the database in July 2014. Therefore, the fault report analysis covers the period from June 2007 to July 2014. All actions that have happened outside this period have not been considered in this study.
As some of the Caen equipment in operation in the LHC experiments was bought outside of the frame contract (the centralised purchasing and maintenance contract managed by CERN), the technical database equipment list is not exhaustive. In order to address this issue, information on real equipment quantities has been collected from the users. Caen also contributed to improve the total quantity numbers by tracing back the sales of custom items only used at CERN. The number of fault reports in the database is accurate as items purchased outside the frame contract are necessarily added into the database when the first fault report is issued. Moreover, Caen confirmed that almost all repairs performed for CERN pass through the frame contract and are traced in the database.
On the Wiener side, the equipment list in the database is exhaustive as practically all purchasing went through the frame contract and was therefore automatically registered.

Yearly failure rate
The yearly failure rate calculation is based on the proven failures that occurred in the given year versus the unit quantities registered at the end of that year. The focus is put on Caen and Wiener equipment, as it represents over 90% of the supported crates and PS.
In 2011 a first survey was carried out in order to know the total number of Caen items in circulation. In 2014, another further investigation including Caen sales has been performed. Based on the data obtained, the total Caen quantity for the previous years was estimated by subtracting the items delivered each year. These estimates explain the quantity jump between 2011 and 2012 on figure 1.
Two main observations can be made from figure 1. First, the global yearly failure rate is relatively low with an average below 2% over the last 8 years. The second observation is that the failure rate tends to decrease despite the recent tests, maintenance and handling activities related to LS1. In particular, the Caen yearly failure rate decreased by 3% over the last five years.
It is also important to note that the global failure rate for 2014 only indicates a tendency as the numbers could change at the end of the year. The data used for 2014 only covers the period from January to June and numbers were extrapolated to cover the whole year.

Operating time before failure
Considering the theoretical failure bathtub plot illustrated in figure 2, the time before failure tends to indicate the failure phase the equipment is currently in. Figure 3 shows the number of fault reports versus the equipment age. As the item quantities are changing constantly (new equipment -2 -   is delivered 4 times a year), the fault reports are given in absolute numbers rather than as a relative failure rate in the ordinate of figure 3.
Equipment older than 8 years shows lower number of fault reports. This is mainly driven by the low equipment quantity of that age. In fact, the purchase and delivery peak was recorded between 2006 and 2008, just before the LHC start.
For the Caen equipment, an early failure period of 6 months can clearly be identified and the first 4 operating years can be considered as a finalizing development period and/or a user learning phase.
The time before failure combined to the yearly failure rate show that the equipment is most likely getting out of its early failure period.

Caen equipment analysis
The Caen equipment represents the largest quantity of equipment supported by the service (around 7500 items). One of the goals of this analysis is to find out the most common and evident weaknesses, understand their causes, and if possible, propose recommendations for future developments.
The following analyses were performed on all item types having a minimum of one fault report. Considering that the uncertainty on the total item numbers could distort the failure rates, all item types for which the quantities in operation are not known have been taken out of the analysis. According to the database, this concerns at least 225 items distributed in 13 types. Figure 4 shows the cumulative failure rate between 2004 and 2014, broken down by design type, standard or custom. A so-called standard design refers to all items available in the standard Caen commercial catalog and/or designed for common/general use at CERN. A custom design refers to items specifically designed on request and/or for a specific application (e.g. the VX1390 LV module designed to supply the TDC Readout Modules in the ALICE experiment).

Failure breakdown
Considering the fault reports, custom devices present a lower cumulative failure rate when compared to standard catalog equipment. However, the final development and adjustment phase required for custom equipment (see figure 3) is not taken into account in this analysis. Upon reception and equipment acceptance, experience showed us that this final adjustment phase represents a not negligible amount of effort and time. Most of the time this effort is not registered as a fault report in the technical database but rather sorted out between the users and the contractor via direct contacts.
About one third of the fault reports are not related to actual equipment failures. To define the proven failures among the fault reports, we used the ratio between the total repairs and the repairs considered as "No fault found". The reason of these "No-fault found" is developed in the next section.
-4 -   Figure 5 represents the failure rate breakdown by equipment type. The SYxx27 mainframes show a high failure rate compared to other equipment types, even if their quantity is low. The SY failure modes are detailed in the next section. Obviously the Easy crates failure rate is very low as this equipment type mainly consists of mechanics and a passive backplane.

Cause of failures
During repairs, Caen technicians fill in a repair report and tick entries corresponding to the performed repairs. Those entries were used to build up the distribution of failure causes for each equipment family as shown in figure 6. Two main categories can be distinguished, the proven failures and the "No failure found" (greyed out in figure 6). The proven failures are composed of the following repair entries: components, functional block, soldering, calibration and other fine tunings. The no fault found, software upgrade, improper use and hardware upgrade entries fall into the "No failure found" category. The main contributors of this category are upgrade (software & hardware) related issues as illustrated in figure 6.
Focus was put on the SY functional block repairs which represent the highest cause of failure on the SY mainframes. For other equipment types, the component failure is the first cause of failure. However, the number of failing components on unit presenting component failure is relatively low considering the huge quantity of components mounted on the devices (in average, a Caen PS module is made of 3500 components [1]).
-5 -  As illustrated in figure 7, the main SY functional block repairs are related to the fragility of the commercial PC components the SYxx27 mainframe family is based on. The new SY mainframe generation (SY4527 and SY5527) has therefore been designed differently and is based on a modular (and easy to exchange) CPU board. However, no failure statistics or analysis is available yet on this new equipment.

Wiener equipment analysis
A total of 5647 Wiener items (on 11 July 2014) are supported by the service. Past experience with this contractor has given us a good feeling of most frequent failures encountered. The results presented in this section permit to confirm and quantify this feeling.

Failure breakdown
The cumulative failure rate breakdown by cooling modes on figure 8 demonstrates that the cooling fluid (water or air) has no impact on the equipment failure ratios. As the uncooled equipment is mainly composed of mechanics and passive components (crates), its failure rate is low.
-6 -  For Wiener equipment, the database does not offer repair categories as for Caen. Therefore, the "proven failures" counting is based on the number of "fault reports" leading to a contractor repair. Thus, the difference between "fault reports" and "proven failures" is mainly due to the verification and no-fault found screening performed by the service before repair. This difference is relatively high for the PL512 PS family, as most of their "fault reports" are related to configuration errors.

Cause of failures
The PFC (AC/DC Power Factor Corrector) is the most important failure source of all devices equipped with it (All PS types except the Maraton and PL508 families). This PFC weakness is well known by Wiener and several actions have been put in place to cure this weakness: • Early 2008, an additional diode was implemented in order to by-pass temporary high in-rush currents.
• The Bx revision with an improved power stage has been produced since 2009.
• Verification and exchange campaign of capacitors from a bad production batch with critical life time was launched during LS1 in 2013-2014.
Additional failure factors are still under Wiener's investigation. These are: • Power transistor failure (suspected cause: too high inrush current and exceeded operating temperature).
According to figure 10, the PS internal DC/DC modules are the second most frequent cause of failure. However, the number of failures related to the DC/DC module concerns only 71 PS units since 2007 which is a modest rate considering the important DC/DC module multiplicity (up to 6 modules per PS unit), their complexity and diversity (water/air cooled, magnetic shielded, output voltages).
-7 -   Figure 11 presents the yearly PFC failure modes observed by Wiener during repairs [2]. Until 2009, the number of repairs was constantly increasing. Since the launch of the PFC Bx revision, the number of repairs significantly decreased. However, the recent LS1 preventive maintenance campaign has revealed some weaknesses to be addressed. Those weaknesses mainly concern a bad capacitor, preventively exchanged during LS1, and the output power transistors. Figure 11 confirms that constant monitoring and preventive maintenance can effectively contribute to maintaining the failure rate low.

Lessons learned
• Failure rates have decreased with time and are kept low, possibly thanks to the targeted preventive maintenance campaigns.
• The failures related to the PFC represent over 56% of all Wiener equipment repairs.
• The SY mainframes present a failure rate almost 4 times higher than the other equipment types, even if this represents only 7.8% of all Caen repairs.
-8 - • Around 2/3 of Caen equipment used at CERN has been purchased out of the frame contract. These items are not registered nor traced in the technical database and can therefore not be taken into account when failure analysis are performed. These items unfortunately also risk missing some important upgrades and/or preventive maintenance campaigns.
• Caen customized equipment tends to be effective in the long term despite a relatively high starting failure rate or tuning phase (with corresponding delivery delays).

Recommendations
• The preventive maintenance campaigns are essential to prevent failures and extend the useful operating time of PS equipment.
• Using the frame contracts for purchasing and maintenance allows better and exhaustive tracking and leads to relevant statistics beneficial to all. The frame contracts cover the majority of power supplies and crates used in the LHC experiments.
• Being able to keep track of what is going on in the experiments (equipment status, upgrades, etc.) helps gaining experience and getting a global view that benefits all users.

Conclusion
This failure analysis on LHC crates and power supply equipment shows a global yearly failure rate (average of 1.70% over 8 years) better than originally foreseen 5-10%. These results give relatively good confidence in the equipment reliability. They probably also show the benefit of the preventive maintenance campaigns and reflect the good collaboration maintained with the contractors. The analysis carried out permits us to list out the different failure modes and to clearly identify the Caen mainframes and the Wiener AC/DC converters as the least reliable devices. The data collected over the last 8 years to provide, through this paper, the first complete and up to date overview of the failure rate of PS equipment used in the LHC experiments. The completeness and quality of the collected data are key points to succeed in performing such an analysis. We will therefore continue to encourage all actors (users and contractors) to go through the service and register equipment as much as possible. This will allow everyone to share experience and contribute to improve equipment knowledge. All lessons learned will be of invaluable help to design and build the future power converter solutions and systems.