Availability and critical systems of the Future Circular Electron–Positron Collider

The Future Circular Electron–Positron Collider (FCC-ee) is planned to be the world’s largest particle collider and a precision instrument to study the heaviest known particles. Achieving substantial physics results requires producing high amounts of integrated luminosity, which calls for sufficient machine availability. Although the operational availability of lepton colliders has been high, the increased complexity of the new infrastructure creates a challenge to maintain this. At the early stage of research, the main activity of an availability study should be to identify the causes that potentially have the most significant effect on downtime. This paper identifies critical systems of the FCC-ee based on available failure data. The paper further presents an operation model for the FCC-ee that can be used for assessing the effect of unavailability on overall performance. Special attention is given to systems with built-in redundancies as this design concept has been proven to increase accelerator availability.


Introduction
The Future Circular Electron-Positron Collider (FCC-ee) [1] is one of the options for the next large particle collider to succeed the Large Hadron Collider (LHC) at CERN. If built, the FCC-ee will be the largest particle collider ever to exist. It will be 100 km long and operate in a top-up mode where new beams are continuously injected into the collider. For this purpose, the collider tunnel will also house a 100 km long booster accelerator. A schematic of the accelerator complex is shown in Fig. 1.
During its life-cycle, the FCC-ee will operate with different energies to study Z, W, and H bosons and the top quark. The aim is to increase the statistical precision of most electroweak and Higgs observable measurements by one to two orders of magnitude. Fig. 2 shows the operation schedule, and Table 1 shows the operational parameters and integrated luminosity goals [2] for different operation modes. The mode tt 1 that is shown in the schedule is a one-year-long intermediate stage with 175 GeV beam energy before the upgrade to the prevalent tt 2 -mode.
Reaching the luminosity production goals requires operating the machine with 75% efficiency for 185 days per operation year [3]. This number is formed by subtracting 5% from the 80% machine availability goal to take into account the required time for refilling 1 the collider * Corresponding author.
E-mail address: arto.niemi@cern.ch (A. Niemi). 1 In this paper, a fill is a period between the first beam injection to the dumping of the beam. A filling is a period within a fill between first injections to the point when the beams have reached their nominal intensity. 2 During the four years of Z operations total goal is 150 ab −1 . The annual luminosity goals for these four years are: 2 * 24 ab −1 + 2 * 48 ab −1 ≈ 150 ab −1 .
after a failure. These numbers are estimates that have been based on operation experience. An earlier paper [3] used these numbers as inputs for a simple luminosity approximation with equation: where is the efficiency, the time, and the instantaneous luminosity. For example, in the Z-mode, 2 2 * 0.75 * 200 * 10 34 cm −2 s −1 * 185 days ≈ 48 ab −1 . The number two at the start of the calculation signifies that the result is for two interaction points.
Experience shows that the term availability is a highly subjective metric if it is not well defined. Following terms were initially coined in the thesis [4] to describe the availability of an accelerator: Fig. 1. Schematic layout of the FCC-ee complex with the SPS serving as the pre-Booster [1]. Alternatively, the pre-Booster could be a linear accelerator.  Thesis [4] also defines: Availability for physics = Beam delivery time Run time , where the beam delivery time is the effective time when the beam is delivered to experiments or other accelerators, in an accelerator complex. This metric shows that an accelerator cannot always deliver a beam even though it was technically operational. The availability for physics is, however, an imprecise metric as it does not consider the rate of physics production, which is measured with luminosity. In this paper, the term availability stands for machine availability to keep the text compatible with the earlier paper [3]. Machine unavailability can be seen as a risk that reduces operational efficiency, and in the worst-case scenario, it limits a collider's ability to reach the physics goals. The European norm 31010 [6] presents a risk assessment process that consists of three steps: risk identification, analysis, and evaluation. This process is used for determining the actions that are required to treat the risk. An action can either mitigate the consequences or reduce the likelihood of a risk.
This paper presents two topics that relate to assessing the risk of unavailability in the FCC-ee. We give the first listing of critical systems for the FCC-ee availability and remarks on technologies that can reduce this risk. In this process, we uncovered a need for an energy storage system for the FCC-ee booster that was not considered in the design report [1]. Due to this, the paper provides additional details on how this system could be implemented.
Secondly, we present an operations & availability model that can be used to analyze the system unavailability's effect on the luminosity production. The model uses the same concept [7][8][9] as the prior studies for the FCC-hh [10,11]. Special attention is given to systems with builtin redundancies. They are studied first in the quantitative analysis, and their benefits are later shown in a small calculation case.

Post-LHC colliders
The FCC-ee is being designed as a part of the Future Circular Collider study. The primary focus of the FCC-study has been in the design of 100 TeV hadron collider FCC-hh [12]. However, an integrated program where the 100-km long tunnel would first house a lepton collider FCC-ee [13] is now considered as the leading approach within the FCC study. The FCC study has also conducted research on so-called High Energy LHC (HE-LHC) [14]. In this option, the Nb 3 Sn magnets designed for the FCC-hh would be used in a new collider housed in the LHC tunnel. In parallel to the FCC study, the Institute of High Energy Physics of the Chinese Academy of Sciences is designing the Circular Electron-Positron Collider (CEPC) that would later share its tunnel with a hadron collider [15]. Both the CEPC and the linked hadron collider are designed to achieve lower energies than the FCC counterparts. The CEPC is not planned to study the top quarks, and the hadron collider would only achieve 75 TeV collision energy.
Alternatively to the circular options, the Compact Linear Collider (CLIC) [16] and the International Linear Collider (ILC) [17] studies are developing linear lepton collider options. The advantage of a linear lepton collider is that circular lepton collider produces large amounts of synchrotron radiation that limits the luminosity production at the high energies. This effect can be seen in Fig. 3. However, a circular collider can produce more luminosity to study the known particles. As an additional value, a tunnel for a circular lepton collider can later be used to house a hadron collider. For example, the LHC tunnel used to house the Large Electron-Positron (LEP) collider [18] before the LHC was built.

Historic performance of lepton colliders
Several electron-positron colliders have been built over the years, for example, LEP at CERN; PEP-II at SLAC; KEKB at KEK; BEPC II at IHEP; and DA ℎ NE at INFN. Historically electron-positron colliders have had high availability. Earlier, a literature review was conducted on their availability [3], and Table 2 summarizes its findings. Although we consider here only the lepton colliders, additional sources of information exist for availability assessment. For example, synchrotron light sources use the same technologies as electron-positron colliders. Table 2 The availability goal of the FCC-ee appears conservative when it is compared to achieved performance in other lepton colliders [3]. However, it will have an unprecedented size and energy compared to the other lepton colliders a .  [3] and additional technical information from Refs. [19][20][21][22]. b Unlike the other listed colliders, the LEP did not operate in a top-up mode. The beam was accelerated in the collider before collisions. During this time, the machine was available as defined by Eq. (2), but it was not available for physics (Eq. (3)). This table shows the machine availability. c For KEKB and PEP-II, the higher energy value is for the electron beam and the lower for the positron beam.
Although the availability goal of the FCC-ee appears conservative in Table 2, it also shows that FCC-ee will have much more demanding operation requirements than the LEP-2, which is its closest comparison. FCC-ee will be three times larger and will operate with 75% higher energy in the tt 2 -mode. FCC-ee will also produce luminosity at a significantly higher rate partly thanks to the far higher beam current in all but the tt 1 and tt 2 modes. These factors, combined with the booster ring for top-up injection, result that the FCC-ee will be a far more complex system than the LEP. This is a challenge as complexity often correlates with unreliability [23] that will result in lower availability and physics production performance.
The risk of low production performance is not the only reason why the FCC study should be interested in availability and reliability. Increasing reliability and condition-based maintenance techniques can reduce the number of corrective maintenance interventions. This can reduce the operation costs if it allows scheduling the interventions to the ''office hours'', limiting the need for around the clock surveillance. Further savings can be achieved if the majority of the interventions can be performed during scheduled operation stops, as this would limit the need for on-site personnel. Alternatively, if nothing is done to enhance the availability of the FCC-ee, the number of technicians and maintenance costs will likely scale-up, and the performance will scale down compared to the LEP.

Accelerator availability studies
Refs. [10,11] provide comprehensive literature reviews on accelerator availability studies. In brief, operational availability has become more and more important as user requirements, and machine complexity has grown. For example, OECD [24], European Science Foundation [25], and ESFRI [26] reports on sustainable operations all mention, in various terms, that it is essential to provide high-quality science services reliably and continuously.
This increased attention has produced multiple activities to assess accelerator reliability and availability. For example, in light sources, availability is one of the key performance indicators. Facilities have attempted to define common rules to calculate the availability to ensure that the results from different facilities are comparable [27]. In all cases, reliability data collection is a fundamental activity that enables other studies and allows prioritizing them. Reliability data collection has to be part of operations to achieve accurate results. The work presented in the paper [28] shows how the organizationwide reliability data collection was established at CERN. Nevertheless, this practice is still new. For example, the Capability Maturity Model Integration (CMMI) program [29] would assess the current maturity level as 1/5 ''performed process'', as the rigor with which this task is performed varies considerably depending on the individuals managing and performing the work.
Collected reliability data is vital for modeling activities as producing relevant results depends on accurate and detailed data. Modeling can be used for defining reliability requirements, estimating the effect of system reliability improvements to overall machine availability, or assessing the maintenance & operation costs. Extensive examples of modeling activities within the accelerator community include reliability studies for the MYRRHA nuclear waste transmutation facility, where even a few seconds long beam trip can force a shutdown of the accelerator-driven nuclear reactor [30]. Also, the IFMIF fusion materials test facility project had an extensive reliability study that restarted the AvailSim modeling tool development [31]. The tool was initially created for ILC availability studies and has been developed further for ESS and CLIC availability studies.
Our modeling approach has relied on fault trees [32] for system failure modeling and semi-Markov models [33] for operations modeling. A Markov model consists of states and transitions between the states. At any given moment, only one state can be active. A semi-Markov model differs from a classical Markov model by the fact that the classical model only allows using exponential distributions to depict the transitions. This is not assumed in a semi-Markov model. When any distribution is allowed, the model cannot be solved analytically, and a numerical method is required to obtain a solution. We use a stochastic discrete event simulation [34] to produce the results. This kind of approach is common in complex system reliability modeling as trying to build a model that can be solved analytically results quickly in unrealistic assumptions [35].
The approach taken by the FCC study was to rely, as much as possible, on the best industrial practices, long-term experience as well as commercial services and tools. For this reason, the FCC study chose the ELMAS suite 3 as the model platform. A Finnish company called Ramentor Oy develops the ELMAS that is a Java-based, platformindependent modeling and simulation engine that implements dynamic fault trees [36]. ELMAS version 4.8 was used in our initial study [10]. It permits specifying production functions, semi-Markov chain transition logic, and operation schedule specifications with user-defined Java snippets.
Our first effort to model the availability of the FCC-hh resulted in a combined availability and operations model that was validated against data from the LHC operation [10]. This model is useful for risk analyses where it can be used for studying a specific risk's impact on machine availability and luminosity production. This work, however, also recognized a need to improve the ELMAS approach. While the use of user-specified code in ELMAS gave a high degree of flexibility to tailor the tool to the domain-specific needs, it broke the concept of a single model, parameterization, and simulation specification, as parts of this information were in the code. This finding motivated the development of the OpenMARS approach [7,8]. Thesis [11] provides details on how the collider operations model was implemented initially in Ramentor's ELMAS software and later in OpenMARS. This paper uses a further derived Analysis of Things (AoT) framework for the calculations [9].
The key performance indicator of a collider is integrated luminosity [2]. Modeling luminosity production is more complicated than modeling just the machine availability. Colliders like the LHC require the machine to be filled with beam and to accelerate the beam before the luminosity production can start. The machine spends a significant amount of time preparing the beam for collisions, and modeling this process is crucial to model luminosity production. This feature is also necessary to model the FCC-ee operations, as frequent refillings are foreseen.
The luminosity production rate is often time-dependent. For example, in the LHC, the number of colliding particles decreases over collision time, which reduces the luminosity production rate. In our model, luminosity production is modeled with a so-called production function that can take into account the time-dependency. For a collider where luminosity production rate changes as a function of collision production time, this feature is essential. In the FCC-ee, this is also useful in certain situations. The luminosity production rate changes significantly at the start of the fill, where collision production starts before the machine is entirely filled. Once the filling is complete, the luminosity production rate can be understood to be constant over a long time scale. However, this is not the case in situations where collisions are ongoing, but the beam cannot be injected to the collider. In this paper, we assume that this situation causes the instantaneous luminosity to decay exponentially [2].

FCC-ee pre-booster availability
Today, two alternatives exist for implementing the pre-booster that feeds the booster ring: (1) the SPS at CERN or (2) a linear accelerator with 20 GeV energy. This report studies these options based on existing availability data. In 2018, the availability of the LHC injectors was closely monitored [37], and this effort led to credible statistics on the SPS availability. For the linear accelerator, this report uses data from SLAC's LCLS that is a hard X-ray free-electron laser with 14.7 GeV beam energy [38,39].
In the SPS statistics [37], the SPS availability has been divided into two destinations: the LHC, and the North Area (NA) that contains a wide variety of fixed-target experiments [40]. The statistics show that the LHC availability had a high priority during 2018. Due to this reason, from the two options, the SPS availability for the LHC is the more reflective indicator on what the availability could be for the FCCee. Overall, the SPS availability for the LHC was 92.8% and for the NA 82.8%.
In the current statistics, the downtime caused by other injectors is added to the SPS unavailability. If the SPS is the pre-booster for the FCC-ee, its injector chain will consist of a linear accelerator, positron target, and a damping ring, as shown in Fig. 1. This chain will be much simpler than the injector chain the SPS has had during the LHC operations. So, the majority of the SPS downtime that was caused by other injectors do not need to be considered. In total, in 2018, this was 517 h of downtime, which is approximately 41% of the total SPS downtime. Without this downtime, the SPS availability would be 96%. However, this is not an entirely accurate number as, in some instances, the electrical network faults were counted as injector failures. Even though an electrical network fault often affects the whole complex. Regardless, the lepton beam injector for the SPS cannot be ignored as a potential source of unavailability. The LEP Pre-Injector had a relatively low availability of 93%-98%, mainly due to issues with klystrons [41]. However, some these issues could have been avoided with built-in redundancies within the accelerator systems that will be presented later in this paper. Fig. 4 shows the SPS system downtime during 2018. The provided data do not take into account the beam destination. This has several effects. Most notably, a power converter fault caused more than 200 h of downtime for the North Area due to a decision to delay the corrective maintenance intervention. Also, the majority of the long extraction system and beam instrumentation faults only affected the North Area. The downtime linked to beam-induced failures is mainly caused by  [37]. Black bars indicate the total downtime for an individual system, and the white bars indicate the root cause duration. The root cause duration disregards the downtime where the root cause of the failure is outside the system boundary. However, it includes the downtime from other systems in cases where a failure in the specific system is the root cause for other system's failures. In this graph, these assumptions might not be valid for injector downtime, where electrical network faults might be counted as injector faults. an incident where the beam loss created a hole in the beam pipe. In an injector, such failure is mainly linked to proton operations. With leptons, the beam energy would be too low to cause damage in the SPS.
This report uses two public references to assess the availability of SLAC's LCLS. The first report [38] describes an operation period in 2008-2009 and gives about 91% availability, and the second report [39] describes a period in 2010-2011 and gives 94.8% availability. The latter report also defines a 95% availability goal that reserves 2% of the time for machine tuning. Based on this, Ref. [39] estimates that the LCLS could operate with 97% hardware availability.
All in all, both SPS and a linear accelerator can reach high availability as the pre-booster for the FCC-ee. As such, minor differences in the availability performance of the existing facilities should not be the deciding factor between the two options.

Critical systems for availability
Our initial driving hypothesis to assess what the critical systems for the FCC-ee is to study what systems have caused issues in the CERN accelerator complex. It is the only currently operated scientific infrastructure with similar complexity to the FCC-ee. In all of CERN's injectors, primary sources of unavailability have been Radio-frequency (RF) system, magnet utilities (powering and cooling), and electricity distribution [37]. In the LHC, the cryogenic system and the systems linked to the machine protection are also among the top downtime contributors [42].
This section presents details on how these systems are planned to be implemented in the FCC-ee collider and booster rings together with considerations on the system reliability. Section 3.2.4 gives a special attention to the energy storage system for the booster ring. This system would level the voltage loads caused by the booster's operation cycles. Authors feel that this attention is warranted as the FCC-ee design report [1] did not provide details of this system, and these systems add an additional layer of complexity to the magnet powering.

Radio-frequency system
The FCC-ee and the booster will have superconducting RF-cavities. The number of cavities will change through the life cycle as the   machine is upgraded to operate with different energies. Tables 3 and 4 show how the type and number of cavities change in different scenarios in the collider and the booster. Running at the Z-pole, the collider is a heavily beam loaded machine, while at the tt 1 and tt 2 energies, it becomes a high energy machine. This affects the design. The fast RF feedback requirements and the high number of bunches in the Z and W-modes favor design with a single cavity per power source. While in other modes a power source may feed multiple cavities and in the tt-modes the RF system in the collider could be shared with the two beams, thanks to the small number of bunches [1].
Experience from the LEP and the LHC shows the importance of the redundant design and operational margin. In the LHC, the RF system was designed for 2 MV load, but it has been operated with 1.5 MV load, which has led to relatively high operational reliability [44]. In contrast, during the LEP operations, the RF performance was pushed beyond the design specifications [18]. During the last year of operations, the average value of the accelerating gradient reached 7.5 MV/m that is 25% over the nominal value of 6 MV/m. Even before this, in order to maximize the beam energy, the operating voltage of each individual RF unit was chosen to give the maximum acceptable trip rate so that the mean time between trips was 15 min [18]. This was sustainable as, in most cases, a trip would only affect one half-unit, while the LEP could survive a trip of two half units. During the last year of operations, this margin was removed, and practically all the fills ended with an RF trip.
A lesson for the FCC-ee is that a large number of cavities allows designing built-in redundancy. The LEP could operate reliably with 2/22 unit redundancy with a high accelerating gradient leading to a high cavity trip rate [18]. Assuming that the FCC-ee's RF system will be operated with a low trip rate, the level of redundancy can be much lower than in the LEP. For example, even two units (power source and associated cavities) per beam could be a sufficient level of redundancy if the failure rates are low [44]. Still, the FCC-ee will have an additional challenge compared to the LEP as the RF systems are not shared in most of the operation modes. In this case, if an RF system loses accelerating capacity, it will only affect one beam, instead of affecting both beams symmetrically.
RF system reliability can also be enhanced with new technology. Sound operational experience with solid-state amplifiers (SSA) at SOLEIL has led many other accelerator facilities to adopt this technology to power RF-cavities [45]. In these instances, the new technology has replaced traditionally used klystrons. Experience at SOLEIL shows that SSA is a highly reliable power source. The reliability is a result of built-in redundancies that form a fault-tolerant design where the failure of a single module does not stop operations. Instead, the corrective maintenance can be deferred to a planned operation stop. This concept is also used in CERN for PS booster [46] and SPS [47] upgrades. Ref. [46] further presents an analysis that shows that the redundancy in the PS booster RF powering indeed allows deferring most of the corrective maintenance to planned stops.
However, reliability is not the only factor that should be taken into account when deciding on the powering technology. The FCC-study currently considers klystrons as the baseline powering option due to several different aspects. In Z, W, and H-modes, a single cavity requires in the order of 1 MW of power per cavity, which might limit the use of alternative technologies [48]. However, in the tt-modes, the lower power requirement (200-250 kW) opens the door for different options. It is also worth noting that the CEPC design report mentions that the RF system of the CEPC booster injector will use SSA power sources [15].
Another aspect is the life-cycle costs. There is an ongoing study that aims to improve the energy-efficiency of klystrons above 80% [1], which is higher than the current SSA performance of 65% [45]. On the other hand, Ref. [49] remarks that combined capital expenditure and maintenance costs could be lower for the SSA. However, this does not take into account a fact that in the FCC, the equipment will be underground. Klystrons require less space than a comparable SSA system. Besides, the current plan is that the power converters for klystrons are located on the surface, which might not be a suitable option for the low voltage DC converters for SSA [48]. Table 5 This table shows the values for magnets as they are presented in the conceptual design report [1]. FCC study presently considers modular design for the collider ring dipoles to ease the transportation. However, this design would significantly increase the number of magnets. The shown number of sextupoles is for tt-modes. In other modes, the required number is lower.

Magnet utilities
The concept of redundancy can also be applied to magnet powering. Modular power converters are used in the LHC [50] and the Diamond light source [51,52]. They are planned for LHC experiments [53], and such structures have also been studied for the CLIC [54]. The CEPC design report also mentions a plan to use redundant power converters with modular n+1 architecture [15].
Diamond's operational experience of fault-tolerant power converters shows the high potential of this design [43]. Diamond has 1500 power converters. During the first ten years of operations, the Diamond has had two separate years without beam trips caused by power converters. Fig. 5 shows that contrary to the CERN's experience with normal conducting machines, Diamond's power converters are far from being the top contributor for unavailability. Due to this, adopting a faulttolerant power converter design could significantly improve FCC-ee's chances to reach the physics goals.
Normal conducting magnets require cooling. A survey presented in [55] shows that water leaks from the magnet cooling system are the most prominent failure mode in these magnets. This survey confirms findings from an earlier study that focused on accelerators operated by the SLAC [56]. Failure rates in the SLAC study show that the Mean Time Between Failures (MTBF) of magnets are between 0.5-3 million hours.
To quantify this issue, if the number of magnets was 10 000 and the failure rate was the same as in the study [56], a system would suffer a failure in every 0.5 to 2 weeks. This is relevant as Table 5 shows that the planned number of magnets in the FCC-ee collider and booster rings is in order of tens of thousands. The magnets for the collider are planned to have twin apertures for both beams. This design halves the number of required magnets for the collider.
Most likely failure modes, in studies [55,56], are leaks and blockage of the cooling, which also have relatively long recovery times. Papers [55,56] also highlight failures due to power connection and installation failures. This draws attention to issues such as accessibility of water cooling hoses & connections and efficient installation quality assurance testing. All these issues will not be directly translatable to FCC-ee. Unlike many other accelerators, FCC-ee will have water-cooled low field dipoles that will require less cooling than high field magnets. However, the FCC-ee will have 10 000 photon absorbers that will also require water-cooling [1].

Electricity distribution
Today, the LHC is mainly supplied by one 400 kV power source. The FCC will be supplied from three such sources, as shown in Fig. 6. This can impact reliability as the number of potential failure sources increases. However, this will also allow a redundant design, and the current plan is that FCC-ee could operate with one source missing [1]. This redundant approach is also reflected in lower hierarchy levels. Fig. 7 shows the redundancy within the transformer substations, alternative power sources, and the transmission line segments that connect 135/36 kV transformers. These transmission lines will allow a substation to provide electricity to an adjacent point. The high level of redundancy will alleviate issues with failures within the internal CERN network and also make operations independent from the availability of a single electricity source.
Additionally, CERN's operations have suffered from electricity perturbations that can be caused, for example, by lightning strikes. The effect of short disturbances can be mitigated with buffers. An attractive option would be to switch from an alternating current (AC) distribution network to direct current (DC) power distribution and to combine it with local energy buffers [12].

Energy storing for the booster ring
A concrete need for energy storing system rises from the FCC-ee Booster ring cycle. The booster injects continuously beam into the FCC-ee. The shortest cycle length (5.6 s) is in the tt 2 -mode [57]. The conceptual design report [1] did not provide details on how this system is planned to be implemented in the FCC-ee. Due to this reason, this paper provides additional details on different energy storage technologies. The topic is a particular interest as the system significantly increases the complexity of the powering system, which can affect reliability. Fig. 8 presents different technologies for storing energy. In accelerators, capacitors and flywheels have been used for shielding the network from the cyclical load pattern with local energy storage systems. Besides, Superconducting Magnet Energy Storage (SMES) have at-least been considered to be used in an accelerator.
The PS-accelerator at CERN was initially equipped with a flywheel system that was replaced with capacitors. A paper [58] published in 2005 studied different options on how to implement the buffer for the PS. The flywheel was rejected due to a lack of suppliers for the equipment with the required operating parameters. A direct connection to the network was rejected due to the cost of the required transformer and local reactive power compensator. The paper also identified a risk that a direct network connection might perturb operations of other CERN power converters and amplify disturbances caused by thunderstorms. Superconducting Magnet Energy Storage (SMES) was apparently rejected due to a lack of commercial off-the-shelves industry products.
After the review [58], the different energy storage technologies have been advancing. For example, in Fig. 8, the values for flywheel are for second-generation models with active magnetic bearings. Third generation superconducting bearings (SCB) have lower electromagnetic losses and thus allow storing higher amounts of energy, which allows the flywheel to provide power for a longer duration of time [59]. SCB is  also more straightforward and reliable than an active magnetic bearing as the SCB does not need a sophisticated control system for stable levitation.
The SMES system has also been studied for J-PARC [61] and DESY [62]. Interestingly, the study for DESY notes that accelerator facilities often have cryogenics production on site. This may make the SMES a more economically viable solution than in other instances where the cryogenic cooling is built only to serve the SMES. This applies to the FCC-ee as it will have cryogenics to cool the superconducting RF-cavities and interaction region magnets [1].
For the FCC-hh, batteries are considered as a solution to store energy [12]. They will recover energy from superconducting magnets at the end of a cycle to support the powering of the accelerator during the subsequent ramp phase. In that case, the batteries appear to be a promising solution as this technology has progressed impressively over recent years, thanks to the automotive sector and the increasing use of renewable energies.
Regardless, challenges exist for batteries [63]. Today, the capital and, in some cases, the operational expenditures of battery storage are higher than those for most of the alternative technologies. Regarding the applicability, Ref. [63] states that for batteries, lithium-ion technology has one of the best age performances with a maximum cycle time of 20 000 cycles. This number is sufficient for the FCC-hh that cycles a few times per day. However, for the FCC-ee booster that can cycle several times per minute, 5 this might not be a suitable solution both due to the limited cycle life and the long recharge time. Other issues include a fire hazard, which might affect system availability.

Machine protection system
The decision to dump the beam in FCC-ee will be governed by a machine protection system. The system will dump the beam if there is a high risk of uncontrolled beam losses that could damage the machine. Such a situation arises if a system required to maintain safe operation fails or if the beam becomes unstable. A key design principle in a machine protection system is to find a balance such that the system is sufficient to protect the machine but not overly complex to start to hinder the machine availability [64]. Achieving this balance requires a good knowledge of damage potential in failure scenarios. Uncertainty on this can lead to conservative policies that result in unnecessary dumps.
The concept of redundancy is highly present in the machine protection system. In cases where an individual sensor can trigger a beam dump, redundancy can be used to help to ensure the correct triggering. If false-positive signals are an issue, a system can be designed with a voting gate that only activates a beam dump when two out of three signals are positive. More often, a missed signal is the primary design concern. This issue can be tackled with an OR-gate where one out of two signals is needed to trigger a dump.
Compared to the LHC, FCC-ee will be larger, and in some instances, this might result in a more complex system. However, one significant difference between the LHC and the FCC-ee is that, in the LHC, the main dipoles and quadrupoles are superconducting and require a quench protection system [64]. When instead, the FCC-ee will have only few superconducting magnets in the interaction regions [1,65]. The normal conducting magnets do not need a quench protection system, only interlocks against overheating and loss of cooling. This is crucial difference as the quenches and the quench protection system have been a significant contributors to the LHC downtime [42]. Besides these, the beam loss monitor system also contributes to the superconducting magnets protection. The system initiates a beam dump if the beam losses risk a quench. One can find that this system has caused substantial downtime in the LHC [66]. However, most of this downtime is related to technical issues that are foreseen to be solved in near future.

Cryogenics
In the FCC-ee, the RF-cavities and the triplet magnets close to the interaction points will be superconducting, which will necessitate a cryogenic system. This system has been one of the leading causes of downtime in the LHC [42]. Although the system itself is quite reliable, cooling the system back to the operations temperature after a failure requires a long time. Table 6 shows the number of plants and their cooling capacity for the FCC-ee. The highest capacity will be required in the tt-modes when the FCC-ee will have four cryogenic plants to supply RF-systems of the collider and the booster [1]. Each of these plants will provide a 63 kW cooling capacity. The LHC has eight plants that each provide 20.4 kW cooling capacity. Although the FCC-ee needs a high amount of cryogenic cooling in the tt-modes, the cooling does not need to be distributed for the whole ring. As a result, FCC-ee will need fewer cryogenic plants than the LHC, which may be beneficial for reliability. There are even ideas to improve plant reliability. For example, the FCChh conceptual design report [12] mentions an idea of using compressors with oil-free active magnetic bearings that do not need gearboxes and shaft seals.

Failure recovery time and considerations on robotic maintenance
Besides the failure rates, also the time required to perform corrective maintenance in the FCC has raised attention. Most concerns are linked to the longer distance between CERN sites and access points and longer tunnel sections between individual access points compared to the LHC. Regardless, there are several ways to alleviate the issue of intervention times. CERN could construct an additional site to serve points that will be far from the current sites.
There are also ways to reduce the need for interventions. The buildin redundancy in design may reduce the number of failures and result in a lower number of reactive maintenance interventions. Instead, these interventions could be planned, which would lower the significance of the time it takes to access a point. A similar effect may also be achieved with condition-based maintenance techniques that allow predicting the remaining useful lifetime of a system [67]. Additionally, one could try to design accelerator systems such that the majority of critical systems would be located on the surface. This would reduce the need to enter the tunnel for repair work and shorten the intervention times.
A more technological solution on how to reduce the intervention times is to implement a robotic maintenance system. For example, a futuristic vision presented in [68] describes a scenario where the computerized maintenance management system could handle spares parts and create work orders autonomously, which would then be carried out by robots. Although this scenario is far in the future, the nuclear industry has used for decades remotely controlled robots to limit the radiation exposure for humans [69], and such capabilities are also developed at CERN [70,71]. At CERN, the robots perform monitoring and emergency response tasks.
Remote control still requires a person to operate the robot. The field of robotics is advancing rapidly, which leads to a high uncertainty on what capabilities robots will have at the time when FCC will be operated. When assessing the feasibility of substituting human labor with service robots, it is essential to understand the intended tasks [72]. The maintenance tasks differ from the current use case of today's assembly line robots. The tasks are non-repetitive as a robot would need to perform different kinds of interventions, and they will require some cognitive skills as it is unlikely that all the required tasks could be programmed in a robot.
The question of the feasibility of automated maintenance will also depend on economic factors and performance. Economic factors include the cost of the robotic system compared to human labor, the number of tasks that the robotic system can perform, and the frequency of these tasks. Performance depends on how much quicker a robot could perform maintenance compared to human intervention. One can answer these only later in the project as the answer will depend heavily on how these technologies advance and are adopted in the industry.
Basic conclusions can be already drawn like that longer distances in the FCC tunnel necessitate the use of robots and that this needs to be taken into account in design [73]. In the tunnel, space needs to be reserved for robotic systems to operate, and systems must be designed to be repaired and maintained by robots. To be able to do this in a vast infrastructure calls for industrial standards. These would establish norms on how the different types of equipment and the robots should be designed to be compatible with each other.

Energy calibration
Beam energy calibration by resonant depolarization is the basis for the precise measurements of the mass and width of Z and W bosons in the FCC-ee [1]. Although lepton beams polarize spontaneously in storage rings, in the FCC-ee at the Z mode, achieving the required level of polarization could take more than 200 h [74]. This time can be shortened by using wiggler magnets that will reduce the polarization time to 45-90 min depending on the required polarization level. In the W mode, a similar time should be achievable even with spontaneous polarization.
The current plan to achieve polarization in the Z mode is that at the beginning of each fill, about 200 non-colliding low-intensity bunches will be injected per beam. These bunches will be polarized for the energy measurement before filling the rings with colliding bunches. Practically, the collider will require 45-90 min setup time before a filling. This time has not been considered in the earlier paper [3]. The calculation Section 4.1.4 will show that this may result in stricter availability requirements in the Z-mode to allow the FCC-ee to reach the luminosity production goals.

Filling time
In the top-up operations mode, the beam is injected into the collider at the collision energy. In the FCC-ee complex, the beam will be first accelerated in the booster ring from where it is injected into the collider. The number of bunches in the injected beam is the same as in the colliding beam. So, each injection provides a beam for all bunches in the colliding beam. A filling is a period within a fill between first injections to the point when the beams have reached their nominal intensity. During this time, the current of the injected beam is about 10% of the full current of the colliding beam. Thus, a filling is completed after ten injections per beam. The beam current evolution in the collider during a filling is shown in Fig. 9. Once a filling is over, the beam current is maintained with continuous injections, as shown in Fig. 10. Table 1 shows that the amount of beam in the collider decreases as the collision energy increases in order to limit the amount of synchrotron radiation. This affects the filling times. The longest one is in the Z-mode, where a filling lasts 17 min and 14 s, while in the tt 2mode with the highest energy, a filling lasts 3 min and 44 s [57].  Earlier, the 17 min filling time was used for defining a 5% inefficiency budget [3], assuming that there will be about three to four failures per operation day. Here it is notable that the FCC-ee is designed such that the circulating beams are always colliding together [57]. It is likely that the experiments will start to collect data during a filling. This assumption is reasonable, as even today, the experiments in the KEKB and PEP-II collect data during individual injections 6 [75]. This foreseen ability to collect data during a filling reduces the filling time's impact on efficiency, as this time is not entirely lost in terms of production.

Injector availability
Failures occur both in colliders and injectors. In 2018, the SPS had 1063 recorded failures that lasted more than one minute. This statistic is not perfect as these failures are not limited only to those that affected or had the potential to affect the LHC. Also, the failures that lasted less than one minute were not recorded systematically.
The FCC-ee will require continuous injections to maintain a stable luminosity production level. The current assumption is that if the injections halt due to a failure, the instantaneous luminosity will decay exponentially [2]. In such a case, the instantaneous luminosity follows the equation: where 0 is the instantaneous luminosity at the start of a degraded operations phase, and is the luminosity lifetime. Integrated luminosity is given by the equation: We also assume that once the injector is operational, the injections can continue normally.
In the LHC, another source of injections issues are the beam quality requirements that are set to ensure machine protection and high luminosity performance. For example, in 2017, 30% of the injections were rejected [76]. In the FCC-ee, in the Z-mode, the amount of energy stored in an injected beam is similar to the LHC design value. However, the issue of rejected injections should not be as severe. As in the FCCee, the synchrotron radiation will have a high damping effect that may reduce the amount of beam emittance related rejections. Furthermore, some of the injection issues in the LHC are caused by the fact that beam injections occur only during fillings and the beam might not be ready at the start of a filling. In FCC-ee, a new beam is injected continuously, and so these kinds of issues may only be present during a refilling.
Next, let us demonstrate how a short injector failure affects the luminosity production in the Z-mode based on the established assumptions. For simplicity, we will also assume that the instantaneous luminosity is approximately constant when there no failure. In this case the integrated luminosity is calculated with equation When 0 = 200 * 10 34 cm −2 s −1 and = 4200 s, the instantaneous luminosity starts to decay almost linearly as Fig. 11 shows. In case of a short failure, this decay results in a minimal loss of production. Fig. 12 shows that the difference between the amount of produced luminosity in a normal case and during a fault is only 3.5% after a five minutes failure. This result is achieved by calculating the production in the two cases with Eqs. (5) and (6). Of-course, if these failures are frequent, the total loss of production will became a significant issue.

Availability modeling
Availability modeling section is divided into two parts. The first part describes a collider operations model for the FCC-ee. The goal of this task is to assess if the FCC-ee can reach the luminosity goals shown in Table 1 with the current availability and operation assumptions. The second part shows an example case on how redundancy improves system reliability. The example uses a generic n+1 system design. Authors show this case partly to argue that similar calculations will be needed to assess the availability of FCC-ee systems once their design is matured.

Collider operation modeling
This section starts by showing what modifications are required for the FCC-hh model [11] to use it for FCC-ee operation modeling. This part is divided into operations, availability, and luminosity production to cover different aspects of the model. The developed model is then used for assessing how the unavailability affects the FCC-ee's luminosity production performance.

Operation cycle model
The hadron collider model includes a distinct operation cycle model that is shown in Fig. 13. It contains the following operational phases: • Recovery: operation wait for a recovery of the collider or an injector. • Setup: machine systems are prepared for particle beam injection.
• Injection: the machine is filled with nominal intensity bunches injected in bunch trains. • Ramp-Squeeze & Adjust: beams are accelerated and squeezed to make the beams physically smaller and then set them to collide in interaction points. • Delivery: once adjustments are finished beams are declared stable, and experiments start to collect data from interaction points. • Beam dump & Ramp down: beams are extracted from the machine and dumped due to an operational decision or an interlock activation in the machine protection system. After this, the energy in the magnetic fields is ramped down to a pre-injection level.
This FCC-hh cycle model needs to be altered to model FCC-ee operations. As the machine operates in a top-up mode, dedicated ramp and ramp-down phases are not needed. Also, the injector chain unavailability affects machine operations differently. In the FCC-hh, injector unavailability affects collider operations only during the injection phase. In contrast, the FCC-ee will require a new beam all the time. With continuous injections, an injector can be understood to be an onoff machine. If an injector is unavailable, the luminosity production level will decay, and a new beam can be injected normally once the injectors are available. Also, the hadron collider model simulated individual injections. This would be excessive in the FCC-ee, where injections occur continuously with less than one-minute frequency.
An operation cycle model can be constructed based on these assumptions. Fig. 14 shows the cycle where the injection phase is now called the ''filling'' and the phase ''degraded operations'' stands for a mode where collisions are ongoing, but an injector is unavailable. The model can enter this phase either from the filling phase or from the delivery phase. Similar to the hadron collider model, a failure in the collider will trigger a beam dump, and the model will enter into the recovery phase. The model also assumes that a failure in an injector during a setup phase will cause a transition to the recovery phase.

Failure model
In the hadron collider model, system availability is modeled with fault trees that send information on failures and fault recoveries to the collider cycle model. The failure rates depend on the active cycle phase. This feature is important as some failures can only occur if there is a beam in the machine. The model implements this by informing the fault tree on the active cycle phase. Similarly, the fault tree informs the cycle model on failures that prohibit operations.
Built-in redundancies in different systems allow planning the corrective maintenance interventions during occasions that minimize the harm on operations. Taking this into account requires modeling an operational schedule that includes the planned maintenance interventions. Schedule modeling was done in the LHC operation model [10], but this implementation did not include deferred corrective maintenance as a feature. Implementing this is not difficult and was, for example, done to model CERN's PSB RF system reliability [46].
In this paper, just to demonstrate the functionality of the collider model and to calculate basic results, only the top-level failures are  modeled for the collider and injectors. The practical implementation of redundancy in system fault modeling is developed in Section 4.2 for a system with redundant n+1 design.

Luminosity production
In the hadron collider model, luminosity production is limited to the delivery phase. This will not be the case in the FCC-ee, where the collision production starts when the beam is injected into the machine [57]. Figs. 9 and 10 in Section 3.4.2 showed the beam current during the filling and the delivery phases.
Availability modeling considers long time scales, and in this application, even a rough approximation of the luminosity production is sufficient. During the filling phase, the production rate is assumed to increase linearly, and in the delivery phase, the production rate is assumed to stay constant. With these assumptions, the instantaneous luminosity can be estimated with the equation: where is the assumed slope factor for the filling phase, 0 the instantaneous luminosity at the start of the filling phase and is the assumed constant production rate during delivery. Further, integrated luminosity production follows the equation: During injector unavailability, we use Eq. (5) to calculate the integrated luminosity. Table 7 shows the parameter values for different operation modes.

Calculation
This section presents analyses on how failure and recovery rates in injectors and the collider affect the integrated luminosity production. Failures and repairs are modeled with exponential distributions. They have only one parameter to describe failure or repair rate. A cumulative distribution function gives a probability that an event has occurred before the time . For the exponential distribution, this function is given by the equation: The exponential distribution assumes that the event probability is timeindependent. For the failure rate, this is often a sensible assumption. However, for the repair rate, this assumption can cause the length of a repair to vary more than in reality. As such, using this assumption does not change the mean value, but might increase the result variance. This analysis is mostly interested in mean values and In this case, the assumption does not affect the results.
The following calculation assumes 91% availability for collider and 91% availability for injectors. The calculation gives an availability of about 83%, which is close to the original 80% target. The 91% availability for the collider and the injector chain is achieved by setting the failure rate as 10 h and the repair rate as 1 h, as can be seen from the equation where MTTF is the mean time to failure and MTTR the mean time to repair. For the Z-mode, the calculation assumes a 1.5 h long setup time for energy calibration, as described in Section 3.4.1. For other modes, the calculation assumes a 10 min long setup time. Table 8 shows that with these assumptions, the FCC-ee surpasses the annual target luminosity in all but the Z-mode. The calculation did not, however, take into account the effect of commissioning and intensity ramp-up. For the LHC, earlier calculations have assumed that this lowers the annual performance by 10%. If this assumption were applied, the FCC-ee simulated production in W, H, and tt 2 -modes would match the target values.
As the initial values failed to reach the target in the Z-mode, we performed a sensitivity analysis to study what availability is required to reach the luminosity goal. The result in Fig. 15 shows that the luminosity production is more affected by the failure rate of the collider ring than the failure rate of the injector chain. The same effect can be seen in Fig. 16, where the effects of the collider ring's and injectors' failure and recovery rates are studied. Collider failures affect the luminosity production more because they stop luminosity production, while we assume that a failure in an injector only reduces the production rate over time, as described in Section 3.4.3. This effect is also the reason why luminosity production is more sensitive to the length of an injector failure than the failure rate. In case of a collider failure, these parameters seem equally important.
Based on the sensitivity analyses, FCC-ee reaches the target luminosity in the most challenging Z-mode when the collider MTTF is over 15 h, and the injector MTTF is over 10 h with 1-h MTTR in both machines. Eq. (10) shows that the collider MTTF and MTTR values result in about 94% availability requirement for the collider ring.

Case description
This section demonstrates how n+1 redundancy in design can improve system availability. The case acts as an example of an analysis that should be performed for different FCC-ee systems as their design matures. The presented case is based on the paper [77] that presents a reliability study of a power converter with built-in redundancy. The converter consists of two identical units that share the load when both of them are working. If one of them fails, the other unit will carry the full load.
This study presents how different designs of n+1 redundancy or the lack of redundancy affect the reliability of a system. The initial  failure rates are from the paper [77] that uses Weibull distributions to depict the failure behavior. The cumulative distribution, namely the probability that a unit has failed before time , is given by the equation: where is a scale, and is a so-called shape parameter that allows modeling effects of age on the failure rate. The effect of the load is implemented with an Acceleration Factor (AF) that modifies the parameter. The stress-relative value is gained by dividing the reference value with AF: We will use this concept further to study how improved or reduced reliability will affect the results.
A system failure behavior is described with a fault tree that is shown in Fig. 17, where a system consists of n+1 units where a unit has three possible failure modes. The unidentified fault stands for events where the failure cause was not recorded.

Calculation
Initial values for Weibull distributions' and parameters are presented in Table 9, where values for different current loads are calculated with the acceleration factor techniques presented in [77]. A system failure is assumed to cause 3.5 h of downtime. The paper [77] assumed that the failure of an individual unit can be repaired on average within five days without an adverse effect on operations. Here we assume instead that the corrective maintenance for unit failures is deferred to an annual shutdown after 185 days of operations.
The calculation period duration is set as 14 * 185 days that corresponds to the number of operational days during the life-cycle of the FCC-ee. The simulation is repeated 10 000 times to gain statistically credible results. Fig. 18 shows the results for systems with different n+1 designs and the sensitivity on the failure rate. The system in question is remarkably reliable as it is a part of the machine protection interlock system. Therefore, significant differences only appear when the reliability is decreased. Interestingly, the 5 + 1 design is less reliable than the non-redundant design when the parameters are reduced to 25% of the original values.
A study of different designs is somewhat abstract, especially when, in the preceding example, different designs had different capabilities. The next example examines how redundancy affects a system that requires 20 units to function. The units are organized in clusters with different levels of redundancy. Table 10 shows the considered cases.
The highest level of redundancy 1 + 1 leads to a system that has 20 clusters with 20 redundant units and is likely to be costly to implement. The investment should result in a more reliable system that would pay off the investment cost by providing high availability. The results in Fig. 19 show that in this case, any redundancy has a positive impact on system availability, as the downtime for 20+0 design is an order of magnitude higher compared to the 20+1 design. Regardless, the effect of the redundancy has to be always quantified, as Fig. 18 showed a case where the 1+0 design was better than the 5+1 design when the failure rate was high.

Conclusions and discussion
The Future Circular Electron-Positron Collider (FCC-ee) is planned to be the world's largest particle collider and a precision instrument to study the heaviest known particles. Achieving substantial physics results requires producing high amounts of integrated luminosity, which calls for sufficient machine availability. A previous study [3] defined an 80% hardware availability goal, and one of the objectives of this paper was to test this assumption in an operation model.
The FCC-hh study developed an operations modeling platform that allows allocating the availability goals for different machines and studying different cases. This paper altered the existing model [10,11] to represent FCC-ee operations. The results of the simulations show that the FCC-ee can reach the luminosity goals in the H, W and tt-modes.
However, the set target might not be sufficient in the Z-mode, where the filling time and the energy calibration time are longest. This might further result in a more tightened availability requirement in this mode. On the other hand, it is notable that the RF (Section 3.2.1) and the cryogenic systems (Section 3.2.6) are much more complex in the other modes. Due to this, it might be possible that at the system level, the availability requirements for the individual components may be reversed between the modes.   The results show that the FCC-ee operations can sustain some injector unavailability despite the top-up injection scheme. A contributing factor to this result was our assumption that the injection can continue normally once the injector is recovered from a failure. There, however, will be cases when this assumption is not correct. For example, if the FCC-ee booster failure requires intervention to the tunnel, it is clear that the radiation hazard prohibits the access when the collider has a beam. In such cases, the beam will be dumped before the intervention. In more detailed system analysis, one should study what failures require intervention and how to reduce this need.
One way to reduce the interventions is by adopting a fault-tolerant system design. Throughout the paper, special attention was given to this design as it has proven to increase the availability in many accelerator applications. This paper presented a listing on availability critical systems for the FCC-ee. The redundancy is shown to improve the availability of power converters, RF system, and electricity distribution. The concept is also used heavily in the machine protection system to improve the reliability or availability with different trigger configurations.
This paper presented the benefits of redundancy with calculation cases on a generic n+1 system design. Case also showed that it is essential to quantify these benefits and to consider the cost efficiency. In the authors' opinion, the first candidates for such analyses, in the FCC-ee, should be power converters and RF-system. In both of these cases, the system implementation for the FCC-ee collider and booster will be vast. This paper also uncovered an additional challenge for the powering system of the booster ring. It will likely need an energy storage system to shield the network for continuous ramp cycles, which will increase the system complexity.
To conclude, this study was based on the conceptual design report of the FCC-ee [1] and resulted in the above mentioned findings. Although these findings are valuable, the maturity of the FCC-ee's design and operation plans is still too low to benefit from the full potential of the used methods. However, it is clear that this topic should be readdressed once the design of the FCC-ee is matured as it is essential to ensure that the FCC-ee will reach the set luminosity goals.

Declaration of competing interest
The corresponding author Arto Niemi is employed by the CERN and the co-author J-P Penttinen by the Ramentor Oy. While writing this paper, authors benefited from comments and insights from a large number of colleagues who are mentioned in the acknowledgment section. This section also mentions that the research has received EU funding under two different Grant agreements. We further take a chance to clarify that J-P Penttinen is still a Ph.D. student at Tampere University, while his thesis is waiting to be published.