Preparing the hardware of the CMS Electromagnetic Calorimeter control and safety systems for LHC Run 2

The Detector Control System of the CMS Electromagnetic Calorimeter has undergone significant improvements during the first LHC Long Shutdown. Based on the experience acquired during the first period of physics data taking of the LHC, several hardware projects were carried out to improve data accuracy, to minimise the impact of failures and to extend remote control possibilities in order to accelerate recovery from problematic situations. This paper outlines the hardware of the detector control and safety systems and explains in detail the requirements, design and commissioning of the new hardware projects.


The CMS ECAL detector control system
The role of the CMS ECAL DCS is to monitor and summarise operating conditions and to enable control of the power supplies that serve power to the detector hardware. The DCS software is built using the WinCC Open Architecture (WinCC OA) control system toolkit from ETM GmbH. It also makes use of existing CERN software developments in the form of the JCOP Framework [3] and components provided by the Central CMS DCS group [4]. Industry standards are used where possible, such as the use of OPC Data Access (OPC DA), Modbus and S7 protocols to communicate with standard hardware components. The DCS software runs on 3 DELL blade servers, installed with the Windows Server 2008 R2 operating system. A further set of 3 servers run a replica of the software to act as a hot standby, providing redundancy in the event of a critical failure in the primary system. An overview of the system architecture is presented in figure 1.
In order to fully support redundancy with a seamless transition between the two running systems, all hardware devices to be monitored and controlled must be accessible from both sets of servers. This precludes the use of interfaces based on PCI and USB which typically attach peripherals to a single host computer. The chosen solution for the CMS ECAL DCS has been to install converters to provide access to existing field buses, such as Controller Area Network (CAN bus) and RS485, over Ethernet. These devices have been successfully tested and validated in the DCS environment. All hardware communication is now carried over Ethernet, except for the systems based on CAN, which will be upgraded later in 2015.
Finite State Machines (FSM) [5] are implemented to summarise the process variables of each device in a single human-readable state. They also allow the use of simple control commands (such as "ON" and "OFF") without detailed knowledge of the underlying hardware.

JINST 11 C01020
A hierarchy of FSMs is used to model the physical subdivision of the ECAL detector. This hierarchy enables the states of the various sub-components to be clearly summarised at a higher level and allows high-level commands (such as "ON" or "OFF" commands for the entire detector) to be propagated down to individual devices in a controlled way.

Barrel and Endcap environmental condition monitoring
The EB and EE are composed of 75,848 lead tungstate scintillator crystals. The scintillation response of these crystals varies with temperature and it is therefore critical to have precise monitoring of the temperatures in the detector volume. This is one of the key tasks of the DCS.
The temperature is monitored with 18 Embedded Local Monitor Boards (ELMB) [6] which compare signals from 512 thermistors mounted in the detector volume against signals from precise current sources. The ELMBs feature a CAN interface and are currently connected to the DCS via a CAN-USB interface.
Previously, the relative humidity in EB and EE was also monitored with the same ELMB infrastructure. However, as described in section 4.2, a newly designed system has recently been deployed.

Powering systems
The detector hardware requires high voltage, ranging from 60-800V, to bias the active sensing elements, which consist of Avalanche Photo Diodes (APD) for EB, Vacuum Phototriodes (VPT) for EE and silicon sensors for ES. Low voltage power supplies deliver more than 100kW to the on-detector electronics.
The powering hardware is provided by two commercial vendors, CAEN SpA and Wiener Plein & Baus GmbH. CAEN SY4527 mainframes are used to deliver high voltage to EB, EE and ES in addition to the supply of low voltage to ES. Radiation and magnetic field tolerant Wiener power supplies are used for providing low voltage to EB and EE. In total, the DCS controls 1624 high voltage channels and 1060 low voltage channels.

Safety systems
The CMS ECAL detector hardware is also protected by high reliability, PLC-based safety systems. These monitor conditions that are important for the safe operation of the detector and can act to bring the detector to a safe state by applying failsafe, hardwired interlocks. More details are described in section 3. The DCS has an interface to these safety systems to allow monitoring, visualisation and archiving of the safety related data as well as enabling the manual trigger and release of interlocks.
Automatic protective actions are implemented in the control system software to avoid major deviations from the nominal conditions. These actions are designed to shut down the detector in a controlled way, avoiding the need for the safety systems to act. However, the safety systems are the ultimate safety mechanism in cases where the software layer is unavailable or fails to act.

Interfaces to other systems
In addition to the powering and safety systems, the DCS has interfaces to several other systems. One such interface is to the Central CMS DCS, in order to provide integrated and centralised monitoring of ECAL by the CMS DCS shifter in the control room, as well as to benefit from centralised -3 -

JINST 11 C01020
services such as the distribution of access control authorisation information. Other systems require less integration, requiring only the exchange of high level data, which is transferred using the CERN Data Interchange Protocol (DIP) [7]. In this way, real-time information from the LHC, CMS magnet and EB/EE and ES cooling systems is incorporated into the DCS to provide a complete overview of the operating conditions.

The CMS ECAL safety systems
The CMS ECAL features two independent safety systems; one which assures the safety of the EB and EE partitions and the other which is dedicated to ES. Both systems are implemented with Siemens PLCs to ensure fast and reliable execution of the necessary actions to bring the detector into a safe state. An architectural overview of the systems is shown in figure 2.
The safety systems gather information from sensors located inside the detector volume, which measure temperature, relative humidity and detect water leakages. Additionally, the safety systems collect information from other PLC systems. These links are implemented with digital signals through failsafe hardwired connections, ensuring a reliable and dependable interconnection between systems. Information from the CMS magnet, the Detector Safety System (DSS) and the ECAL cooling systems are used by the safety systems to determine whether or not it is safe for the detector to be powered. When a safety critical condition is detected, the safety systems can act by interlocking the powering hardware to interrupt and prevent the supply of power to the detector. The safety systems are also able to send signals to the DSS and ECAL cooling systems, in case further actions need to be taken in these external systems.
In addition to the functionality described above, the ES safety system also implements PID control and safety related actions for the thermal screen. The PID loop controls heaters to ensure that the external surfaces of ES are maintained at a constant temperature, thermally isolating the internal ES detector volume from neighbouring subdetectors.
While the ES safety system is based solely on commercial Siemens PLC components, the EB/EE safety system uses a combination of Siemens hardware and two custom designed hardware units, developed by the CMS Belgrade Group. The first of these units is used to monitor the temperature and water leakage sensors and to package and send this data to the PLC via a custom protocol on an RS485 bus. These units are called readout units and use a PIC microcontroller to digitise the probe signals and to handle the bus communication. The second type of unit, called interlock units, handles input and output interlock signals.

Hardware upgrades during LHC LS1
As a critical part of the detector operations, the 24/7 DCS on-call service responds rapidly to resolve issues whenever they occur. By documenting these interventions during Run 1, it was possible to identify particular topics where improvements could be made. Software changes could be made during short technical stops of the LHC, but hardware changes required longer periods to complete the migration and testing. For this reason LS1 was an excellent opportunity to upgrade and consolidate the hardware of the control and safety systems. For the new hardware projects, existing technologies and standards were used where possible to simplify the design effort and minimize the long term maintenance load.

High Voltage mainframe remote reset
It was observed that the CAEN mainframes can occasionally enter a state where it is no longer possible to communicate with them remotely. To recover from this situation, it was previously necessary to manually power cycle the mainframe, which involved turning a key on the front panel.
Depending on when such an error occurred, this could involve a significant delay before recovering the system.
To avoid this delay, a remote reset system was designed and implemented. The CAEN mainframes accept either a NIM or TTL signal on the front panel which can be used to trigger a reboot [8]. If a signal with a pulse length between 100ms and 200ms is sent, only the CPU of the mainframe is reset, which has no impact on the output power channels and is typically sufficient to resolve most communication issues. If a pulse of longer than 1000ms is sent, the mainframe reboots the CPU and resets the backplane which immediately cuts the power to all output channels.
The system was implemented using an Arduino Ethernet, with each unit providing TTL outputs to reset up to 14 CAEN mainframes. The unit can send pulses with a configurable length, providing access to both types of reset action. Due to the geographical distribution of the 23 mainframes used for ECAL, a total of 3 units were deployed. An implementation of Modbus was programmed in the Arduino to enable direct access from the DCS software using the native Modbus driver of WinCC OA.
The units have been installed and integrated into the DCS software. They have been used to resolve several real issues, proving to be an efficient way to speed up recovery of communication with the CAEN powering system. This extension has the potential to reduce downtime in Run 2 by ensuring that the powering systems are always controllable from the DCS and hence, always in the desired state.

Improved relative humidity monitoring system
The performance of the original ELMB-based humidity readout system for EB and EE was limited by the parasitic capacitance of the long cables between the humidity probes in the detector volume and the readout electronics installed outside of the detector. The resistive humidity probes, UPS-600 humidity sensors from Ohmic Instruments [9], require an AC excitation signal with a specified frequency between 33Hz and 10kHz. The original ECAL application used a frequency of 400Hz.
The parasitic capacitance of the cables imposed a lower limit on the readable humidity values. At low humidity values, the probe impedance became much higher than the parallel impedance of the cable, so the system was no longer sensitive to changes in humidity. The observable range of relative humidity was 60-80%. This range did not include the nominal, low humidity levels of the CMS ECAL, but was able to indicate anomalous rises in humidity.
To overcome this limitation, it was decided to reduce the probe excitation frequency to widen the range of probe impedances that could be monitored. A DC excitation is prohibited because it can lead to drift in the readout values, so it was decided to target an AC signal with a frequency of 1Hz. This value is outside of the specifications of the probe, so an intensive testing campaign was carried out in order to evaluate the long term impact on the probes. Any degradation of performance would be unacceptable as the probes are inaccessible and cannot be repaired or replaced for the lifetime of the current detector hardware. The tests with a low frequency were successful and demonstrated that the lower frequency excitation was a feasible method to improve the humidity readout range.
A completely new excitation and readout system was designed and implemented by the CMS Belgrade Group. Due to the positive experience with the PIC-based, Belgrade-developed safety system readout units that are installed in the experimental cavern, the new humidity readout was implemented using PIC18F452 microcontrollers. The microcontroller is used to coordinate the generation of the excitation signal, the digitisation of the amplified probe signals and the communication with the DCS supervision software. The excitation is precisely generated and ensured to be symmetrical to avoid causing drift of the humidity probes. The signal amplifiers are logarithmic in order to handle the large dynamic range of the measured signals and feature diode-based temperature compensation.
Communication between the DCS and the readout units was implemented with the Modbus protocol. A simple Modbus implementation was created in the PIC microcontroller to provide access via the WinCC OA Modbus driver. At the location where the units are installed, in the experimental cavern, there is no Ethernet service available. For this reason, RS485 was used to transport the Modbus data to the service cavern where commercial Modbus RS485-Ethernet adapters were installed.
The units have been successfully deployed and the humidity data is integrated into the DCS software. With this upgrade, the range of relative humidity that can now be observed has been extended to 10-80%.

Upgraded power distribution for the precision temperature monitoring system
Following experience from Run 1, it was observed that a single failure of a module of the ELMBbased temperature readout for EB and EE could lead to degradation of the complete temperature monitoring system. To avoid this situation, a new power distribution network to deliver power to the readout electronics was designed.
The new power distribution was designed to provide higher granularity to limit the consequences of a single failure. The temperature monitoring hardware requires 3 independent 12V inputs to power the ELMBs and a further 5V supply to power the precision current sources. The new powering network features switches and fuses on each of the distributed power lines. If a failure occurs, the fuse will isolate the problematic components, so that the remainder of the system can continue to function normally. The switches can be used to carry out debugging and to isolate parts of the system to perform repairs.

Safety system hardware updates and spare components
As the safety systems are critical for the operation of CMS ECAL and must run with the highest availability when the LHC is running, it was necessary to take steps to ensure that the systems continue to work successfully for the duration of Run 2.
The EB/EE safety system CPUs were due to reach the end of their supported lifecycle in July 2015, meaning that no more spare parts or repair services would be available. For this reason, the decision was taken to replace the CPUs with a newer, equivalent model to guarantee full support until October 2022.
Spare parts for the Siemens PLCs are provided by CERN PLC Spare Parts Critical Stock which is accessible 24/7. For the custom elements of the safety systems, developed by the CMS Belgrade Group, the local spare stock was reinforced. There were already four interlock units in stock and the manufacture of four readout units was commissioned to ensure that the spare stock is equivalent to at least 33% of the production system.
In addition to producing new readout units, there was also a campaign to build an additional stock of pre-programmed PIC microcontrollers, which are the most important component of the readout units. While the data retention of the PIC program memory is quoted as being 40 years [10], the units operate in a hostile environment with exposure to radiation and magnetic fields. For this reason, a batch of additional PICs were programmed and are stored in the local spare stock. To evaluate the new batch of microcontrollers and to monitor any differences over time compared to the previous generation, two new PICs were installed in existing readout units in the ECAL safety system during LS1. After running successfully for several months, it was decided to freeze the hardware configuration for Run 2.

Conclusion
Using the operational experience of Run 1, the CMS ECAL DCS team was able to identify key areas for improvement of the control and safety systems in order to ensure successful operations in Run 2.
The hardware modifications carried out on the DCS and safety system hardware have improved robustness and extended functionality. By reusing known technologies and taking advantage of standards and open platforms, the hardware upgrades were delivered on schedule and were rapidly integrated into the DCS software layer.
The operation of the CMS ECAL DCS and safety systems has proven to be very reliable in the first months of Run 2 of the LHC. With the benefits of the work described in this paper, the systems will continue to provide high levels of availability over the next several years, contributing to the efficient collection of physics data.