The prototype of the HL-LHC magnets monitoring system based on Recurrent Neural Networks and adaptive quantization

This paper focuses on an examination of an applicability of Recurrent Neural Network models for detecting anomalous behavior of the CERN superconducting magnets. In order to conduct the experiments, the authors designed and implemented an adaptive signal quantization algorithm and a custom GRU-based detector and developed a method for the detector parameters selection. Three different datasets were used for testing the detector. Two artificially generated datasets were used to assess the raw performance of the system whereas the 231 MB dataset composed of the signals acquired from HiLumi magnets was intended for real-life experiments and model training. Several different setups of the developed anomaly detection system were evaluated and compared with state-of-the-art OC-SVM reference model operating on the same data. The OC-SVM model was equipped with a rich set of feature extractors accounting for a range of the input signal properties. It was determined in the course of the experiments that the detector, along with its supporting design methodology, reaches F1 equal or very close to 1 for almost all test sets. Due to the profile of the data, the best_length setup of the detector turned out to perform the best among all five tested configuration schemes of the detection system. The quantization parameters have the biggest impact on the overall performance of the detector with the best values of input/output grid equal to 16 and 8, respectively. The proposed solution of the detection significantly outperformed OC-SVM-based detector in most of the cases, with much more stable performance across all the datasets.


Introduction
The LHC (Large Hadron Collider) was built with more than 20 years lasted effort of CERN (the European Organization for Nuclear Research) personnel and whole world wide High Energy Physics community. The LHC consists of a 27 km ring located 100 m underground and filled mainly with superconducting magnets. The LHC started operating in 2008 and since that time it contributed to a number of pronounced scientific discoveries concerning Standard Model [4,5].
Many research and development programs are continuously carried in order to deliver improvements to Table 1: The comparison of main parameters of LHC [1], HL-LHC [2] and FCC-hh [3]. will be much higher and may even be the only chance to maintain and operate the accelerator. The modifications in the accelerator structure related to the HL-LHC project require, in turn, a creation of new solutions for the MPS (Machine Protection System), the LHC components maintenance and monitoring system, which is the responsibility of CERN TE-MPE (Technology Department -Machine Protection and Electrical Integrity) group.
A complexity of this task stems from a quantity of signals acquired from various LHC magnets and the real-time operation requirement. The system needs to process all the data and detect anomalies in such a time that will allow operators to react, as well as various automatic fault prevention procedures to run. Conventional anomaly detection systems, such as presented in section 3, cannot be used in this particular application due to a huge quantity of signals, very few anomaly cases and the hardware implementation in embedded systems requirement.
One of the promising research directions involves using machine learning, especially deep learning, algorithms for magnets monitoring, as well as anomaly detection. A real-time execution of deep learning algorithms requires dedicated, low latency architectures, which is an authors' long-term research goal. The current work, presented in this paper, focuses on the development and verification of a dedicated, deep learning-based solution involving adaptive quantization and RNN (Recurrent Neural Network). The created solution achieved very encouraging results for LHC magnets signals.
The presented research main contributions of are as follows: • development of an architecture for anomaly detection based on GRU (Gated Recurrent Unit), • introduction of a new approach based on adaptive grid quantization, • detector design procedure, including neural network hyper-parameters selection which accounts for the detector operation environment, • development of a prototype of the system suited for doing experiments with the adaptive grid-based approach. The software is available online [6].
The developed design procedure should allow reusing the researched solution for various use cases, requiring only a setup configuration changes.
The rest of the paper is organized as follows. Sections 2, 3, 4 provide background information about LHC, anomaly detection state of the art and Recurrent Neural Networks, respectively. The system description is presented in section 6 and the developed methodology in section 7. Section 8 provides the results of the experiments, with discussion being presented in section 9. Finally, the conclusions of our research are presented in section 10.

Large Hadron Collider superconducting magnets
The LHC, the biggest and most powerful accelerator in the world, is divided into eight sectors (octants) (Fig. 1). The tunnel itself contains strings of superconducting magnets, accelerating cavities and many other necessary instruments. Two vacuum beam pipes are going through central part of iron yoke of magnets.   The particles are produced and initially accelerated by the chain of smaller accelerators. Then, particles are delivered into the LHC with energy at injection level (see Tab. 3). During every turn around whole trajectory, the energy of particle raises and synchronously the magnetic file produced by bending magnets must also be increased. This ramping process takes some time before the machine achieves condition in which collisions are initiated. In this state, every 25 ns two particle clouds (bunches) collide in four interaction points denoted in the Fig. 1  Products from each collision are observed by dedicated systems of particle detectors. The main goal of whole engineering effort at the LHC is to maintain the collision state of the accelerator as long as possible to give a chance to maximize the number of observed events. However, the quality of beams is decreasing with each collision and at some point the collisions stop being useful for physics experiments. At this stage of operation the beams are dumped, the machine must ramp down and be filled with particles again. This whole work cycle can be interrupted at any time by a malfunction of one of thousands elements of the accelerator.
There are 1232 dipole and 392 quadrupole magnets that are key elements of the LHC (see Tab. 2 for the approximate list). The coils of those magnets are wound up with multi-filament cables. The filaments are made with Nb − Ti alloy and they are surrounded by a copper matrix. This kind of coils produces a magnetic field of 8 T, sufficient to drive particles along the ring at 7 TeV energy. The coil conducts a high superconducting current, but sometimes, locally, in random and uncontrolled way, it becomes normally-conducting. This event (a quench) is very dangerous because it is connected with burst dissipation of energy stored in superconducting circuit. It is not a malfunction, but a physical phenomenon which takes place in any superconducting circuit which does not meet a condition of cryogenic stability [11]. The superconducting magnets applied in the accelerators are designed as not safe in an sense of cryogenic stability. Many other design constraints make this an only feasible possibility. Therefore the QPS (Quench Protection System) was created at the LHC [12,13].
QPS is a sophisticated subsystem dedicated to magnet coils monitoring and anomaly detection, supervising working condition changes during various phases of the system operation. The voltages on coils, busbars and current leads are acquired and stored in database. The malfunctions or quenches are detected on-line when a value of the voltage exceeds a safety threshold. When an excess value lasts longer than a discrimination time, a trigger signal is generated to stop the operation of whole accelerator and to safely discharge the energy stored in its circuits. The diagram presented in Fig. 2 summarizes the described scenario.
For the HL-LHC, even more powerful magnets are necessary. A number of selected main dipoles will be replaced by the newly designed magnets. A magnetic field on the level of 11 T can be achieved with coils made with niobium-tin Nb 3 Sn [14]. However, this kind of coils suffers from not only a quench but also from a phenomenon called flux jumping [15], affecting the nature of voltage waveforms describing the state of superconducting magnets. As a consequence, previously used monitoring methods and fault prevention systems may need to be overhauled or replaced with new solutions.

State of the art
Monitoring changes in time series signals are critical in many areas of engineering and real-life applications. This is mostly due to the fact that roughly 80 % of the signals which occur in the world are temporal in their nature. Consequently, anomaly detection was heavily explored as a field over the last several decades and a lot of methods were developed to address this challenging task [16][17][18]. It is worth noting that most of the anomaly detection tasks deal with quite asymmetric datasets which means that there are far few cases of anomalous behavior than regular ones. Furthermore, labeling the data is challenging task. This leads to the preference of unsupervised methods over supervised ones when it comes to real life applications. An ideal anomaly detection system should: • be able to detect anomalies with the highest possible accuracy, • be trained in unsupervised fashion, • trigger no false alarms, • work with data in a real-time, • be completely adaptable (no hyper-parameters tuning).
Unfortunately, it is very hard to construct such a system not only because of a challenge to meet all of the requirements at the same time but also for the sake of a data profile. For instance, how frequently a model should be updated to account for seasonality of changes in the data is not always clear. Furthermore, real-time performance is not always at the premium. However, due to the rise of data volume and an increasing demand for speed at which the result should be delivered by a system we may expect the growing demand for realtime performance.
Anomaly detection systems in real-life applications are not ideal which means that they do not meet all the requirements enumerated on the list. They do not have to as very often it is enough that a system detects most of the anomalies in a reasonably short time. Sometimes, however, for a sake of a task profile, it is critical that a system does not generate false alarms. This may even come at the expense of slightly lower overall accuracy of the system. In some other cases, response accuracy is not as important as a low response time of the system. It can be observed while analyzing how anomaly detection systems developed over the past few decades [18] that there is a trade-off between response accuracy and reaction time. This means that it is very hard to construct a system which is very accurate and fast at the same time. Consequently, depending on an expected performance three different groups of anomaly detection systems may be distinguished: • offline, • partially on-line, • online.
The first category of the systems operates in an offline fashion which means that they are trained offline and work offline. Such solutions are well suited for processing large volumes of data at relatively low pace and usually require an access to the whole dataset. Examples of such systems used in industrial applications are EGADS (Extendible Generic Anomaly Detection System) developed by Yahoo [19] and RPCA (Robust Principle Component Analysis) [20]. EGADS is based on an assumption that integration of several methods within a single framework helps to address different kinds of anomalies. This is in principle very good approach but comes at a cost of processing time which results from a necessity of weighting and incorporating contributions of different methods. There are also other solutions which can be classified into a category of offline systems [21,22].
The second category of the algorithms is trained offline and work online. Complex and large models usually need to be trained offline because of the time it takes to complete the process. However, sometimes the model is small enough to be trained online, but it is still beneficial to conduct the training process offline. This results from a profile of an application and the requirement the system is to meet.
In some applications, a system should be more sensitive to seasonal changes and it is essential to train a model only during certain periods of time. There is a whole branch of clustering-based algorithms which are trained offline to subsequently work online [23][24][25][26][27][28][29][30]. Time series are clustered according to their properties and all the outliers which do not belong to one of the clusters are considered anomalies. A number of clusters and the classification threshold are two of the more critical parameters which are to be chosen for the applications of those algorithms.
One of the common solutions which fill in the second category is an approach based on OC-SVM (One Class Support Vector Machine). Several implementations of anomaly detection systems based on OC-SVM were proposed and promising results were reported [31][32][33][34][35][36][37].
There is also a set of methods based on RNNs which model original signal and based on a discrepancy between an expected and real value make a decision regarding an anomaly occurrence [38][39][40].
The third kind of the anomaly detection systems adapts online which means that all the novelties which are detected are incorporated in the model [41,42]. Next time the same phenomenon occurs in the input signal to the system it will not be considered as an anomaly. In such an approach, the system constantly adapts to the changing environment. This may be beneficial in many scenarios but there are applications in which due to the seasonality it is recommended to update a model in well-defined moments of time.
There is set of more advanced streaming anomaly detection methods, such as ESD, ARIMA, and Holt-Winters [43][44][45][46], which are used in many industrial applications. A broad analysis of the modern anomaly detection systems is beyond a scope of this paper, for more in-depth review we suggest [16][17][18]. It is worth emphasizing that the area of novelty detection is expanding very fast which is driven by an exponential growth of available information and rising need for knowledge extraction. We may expect this trend to intensify as a result of an introduction of new hardware platforms which are capable of processing data faster [47][48][49].

Recurrent Neural Networks
RNNs fundamentally differ from FNNs (Feedforward Neural Networks). The recurrent neural models learn to recognize patterns in the time domain. Conse-

GRU unit
Pointwise multiplication + Sum over weighted inputs z (t) r (t) hc (t) + 1 - Figure 3: An architecture of GRU unit quently, they are capable of modeling signals in which patterns span over many time steps. For many years, factors limiting the development and engineering applicability of recurrent neural networks existed. Those restrictions were associated with the possibility of learning very long patterns, with sequences length in tens or hundreds. Classical RNNs were not able to learn them due to the so called vanishing (or exploding) gradient phenomenon. Scientists working in a RNNs domain were aware of it occurring, not only in RNNs but also in deep FNNs. Therefore, extensive research was conducted and in 1997 it resulted in developing the LSTM (Long Short-Term Memory) architecture by Jürgen Schmidhuber [50]. Unfortunately, due to the lack of computing power and limited available data quantities, LSTM-type networks were developing slowly. In the recent years, however, there was a huge progress in RNNs. Many variants of the original LSTM algorithm were introduced, optimizing the original architecture. One of such a modifications is GRU [51,52].
GRU has gating components which modulate the flow of information within the unit, as presented in Fig. 3.
The activation of the model at a given time t is a linear interpolation between the activation h (t−1) from the previous time step and the candidate activation hc (t) . The activation is strongly modulated by z (t) as given by (2) and (3).
The formula for the update gate is given by (4) and modulates a degree to which a GRU unit updates its activation. The GRU has no mechanism to control to what extent its state is exposed, but it exposes the whole state each time.
The response of the reset gate is computed according to the same principle as the update gate. Previous state information is multiplied by the coefficients matrix W r and so is the input data. It is computed by (5).
The candidate activation hc (t) is computed according to (6). When r (t) is close to 0, meaning that the gate is almost off, the stored state is forgotten. The input data is read instead.
As it was pointed out, GRU has a simpler structure than LSTM [53,54] which is also reflected in the performance of the algorithm.

Previous work
In the authors' previous work concerning superconducting magnets monitoring [53] RMSE (Root-Mean-Square Error) approach was used. It showed that RNNs are in fact able to model magnets behavior, however, it has several drawbacks that make it hard to use in practical anomaly detection applications.
Firstly, to effectively analyze anomalies using RMSE it would be necessary to select a detection threshold. Such a threshold would be very arbitrary, since it is hard to discern what value, allowing to detect all anomalies, would be appropriate based on the results obtained from all the data, including normal operation.
Secondly, the resolution of such an anomaly detection would depend on used window size. Additionally, choosing too wide a window would result in anomaly potentially 'drowning' in the correct data, while choosing too small could result in false positives. Window size would also influence the trained RNN accuracy.
Thirdly, RMSE approach is fundamentally a regression problem, which limits the RNNs application potential.
Described drawbacks resulted in authors' decision to switch from regression to classification, which can be more effectively addressed using RNNs. Initially, the signals were converted to classes using a static, evenly-spaced grid, mapping whole signal amplitude [54]. Both input signals and output one were mapped, and the model task was to correctly predict output category given a tensor of input classes. When the prediction and real signal did not match over a certain amount of samples, an anomaly was reported.
A static quantization process is mapping signal input space S in to m classes (see Tab. 4 for notation used), that can potentially be represented by log 2 m bits instead of initial 32 or 64 per value. At first, as given by (7) -(8), the signal is normalized. Next, the normalized signal values are mapped to categories, using m evenly-spaced bins spanning whole signal amplitude (9) -(10).
Π norm : As a result of conducted experiments analysis, as well as formal static quantization algorithm scrutiny, the authors concluded that using evenly-spaced grid will not allow to effectively detect anomalies. It was due to the algorithm mapping most of the data to a very small number of categories, with the majority of them almost never being used (see Fig. 4a and Tab. 5). The end effect was that, on one hand, it took a long time for a model to adapt to a vertical shift in data (Fig. 5a), resulting in false anomalies, and, on the other hand, smaller anomalies would have no chance to be detected. Additionally, since most of the categories were barely used, it resulted in wasting resources.  (13) sorted_samples i i-th sample in the ascending sorted array of all available signal samples

Adaptive grid
Analysis of previously mentioned experiments and algorithms resulted in a conclusion that a more advanced algorithm is needed. It should avoid the threshold-selection problem, allow to harvest the RNNs potential by using classification instead of regression and optimally use available resources. As a consequence, adaptive grid quantization algorithm was developed. Its principle of operation is mapping the input space to a fixed number of categories (bins) in such a way, that all categories have (ideally) the same samples cardinality (11 -13). As a result, bins widths are uneven, adjusted specifically to the input signal (see Fig. 4c and Tab. 5). Each of the signals used in the model training has its own bins edges calculated. This approach allows to potentially maximize the utilization of the grid and minimize the consumption of resources.

S norm
edges : 6. System description

Principle of operation
The detector principle of operation is a comparison between predicted and real signal values (Fig. 6).
Whenever a new sample arrives, it can be used (in conjunction with previous samples) to predict category the next sample should be in. Assuming that the model was trained to perfectly anticipate normal operating conditions, any difference between the prediction and an actual arriving sample category means that an anomaly occurred.
In the practical applications, however, achieving the model perfection is not feasible: the data used to train the model, as well as real samples that the predictions will be compared with, contains noise. Given large enough pool of samples to learn from, the model should start to predict nearly ideal normal operation values, but even the actual normal operation samples will differ from that ideal due to noise. When an anomaly occurs, those differences should be much more pronounced. As a result, a method to discriminate between 'noise anomalies' and actual anomalies is needed.
The simplest discriminating method is to check an anomaly candidate length. In previous work [54], authors assumed that a gap between available history data and prediction (look_ahead) must be bigger than length_threshold, so that, in case of an anomaly occurring, the model prediction would not get distorted by an irregular input. This, in turn, further affected the model accuracy. However, after further research on the RNN behavior, the authors concluded that this condition is unnecessary since the model should (up to some point) ignore the anomalous sample in favor of available normal operation historical data, with this 'smoothing' capability increasing with look_back (history window) length. Predicting only one step forward (look_ahead = 1) has an additional advantage of potentially decreasing system reaction time, especially in conjunction with more advanced anomaly discrimination methods.

Setup overview
The system is coded in Python, using Keras [55] library with Theano [56] backend for the classifier implementation. The detector module consist of two main sub-modules, model and analyzer, and a few helper scripts. The conceptual overview of the single detector setup is presented in Fig. 7.
The system is prepared to work with normalized data, which are prepared and saved beforehand. Normalization process takes into account all available data, both from training and testing sets. All the data fed into the system at a later time would need to be prepared using the same scale, with out-of-range values clipped to 0.0 or 1.0 as relevant. All setup variants use the same normalized data. Depending on the configuration, a particular system setup variant is created. The configuration includes data pre-processing options (eg. number of input/output categories or the bins edges calculation algorithm), the model hyper-parameters (number of layers, number of cells per layer, batch size etc.) and the analyzer rules (like minimum anomaly length or energy). Most of the configuration options are specified as arrays, allowing to easily test several setups and compare them. Models trained during each of the setups are automatically saved. When the model with high enough performance is found, it can be loaded and used to further test various analyzer setups.
The data quantization, controlled by in_grid, out_grid, in_algorithm and out_algorithm configuration parameters, is described in section 5. The in_grid and out_grid control the number of classes for input and output signals, respectively. At the moment, each of the input channels is quantized using same grid/algorithm combination, analogically for output channels. Available algorithms are even_grid (10) and even_histogram (12,13).
The model is the detector core. It is an abstraction over the actual classifier. In the current implementation, the classifier comprises of a configurable number of GRU layers from Keras library, followed by a Dense layer with dimensionality matching out_grid parameter value (see Fig. 8). However, as long as this abstract interface is preserved, any classifier capable of prediction can be used. The fitted model accuracy should be high enough that when the detector setup is tested using normal operation data, it ideally should not report any anomalies (no false positives).
The analyzer module uses fitted model in order to generate predictions and compare them with real quantized output signal values. Whenever it encounters an anomaly candidate (a discrepancy between real and predicted value) it runs a series of checks, according to configured rules, to determine whether the candidate meets the requirements of a true anomaly. If the conditions are true, all samples belonging to that candidate are marked as anomalous.
At the present time, the analyzer is able to calculate several properties of the anomaly candidate that can be used to discern its validity. Aside from the anomaly length (in samples), amplitudes, max amplitude, and energy values are determined. Assuming that sample s really belongs to category r ∈ [0, m), and was predicted to belong to category p ∈ [0, m), the discrepancy between mean signal values for bins r and p is an anomaly amplitude for that particular sample s (14). When amplitudes of all samples belonging to an anomaly candidate C are Both model hyper-parameters and analyzer rules are application-specific, and should be tweaked in order to achieve best possible performance. A methodology for parameters selection is described in section 7.

System integration
After the right detector setup is chosen, it can be integrated into (in CERN use case) magnets protection and monitoring system. The currently used conventional system comprises, among others, hardware modules and a set of databases, as shown in Fig. 9.
Sensors installed in magnets acquire voltage and current signals from selected parts of the device. Obtained  data is then preprocessed and filtered in real-time, finally arriving in discriminating and thresholding module. This module allows discerning whether a situation needing an operator intervention or running automatic fail-safe procedures arose. Described conventional solution highly depends on the expert knowledge concerning behavior and parameters of the LHC magnets and associated equipment, which allows selecting monitoring system hyper-parameters.
All obtained data is also stored in CALS (CERN Accelerators Logging Service) databases, which allows for later analysis and reasoning about LHC equipment condition and behavior. It should be emphasized that data collected in CALS database is of relatively low resolution, however, in case of a problem or in some predefined normal operation scenarios, very high-resolution data is gathered in PM (Post Mortem) database. Fig. 10 shows model offline training, using data in the same format as currently stored in PM database. Training data compatibility with PM database allows to test Deep Learning-based solution using historical anomalies (Fig. 11). Once the model is trained, it can be periodically updated when even more data is available. It needs to be highlighted that every kind of magnets will need to have its own setup. Huge available amounts of data should allow to very effectively train the required models. It is expected that a single model instance should be sufficient for all magnets belonging to one category, however, it will need to be verified experimentally.
The current research is focused on validation and quality evaluation of models implemented in a highlevel language. However, once the detector setup is determined and model fitted, the anomaly detection algorithm will need to be implemented in hardware, for example using FPGA (Field-Programmable Gate Array) platforms, in order to meet latency constraints (see Fig. 12).
Such an implementation poses several challenges, that will not be widely discussed in this paper. However, it needs to be noted that fitting the detector system inside an FPGA platform that has limited computing and memory resources will require model compression. It also translates to the constraints on the model size -the smaller number of model parameters (weights), the better. For that reason, developed design methodology, including model hyper-parameters selection, needs to take the resources availability into account. Fig. 13 shows the vision of a final MPS, that includes both the proposed RNN-based detector and conventional solution. Such an approach would allow increasing the reliability of the superconducting magnets monitoring system. It is also possible to use only the proposed detector module.

Detector design methodology
The presented detector system has a range of hyperparameters, that can be tweaked to achieve best results for a particular use case. Some of them are directly influenced by the required operation macroparameters, such as the smallest anomaly length or amplitude change, that the system should be able to detect, or the maximum response latency.
The process of tweaking the detector setup it highly iterative, with future hardware implementation in mind. Optimizing the model in order to achieve a better accuracy usually comes hand in hand with increasing it resources consumption. As such, contrary to the usual approach, the model underlying the detector, at the beginning very small, is improved until it is just good enough for the application.

Generic steps
In the initial phase (Alg. 1, lines 2-11), the data is preprocessed (normalized, quantized, formatted according to model needs and split into training and testing sets) and, if possible and to reduce computation time, decimated.
Next, the starting model is fitted with decimated data, and then used to make predictions on the training set. Predictions obtained from model are then used by analyzer to detect anomalies.
Since anomalies are detected in training set (where target system should report none), they can be used to automatically adjust analyzer thresholds. Procedure for automatic thresholds and rules selection is described in the following subsection. Finally, model is used to make predictions on testing data, which are then used for anomaly detection.
The iterative phase (Alg. 1, lines 12-35) starts with detector quality evaluation (using Precision, Recall and/or F1 score metrics, see subsection 8.3). Exact quality evaluation depends on application needs, e.g. there can be applications where a lower number of false positives is more important than a lower number of false negatives etc.
If an amount of false positives is high and real anomalies are (using measures like a length or an energy) bigger than those incorrectly reported anomalies, the model can be considered as oversensitive. This situation can be addressed by increasing analyzer threshold values, especially those where the gap between real true and false anomalies is significant.
If thresholds adjustment is impossible (e.g. true anomalies measurements are similar to those of false anomalies or the model is not accurate enough, resulting in high false negatives number), the only way to improve the detector quality is by changing the preprocessing and/or improving underlying model.
After setup is improved, example procedure for which is described in subsection 7.3, detector quality evaluation can begin anew.

Automatic rules adjustment
The proposed automatic analyzer rules selection procedure, described in Alg. 2, is conceptually simple. Its objective is to find such a combination of threshold values that will guarantee filtering out all false anomalies found in training set. At the same time, it should not be too greedy, in case true anomalies are close to false ones.
For example, most of the false anomalies can be shorter than a certain length, while at the same time  It can be then surmised that any longer (but still within range) anomaly with higher (in range) energy would be a true one and should not be filtered out. If only greedy, basic thresholds were applied ("anomaly is true if its length or energy or amplitude is bigger than the relevant maximum found for false anomalies"), the above-mentioned anomaly would not match any of those criteria and therefore would not be detected.
In Fig. 14, the example visualization of false anomaly properties is shown. This collimation chart is basically a 2-D histogram, with a number of bins equal to the amount of possible discrete values (for small anomaly length ranges) or calculated using numpy 'sturges' algorithm [57]. Assuming there are only two thresholds possible (length and energy, for example), the algorithm task is to find such values, that will result in the biggest possible saved area (marked in green in Fig. 14). The saved area represents an additional space (aside from areas outside of found maximums) in which anomaly detection will be possible. This reasoning is then generalized to an arbitrary number of parameters.
Initially, the algorithm sets the best threshold com-bination to contain a value only for the first threshold, with others set to 0. The first threshold value is based on the maximum value appearing in anomalies and the saved area is set to 0 (Alg. 2, lines 1-9).
In the next part, possible threshold combinations are checked (Alg. 2, lines 10-26). For every detected anomaly, it is checked if a particular threshold combination can be used to filter it out. If the combination can be used to filter all anomalies, it is considered valid and saved area for that particular combination is calculated. The best threshold combination is the one with the highest saved area.

Experiments
The experiments with both evenly-spaced grid and adaptive one were conducted and compared with the classical SVM (Support Vector Machine)-based solution. Collected results allowed to judge the effectiveness and efficiency of the proposed solution.

Dataset
Data used in the experiments was acquired from Hi-Lumi magnets. Those magnets are still in a testing and training phase, with their characteristics being checked and operation parameters verified.
The collected data is divided into four series (h1011, h1144, h1451 and h1819), all coming from the same magnet. Each of the series contains four data channels, with first two representing magnet coil voltage, thirdthe current measurement and fourth -the compensated signal (sum of the first two). To obtain actual voltage values, the signals would need to be multiplied by gain (= 5) and lsb (= 9.5348 × 10 −6 V) constants. The current signal needs an additional (aside from gain and lsb) multiplication by a factor of 2000. The sampling frequency is about 9.3 kHz. For the experiments, only the first three channels were used.
Each of the series contains a long period of the normal operation, followed by an anomaly (quench) and results of a power abort procedure. Each of the series was then split into two parts, one containing only normal operation data and the second one containing the anomaly and power abort in addition to normal operation (see Tab. 6).
For the purpose of measuring the detector performance, both the quenches and the power abort fragments were annotated as anomalies, since both contain phenomena that the model has never seen.
Since the available real anomalies number is very small, to further examine the detector performance the tests sets containing synthetic anomalies were created. For that purpose, the normal operation part of h1011 series was augmented with a thousand of synthetic anomalies added.
In synthetic set I, introduced anomaly is a unit step impulse with the length of 100 (see Fig. 20 and 21). The synthetic set II is similar, only with unit step impulse length set to 50.

Preprocessing
The acquired data needs to be initially prepared in order to be used for recurrent neural network training. The first step is the signals normalization to [0, 1] range, using all of the available data. The normalized data is saved to be reused in all the experiments. Normal operation data is used for model training and validation, while data containing anomalies and power abort is used for whole detector setup testing.
In the next step, based on in_grid, out_grid, in_algorithm and out_algorithm configuration values, the grid edges are calculated. They will later be used to quantize input and output signals.
Following that the data structuring is done. For all the data points that have the required history length (the sample index in series is greater than or equal to look_back + look_ahead) tensors containing that history (with look_back length) for all three used channels are created. At the same time this history data is quantized, using previously calculated edges. It needs to be highlighted that, unlike when working with statistical models such as SVM, the data used for training is overlapping. Simultaneously, the output data tensor using 'one hot' encoding is created, with length equal to the out_grid parameter. Voltage 0 signal was selected as the prediction target.
After history tensors and linked output categories are prepared for each data series, they are all mixed together. For the actual experiment, a fraction of data specified by the samples_percentage ∈ [0, 1] is randomly chosen.

Quality measures
In order to compare models and detector setups, as well as compare performance with alternative ap-proaches, several standard quality measures were used.

Model quality
The metric used for measuring the underlying GRU model is accuracy. Given the values t and f representing, respectively, a number of correctly and incorrectly classified samples, the accuracy can be defined as in (17).

Detector quality
For the purpose of scoring the detector performance, a switch from measuring the quality on a per-sample basis to a per-anomaly basis is needed. This is especially true considering the rarity of anomalies, where the number of samples belonging to the 'anomaly' category is insignificant when compared with the number of normal operation samples. In such a case, any metric incorporating total number of samples (like accuracy) would not provide any meaningful information about anomaly detection capabilities.
Scoring the performance on per-anomaly basis, on the other hand, needs some well-defined rules for the metrics to be useful. The simplest question that needs to be answered is "when the detected anomaly (positive) is considered true?". In this paper, a detected anomaly is considered to be true positive if any part of it overlaps with the real anomaly. What follows, if several detected anomalies are overlapping a single real one, all of them are considered true. This is also true in reverse, if a single detected anomaly spans several real ones, all of them are considered to be found.
Depending on the application needs, it may be crucial to be able to further qualify the detection quality. An attempt to develop the more comprehensive anomaly detection metrics can be found for example in [59].
It needs to be noted that, due to continuous nature of detector operation, it is impossible to define a true negative. Such a notion would not only require artificial splitting of time series into windows of arbitrary length and overlap, but also contradicted the purpose of switch from per-sample based metrics to per-anomaly one. This lack of true negatives narrows down the available standard quality measures.
The selected quality metrics should reflect the application needs. In case of the HiLumi data, it is crucial to find all anomalies, since undetected faults may lead to extremely expensive repairs and long accelerator shutdowns. On the other hand, false positives, while undesirable, are not nearly so costly. The two metrics, mea-suring those features, are recall (18), also called sensitivity, and precision (19), respectively: where: • tp -true positive -item correctly classified as an anomaly, • fp -false positive -item incorrectly classified as an anomaly, • fn -false negative -item incorrectly classified as a part of normal operation.
In order to combine those two metrics into a single one, that can be directly applied for anomaly detection solutions comparison, an F-measure is used (20). The β parameter controls the recall importance in relevance to the precision.
During the detector performance experiments two β values were used, 1 and 2, in order to show the impact of recall on final score.

Alternative approach: OC-SVM
Several implementations of anomaly detection systems based on OC-SVM were proposed with promising results [31][32][33][34]. These algorithms are trained offline to subsequently work online. In this subsection, the comparison of OC-SVM models with the proposed GRUbased system is presented.
Some properties of the experimental setup needed to be changed as required by the nature of the OC-SVM. Therefore, the HiLumi data was preprocessed accordingly. In following paragraphs the OC-SVM algorithm, the preprocessing and the experimental setup with results are described in the details.

One Class Support Vector Machine
OC-SVM are a special case of SVM. They can be trained with unlabeled data, therefore they are an example of unsupervised machine learning techniques. In OC-SVM the support vector model is trained on data that has only one class. This is the 'normal' class and it infers the properties of normal cases. As such, after training, the examples "unlike normal examples" can if n ≡ even frequency of the power spectrum into two halves with the same energy [35] 4 Standard deviation σ =   In the experiments the OC-SVM by Schölkopf et al. [36,37] implemented using Sklearn python library was used. Basically, this SVM separates all the data points from the origin (in feature space F) and maximizes the distance from this hyperplane to the origin. This results in a binary function which captures regions in the input space where the probability density of the data is positioned. In such a way the function returns +1 in a 'small' region (capturing the training data points) and −1 elsewhere.

Data preprocessing
The OC-SVM was trained on the normal operation data. As described in subsection 8.1, four series (h1011, h1144, h1451 and h1819) were used, all coming from the same magnet, with only the actual voltage values for each of the signals used in experiments. Each of the series was then split into two parts: a training set, containing only normal operation data and a testing set containing the anomalies.
The sets were then preprocessed and various features extracted, similar to [34], in order to achieve a simpler classifier. Tab. 8 represents extracted features and their properties. Tab. 7 shows different window scenarios which were used for extracting the features.

Training and testing
OC-SVM, using RBF kernel, was trained on the preprocessed training dataset (the grid search method to find the best values of ν (= 0.07) and γ (= 0.06) was used). After training, the model was applied on the test datasets where there were data labeled with known quenches. Due to very few numbers of quenches in the test data, the test set has been augmented, increasing the number of quench instances so that it would match normal samples cardinality. Tab. 7 shows the properties of this augmented test set. The model has also been applied on the same synthetic data sets that were used to validate the GRU-based detector. Fig. 15 summarizes the main processing stages for OC-SVM and contains a high-level description of the methods used in this alternative approach.
Tab. 9 shows the results (in accordance with the metrics described in 8.3) of the OC-SVM on the HiLumi data with synthetic data sets.

Methodology validation
Experiments involving neural networks are usually very resource-consuming. In order to reduce computational cost it may be beneficial to at first train model on (small) representative fraction of available training data. In order to select such a percentage used in later experiments, a random sweep with increasing percentage values was conducted. The sweep results can be seen in Fig. 16. The data fraction can be considered to be big enough when, for a given model, training and validation accuracy is similar. Based on the experiment results, it can be seen that starting with 1 % of the original  dataset size (samples_percentage=0.01), the above condition is true -in the visualization the line connecting those two accuracy values is nearly horizontal. Moreover, the relative difference between average accuracy achieved for higher samples_percentage values is very small. Fig. 17 shows the relationship between four hyperparameters: history window length (look_back), number of GRU model cells and in_grid/out_grid values. It can be observed that model performance mainly depends on grid sizes, with look_back and model size (cells) values having surprisingly small impact. It is however worth noting that smaller models with smaller look_back values tend to have better performance than those with one of the parameters closer to the upper tested limit. Fig. 18 visualizes relationships between look_back, model validation accuracy and calculated threshold parameters. The grids sizes influence on length and energy thresholds is small, but noticeable, especially affecting maximum false anomaly length, which in turns affects saved area length threshold. Look_back values seem to play an important role in determining model capabilities -smaller values tend to result in lower maximum false anomalies length. It can also be clearly seen that maximum false anomaly amplitude (topping up around 0.5) depends almost entirely on out_grid.
As an additional experiment, authors conducted a more thorough research into impact of grid sizes on model accuracy (Fig. 19).
It turns out that in_grid/out_grid ratio is very visibly related to model accuracy, with lower ratio values lowering the model performance. However, if the higher ratio cannot be achieved, it is better to use lower out_grid value.

Detector performance
To test detector performance, five of the setups were selected using different criteria: • best_length -setup with the lowest maximum false anomaly length, • best_energy -the lowest maximum false anomaly energy, • best_max_amplitude -the lowest maximum false anomaly amplitude, • best_accuracy -the highest model validation accuracy • balanced -setup with relatively good accuracy, maximum length and energy values, as well as saved area length and energy thresholds.
Exact setup parameters, as well as performance results are described in Tab. 11.
Looking only at the results on the real data test set, it may seem that best_length, best_accuracy and balanced setups perform equally well. The best_energy and best_max_amplitude ones performance is abysmal, which basically outright disqualifies them from being used with this particular data.
Looking at the results for test sets containing unit step impulses, however, brings out a bit different picture. The balanced setup is not performing nearly as well, with a significant drop in the recall value even for long impulses and very low score for shorter ones. The best_length setup scores are still very high, with best_accuracy ones slightly outperforming it in terms of recall and F 2 for longer impulses. Fig. 20 shows the example of false negative, missed by the best_accuracy setup in synthetic set I. Looking at the selected section of signal and comparing the quantized real anomaly values with predictions, it can be seen that in several points the predicted values fall into the highest range. From the detector point of view it means that the model, after some small error, correctly predicted current value, and the anomaly candidate can be disregarded. Such a situation occurs as a result of the discriminating thresholds being too high in this particular case.
One of the ways in which this problem possibly may be mitigated is out_grid increase. Direct application of this solution, however, severely affects the model accuracy (as shown in Fig. 19) -increase in out_grid needs to come in hand with other parameters, especially in_grid, adjustment.
Another way involves changing the way analyzer confirms or rejects anomaly candidates. As mentioned, currently candidate is rejected whenever a true predic- Figure 19: Influence of grid sizes on model accuracy (look_back=32, samples_percentage=0.01).  tion is made, unless the configured thresholds were passed first. Alternative approach, subject to future research, could involve tracking a ratio of true vs false predictions or introducing required true predictions threshold for candidate rejection. Fig. 21 shows the example false positive, found in synthetic set I using the best_accuracy setup. It can be seen that the false positive was reported soon after synthetic anomaly occurred, so it can be assumed that incoming anomalous signal affected the model predictions. In the real-world scenario this should not be a problem, since first detected anomaly would probably trigger a failsafe mechanism. In cases where anomalies should be detected even if they occur one after another, a solution involving a small ignored window, equal in length to a half or a whole look_back value, could probably be implemented. How such a mechanism would affect the whole detector performance needs to be researched.
Overall, it seems that for HiLumi data analysis, the setups selected based on the lowest maximum false anomaly length may yield the best performance, with the ones based on best accuracy being nearly as good.

Discussion
This paper addresses a challenging task of proposing an efficient methodology, algorithm, and design approach for modeling and anomaly detection in time series signals of the LHC superconducting magnets. The conducted research resulted in a development of the solution based on GRU RNN as a core component. During the work, several issues needing addressing have arisen.
In order to be able to measure the detector system performance, it is necessary to answer the question of what and where an anomaly is in this context. When it comes to the signals acquired from the LHC magnets, especially the new HiLumi ones, exact anomaly position is very hard to determine. As a result, the manually labeled signals may be slightly inaccurate, making it hard to compare the moment labeled anomaly started with the detection system trigger point. In order to obtain more precise performance data, the system was also tested using synthetic datasets, in which the exact anomaly start and end positions are known.
The CERN equipment is very sensitive to false negative cases, consequentially, failing to detect the improper behavior of the system may lead to catastrophic results such as an explosion of magnets [60]. As such, a careful choice of the quality metrics, reflecting this system trait, is required. In the presented work, the F 2 quality measure, putting more importance on the recall than precision, was used in order to satisfy that requirement. However, further research into existing metrics designed for anomaly detection systems (such as presented in [59]) and/or development of the custom metrics is could prove insightful.
The solution based on NNs was selected because those models may be updated in an automatic way, require very little feature extraction, can be compressed efficiently and ported into hardware [61][62][63]. The challenge lies in choosing the right model for the task and its hyper-parameters adjusting, which can be a very timeand resource-consuming process. In this work, an approach to model architecture selection, based on dataset reduction and bisection idea, was briefly described. The huge challenge lies, however, in an automation of that process, taking into account not only model performance but also its resources consumption. To address this issue, authors plan to develop a RL-based optimization algorithm, the preliminary idea of which was presented in [58].
Such an algorithm could potentially not only automatically tweak the model hyper-parameters, but also answer the problem of model compression/precision reduction challenges and make a NN-based solution hardware implementation much easier. Since the anomaly detection module needs to operate in a real-time regime and be able to respond with a very low latency to meet the CERN requirements, the developed solution will be ported to hardware (e.g. FPGA) at some point. The currently used anomaly detection system required the adoption of the high-level models by means of HDL (Hardware Description Language) (e.g. VHDL). This is extremely difficult and error-prone process. Furthermore, any updates or modifications of the highlevel models require a complete reiteration of the design flow. The adaptability of the NN-based system coupled with the automatic optimization algorithm could significantly simplify that process.
It is worth noting that historically at CERN feature extraction was very demanding phase since it involved a lot of experiments with a range of filtering and discrimination methods to reach reliable parameters of the system as a whole. That modeling challenge, regarding a right preprocessing scheme arrangement, can potentially be reduced by introducing the proposed solution. While the naive adoption of RNNs requires an operator of the system to make an arbitrary decision regarding the values of the thresholds [53], the adaptive inputs and outputs quantization approach presented in this work alleviates this issue by introducing an automatic required detector thresholds adjustment process.
Despite being primarily focused on the CERN equip-ment monitoring, the results presented in this paper should be considered a part of a larger endeavor aiming at developing a methodology and architecture of an anomaly detection system operating in a space of time series analysis, especially under hard real-time constraints.

Conclusions and future work
In this paper, an applicability of RNN models for detecting anomalous behavior of CERN superconducting magnets was examoned. The developed solution, based on GRU and adaptive quantization, achieved very promising results for the data acquired from HiLumi magnets. Three testing sets were used in the experiments, one including real anomalies and two with synthetic anomalies in the form of a unit step impulse with the length of 100 and 50 samples. For those datasets, the proposed anomaly detection system reached F 2 equal to 1.0, 0.9980, and 0.8920, respectively, with F 1 equal to 1.0, 0.9987, and 0.9296.
Several setups of the proposed solution were analyzed, with the configurations selected based on shortest reported false anomaly length and best underlying model accuracy achieving the best results. The GRUbased solution was compared with the state-of-the-art OC-SVM reference model and was found out to perform significantly better in most of the cases, with much more stable performance across all the tested datasets. The inputs and output quantization parameters turned out to have a significant impact on the detector performance, with 16/8 ratio providing the best overall results among the tested cases.
As a future work, the authors plan to further test and improve the proposed algorithm, especially with regards to the adaptive quantization and automatic analyzer rules selection. It is also planned to examine the proposed solution performance using more advanced quality measures and more sophisticated testing datasets. Ultimately, the authors plan to develop a RL-based NN model optimization algorithm and use it to simplify the process of the detector prototype implementation on an FPGA platform.