Operation of the ATLAS trigger system in Run 2

The operation of the ATLAS trigger system during Run 2 of the LHC was highly successful for both $pp$ and heavy-ion data-taking as well as for a variety of special runs. The recorded data are of high quality with only minimal losses due to operational problems. The trigger's excellent performance and flexibility allowed collection of $139\,fb^{-1}$ of high-quality $pp$ collision data at $\sqrt{s}=13$ TeV over a wide range of running conditions, with peak instantaneous luminosities ranging from $0.5\times 10^{34} cm^{-2}s^{-1}$ to $2.1\times 10^{34}cm^{-2}s^{-1}$ and 28-60 pile-up collisions. A variety of tools and procedures, as detailed in this paper, have been used in both the offline and online environments to efficiently validate, monitor and assess the operation and the performance of the trigger system. They allowed maximal exploitation of the collisions delivered by the LHC in Run 2 for physics analyses, while respecting the limitations of the trigger system in terms of maximum read-out rates, available HLT computing resources and sustainable storage output bandwidth.

. The ATLAS TDAQ system in Run 2 showing the components relevant for triggering as well as the detector read-out and data flow.
The Level-1 (L1) trigger is a hardware-based system that uses custom electronics to trigger on reduced-granularity information from the calorimeter and muon detectors [9]. The L1 calorimeter (L1Calo) trigger takes signals from the calorimeter detectors as input [10]. The analogue detector signals are digitised and calibrated by the preprocessor, and sent in parallel to the Cluster Processor (CP) and Jet/Energy-sum Processor (JEP). The CP system identifies electron, photon, and τ-lepton candidates above a programmable threshold, and the JEP system identifies jet candidates and produces global sums of total and missing transverse energy. The signals from the LAr calorimeter are bipolar and span multiple bunch crossings, which introduces a dependence of the amplitude on the number of collisions occurring in neighbouring bunch crossings (out-of-time pile-up). Objects with narrow clusters such as electrons are not strongly affected by small shifts in energy, however the missing transverse momentum is very sensitive to small systematic shifts in energy over the entire calorimeter. These effects are mitigated in the L1Calo trigger by a dedicated pedestal correction algorithm implemented in the firmware [11].
The L1 muon (L1Muon) trigger uses hits from the RPCs (in the barrel) and TGCs (in the endcaps) to determine the deviation of the hit pattern from that of a muon with infinite momentum [12]. To reduce the rate in the endcap regions of particles not originating from the interaction point, the L1Muon trigger applies coincidence requirements between the outer and inner TGC stations, as well as between the TGCs and the tile calorimeter.
The L1 trigger decision is formed by the Central Trigger Processor (CTP), which receives inputs from the L1Calo trigger, the L1Muon trigger through the L1Muon Central Trigger Processor Interface (MUCTPI) and the L1 topological (L1Topo) trigger [13] as well as trigger signals from several detector subsystems such as the Minimum Bias Trigger Scintillators (MBTS) [14], the LUCID Cherenkov counter [15] and the zero-degree calorimeter (ZDC) [16].
The CTP is also responsible for applying dead time, which is a mechanism to limit the number of L1 accepts to be within constraints on detector read-out latency [17]. This preventive dead time limits the minimum time between two consecutive L1 accepts (simple dead time) to avoid overlapping read-out windows, and restricts the number of L1 accepts allowed in a given number of bunch crossings (complex dead time) to prevent front-end buffers from overflowing. The complex dead time uses a leaky bucket model to emulate a front-end buffer. In this model, dead time is applied when the bucket is full. The size of the bucket is X, expressed in units of L1 accepts and R (in units of bunch crossings), the time it takes to read out one L1 accept. With these numbers the trigger rate, on average, is limited to X triggers in a time period of X × R bunch crossings. At the end of Run 2, the simple dead time setting was four bunch crossings, which corresponds to an inefficiency of about 1% for a L1 rate of 90 kHz. The complex dead time was configured with four different leaky bucket algorithms and one sliding-window algorithm to cover the read-out limitations of the various sub-detectors. The total peak inefficiency was about 1% for a L1 rate of 90 kHz.
The L1 trigger can select events by considering event-level quantities (e.g. the total energy in the calorimeter), the multiplicity of objects above thresholds (e.g. the transverse momentum of a muon, etc.), or by considering topological requirements (such as invariant masses or angular distances). The topological requirements are applied in the L1Topo trigger to geometric or kinematic combinations between trigger objects received from the L1Calo or L1Muon systems. The L1 trigger accepts events at a rate up to the maximum detector read-out rate of 100 kHz, down from the bunch crossing rate of about 40 MHz, within a latency of 2.5 µs.
For each L1-accepted event, the Front-End (FE) detector electronics read out the event data for all detectors. The data are sent first to ReadOut Drivers (RODs), performing the initial processing and formatting, and then to the ReadOut System (ROS) to buffer the data. The data from the different sub-detectors are sent from the ROS to the second stage of the trigger, the High-Level Trigger (HLT), only when requested by the HLT. In addition to performing the first selection step, the L1 triggers identify Regions-of-Interest (RoIs) in η and φ within the detector to be investigated by the second trigger stage.
The second stage of the trigger, the HLT, is software-based. A typical reconstruction sequence makes use of dedicated fast trigger algorithms to provide early rejection, followed by more precise and more CPU-intensive algorithms that are similar to those used for offline reconstruction to make the final selection. These algorithms are executed on a dedicated computing farm of approximately 40 000 selection applications known as Processing Units (PUs). Between each year of data taking, older hardware in the farm was replaced with newer hardware on a rolling basis to increase the available computing power and the total number of PUs. The PUs are designed to make decisions within a few hundred milliseconds. A step in such a sequence of algorithms will typically execute one or multiple feature-extraction algorithms requesting event-data fragments from within an RoI and terminate on a hypothesis algorithm which uses the reconstructed features to decide -4 -

JINST 15 P10004
whether the trigger condition is satisfied or not. In some cases, information from the full detector is requested in order to reconstruct physics objects (e.g. for the reconstruction of the missing transverse momentum [18]). The HLT software is largely based on the offline software Athena [19], which itself is based on Gaudi [20], a framework for data processing for HEP experiments. Gaudi/Athena is a component-based framework where each component (e.g. algorithm, service, tool) is configured by a set of properties that can be defined during the configuration stage of the application. The physics output rate of the HLT during an ATLAS data-taking run (see section 4.4) is on average 1.2 kHz with an average physics throughput to permanent storage of 1.2 GB/s. Once an event is accepted by the HLT, the Sub-Farm Output (SFO) sends the data to permanent storage for offline reconstruction and exports the data to the Tier-0 facility [21] at CERN's computing centre.
The Fast TracKer (FTK) [22] is a hardware-based system for inner-detector track reconstruction designed to provide tracks to the HLT at the L1 accept rate. It was undergoing commissioning during Run 2 and was not used by the HLT for trigger decisions.

LHC fill cycle, fill patterns and ATLAS runs
In the following, the LHC fill cycle, the fill pattern and their representation in ATLAS, the so-called bunch groups, are described. Additionally, the ATLAS run, which refers to a continuous period of data acquisition that typically corresponds to an LHC fill cycle, is laid out.

The LHC fill cycle
The LHC is a circular particle accelerator that is the last in a chain of accelerators used to bring particle beams into collisions at their final energies. The beams travel through the LHC in opposite directions in separate rings of superconducting magnets, which are crossed at four interaction points. The beams are kept separated in the interaction points using magnetic fields until they are ready for collisions. The LHC aims to provide the largest, usable integrated luminosity of high-energy proton and ion collisions to the LHC experiments. To provide collisions to the experiments, the LHC has to go through a cycle composed of several phases [23], which are shown in figure 2. Each phase refers to one or several beam modes and all together are referred to as the nominal cycle: • Injection: after the current in the magnets is increased to provide the field necessary for injection, beams are injected from the accelerator chain into the LHC rings following a filling scheme, specifying the number of proton bunches and the spacing between them.
• Ramp: the beams are accelerated to the collision energy. During this phase, the radio frequency systems accelerate the particles and the current in the magnets is further increased.
• Squeeze and adjust: in these two phases, beams are prepared for collisions. First, the beam sizes at the interaction points are reduced (squeeze), then the beams are adjusted so that they are optimally colliding (adjust).
• Stable beams: this is the phase when the LHC conditions are stable, collisions take place in the experiments, and it is safe for detectors to be turned on to record data. Small adjustments of beam parameters are permitted [24,25]. The LHC spent approximately 50% of the time in stable beams throughout Run 2.
-5 -  Figure 2. The LHC goes through a cycle composed of several phases: the injection of beams into the rings, the acceleration to the collision energy during ramp, the preparation of beams for collisions during squeeze and adjust, the phase where collisions take place during stable beams, the extraction of the beams from the rings during dump, and finally the ramping down of the magnetic fields. Adapted from ref. [27].

JINST 15 P10004
• Dump and ramp down: beams are extracted from the rings and safely dumped. The dump can either be planned (by the LHC), requested (for example by experiments in case of problems with the detector) or unplanned. Following the dump, the magnetic fields are ramped down.
The time in between two consecutive stable beams periods is referred to as turnaround, and includes the nominal cycle as well as all necessary actions to set up the machine for operation with beams. The ideal duration of the stable beams phase is typically 10-15 hours, depending on several factors, including luminosity lifetime, average turnaround duration, and predicted availability of the machine.
In about every second fill in the last year of Run 2, fast luminosity scans were performed [26] during stable beams to provide feedback on the transverse emittance at a bunch-by-bunch level to the LHC. During these scans, beams are offset against each other in the x-and y-plane in several displacement steps. The scans were typically done a few minutes after stable beams had been declared and just before the end of the stable beams period, and lasted a few minutes.

LHC fill patterns in Run 2
During Run 2, the LHC machine configuration evolved significantly. This was a major factor in improving luminosity performance each year of Run 2 [28]. The various bunch filling patterns used have a direct impact on the trigger configuration. With the changing running conditions, adjustments had to be made in order to respect the trigger and DAQ system limitations (see e.g. section 7.1).
At the start of Run 2 in 2015, the LHC used 50 ns bunch spacing and switched in August 2015 to the nominal 25 ns bunch spacing scheme [29]. In June 2016, the high-brightness version of the 25 ns beam obtained through the Batch Compression, Merging and Splitting (BCMS) scheme [30] became operational for physics production. These changes brought about an increase in instantaneous luminosity to about 1.3 × 10 34 cm −2 s −1 , resulting in higher trigger rates and an evolution in the -6 -2020 JINST 15 P10004 ATLAS Trigger Operation Figure 3. Example bunch group configurations for four out of the 16 possible bunch groups. The numbers in blue on the right indicate the number of bunch crossings for each group. The group of the bunch counter reset veto (BCRVeto) leaves a short time slice for distribution of the LHC bunch counter reset signal to the ondetector electronics. The Paired bunch group indicates the bunch crossing IDs with colliding bunches, while the Empty bunch group contains no proton bunches and is generally used for cosmic ray, noise and calibration triggers. The calibration requests (CalReq) bunch group can be used to request calibration triggers. trigger strategy. In 2017, several LHC fills were dumped because of beam losses in the LHC sector 16L2 [31]. As a consequence, the 8b4e filling scheme [32] (eight bunches with protons, four bunches without protons) and later its high-brightness variant 8b4e BCS (Batch Compression and Splitting) with their low e-cloud build-up characteristics were made operational. The 8b4e filling schemes circumvented the problem, but resulted in a reduction in the number of colliding bunches by 30% compared to the BCMS scheme. To compensate for the loss in luminosity due to the decrease in colliding bunches, the bunch intensity was increased and this led to a 33% increase of simultaneous interactions per bunch crossing ('pile-up'), up to 80 interactions compared to up to 60 interactions previously. Such an increase together with the high luminosities of up to 1.9 × 10 34 cm −2 s −1 [33] would have resulted in an increase of trigger rates, straining the CPU resources of the HLT farm. The trigger configuration intended to be used for a luminosity of up to 2.0 × 10 34 cm −2 s −1 with a pile-up of 60 would have required higher trigger thresholds at a pile-up of 80, leading to a reduced efficiency for many physics analyses. Therefore, ATLAS requested that the luminosity be kept constant for the first few hours of a run (luminosity-levelling) at a luminosity of 1.56 × 10 34 cm −2 s −1 with a pile-up of 60. In 2018, the LHC switched back to the 25 ns BCMS beam for luminosity production, as the problems with beam losses in 16L2 were mitigated [34], and the pile-up interactions were again reduced to about 60. More information about these filling schemes can be found in ref. [32].

Bunch groups
In the LHC, there are a total of 3564 bunch crossings per LHC revolution. Each of these bunch crossings can have either two bunches colliding, one bunch, or be empty of protons. Each bunch crossing is identified by a Bunch Crossing Identifier (BCID) from 0 to 3563. A list of BCIDs is called a bunch group.
Bunch group conditions are used in combinatorial logic 'AND' with other trigger conditions to define which items generate a L1 accept. There are 16 distinct bunch groups that can be defined in ATLAS, each with its own particular purpose, defined for each LHC bunch. Figure 3 shows four types of bunch groups which are described in the following.
Bunch group conditions can be paired (colliding) bunches for physics triggers, single (onebeam) bunches for background triggers and empty bunches for cosmic ray, noise and calibration -7 -triggers. More complex schemes are possible, e.g. requiring unpaired bunches separated by at least 75 ns from any bunch in the other beam. Two bunch groups have a more technical purpose: the calibration requests group defines the times at which sub-detectors may request calibration triggers, typically in the long gap with no collisions and the group of the bunch counter reset veto leaves a short time slice for distribution of the LHC bunch count reset signal to the on-detector electronics.
As the LHC filling scheme can vary from fill to fill, ATLAS has developed and commissioned a procedure for monitoring and redefining the bunch groups using dedicated electrostatic detectors. These so-called beam pick-ups [35] are located 175 m upstream of the interaction point. An online application measures the filling scheme seen by the beam pick-ups and calculates the corresponding bunch groups. While the configuration of some bunch groups is given by the LHC (e.g. the colliding BCIDs) through the fill pattern, others can be defined to contain any desired list of BCIDs for specific data-taking requests (e.g. in van der Meer scans [33] or single-beam background studies).
The 16 bunch group configurations are together called a bunch group set, which is different for each different LHC filling scheme. The bunch group sets are generated for each filling scheme in advance of running, using information about the positions of bunches with protons in each beam provided by the LHC. The generated bunch group set is then checked against the measured beam positions to ensure that it matches. The CTP pairs each L1 trigger item with a specific bunch group defined in the set. Those L1 trigger items which are employed to select events for physics analyses trigger on bunch groups containing all colliding BCIDs. The CTP can also provide random triggers and apply specific bunch crossing requirements to those.

The ATLAS run structure
An ATLAS run is a period of data acquisition with stable detector configuration and in the case of physics data-taking usually coincides with an LHC fill, which can last many hours. Another example is a cosmic-ray data-taking run, which takes place when there is no beam in the LHC and the ATLAS detector is used to detect cosmic rays to study detector performance [36]. The DAQ system assigns a unique number to every run at its beginning. An identifier is assigned to each event and is unique within the run (starting at 0 for each run). A run is divided into Luminosity Blocks (LB), with a length of the order of one minute and identified by an integer unique within a given run. A LB defines an interval of constant luminosity and stable detector conditions (including the trigger system and its configuration). To define a data sample for physics, quality criteria are applied to select LBs where conditions are acceptable. The instantaneous luminosity in a given LB is multiplied by the LB duration to obtain the integrated luminosity delivered in that LB. The length of the LB can be changed during the run and a new LB can be started at any time (following a 10 second minimum delay). From a data quality point of view, the LB represents the smallest quantity of data that can be declared good or bad for physics analysis.
To start a run, the software and hardware components of the ATLAS detector have to follow the transitions of a finite-state machine [37]. The transitions performed by the applications are shown in figure 4. During 'boot', all applications are being started. In the 'configure' and 'connect' transitions, the hardware and applications are configured and connections between different applications are established where necessary. Finally, during 'start', a run number is assigned and the applications perform their final (run-dependent) configuration. Once all applications arrived at the 'ready state', the CTP releases the inhibit and events start flowing through the system.  Run-dependent configurations (e.g. loading of conditions data) are performed during the start transition, which can take several minutes for the entire ATLAS detector [37].

Operational model of the ATLAS trigger system
During the operation of the LHC, the ATLAS detector is operated and monitored by a shift crew in the ATLAS control room (ACR), 24 hours a day, 7 days a week, supported by a pool of remote on-call experts. Shifters and experts are responsible for the efficient collection of high-quality data.
The operation and data quality monitoring of the trigger system is overseen by two operation coordinators whose main responsibility is to ensure smooth and efficient data-taking. They coordinate a team of weekly on-call experts, on rotation, for the areas listed below. Operation coordinators and on-call experts work together closely at a daily trigger operation meeting to plan the activities of the day.
• ACR trigger desk: during the shift in the ACR, the person is responsible for providing the needed trigger configuration and for monitoring the operation of the trigger system in close communication with other ACR shifters.
• Online: responsible for the proper operation of the ATLAS trigger and primary support for the ACR trigger shifter.
• Trigger menu: responsible for the preparation of the trigger configuration of active triggers and their prescale factors (see section 6).
• Online release: collection and review of software changes and monitoring the state of the software release for online usage via validation tests that run every night; deployment of the online software release on the machines used during data-taking.
• Reprocessing: in charge of validating the online software release (see section 10) by running the simulation of the L1 hardware and the HLT software on a dedicated dataset to spot errors by running on large samples.
• Data quality and debug stream: responsible for the data quality assessment of recorded data; investigate and recover the events in the debug stream (see section 11).
-9 - • Signature-specific: monitor the performance of triggers for signatures, assist in data quality assessment and reprocessing sign-off; several trigger signatures are grouped together (muon and B-physics and Light States; jet, missing transverse momentum and calorimeter energy clusters; τ-lepton, electron and photon; b-jet signature and tracks).
• Level-1: each L1 trigger system (L1Calo, L1Muon barrel, L1Muon endcap, and CTP) has an on-call expert who helps to ensure smooth operation of the L1 trigger and monitors the data quality for their respective system.
In additon to data-taking, the trigger operation group participates in special runs of a technical nature together with the ATLAS DAQ team to develop and test the online software and tools to be used for data-taking. It also provides support for other ATLAS systems during detector commissioning runs and for special tests during LHC downtime periods.

The Run-2 trigger menu and streaming model
Events are selected by trigger chains, where a chain consists of a L1 trigger item and a series of HLT algorithms that reconstruct physics objects and apply kinematic selections to them. Each chain is designed to select a particular physics signature such as the presence of leptons, photons, jets, missing transverse momentum, total energy and B-meson candidates. The list of trigger chains used for data-taking is known as a trigger menu, which also includes prescales for each trigger chain. To control the rate of accepted events, a prescale value, or simply prescale, can be applied. For a prescale value of n, an event has a probability of 1/n to be kept. Individual prescale factors can be given to each chain at L1 or at the HLT, and can be any value greater than or equal to one. More details of how prescales are applied can be found in section 8.1.3.
The complete set of trigger selections must respect all trigger limitations and make optimal use of the available resources at L1 and the HLT (e.g., maximum detector read-out rate, available processing resources of the HLT farm, and maximum sustainable rate of permanent storage). Rates and resource usage are determined as described in section 6.2 and section 7.3.
The configuration is driven by the physics priorities of the experiment, including the number of clients satisfied by a particular trigger chain. The main goal of the Run-2 trigger menu design was to maintain the unprescaled single-electron and single-muon trigger p T thresholds around 25 GeV despite the expected higher trigger rates to ensure the collection of the majority of events with leptonic W and Z boson decays. The primary triggers (used for physics analyses and unprescaled) cover all signatures relevant to the ATLAS physics programme including electrons, photons, muons, τ-leptons, jets, b-jets and E T which are used for Standard Model precision measurements including decays of the Higgs, W and Z bosons, and searches for physics beyond the Standard Model such as heavy particles, supersymmetry or exotic particles. A set of low transverse momentum dimuon triggers is used to collect B-meson decays, which are essential for the B-physics programme of ATLAS.
Heavy-ion (HI) collisions differ significantly from pp collisions, and therefore require a dedicated trigger menu to record the data. The main components of the HI trigger menu are triggers selecting hard processes (high E T , b-jets, muons, electrons, and photons) in inelastic Pb+Pb collisions, minimum-bias triggers for peripheral and central collisions, triggers selecting events with -10 -

JINST 15 P10004
particular global properties (event-shape triggers to collect events with large initial spatial asymmetry of the collisions, ultra-central collision triggers), as well as triggers selecting various signatures in ultra-peripheral collisions. More information about the HI trigger menu and associated streams can be found in ref. [38].
Apart from the trigger menu used to record nominal pp or HI collisions, additional trigger menus were designed in Run 2 for special data-taking configurations, with some examples discussed in section 7.

The trigger menu evolution in Run 2
The trigger menu for pp data-taking evolved throughout Run 2 due to the increase of the instantaneous luminosity and the number of pile-up interactions. The composition of the trigger menu is developed based on the expected luminosity for each year, with looser selections deployed during early data-taking or when the peak luminosity falls below the predicted target value. The main trigger chains that comprise the ATLAS trigger menu for 2015 targeting an instantaneous luminosity of 5 × 10 33 cm −2 s −1 and valid for a peak luminosity up to 6.5 × 10 33 cm −2 s −1 are described in detail along with their performance in ref. [39]. As the instantaneous luminosity increased substantially in 2016 (up to 1.3 × 10 34 cm −2 s −1 ) and again in 2017 (up to 1.6 × 10 34 cm −2 s −1 ), it became necessary to adjust the trigger menu each year accordingly. The various improvements and the performance for the trigger menu used in 2016 and 2017 are described in detail in refs. [40] and [38], respectively. The peak luminosity in 2018 was close to 2.0 × 10 34 cm −2 s −1 . Even though the luminosity was higher than in 2017, the number of interactions per bunch crossing was similar. The resources needed to continue running the same trigger menu as in 2017 were estimated to fall within the limitations of the trigger system. The 2018 trigger menu [41] therefore only contained additions on top of the 2017 menu together with a few changes and improvements to the trigger selections used in 2017.

Cost monitoring framework
The ATLAS cost monitoring framework [42] consists of a suite of tools to collect monitoring data on both CPU usage and data-flow over the data-acquisition network during the trigger execution. These tools are executed on a sample of events processed by the HLT, irrespective of whether the events pass or fail the HLT selection. It is primarily used to prepare the trigger menu for physics data-taking through the detailed monitoring of the system, allowing data-driven predictions to be made utilising dedicated datasets (enhanced bias dataset, see section 7.3). Monitored data include algorithm execution time, data request size, and the logical flow of the trigger execution for all L1accepted events. To sample a representative subset of all L1-accepted events, a monitoring fraction of 10% is chosen. Example monitoring distributions are given for two of the many algorithms in figure 5: calorimeter topological clustering [43] and electron tracking. These monitoring data were collected over a 180 s data-taking period at 1 × 10 34 cm −2 s −1 . Topological clustering can run either within an RoI or as a full detector scan, leading to a double-peak structure in the processing time as shown in figure 5 (top left). Equivalently to the procedure of predicting the rates of individual HLT chains and trigger menus (see sections 7.3 and 10), it is possible to estimate the number of HLT PUs which will be required to run a given trigger chain or menu. This functionality was extremely useful in planning for different LHC scenarios in 2017 and in preparation for 2018 data-taking.

Run-2 streaming model
The trigger menu defines the streams to which an event is written, depending on the trigger chains that accepted the event. Data streams are subdivided into files for each luminosity block, which facilitates the subsequent efficiency and calibration measurements under varying running conditions.
The five different types of data streams considered in the recording rate budget available at the HLT during nominal pp data-taking are: • Physics stream: contains events with collision data of interest for physics studies. The events contain full detector information and dominate in terms of processing, bandwidth and storage requirements.
• Express stream: very small subset of the physics stream events reconstructed offline in real time for prompt monitoring and data quality checks.
• Debug streams: events for which no trigger decision could be made are written to this stream. These events need to be analysed and recovered separately to identify and fix possible problems in the TDAQ system (see section 11).
-12 - • Calibration streams: events which are triggered by algorithms that focus on specific subdetectors or HLT features are recorded in these types of streams. Depending on the purpose of the stream, only partial detector information is recorded through a strategy called Partial Event Building (PEB) [5], which has the potential to significantly reduce the event size.
• Trigger-Level Analysis (TLA) streams: events sent to this stream store only partial detector information and specific physics objects reconstructed by the HLT to be used directly in a physics analysis.
• Monitoring streams: events are sent to dedicated monitoring nodes to be analysed online for, e.g., detector monitoring, but are not recorded.
For special data-taking configurations it is possible to introduce additional streams; an example is the recording of enhanced bias data, which is discussed in section 7.3. With the exception of the debug streams, the streaming model is inclusive, which means that an event can be written to multiple streams. Aside from the express stream, there are typically multiple different streams of each type. For PEB, data are only stored for specific sub-detectors, or for specific regional fragments from specific sub-detectors. Similarly, the TLA stream (see ref. [44] for more details of the procedures) only stores physics objects reconstructed by the HLT with limited event information and uses these trigger-level objects directly in a physics analysis [45]. By writing out only a fraction of the full detector data, the event size is reduced, making it possible to operate these triggers at higher accept rates while not being limited by constraints on the output bandwidth. This strategy is effective in avoiding high prescales at the HLT for low transverse momentum (p T ) triggers. Figure 6 shows the average recording rate of the physics data streams of all ATLAS pp runs taken in 2018. Events for physics analyses are recorded at an average rate of ∼1.2 kHz. This comprises two streams, one dedicated to B-physics and Light States (BLS) physics data, which averaged 200 Hz, and one for all other main physics data, which averaged the targeted 1 kHz. The BLS data are kept separate so the offline reconstruction can be delayed if available resources for processing are scarce. Figure 7 shows the HLT rates and output bandwidth as a function of time in a given run. The apparent mismatch between rate and output bandwidth in some streams is due to the use of PEB techniques. The increase of the TLA HLT output is part of the end-of-fill strategy of the ATLAS trigger. Towards the end of the LHC fill, when the luminosity and the pile-up are reduced compared to their peak values, L1 bandwidth and CPU resources are available to record and reconstruct additional events using lower-threshold TLA trigger chains. Table 1 shows the average event sizes for the described streams. The event size of physics, express and BLS streams is comparable whereas the TLA stream event size is significantly smaller. The calibration stream size varies considerably depending on the purpose and what sub-detector information is written out.

Special data-taking configurations
In addition to standard pp and heavy-ion data-taking, the LHC programme includes a variety of short periods when the machine is operated with particular beam parameters, referred to as -13 -  special data-taking configurations. The special data-taking configurations provide data for detector and accelerator calibration as well as additional physics measurements in the experiments. The specific LHC bunch configurations and related conditions (e.g. lower number of paired bunches, change in the average number of pile-up interactions), detector settings (e.g. subsystem read-out settings optimised for collecting calibration data) and desired trigger configuration have to be taken into account when preparing the trigger menu. The preparation of these configurations can be quite extensive as they require a specific trigger menu which needs to be prepared and adjusted to comply with the imposed, usually tightened limits in rate, bandwidth and CPU consumption. In the following, three examples of special data-taking configurations, and the challenges that come with them, are discussed: runs with a low number of bunches, luminosity calibration runs and a configuration used to record enhanced minimum-bias data for future estimates of trigger rates and CPU consumption.

Runs with few bunches
Runs with a low number of bunches (e.g. 3, 12, 70, 300 bunches) usually occur during the periods of intensity ramp-up of the LHC after long or end-of-year shutdowns [46]. While in most of these -14 - Presented are the main physics stream rate, containing all trigger chains for physics analyses; the BLS stream, containing trigger chains specific to B-physics analyses; the express stream, which records events at a low rate for data quality monitoring; other physics streams at low rate, such as beam-induced background events; the trigger-level analysis stream; and the detector calibration streams. The monitoring stream is not reflected in the output bandwidth as the monitoring data are not written out to disk. The increase of the TLA HLT output rate is part of the end-of-fill strategy of the ATLAS trigger. At the end of the LHC fill, L1 and CPU resources are available to reconstruct and record additional events using lower-threshold TLA triggers. During Run 2 the TLA stream was seeded by jet triggers and only the HLT jet information was saved. This increased the total HLT output rate, but did not significantly increase the total output bandwidth due to the small size of TLA events.
runs it is still desired to collect data for detector calibrations or for physics analyses that prefer lowluminosity conditions, they provide an operational challenge due to certain limits of the ATLAS detector which the trigger needs to take into account. The most stringent limitation when a small number of bunches are grouped into small sets of bunches (bunch trains) arises from events being accepted at L1 and the data being read out at the mechanical resonance frequencies of the wire bonds of the insertable B-layer (IBL) or the semiconductor tracker (SCT  . The fixed-frequency veto limit to protect the innermost pixel detector of ATLAS (IBL) against irreparable damage due to resonant vibrational modes of the wire bonds has a direct impact on the maximum allowable rate of the first trigger level (L1). This limit depends on the number of colliding bunches in ATLAS and on the filling scheme of the LHC beams. This plot presents the simulated rate limits of the L1 trigger as imposed for IBL protection for two different filling schemes (in blue), and the expected L1 rate (in red) from rate predictions. The steps in the latter indicate a change in the prescale strategy. The rate limitation is only critical for the lower-luminosity phase, where the required physics L1 rate is higher than the limit imposed by the IBL veto. The rate can be reduced by applying tighter prescales.
Lorentz forces induced by the magnetic field and cause wire bonds to break due to fatigue stress. The resonant modes of the wire bonds lie at frequencies between 9 and 25 kHz for the IBL, which is of concern given the 11 245 Hz LHC bunch revolution frequency. The resonant modes of the SCT are less of a concern as they are typically above the maximum L1 trigger rate limit imposed by the IBL. To protect the detector, a so-called fixed-frequency veto is implemented, which prevents read out of the detector upon sensing a pattern of trigger rates falling within a dangerous frequency range [47,48]. The IBL veto provides the most stringent limit on the L1 rate in this particular LHC configuration. To prepare trigger menus which respect this limit, the maximum affordable trigger rate is first determined by simulating the effect of the IBL veto. If the expected rate from the nominal trigger menu is higher than the allowed rate, the menu is adjusted to reduce the rate to fit within the limitations. Figure 8 shows the simulated IBL rate limit for two different bunch configurations, together with the expected L1 trigger rate of the nominal physics trigger menu. This rate limitation is only critical for the lower-luminosity phase, where the required physics L1 rate is higher than the limit imposed by the IBL veto. In order to avoid impacting primary physics triggers, the required rate reduction is achieved by reducing the rate of the supporting trigger chains.

Luminosity calibration runs
Luminosity calibration runs are runs in which the absolute luminosity scale [33] is being determined and the calibration of the different luminosity detectors is measured. A precise measurement of -16 -the integrated luminosity is a key component of the ATLAS physics programme, in particular for cross-section measurements where it is often one of the leading sources of uncertainty. The luminosity measurement is based on an absolute calibration of the primary luminosity-sensitive detectors in low-luminosity runs with specially tailored LHC conditions using the van der Meer (vdM) method [49]. The luminosity calibration relies on multiple independent luminosity detectors and algorithms, which have complementary capabilities and different systematic uncertainties. One of these algorithms is the counting of tracks from the charged particles reconstructed in the inner detector in randomly selected bunch crossings. Since the different LHC bunches do not have the exact same proton density, it is beneficial to sample a few bunches at the maximum possible rate. For this purpose, a minimum-bias trigger [50] selects events for specific LHC bunches and uses partial event building to read out only the inner-detector data. The data are read out at about 5 kHz for five different LHC bunches defined in the specific bunch group of the bunch group set used in the run.

Enhanced bias runs
Certain applications such as HLT algorithm development, rate predictions and validation (described in section 10) require a dataset that is minimally biased by the triggers used to select it. The Enhanced Bias (EB) mechanism allows these applications to be performed utilising dedicated ATLAS datasets. These datasets contain events only biased by the L1 decision, by selecting a higher fraction of highp T triggers and other interesting physics objects than would be selected in a zero bias sample (i.e. a sample collected by triggering on random filled bunches). To collect the EB dataset, a specific trigger menu is used which consists of a selection of representative L1 trigger items spanning a range from high-p T primary trigger items to low-p T L1 trigger items, plus a random trigger item to add a zero-bias component for very high cross-section processes. The random trigger item corresponds to a random read-out from the detector on filled bunches and therefore corresponds to a totally inclusive selection. The bias from the choice of items in the EB trigger menu is invertible, which means that a single weight is calculable per event to correct for the prescales applied during the EB data-taking. This weight restores an effective zero-bias spectrum. The recorded events are only biased by the L1 system, no HLT selection is applied beyond the application of HLT prescales to control the output rates. The EB trigger menu can be enabled on top of the regular physics menu, adding a rate of 300 Hz for the period of approximately one hour in order to record around one million events. This sample contains sufficient events to accurately determine the rate of all primary, supporting and backup trigger chains which together make up a physics trigger menu.

Condition updates in the HLT
The HLT event selection is driven by dedicated reconstruction and selection algorithms. The behaviour and performance of some of those algorithms depend on condition parameters, or conditions, which provide settings, such as calibration and alignment constants, to the algorithms. Conditions are valid from the time of their deployment, and until their next update. Depending on the nature of the conditions, these updates can be frequent. While most conditions are updated only between runs and often much less frequently, some are volatile enough to require updates during ongoing data-taking. In the ATLAS experiment, all conditions data and their interval of validity -17 -are stored in the dedicated COOL database [51]. This section describes those special conditions and the procedure that was introduced to configure them consistently and reproducibly across the HLT farm.

Online beam spot
Many criteria employed in the event selection are sensitive to the changes in the transverse and (to a lesser extent) longitudinal position and width of the LHC beams, also referred to as the beam spot [4]. The parameters of the beam spot are important inputs for the selection of events with B-hadrons, which have a long lifetime, typically decaying a few millimetres from the primary proton-proton interaction vertex. Since the beam-spot parameters are not constant within a run, they are continuously monitored and updated during data-taking if there are large enough deviations from the currently used values. The beam spot is estimated online by collecting the primary-vertex information provided in histograms created by HLT algorithms executed on events selected by L1 jet triggers. These histograms are then collected by an application external to the HLT, and the beam-spot position and tilt are determined in a fit. For every new LB, the beam-spot application reads in the histograms of the last few LBs; usually at least four or five LBs are required in order to acquire enough statistics to perform an initial beam-spot fit. If the fit is successful, the conditions update procedure for the beam spot is started if any of the following is true: a) the beam position along any axis relative to the beam width has changed by more than 10% and the significance of this change is larger than two, b) the width has changed by more than 10% and the significance of this change is larger than two, or c) the precision of either the beam position or the width has improved by more than 50%.

Online luminosity
Since the bunches in the LHC arrive in trains, there are several consecutive bunch crossings with collisions followed by a gap between the trains with empty bunch crossings. Additionally, the bunches in the train have slightly different bunch charges, which means that the luminosity for each bunch can be different from the average luminosity across the full train. The signals from energy depositions in the liquid-argon calorimeter span many bunch crossings, affecting the energy reconstruction of subsequent collision events. Therefore, the signal pedestal correction that is applied during the energy reconstruction depends on the per-bunch luminosity of the event bunch itself and that of the surrounding bunches [52]. The LUCID detector continuously monitors the overall and per-bunch luminosity, while a separate application compares it with the currently used luminosity values and starts the conditions update procedure for the luminosity if the average luminosity deviates by more than 5%. The per-bunch pile-up values are also used in pile-up-sensitive algorithms to correct for the bunch-crossing dependence of the calorimeter pulse-shapes [40], and for reconstruction of electron [53] and hadronic τ-lepton decay [54] candidates.

Updates of trigger prescales
Prescales can be used to either adjust or completely disable the rate of an item/chain, or to only allow its execution after the event has already been accepted (so-called rerun condition). Being able -18 -2020 JINST 15 P10004 to change trigger prescales during an ongoing run is extremely valuable in providing luminositydependent trigger configurations during data-taking and therefore in making optimal use of the available resources. As opposed to L1, where prescales are applied centrally by the CTP, the update of HLT prescales needs to be synchronised across the entire HLT farm. In addition, the HLT nodes need to be able to access the history of prescale sets used for the currently ongoing run to be able to apply the correct prescales due to the processing delay in the HLT. For this purpose, a table holding the mapping between LBs and prescale keys is recorded in the trigger database (see section 9), and propagated to the HLT farm using the same mechanism as the other conditions updates. Any change of prescale key is also recorded in COOL for use in offline analysis.

Conditions update procedure
To update conditions during the run the new data are written to COOL with a validity range starting in the next LB and the HLT processes are informed through markers in the event data itself that the new information has to be fetched from COOL. Distributing the update markers through the data path rather than the control path ensures that all HLT nodes receive it. The HLT algorithms that use such volatile conditions data, i.e. online beam-spot or luminosity information, receive a handle to the relevant conditions object and the framework takes care of providing the correct conditions data for every event (see ref. [37] for more details). An example showing the luminosity and the y-position of the beam spot used in the HLT compared with their respective live values is shown in figure 9.

Trigger configuration
The settings of both the hardware and software components of the ATLAS trigger system must be accurately recorded to ensure the correct interpretation of the trigger in offline analysis and the -19 -reproducibility of the trigger behaviour in the recorded data. The ATLAS trigger configuration system uses a relational database, the TriggerDB, to store these configuration data in a stable and accessible manner. L1Muon and L1Calo maintain additional configuration and condition databases, while the CTP stores most of its information in the TriggerDB. The Run-2 ATLAS trigger configuration system is described in detail in ref. [55] and is summarised in the following.
The upload of the configuration to the TriggerDB is handled by the TriggerTool via its graphical user interface (GUI). The TriggerTool GUI also makes it possible to browse the TriggerDB configurations and to edit existing configurations, which results in new configurations that maintain reproducibility.
The definition of the L1 configuration contains: the L1Topo trigger menu, a list of algorithms and their parameters used in the L1Topo system; the muon and calorimeter p T thresholds and isolation requirements for each L1 trigger object; the L1 trigger menu, a list of L1 trigger thresholds and objects which are combined into L1 trigger items; the CTP firmware logic, which is derived from the L1 trigger menu; and global settings, such as the L1Calo isolation settings. To accurately describe the HLT software configuration, two components are needed: the HLT trigger menu and parameters needed to configure the HLT algorithms and other software components employed by the trigger chains. The software release version that was used to produce this configuration is also stored in order to provide a reference to the algorithm implementation that is applicable for the intended release (the original or a later release version).
In addition to the configuration, which is fixed during a run, the TriggerDB also stores information that can be updated during physics data-taking (see section 8), the prescales to be used with each L1 item or HLT trigger chain and the bunch group set (see section 4.3).

The trigger database design
The TriggerDB [56] is an Oracle relational database schema in which all tables have an integer primary key corresponding to a unique combination of a name and version for the object and any dependent object that is stored. The primary key of the top-level table is known as the Super Master Key (SMK) and uniquely defines the L1 and HLT configuration. The SMK together with the Bunch Group Key (BGK), associated with the bunch group set, and Prescale Set Keys (PSK) for L1 and the HLT uniquely define the trigger configuration used to record a run. These keys are saved in COOL for each LB for use by offline analyses. A simplified representation of the database structure for the trigger configuration is shown in figure 10.
There are several instances of the TriggerDB hosted on the CERN Oracle servers, each for different use cases. The online TriggerDB is accessed by custom C++ configuration classes to load the configuration into the L1 and HLT systems at the start of a run. The online DB is only accessible from within the ATLAS control room network, but there is a read access clone replicated externally. This is needed to be able to run the trigger simulation using the same configuration classes as online. Other TriggerDB instances are used to produce MC simulation and for trigger reprocessing tasks.
The design of the TriggerDB is based on two principles: to avoid duplication of records in the database and to maintain every configuration ever used to record data. Link tables are used to allow the same primitive to appear in many configurations and similar primitives to appear within the same configuration. This removes the need to replicate the primitives themselves. If there is a change to any part of the configuration then it will always result in a new SMK, with only the updated -20 -   Figure 10. A simplified representation of the database structure for the trigger configuration highlighting the primary keys and the relationship between a selection of tables. The primary keys (black rectangles) are associated with an index of a table (rectangles). Most of the top-level table structure is shown (red rectangles) with each of these linking to further tables (for example a trigger chain). These linked tables contain various objects (for example a chain name) and subsequent links to still further tables (for example the algorithm for the given trigger chain) as demonstrated by the blue (L1) or green (HLT) rectangles.
information being saved. The TriggerDB is also used by a web page where the configuration can be browsed for a specific set of trigger keys, or looked up for a specific run.

The TriggerTool
The commonly used way to upload the trigger configuration into the database is through the TriggerTool GUI [57], a Java project developed in the NetBeans [58] integrated development environment. Upon launching the tool, the user has to choose their access role and the database connection.
Within the TriggerTool, a variety of options are available to display and edit the trigger configuration through a range of panels. The main panel is shown in figure 11, which shows the configurations which are currently contained in the connected TriggerDB, as well as the L1 and HLT PSKs that are associated with a given SMK. Selecting one of these prescale keys loads the details of that particular key, displaying the L1 items or HLT chains, their prescales and whether they are enabled or not. An example display for the HLT prescales can be seen in figure 12. The table can be used to edit the prescale or enable/disable an item. Changes are then saved as a new prescale key -21 - and associated with the same SMK. Various options for displaying differences between prescale sets of the same SMK are given. There are additional panels to view or edit the configuration, which is especially useful for understanding algorithm settings or stream configurations.
In preparing for and during data-taking, the TriggerTool is routinely used by the trigger menu and trigger online on-call experts to prepare the trigger configuration for the upcoming runs or to generate new trigger configurations during an ongoing run. Any change in prescales or bunch group sets can be deployed during a run, while a change in the trigger algorithm parameters (which leads to the generation of a new SMK) requires stopping the ongoing run and starting a new run.

The TriggerPanel
The change of the SMK and PSKs is handled by the TriggerPanel, which is included as part of the TDAQ software controlling the finite-state machine of ATLAS in the integrated run control GUI [59]. The TriggerPanel displays all prescale keys that are associated with the loaded SMK, along with which keys are currently in use, as can be seen in figure 13. The prescale keys can be set for different data-taking scenarios: standby, emittance scans, or physics. To facilitate the selection of a corresponding L1 and HLT PSK assigned to a particular luminosity range, they can -22 - be linked via alias tables in the TriggerDB for this luminosity range. Once an alias is selected in the TriggerPanel, it allows the PSKs to be suggested automatically, based on the current measured luminosity, and these are used only when confirmed by the shifter. When the prescale keys are changed in the TriggerPanel, the keys are sent to the CTP, to then be forwarded to the HLT as described in section 8.1.3.

Automatic prescaling of L1 trigger items
The Auto-Prescaler [60] is an application that was developed to automatically create and apply new prescale keys to L1 trigger items which are particularly sensitive to noise from sub-detectors. These can cause high rates, potentially exceeding the limits of the trigger system. The Auto-Prescaler consists of two parts, a rule checker responsible for evaluating the automation rules against the available monitoring data and requesting any prescale changes, and an orchestrator which is responsible for processing the prescale changes, generating a new L1 prescale set if needed and automatically applying the new prescale set. The prescales are calculated in multiples of an integer value, which is configurable for each L1 trigger item. The functionality of the Auto-Prescaler was extended during Run 2 to also reduce or remove prescales applied to L1 items during non-physics beam modes. This is used for recording beam background data during periods when no collisions are expected. The prescales set by the Auto-Prescaler automatically revert to the previous setting once the beam mode has changed or the rate spikes from detector noise have subsided.

Online release validation
The online trigger software release cycle includes development, validation and deployment of the software to ensure the reliability and predictability of the performance of the HLT. This release is -23 - In the alias panel, for the case of physics data-taking, the alias table is displayed with the relevant luminosity range, L1 and HLT PSK and additional comment field as a suggestion of which prescale keys to use. kept separate from the offline reconstruction release to be able to make use of new developments and fixes for data-taking. The final validation of the trigger release is done using a suitable EB -24 -dataset (see section 7.3) and mimicking the configuration and execution during the data-taking. The performance evaluation covers aspects of low-level memory and CPU requirements as well as distributions and efficiencies of high-level physics quantities which are analysed by a range of experts. This multifaceted task is crucial for the successful and efficient operation of the trigger system. More details of the development, validation and integration cycle can be found in ref. [61]. Throughout Run 2, about 80 software releases were deployed online for physics data-taking.
The trigger software validation cycle is typically carried out weekly for the most recent trigger development release as well as in preparation for changes in running conditions, e.g. for heavy-ion runs or new pile-up conditions. This cycle requires a coordinated effort between many groups and experts. The process of software development is mainly driven by the improvement of algorithms, the addition of new trigger chains, as well as addressing occasional problems in the software that have been observed online or offline. New software changes are validated daily by automatic nightly tests. These tests run on a few EB data events and are meant to catch software bugs. They are monitored by the online release on-call expert. Typically, a larger-scale validation is carried out once a week, running the new software meant to be deployed online over about 1 million EB data events. This reprocessing is used to identify rare bugs as well as to assess the performance of the new software in terms of trigger selection efficiencies and CPU usage. The reprocessing is also used to estimate the rates of individual trigger chains, groups of chains and the entire menu after correcting for the prescales used during collection of the EB data [42,62].
Two examples of the EB rate predictions are given in figure 14. In figure 14 (left), predicted rates are shown versus transverse energies expressed as L1 thresholds. The smooth spectra, which are obtained via the weighting procedure, illustrate the statistical power of the data sample over several orders of magnitude in rate. In figure 14 (right), the HLT rate predictions are compared with online rates for 957 trigger chains of the trigger menu with a non-zero rate at the time of data-taking at the same luminosity. The recorded rates are corrected for prescales applied at L1 and/or the HLT, and the difference in rate between the prediction and the recording is normalised to the combined statistical uncertainty from both samples. The Gaussian fit in the range −3 to 3 indicates that the prediction for the majority of HLT chains is normally distributed. For a small number of chains where the predicted rate was too low, a tail to the negative significance is visible. This is due to a small bias arising from the chosen set of L1 seeds of the EB dataset in addition to luminosity scaling assumptions. Since the enhanced bias data are collected over a one hour period, the mean number of pile-up interactions changes throughout the sample and this introduces another small bias to chains whose rate is not linear with respect to the instantaneous luminosity, due to pile-up sensitivity. The mean fractional statistical uncertainty per HLT chain is 10% for the predicted rates and 2% for the recorded rates. For the total HLT rate, the statistical uncertainty of the prediction is 1%.
The processing of the trigger software on the enhanced bias dataset is run on the Worldwide LHC Computing Grid [63] using the ATLAS production system, Prodsys2 [6], and can be monitored and managed through a web interface. Using the ATLAS Metadata Interface (AMI) [64], the software validation expert manages and stores the configurations and parameters of the data reprocessing. The reprocessing runs the HLT algorithms over raw data files, which is the same format as produced by the trigger system during data-taking, and produces another raw data file containing the new trigger decisions. In addition to reprocessing with the HLT algorithms, it was also possible to run  Figure 14. Two examples of the EB rate predictions [42]. Left: predicted rates as a function of p T for various L1 triggers. Right: pull distribution in predicted rates of 957 HLT chains from EB data normalised to the combined statistical error (σ), and fitted by a Gaussian function with mean, µ Fit , and width, σ Fit , where R Prediction and R Online refer to the rates from prediction and data-taking for the same luminosity, respectively. Error bars include statistical uncertainties only. the L1 simulation on the raw data to obtain updated L1 decisions when new L1 trigger items were introduced in the trigger menu. The raw data files containing the new trigger decisions are then processed with the ATLAS production reconstruction software to produce the output necessary for validation. This process of rerunning the trigger and offline reconstruction on about 1M EB events takes around 24 hours and typically requires on the order of 7000 hours of CPU time, with each job using roughly 4 GB of virtual memory. Throughout Run 2, the HLT reprocessing used about 0.2% of the ATLAS grid resources.
The performance evaluation of a new release is carried out by signature-specific experts who compare various results with those from a previously validated release. The offline and online monitoring is used to sign off a new release similar to the procedure used for the quality assessment of the collected data (see section 13). Once the release is validated, it is deployed to the online machines for data-taking and further tested in an ATLAS test run during an LHC inter-fill period to check that there are no problems configuring the HLT with the new release in the online environment and that the release is compatible with software being used by other subsystems.

Debug stream processing
Sometimes the HLT is unable to make a decision on whether to accept an event due to processing failures. Such failures can arise from algorithms crashing, time outs, missing data, etc. In such cases, the events are grouped by their failure type and recorded into a corresponding debug stream in order to be further studied offline, and in many cases recovered.
The failure types and corresponding debug streams are: • HLT Timeout: configurable limits are set on an event's processing time. To allow a graceful termination of the processing, a soft timeout is applied which corresponds to 95% of the value of the hard timeout. If an event exceeds the soft timeout limit, the remaining algorithms are -26 -skipped, a partial HLT result2 is recorded, and the event is sent to the HLTTimeout stream. These events are recoverable offline where there is no limit on the processing time except in the cases where timeouts are caused by infinite loops which will require further fixes.
• Processing Unit Timeout: if events from the soft timeout limit still do not finish processing within the hard timeout limit, they are sent to the PUTimeout stream. The settings for the hard timeout were 3 minutes in 2015, 5 minutes in 2016 and 7 minutes in 2017 and 2018. These events are recoverable offline where there is no limit on the processing time except in the cases where timeouts are caused by infinite loops which may require further fixes.
• HLT Error: if an algorithm aborts processing in an 'error' state, the event is sent to the HLTError debug stream. This can occur if, for example, a certain feature cannot be retrieved and sufficient protection is not put in place. These events are recoverable offline, if the cause of the error is fixed.
• PU Crash: if an algorithm or processing node crashes (due to failure in algorithm reconstruction, an algorithm using too much memory, etc.), the event is sent to the PUCrash debug stream. These events are recoverable offline if the cause of the crash is fixed.
• Missing data: if some data are missing at the HLT, the event is sent to the HLTMissingData debug stream. This can occur when the L1 result fragment is empty, or if there are inconsistencies in the CTP fragments, or problems recording the EventInfo data, which contains information such as run number, luminosity block number, and bunch crossing identifier. These events are not usually recoverable offline unless fixes can be applied which negate the need for this particular portion of data.
• Possible duplicate: if the original PU stops responding then the event may get duplicated by being assigned to a second PU and is therefore sent to the possibleDuplicate debug stream. These events are recoverable offline but further checks of data recorded at this point are required.
• Truncated result: if the HLT result size is larger than a predefined upper limit, the result is truncated and sent to the TruncatedHLTResult stream. These events are usually recoverable offline with the limit on the result size increased.
• Force Accept: any problems not covered above. Such events are usually caused by processing problems in a PU due to application crashes where all events being processed by that PU are sent to the ForceAccept debug stream. These events are recoverable offline.
• Late Events: events that did not arrive at the SFO to be sent to permanent storage in time but outside the luminosity block boundaries. These events are recoverable offline. Table 2 gives a breakdown of the number of events in each of the debug streams when the stable beams flag was set and data were recorded for physics during pp data-taking. Over the four 2The HLT result contains a header summarising the overall trigger decision, plus more detailed intermediate information from the event processing, e.g. amongst other information, the navigation data structure and some of the features which are useful for debugging.
-27 - Table 2. A breakdown of the number of events in the different debug streams for each year of pp data-taking during Run 2. In addition, the total number of events recorded in the physics stream is shown.  While the number of events in the debug stream might be very small, an attempt to recover them is still made. The recovery is done offline by rerunning the trigger using the same configuration as online, including relevant databases, except the relevant constraints that caused them to be written out to the corresponding debug stream. In the case of algorithmic errors, the events can be recovered by applying a patch to the software release with a fix for the cause of the failure before rerunning. Out of the total number of debug stream events, 97.8%, 99.5%, 98.9% and 96.5% were recovered in 2015, 2016, 2017 and 2018, respectively. The lower recovery rate in 2018 was due to a small number of events ending up in the debug stream during a special run dedicated to collecting data from the ALFA detectors due to errors in the monitoring code of the ALFA system.
In the case of a very high number of debug stream events in a single run (over 100 k events), the automatic recovery is not attempted as this typically points to a persistent software issue which requires fixing prior to the recovery attempt. Events that cannot be recovered offline typically ended up in the debug stream due to significant detector issues deeming the data not good for physics analyses. If individual LBs contain a large number of unrecoverable events, the corresponding LBs are not considered for use in physics analyses.
-28 -Events that are recovered offline from the debug streams are available for physics analyses, and analysis teams are advised to include the processing of the debug stream events to check if any event passes their selection and can be included in the analysis.

Online monitoring
During an ongoing run, the trigger shifter and experts have a variety of tools available to monitor the trigger system. The monitoring includes that of trigger and stream rates, as well as monitoring of online reconstructed objects to perform an initial assessment of the quality of the data being taken. Many of these tools make use of the ATLAS Information Service (IS) [65], which is the backbone of the information sharing between the various online systems. The data shared through the IS range from simple numbers up to more complex objects such as histograms. The IS is used to share information between applications in a distributed environment, flowing around the repository which holds the information provided by the applications. Based on the IS, the Online Histogram Service (OHS) handles the sharing of histograms between the online applications. Since the HLT is a distributed system over many PUs the histograms need to be added up and handled in a correct way. This is done by an application called the Gatherer [66]. The monitoring information is permanently stored on a run-wise basis for additional cross-checks and analysis on the offline side. The storage of monitoring information is done by the Monitoring Data Archiving (MDA) [67] application, which stores histograms, and P-BEAST [68] which stores all other time-series data. The histograms produced online are stored in ROOT [69] files and asynchronously transferred to the offline permanent storage.

Rate monitoring
Trigger rates are particularly valuable indicators of the trigger performance since they are highly sensitive to detector malfunctions, LHC beam issues, and instantaneous luminosity variations and can therefore often provide a first indication that something is wrong. Trigger rates are primarily monitored in the control room using the Trigger Rate Presenter (TRP) [70]. The TRP is a package that not only displays and monitors the rates, but also archives them for later offline usage. With the TRP it is possible to monitor the individual rates of all L1 and HLT trigger chains included in the trigger menu. For L1 items, the rates before and after prescales are available; similarly for HLT chains, the input and output rate before and after prescale can be monitored. In addition to the rates of individual triggers, the output rates of the different streams are available. An example view of the TRP can be seen in figure 15 displaying four plots of HLT input, output and recording rates, and rates of L1 items and some streams.
In addition, a cross-section/trigger rate monitoring tool (Xmon) was used to compare trigger rates from the ongoing run with rate predictions [71]. Xmon provides real-time, luminosity-scaled rate predictions based upon both offline and online trigger cross-section regressions. The rate predictions are derived offline by fitting past runs with a functional form to the luminosity and pile-up dependence for a given trigger. This requires the trigger to have been implemented in these past runs but also mitigates issues with predicting rates for triggers which have a potential non-linear rate dependency on luminosity and pile-up. The predicted rate when running online is then determined using the functional form and the current running parameters (luminosity, trigger -29 - Figure 15. The Trigger Rate Presenter is used to monitor the rates of various triggers and streams while taking data. Shown here is an overview which includes various L1 rates and total HLT output rates, as well as rates in important physics and calibration streams during a data-taking run starting at around 15:35 and ending around 17:40. prescale, etc.). The rates of several primary triggers (at L1 and the HLT) with the predictions overlaid can be displayed in the TRP, as shown in figure 16. Having such predictions for various primary triggers overlaid with the live rates allows shifters to very quickly spot problems with the trigger system or other parts of the detector and proved to be a very useful tool throughout Run 2.

Online data quality monitoring
In addition to the trigger rates, histograms of other quantities from the trigger system are monitored [72]. These include kinematic properties of various objects reconstructed by the trigger, as well as more general infrastructure-related information (e.g. number of processing slots available in the HLT farm, the status of the magnets, etc.). The histograms are displayed using the Data Quality Monitoring Display (DQMD) [73]. The display is organised in a tree structure with histograms from different trigger signatures grouped together. The histograms are displayed along with a reference histogram and are flagged with different colours depending on the outcome of automatic tests. These tests can be as simple as checking that a histogram is filled, or can be more complicated such as checking the level of agreement using a Kolmogorov-Smirnov test between the live histogram and its reference (and flagging the histogram as green for good, yellow for - 30 - acceptable, red for bad agreement). Other types of quality assessment algorithms may be also executed for the histograms independently of the reference, such as comparing the mean value of a distribution with a certain threshold. The coloured flag is propagated up the tree structure, so even if only one histogram is flagged red, it is easily spotted from the overall view. The histogram references are taken from a good run with similar running conditions, and are updated periodically as running conditions change. An example of the data quality display showing several histograms of the electron candidates reconstructed in the HLT is shown in figure 17.
A second GUI called the Online Histogram Presenter (OHP) [74] is available to display histograms. The OHP display does not include the colour-coded flags, but is used to display additional, more detailed histograms that are used by experts to debug problems. It also has an easy-to-use web display that allows the user to interact with the histograms (zoom in, etc.).

P-BEAST and shifter assistant
P-BEAST is a web-based monitoring tool used by shifters and experts in addition to the other available tools to monitor the system and debug problems. P-BEAST displays data from the IS graphically using Grafana [75] dashboards. The dashboard developed for the trigger displays, as a function of time, rates and bandwidths for various streams, the prescale keys in use, the number of free processing slots in the HLT farm, memory usage, read-out request rates, luminosity, dead times from detectors, etc. It provides a very quick overview of the current state of the running system, as well as the history. An example dashboard is shown in figure 18. This dashboard was heavily used during 2017 data-taking with the 8b4e filling scheme where luminosity levelling was required in -31 -

JINST 15 P10004
ATLAS Trigger Operation Figure 17. The Data Quality Monitoring Display (DQMD) used for the HLT. The data (black) are compared with the reference (purple) using a Kolmogorov-Smirnov test to compare the shapes of the distributions. Based on the output of the comparison test, the histograms are flagged either green (good agreement with the reference), yellow (tolerable disagreement with the reference), or red (major disagreement with the reference). -32 -order to stay within the trigger resource limitations, and provided the shifters an easy overview to check that the limits were respected.

ATLAS Trigger Operation
To help shifters catch problems, an alert system called the Shifter Assistant [76] is used in ATLAS to flag certain possible issues. The alerts are displayed in a webpage, with a pop-up notification for new alerts. In the trigger, these alerts notify shifters if rates of different streams drop below or rise above certain specified values (depending on the stream). It also notifies shifters of plots that are flagged red in the DQMD, and sends periodic reminders to the shifters to check rates and histograms.

Data-taking anomalies
While precaution is taken to prevent anomalies in data-taking from occurring, quick action needs to be taken in case smooth operation is not ensured any more. In such a situation, the trigger shifter in the control room and the on-call experts work together to ameliorate the issue. The time taken to solve a given issue can vary widely, and depends on the severity of the problem, and the subsystems involved. All the online tools described above are meant to assist and provide as much information as possible to quickly identify the source of the issue.
A change in rate is one of the most common problems where trigger rates can be either too high or too low. If rates are high enough to cause issues with the data flow, the cause needs to be identified and fixed promptly as it prevents data from being recorded. The circumstance in which the change in rate occurred is important to understand, and below are some examples of information that can help identify a given problem: • The time of occurrence, i.e. in the middle of a data-taking run or at the beginning of the run.
• Any changes made prior to the observation of the high rate in the trigger system or sub-detector systems, e.g. a new software release, change of prescales, etc.
• The current status of sub-detector systems and if they are in good condition. Good communication between different members of the shift crew in the control room is essential to efficiently forward information to the experts.
• Changes from the LHC, e.g. luminosity optimisation during a run.
With help of the TRP, the rates of each stream or individual L1 item or HLT chain can be displayed.
Once the type of trigger has been identified, the underlying problem can be further narrowed down. It is possible that a wrong prescale key or an incorrect combination of L1 and HLT prescale keys together with an incorrect bunch group configuration was loaded during the run. Noisy cells from sub-detectors (e.g. in the calorimeter) can lead to spikes in trigger rates, which can be identified using the DQMD to check the positions of reconstructed objects. No or significantly lower rate in e.g. a group of trigger chains typically points to a problem with a sub-detector system or incorrect algorithm configuration in the HLT software. For example, if some stations of the muon detectors are not working, lower rates in the muon trigger chains can be observed, or similarly, a lower rate in b-jet and electron chains may indicate a problem with the data quality of the inner detector. With the help of the trigger rate predictions, changes in the rates during a run or right from the beginning can be spotted quickly.
-33 -High dead time, and consequently severe problems with the data flow, can have several causes and often requires a combined investigation by trigger and DAQ experts. Since none of the DAQ/HLT components [77] can generate a busy signal (unlike the front-end systems) they need to rely on backpressure between components to temporarily stop the data flow. As the buffer space fills up, the back-pressure builds up until the data flow is suspended completely. Backpressure can arise in the HLT if there are algorithms running in the HLT that require too much CPU time for the HLT farm to be able to process events quickly enough, so that no processing units in the HLT farm remain available to process new events. In this case, chains running CPU-intensive algorithms are typically prescaled or disabled in order to free up CPU resources until the CPU usage of the algorithms can be reduced. The HLT can also cause back-pressure if it tries to write out too much data. This can happen if there is a very high rate of events being written into a debug stream (if, for example, an algorithm is exiting in an error state in every event it runs). Depending on which stream they are written to, this could indicate a detector problem, or a problem with a specific chain or algorithm running in the HLT. In the latter cases, the usual intermediate solution is to disable or prescale the offending chains until a more permanent solution can be implemented.
The main operational issues encountered during Run 2 that caused data losses were typically too high rates, induced either by changes in the LHC conditions (e.g. increase in bunch intensities, different filling schemes, luminosity optimisation during a run) or sub-detector problems.

Offline monitoring and data quality assessment
In addition to monitoring the data online as it is being collected (section 12), an assessment of the quality of the data is carried out offline after the data are fully reconstructed. This allows more detailed studies of the trigger efficiency relative to offline objects. The data quality (DQ) assessment procedures for all of ATLAS are described in ref. [78].
The express stream is the main data stream used for the offline DQ assessment, and this stream is fully reconstructed within a day after data are recorded. Signature on-call experts (see section 5) assess the performance of the trigger chains by looking at a series of histograms which are chosen to sample all relevant aspects of the particular trigger signature. The histograms can include kinematic distributions, efficiencies, resolutions, and comparisons between trigger objects and the corresponding offline reconstructed objects. As shown in figure 19, the histograms are arranged in a web-based display that includes a reference histogram, and a colour-coded flag based on how well the histogram and its reference agree. Mismatches (yellow in case of slight deviations, or red) can be due to either actual problems, outdated references (e.g. in the luminosity ramp-up phase at the start of a data-taking period), or low sample size and need to be followed up and understood. In some cases, the number of events collected in the express stream is not sufficient to perform a full assessment of the DQ. In these cases, the express stream is used as a preliminary check, and the final assessment is done when looking at the same histograms but produced by processing all events collected in the physics stream. These histograms are typically available a few days after the express stream is processed.
If there are problems spotted during the assessment, a so-called data quality defect is set and recorded in the data quality database. A defect denotes an LB (or several LBs up to the entire run) in which the data-acquisition was not optimal. Many different defects are defined, so that each -34 - Figure 19. An example of the web-display for the offline DQ assessment. Shown here are various plots for hadronically decaying τ-lepton candidates as reconstructed in the HLT. The data (black points) are compared with a reference (blue points), and flagged as green, yellow, or red depending on how well they agree, according to predefined algorithms. The reference run is updated periodically to reflect significant condition changes, such as updated trigger menus. This example shows run 352 056,which was recorded in June 2018. Error bars include the statistical uncertainties only. type of problem can be tracked individually. They can be either tolerable or intolerable. Tolerable defects are used to flag slight abnormalities in the data that do not have a large impact on physics (e.g. spikes in the calorimeter energy, small loss of coverage from some detector, etc.). Intolerable defects are used to flag data that should not be used for physics (e.g. large loss of coverage from some detector, incorrect trigger configuration, etc.).

ATLAS
The DQ and debug stream on-call expert is responsible for collecting the assessment outcome from the L1 systems and the trigger signatures and reporting them to the ATLAS DQ group. Any relevant findings from the other ATLAS subsystems are also reported back to the trigger community.
The collection of all defects from the DQ assessments of the trigger system as well as of all other parts of the ATLAS detector constitute the basis for the creation of the Good Runs List (GRL). The GRL is the list of all LBs of all runs collected by ATLAS which are deemed good for physics analysis. Generally, any LB which has been labelled by any intolerable defect is vetoed from the GRL, as described in ref. [78].
A summary of the DQ efficiency for the trigger system for each year together with the overall DQ efficiency for the ATLAS experiment is given in table 3, which is extracted from ref. [78]. The DQ efficiency is calculated relative to the recorded integrated luminosity. It accounts for any data that are rejected on the grounds of data quality and therefore does not take into account the events that were not recorded due to detector-related problems. The DQ efficiency is presented in terms of a luminosity-weighted fraction of good quality data recorded during stable beams periods. Only periods during which the recorded data were intended to be used for physics analysis are considered for the DQ efficiency. The trigger-specific and total ATLAS DQ efficiency both improved over the course of Run 2. Over the full Run 2, the trigger and ATLAS DQ efficiencies are 99.6% and 95.6%, -35 -respectively, for the standard √ s = 13 TeV pp dataset, with a corresponding 139 fb −1 of data being good for physics analysis. The inefficiencies of the trigger system where the data were marked with defects as not good for ATLAS are mainly due to either hardware problems (read-out from detectors due to problems further upstream) in the L1 system or an incorrect trigger configuration being loaded. For L1 it was mostly due to detector coverage and a problem with the CTP cables. The data loss from events that were not recorded due to problems in the TDAQ system is very small and was reduced during each successive Run 2 data-taking year. It accounts for a data-taking efficiency loss of 3.16% in 2015, 2.22% in 2016, 1.04% in 2017 and 0.671% in 2018. Table 3. Luminosity-weighted relative detector uptime and good DQ efficiencies (in %) during stable beams pp collision physics runs at √ s = 13 TeV for 2015 (July to November), 2016 (April to October), 2017 (June to November) and 2018 (April to October) for the trigger system (L1 and HLT) and overall for the ATLAS detector. Numbers were extracted from ref. [78].

Conclusion
The operation of the ATLAS trigger system during Run 2 of the LHC was highly successful for both pp and heavy-ion data-taking as well as for a variety of special runs. The recorded data are of high quality with only minimal losses due to operational problems. The trigger's excellent performance and flexibility allowed collection of 139 fb −1 of high-quality pp collision data at √ s = 13 TeV over a wide range of running conditions, with peak instantaneous luminosities ranging from 0.5 × 10 34 cm −2 s −1 to 2.1 × 10 34 cm −2 s −1 and 28-60 pile-up collisions. A variety of tools and procedures, as detailed in this paper, have been used in both the offline and online environments to efficiently validate, monitor and assess the operation and the performance of the trigger system. They allowed maximal exploitation of the collisions delivered by the LHC in Run 2 for physics analyses, while respecting the limitations of the trigger system in terms of maximum read-out rates, available HLT computing resources and sustainable storage output bandwidth.  [79].