ATLAS computing challenges before the next LHC run

ATLAS software and computing is in a period of intensive evolution. The current long shutdown presents an opportunity to assimilate lessons from the very successful Run 1 (2009-2013) and to prepare for the substantially increased computing requirements for Run 2 (from spring 2015). Run 2 will bring a near doubling of the energy and the data rate, high event pile-up levels, and higher event complexity from detector upgrades, meaning the number and complexity of events to be analyzed will increase dramatically. At the same time operational loads must be reduced through greater automation, a wider array of opportunistic resources must be supported, costly storage must be used with greater e ﬃ ciency, a sophisticated new analysis model must be integrated, and concurrency features of new processors must be exploited. This paper surveys the distributed computing aspects of the upgrade program and the plans for 2014 to exercise the new capabilities in a large scale Data Challenge.


Introduction
The ATLAS experiment [1] took data at the CERN Large Hadron Collider (LHC) accelerator between 2009 and 2013, collecting over 2 · 10 9 p-p collisions every year of data-taking, plus more than 10 8 Pb-Pb and p-Pb collision events. During this first data-taking period, conventionally called "Run 1", the centre-of-mass energy of p-p collisions increased from 900 GeV to 8 TeV, the instantaneous luminosity reached 7.7 · 10 33 cm −2 s −1 , and the average number of collisions per bunch crossing increased up to 20 (with instantaneous rates up to 40).
During 2013 and 2014 the LHC machine is undergoing a period of maintenance and upgrades, with the aim of restarting operation in 2015 ("Run 2") at higher centre-of-mass energies (13 TeV for p-p collisions), a reduced bunch spacing (25 instead of 50 ns) and higher luminosities, leading to an average number of collisions per bunch crossing around 40. The data-taking rate will also increase up to 1 kHz on average.
The ATLAS software and computing infrastructure was designed in the early 2000s for the conditions of Run1 and progressively updated to cope with increasing beam energies and luminosities. The long shutdown period (LS1) between Run 1 and Run 2 allows for more fundamental changes that have been made possible in the meantime by the technology evolution of the last few years. Two examples are the availability of many-core processors, which are best exploited by jobs that run concurrently on several cores, and the massive increase in network bandwidth, which allows the de-localisation of jobs with respect to their input data.

Software environment and performance
The speed of physics analysis is in many cases limited by the data processing (and reprocessing) rate and by the availability of adequate simulated event samples. One of the factors that most affect the processing time is the pile-up, i.e. the number of interactions per bunch crossing. As any pattern recognition algorithm starts with a combinatorial component, higher levels of pileup lead to very long event processing times, particularly in Inner Detector tracking. Acting on tracking can lead to substantial processing time savings.
ATLAS studied several linear algebra packages [3] and decided to replace the CLHEP package with Eigen throughout the code. Eigen is 10 times faster than CLHEP for 5x5 matrix multiplications, which are very common in tracking code. The net result is a factor 2 reduction in total reconstruction time. Further improvements of the code efficiency, such as improved access to the magnetic field information, produced an overall reduction of the processing time by a factor 3 with respect to the code used to process 2012 data. Figure 1 shows the total reconstruction time per event for a top Monte Carlo simulation sample with 40 pile-up at 13 TeV, 25 ns bunch spacing. The CPU time is shown as well separately for the Inner Detector reconstruction as the tracking is dominating the total resource needs. This simulation is done using a Run-1 detector geometry. The HS06 scaling factor for the machine used for this study is quoted as 11.95. Monte Carlo simulations of physics events, including detailed simulation of the detector response, are indispensable for every data analysis in high-energy physics experiments. ATLAS developed long before Run 1 full and fast detector simulation techniques to achieve the production of large datasets of simulated events within the computing limits of the collaboration. The new Inte-grated Simulation Framework (ISF) [4] is based on the requirement to allow to run all simulation types in the same job, even within the same sub-detector, for different particles. The ISF is designed to be extensible to new (future) simulation types as well as the application of parallel computing techniques. It can be easily configured by the user to find an optimal balance between precision and execution time, according to the specific physics requirements for their analysis. The default configuration foresees running the full Geant4 simulation [5] for all primary interaction particles and their dacay products (electrons, muons, taus, b-jets etc.) and the fast simulation for the other particles. The main advantage consists of a factor 100 reduction of the processing time per event, while keeping the necessary accuracy for all relevant physics objects. The evolution of CPU hardware is going in the direction of many-core processors, without a matching increase of the available memory per core. ATLAS software has to match this evolution, and indeed the AthenaMP [6] framework evolution addresses this problem. AthenaMP runs a multi-core job where the master process manages job initialisation and I/O, and slave processes run the algorithmic code, one event per process. At the end of the job the partial output files are merged together (see Figure 2). In this way most of the memory can be shared (containing the actual code to be run, the geometry and conditions data) and the memory needs are reduced for a typical reconstruction job to about 1 GB for the shared part of the memory and 1 GB/core for the event data, compared to 2 GB for a traditional Athena job that processes events sequentially on a single core. Multi-core queues have been enabled at several ATLAS Grid sites and have been used in 2014 to run the Geant4 simulation production for Data Challenge DC14 (see Section 5). Between January and August 2014 about 6.5 million jobs have run at 53 sites, using up to 50k cores (about 1/3 of the current ATLAS CPU capacity) with a CPU usage efficiency (CPU time over wall-clock time) of 81%. There is a few percent loss in CPU efficiency with respect to the same kind of jobs running in single-core mode due to the job initialisation and finalisation times, which run on a single core, while the other cores are idle. A solution for this problem is under design. In any case AthenaMP allows ATLAS to use computing resources that would be otherwise unavailable because of insufficient memory/core to run single-core Athena jobs, so the balance is positive. The use of AthenaMP for pile-up and reconstruction jobs is currently under validation and is foreseen to enter production by the end of 2014.
A new analysis data model has been developed, with the goal of creating a data format that is produced by reconstruction and can be conveniently used in analysis tasks. In this way there will be no longer the need for creating ROOT n-tuple formats (Derived Physics Data, or D3PD) that are almost a full copy of the old AOD (Analysis Object Data) information. The new format (xAOD) combines the best features of the old AOD and D3PD files, namely to be able to read information in a basic way even using vanilla ROOT, and in a fully functional way after just loading a small amount of libraries, and to provide the same flexibility for slimming that the D3PDs were capable of (the ability to select which properties of objects one wants to save into a given file, and the ability to decorate objects at the analysis stage with additional information). A new infrastructure has been developed to make it possible to do all the operations on the primary xAODs that users were doing in their analysis starting from the primary D3PDs. The new data model uses an analysis motivated optimisation for I/O settings: primary xAODs are meant mainly for analysis from Athena, providing good performance for reading a large part of the event data for every event in the files, whereas derived xAODs are meant mainly for analysis from ROOT, providing good performance for reading a small number of variables for a lot of events.  and tools of data handling, which in Run 1 were handled by users or user groups, leading to a non-optimal usage of the computing resources especially for crossteam analyses. Each derivation is defined by a single set of Athena jobOptions defined by physics and/or performance groups. A key part of the derivation framework is the concept of train production, where a single job can produce a number of independent output formats from a single input file.

Distributed computing tools
The building blocks of the ATLAS Distributed Computing (ADC) architecture were designed and deployed before the start of LHC operations. The existing tools worked very well for ATLAS during Run1 but at the same time showed some limitations that led to a too high operational manpower need. The experience of Run 1 operations led to a redesign of the two major components, the data management and workload management systems, and to the addition of a few other services that will be needed to cope with increased data volumes and different kinds of computing resources.
The Distributed Data Management (DDM) system was completely redesigned in 2012-2013 and the new implementation Rucio [7,8] is progressively deployed in 2014. With respect to the previous DDM implementation, Rucio has data discovery based on name and metadata, has no dependence on an external file catalog (deterministic relation between logical and physical file name), supports multiple data management protocols in addition to SRM, e.g. WebDAV, xrootd, S3, posix, and gridftp, and features smarter and more automated data placement tools (rules and subscriptions).
Data access can be a bottleneck for data analysis. Some datasets can be very popular for short periods of time, for example just at the end of some reprocessing campaign, with several analysis groups accessing them at the same time on a few sites where they are replicated. A way to ease the situation during peak request periods is to create a "data federation", in which data on disk at any site are directly accessible from jobs running at any other federated site. Evidently the data access tools must be clever enough to choose the "best" data replica to access, depending on the bandwidth and latency between the destination and all possible data source sites. A data federation is needed also to allow remote access to data in case of unavailability of a given file in the local storage element, or sparse access to single events.
FAX [9] (Federated Atlas aXess) is the ATLAS implementation of an xrootd based data federation. It has two top-level redirectors, in Europe and the US; the topology is shown in Fig. 4. It covers so far 56% of ATLAS sites, which contain 85% of the data. Failover works stably: it was tested that all the sites do deliver data efficiently. Test tasks are submitted to sites that dont have the data so that FAX is invoked. The error rate is very satisfactory, as only 0.3% of jobs fail due to FAX issues (typically temporary remote data unavailability or network glitches).
The production and analysis workflows increased in number and complexity during Run 1, and are expected to further increase in the future. The production system had to be redesigned and a better layered infrastructure, ProdSys2 [10] replaced completely the front-end part. ProdSys2 consists of four layers of core components: • the request interface allows production managers to define a workflow request; • DEfT (Database Engine for Tasks) translates user request into task definitions; • JEDI (Job Execution and Definition Interface) generates the job definitions; • PanDA [11] (Production and Distributed Analysis) executes the jobs in the distributed infrastructure.
JEDI+PanDA provide also the new framework for distributed analysis workflows submitted by single users or analysis groups. The EventIndex [12] is a complete catalogue of all ATLAS events, keeping the references to all files that contain a given event in any processing stage. It is useful to find and retrieve small numbers of selected events, for production completeness checks, and to provide data for the Event Service. It is implemented as three major components [13]: the data collection and transfer system, the core storage (in Hadoop technology), and the web server for data access. Fig. 5 shows the building blocks and the data flow associated to the EventIndex.
The Event Service is a novel way to distribute payload to workers in different computing environments (Clouds, HPCs, ATLAS@home) where CPU cycles are usable but the system has to be used as a "black box", without installing any software component on the worker nodes. It uses AthenaMP, remote I/O (FAX), EventIndex together with JEDI+PanDA to distribute single events, or small groups of events, directly to the process running on the remote facility. In this way it can make efficient use of opportunistic computing resources. The Event Service is currently under commissioning.

Computing Resources
Physics analysis groups are always eager to have as many simulated events as possible, in order to reduce the systematic errors in their analyses to the minimum and be able to compare their measurements with a large number of theoretical and phenomenological models. In addition to the resources that are pledged to the collaboration by the funding agencies that support it, it is now possible to use additional resources that may be available only occasionally, but can provide welcome additions to the base resources.
The first, and easiest to use, of these resources is the farm used by the ATLAS High-Level Trigger (HLT) system to select events in real time while taking data. When the LHC accelerator is not operating, it can be used to produce additional simulated events. Simulation jobs run on virtual machines in the HLT nodes, with an implementation based on OpenStack and CernVM. Only twenty minutes are needed to launch virtual machines for the entire HLT farm, which hosts up to 20k jobs slots served by PanDA, adding 15% to the total AT-LAS computing capacity. Jobs slots are automatically discovered and no manual action is needed to fill them.
The virtual machines can be killed within ten minutes if a return to HLT operations is needed; killed jobs are retried elsewhere by the PanDA workload management system. Disk I/O and memory considerations so far limit operations to the MC generation of hits (Geant4), but this may no longer be true with a future updated network.
High-Performance Computers (HPCs) are becoming increasingly available at relatively low cost (or in some cases at zero cost but low priority) for scientific applications. Some of them have already been successfully used in ATLAS as part of NorduGrid and in the US and Germany. HPC nodes have little outside connectivity and no local installation possibilities, therefore they need a non-invasive interface like the ARC-CE (or similar) and a way to connect to CVMFS (for software) and Frontier (for database access) through Squids; the ARC control tower (aCT) allows access to HPC resources from the PanDA workload management system (see Fig. 6). Many elements of the HEP software stack (Geant4, ROOT, Alpgen, Sherpa...) have been made to run on many different HPCs. There is a strong interest and support for the ATLAS HPC activity, which has been awarded 63 million CPU hours over the next 12 months. This is 6% of the ATLAS Grid use and half of the event generation budget. ATLAS@home is a volunteer computing project using the Boinc [14] infrastructure, that is supporting also a number of long-running other projects (notably SETI@Home, Einstein@Home, LHC@Home). A test server was set up with the ARC-CE and a Boinc server with the ATLAS@Home application. The BOINC PanDA queue runs very low priority MC simulation jobs, with 10 events/job; currently up to 1000 jobs run in parallel. On average so far it produces 6000 events/day 0.2% of the total ATLAS Grid capacity, but growing with time. The overall goal of Data Challenge 2014 (DC14) is to get ATLAS ready for Run 2 physics. To achieve this ATLAS needs to commission the Integrated Simulation Framework (ISF) in the context of physics analyses, run large-scale tests of the updated reconstruction algorithms and of the distributed computing tools, and test the Run 2 analysis model, thus gaining experience with the Run 2 analysis framework. This program is broken down into technical components:

Data Challenge DC14
• Partial reprocessing of Run 1 data (for the analysis challenge); • Production of new MC events with the 2015 geometry and expected run conditions; • Reconstruction and distribution of produced data, including cosmics from "M" (test) runs (see Fig. 7); • Data analysis challenge.
The bulk of this program is for the second half of 2014. Fig. 7 shows the timeline of DC14 tasks, in relation with the global ATLAS schedule for 2014.

Conclusions
ATLAS defined at the end of 2012 an ambitious plan for improvements of the software and computing infrastructure and tools. All new components and developments are coming together about now: • A new simulation framework, improved reconstruction algorithms, faster tools; • New workload and data management systems; • A new operation model for analysis and for distributed computing; • Data Challenge DC14 is testing all components of this improved system.
ATLAS wil be ready for taking new LHC data in 2015.