Experience with Intel’s Many Integrated Core architecture in ATLAS software

,


Motivation for studying Many Integrated Core in ATLAS
The use of Graphics Processing Units (GPUs) for general purpose programming (GPGPU), meaning non-graphics work, goes back at least to the early '90s, but it took a huge step forward when NVIDIA introduced the Compute Unified Device Architecture (CUDA) platform and programming model in early 2007; and started to market GPUs for high performance computing (HPC).At that time x86 dominated the HPC market and Intel, the main provider, needed to answer the challenge or lose market share.
Intel first worked on a GPGPU chip codenamed Larrabee until they consolidated all manycore research into the Many Integrated Core architecture (MIC) in 2010.The Xeon Phi is the first production chip to come out of this research.In June 2013, the Chinese supercomputer "Tianhe-2" ("Milky Way-2") was announced: it consists of 16, 000 compute nodes with two Xeon processors and three Xeon Phis each.It vaulted to the top spot in the Top500 list of supercomputers, dethroning "Titan" -which utilizes NVIDIA's K20x GPGPU accelerators.
The ATLAS detector [1] is a multi-purpose experiment at the Large Hadron Collider (LHC) located at CERN.The LHC is currently shutdown for scheduled maintenance.When it restarts, it should quickly ramp up to its original design energy and luminosity.Compared with the runs up to 2012, the higher center of mass energy and possibly lower bunch spacing combined with higher luminosity will result in higher particle multiplicities and an increase in overlapping background events (pile-up).ATLAS also expects significantly higher data rates to fully support the physics program.The throughput of the ATLAS reconstruction software must therefore improve by a factor of 2 to 3, depending on the actual performance of the LHC and the experiments.A decade ago, the duration of the shutdown in itself would have allowed performance improvements in CPUs to match up with our needs.But that free lunch is over.
Given several successful applications of GPGPU in HEP (see the recent "Graphics Processing Units in HEP" workshop [2]) and the need to improve the throughput of our software, it is natural to ask what Intel's offering means for ATLAS.In this paper, we give our views after having explored Intel's MIC over the past couple of months.
Note: we generally refer to the Xeon Phi as "MIC," which technically is the generic name of the architecture, rather than the coprocessor product; but it makes for easier reading as it more clearly distinguishes it from the Xeon line of CPUs.

Overview of the MIC architecture
The most striking features of the MIC (see figure 1 (a) for a display of the board) are its huge number of floating point operations (FLOPs) per second, with good memory bandwidth; and its comparatively small sizes of main on-board memory and level-2 cache.The exact numbers depend on the model: we used a 5110P board with stepping B1 that has 60 cores running at 1.05GHz, 4 hardware threads per core, 8GB main memory, and 512KB L2 cache per core.With 8-wide vector registers for double precision floating points (which are most common in ATLAS), this results in a theoretical peak performance of 504 GFLOPs, which compares with a maximum of 85 GFLOPs for a high performance Sandy Bridge Core i7, see figure 1 (b).
The MIC cores are based on the Pentium 5 design, instructions are executed in order, and there is no hardware prefetching.It features a new set of 512bit wide general and masking Figure 1.Display of the MIC board in a rack-mounted tray (a), and theoretical peak performance (b) depending on whether vectorization is enabled or not ("no-vec").The comparison is with the highest performance Sandy Bridge Core i7 CPU.registers, with corresponding vector instructions: the VPU, or Vector Processing Unit.These are not compatible with any of the current SSE or AVX implementations.The instruction set includes scatter/gather operations, based on strides, but the actual loads and stores are executed in a loop; thus, data locality is important or cache lines risk being evicted during these operations.The use of vectorization in ATLAS codes is currently very limited: developers have historically made no or little use of it, and many of the relevant matrices are small and/or have an odd number of elements (e.g.5x5 covariance matrices).When the matrix sizes mismatch with those of the vector registers, overhead due to padding, peel-, and remainder-loops, becomes expensive.This contrasts with typical HPC code, where matrices have many thousands of elements, thus reducing the relative importance of such overheads.
There are 4 hardware threads per core.Similar to hyperthreads on Xeon CPUs, but designed to mask the latency inherent to in-order cores, as opposed to the filling in of empty slots in out-of-order pipelines -hardware threads can therefore not be disabled.Each hardware thread has a full register file that includes the vector registers, and is cycled in a "smart" round-robin fashion: a thread is only scheduled if it is ready to issue.Instructions are issued in bundles of two, making branching more expensive, and the same thread can not issue back-to-back.The MIC can thus only achieve peak performance by running at least 2 threads per core, or a minimum of 120 threads on our board.Scheduling more than 2 threads per core will only gain further performance improvements if data locality is not optimal.

Compilers, tools, and support
The main selling point of the MIC architecture is that it is x86: existing codes are relatively straightforward to port, and optimization for both coprocessor and CPU can be done on a single code base.It should allow the use of existing compilers and hence there is little restriction on programming languages and libraries -complex C++ and template-heavy libraries, such as Intel's TBB for example.Available tools purport to bring vectorization within reach on both the host and coprocessor, either automatically or with simple hints inside the code.
We find that these claims are oversold.The available tools are lacking in stability and maturity, but also the proposed model is problematic: automatic vectorization is just about as hard as automatic parallelization, as the compiler has to make inferences about code behavior based on insufficient information.In practice that means it gives up and does not vectorize code, unless it encounters specific known use cases of HPC-style codes -and the programmer is left doing the work by instructing the compiler to relax constraints for specific code blocks.
This contrasts with GPUs that have direct hardware support for programming models that require large scale vectorization and parallelization.For example, they can automatically coalesce scalar loads/stores into vector equivalents, predicate instructions for simple branching constructs, and mask individual execution lanes.Changing programming model, from the x86 that ATLAS codes have targeted so far, means work, but it is a more straightforward task because the underlying model is internally consistent.
Intel is now moving in a similar direction: by providing new instructions for the MIC architecture, as well as with AVX-512.These instructions sport operational masks, so that vector instructions can execute conditionally on individual elements.This provides a much more GPU-like programming model for CPUs.Intel is also providing a single-program, multiple-data compiler (ISPC [3]), to take advantage of such a programming model.Contrary to OpenCL, ISPC does not restrict its code generation to kernels with the iteration over elements only in outer loops, but rather targets the single-instruction, multiple-data (SIMD) vector units of the CPU.We find that this works well, albeit that ISPC is currently C-only (with most ATLAS codes in C++) and support for MIC's VPU is inadequate to point of being non-existent.
Support for multi-threading is in a better shape than for vectorization, with TBB and OpenMP being the most prevalent tools.Still, neither of these libraries makes work distribution across both the host and coprocessor easy: each device gets its own thread pool and it is up to the programmer to divide the work load (see [4] and also the next section).Tasks in TBB are large enough to alleviate locality concerns, but with OpenMP, it is important to assign consecutive threads, with consecutive array accesses, to the same core.This "balanced" affinity can make up to a factor of 3 difference in performance, and is documented, but not yet supported.

Framework integration
The "offload" programming model is the simplest way to integrate an algorithm that uses the MIC into the ATLAS software framework, Athena.In this model, the programmer marks a section of code (e.g. the body of an algorithm) as a candidate for offloading at run-time.The ICC compiler then generates two sets of codes and the required marshaling within a single binary.Since ICC integrates well with the GCC binaries from the normal ATLAS releases (when it uses the GCC headers), individual developers only have to recompile their own (offload) code.
Data transfers should not be managed by individual developers: offloading of data is very slow, due to the time spent in allocating and deallocating memory on the coprocessor, see figure 2 (a); the use of 2MB pages helps, but good transfer speeds are only achieved by setting up reusable buffers.A service is therefore needed to setup and manage these buffers.
Figure 2. Data transfer rates (a): asynchronous, with "huge" pages, and using reusable buffers; and the effect of object-oriented programming and shared libraries on throughput (b), for varying numbers of calls in the loop and core utilization rates.(For the host, 4 and 8 calls overlap.) We find empirically that indirection -from virtual method calls due to our object-oriented codes, and from trampolines due to our organization of shared libraries -affects the performance of the MIC less than that of the host CPU, see figure 2 (b).Front-end stalls, caused by branch misprediction, quickly limit the Xeon host (Core i7 compares a bit more favorably).The MIC does not predict branches but rather uses its multiple threads to cover for latency; to better effect if there is sufficient, independent, parallelism.Absolute performance is very low for both host and coprocessor, because of a lack of vectorization.Still, the next generation concurrent event processing framework, GaudiHive, executes multiple events in parallel: it could achieve comparatively good throughput on the MIC, without changes to our current codes.
In a production setting all available resources should be used, even if doing so is theoretically inefficient.Add the need for reuse of buffers for performance, as well as integration with the scheduling of our current and future framework, and it is clear that a service model is needed to manage coprocessor cards.Such an implementation is described in [5].

ATLAS tracking algorithms
Accelerators could be useful in two places: track finding in the Pixel and SCT detectors; and the ambiguity processing after the seed finding.The former parallelizes well, because the different seeds are independent from each other and studies on GPUs show big potential improvements [6].In the latter, final track fits remove duplicate and low-quality measurements that are shared between track candidates.The existing algorithm [7] used in ATLAS reconstruction does not parallelize well, because it scores tracks and iteratively removes shared measurements from them, then re-fits and re-scores.The Multi Track Fitter (MTF [8]), an alternative algorithm, has similar or even better physics performance (inside jets).The MTF is parallelizable because it assigns measurements probabilistically to tracks rather than exclusively, and updates these values iteratively and for all tracks at the same time.The updates become thus independent and can be done fully in parallel.
After the algorithms are designed, the main difficulty is with the geometry, conditions data, and magnetic field map, because these cannot easily be migrated to the accelerator cards.Memory constraints form the foremost technical challenge -it is possible to define a simplified geometry and a parametrized magnetic field that will fit, however these will be insufficient for the final track fit.Another important consideration is man-power for development and maintenance of any geometry that differs in implementation from the one used on CPUs.The MIC has the advantage here, albeit that initialization, being non-parallel, is very slow and this simple approach is therefore only acceptable for very long-running jobs.
One approach to track fitting on GPUs (implemented in CUDA and OpenCL) and the MIC is to use a Kalman filter with a so-called reference trajectory instead: the extrapolation, including material effects, is done on the CPU and independent of the actual track fit.Only the track parameters of the reference trajectory and the measurements have to be transferred to the accelerator card.The track fit and the updates of the measurement assignments are done, in parallel, on the GPU/MIC.Extrapolation of different reference trajectories are independent and would be parallelizable, except that there are too many accesses to the magnetic field service.
Vectorization in the parallel fitting suffers from mismatched sizes, which can be solved by padding: of "zero measurements" to equalize track lengths, and of the measurements themselves to equalize their dimensions.The latter leads to larger matrices to invert (5x5 instead of 1x1 and 2x2 matrices), however, which is comparatively much more compute intensive.
There is potential in using the MIC in track reconstruction, but we need new algorithms such as the MTF; some of the algorithms currently in use are not well suited for wide vectorization and massive parallelization.The main open question is whether the geometry and magnetic field can be offloaded, given the constraints on memory and the required effort for the initial development and subsequent maintenance.