Software-based data acquisition and processing for neutron detectors at European Spallation Source-early experience from four detector designs

: European Spallation Source (ESS) will deliver neutrons at high ﬂux for use in diverse neutron scattering techniques. The neutron source facility and the scientiﬁc instruments will be located in Lund, and the Data Management and Software Centre (DMSC), in Copenhagen. A number of detector prototypes are being developed at ESS together with its European in-kind partners, for example: SoNDe, Multi-Grid, Multi-Blade and Gd-GEM. These are all position sensitive detectors but use diﬀerent techniques for the detection of neutrons. Except for digitization of electronics readout, all neutron data is anticipated to be processed in software. This provides maximum ﬂex-ibility and adaptability and allows deep inspection of the raw data for commissioning which will reduce the risk of starting up new detector technologies. But it also requires development of high performance software processing pipelines and optimized and scalable processing algorithms. This report provides a description of the ESS system architecture for the neutron data path. Special focus is on the interface between the detectors and DMSC which is based on UDP over Ethernet links. The report also describes the software architecture for detector data processing and the tools we have developed, which have proven very useful for eﬃcient early experimentation, and can be run on a single laptop. Processing requirements for the SoNDe, Multi-Grid, Multi-Blade and Ge-GEM detectors are presented and compared to event processing rates archived so far.


Introduction
The European Spallation Source [1,2] is a spallation neutron source currently being built in Lund, Sweden.ESS will initially support 15 different instruments for neutron scattering.The ESS Data Management and Software Centre (DMSC), located in Copenhagen, provides infrastructure and computational support for the acquisition, event formation, long term storage, and data reduction and analysis of the experimental data.At the heart of each instrument is a neutron detector and its associated readout system.Currently detectors as well as readout systems are in the design or implementation phase and various detector prototypes have already been produced [3][4][5][6].ESS detectors will operate in event mode [7], meaning that for each detected neutron a (time, pixel) tuple is calculated, providing the detection timestamp (with a resolution of 100 ns or better) and position on the detector where the neutron hit.This allows for later filtering of individual events (vetoing) and flexible refinement of the energy determination as well as of the scattering vector.
ESS detector prototypes have been tested at various neutron facilities and a number of temporary data acquisition systems have been in use so far.When in operation, ESS will use a common readout system which is currently being developed [8].We are also moving towards a common software platform for the combined activities of data acquisition and event formation.This platform consists of core software functionality common to all detectors and a detector specific plugin architecture.
The main performance indicators of the system are: the neutron rates, the data transport chain from the front-end electronic readout to the event formation system, the parsing requirements for the readout data, and the individual data processing requirements for the different detector technologies.
Good estimates of the neutron flux on the sample and the detectors have been produced by simulations [9][10][11], and estimates on the corresponding data rates have been made, although the precise values will depend upon engineering design decisions that are still to be made or further detector characterizations.Examples of these are: number of triggered readouts per neutron, readout data encapsulation methods, hardware data processing, etc.
The architecture for the ESS data path is described in section 1.2 and the software architecture is described in section 2. Section 3 briefly describes ESS readout architecture and discusses hardware and physical abstractions such as digital and logical geometry.Parsing of readout data and event formation processing is the subject of section 4. Four detectors have been subjected to early testing at neutron sources using a scaled-down version of the anticipated software infrastructure for ESS operations.Performance numbers for event processing are reported in section 5.For the data transport we have chosen Ethernet and UDP.UDP as such is unreliable and packet losses can (and will) occur.It is possible, however, to achieve a high degree of reliability for UDP data and this is discussed in appendix A.

Instrument data rates
ESS is a spallation source, where neutrons are generated by the collision of high energy protons with a suitable target -tungsten in case of ESS.The proton source is pulsed, with a 14 Hz frequency.The neutron flux generated by this process has been simulated with MCNP from the target/moderator [12][13][14].The flux is reduced throughout the neutron path by neutron transport components (neutron guides, monitors, beam ports) and instrument-specific components (collimators, choppers, sample enclosures, etc.) and are typically calculated using the Monte Carlo simulation tool McStas [15,16].The detector properties are determined by a combination of Geant4 simulations [17][18][19] for the initial considerations and experiments at neutron facilities once a prototype has been built.
Typically, rates are reported for neutrons hitting the sample and the detector, as shown in table 1.As mentioned, these rates are not directly convertible into data rates received by the software.Our estimates of the required processing power are, however, based on neutron rates for the detector surface assuming 100% efficiency.On the input side of software processing we use the peak instantaneous rate, measured as the highest number of neutrons received in a 1 ms time bin [10], because we must receive all data without loss.On the output side, we use average rates because event data is buffered up for transmission inside the event formation system which has the practical effect of load levelling the event rate with time.Table 1.Current estimates of data rates for selected ESS instruments at 5MW ESS source power.Global avg.rate is defined in [10].Data Rate is the corresponding amount of data received for software processing.

Architecture for the ESS data path
The system architecture for the ESS neutron data path is shown in figure 1.Every neutron scattering instrument has at least one detector: the individual detector technologies vary [20], and this is discussed in section 4, but eventually an electric signal is induced on an electrode.This signal is digitised by the readout electronics and sent via UDP to the event formation system.
The key component for the software event formation system is the Event Formation Unit (EFU), a user-space Linux application targeted to run on Intel x86-64 processors written in C++.For each ESS instrument, several EFUs will run in parallel to support the high data rates.The EFU is responsible for processing the digitised readouts and converting these into a stream of event (time, pixel) tuples.
The event tuples are serialised and sent to a scalable data aggregator/streamer providing a publish/subscribe interface.A file writer application subscribes to the data stream and combines the neutron data with data from other sources such as motor positions for collimators and sample, temperature, pressure, magnetic/electric fields, etc.This aggregated data is then written to file in a format suitable for long-term storage.From permanent storage it is then possible to perform offline data reduction and analysis [21].

Event Formation Unit architecture
The EFU architecture, illustrated in figure 2, consists of a main application with common functionality for all detectors and detector-specific processing pipelines.The software is written in C++, and is built using gcc and clang compilers for Ubuntu, macOS and CentOS.CentOS is currently the target Linux distribution for ESS operations, whereas the other operating systems are used during development and implementation.
The main application handles low CPU-intensity tasks such as launch-time configuration via command-line options, run-time configuration using a TCP-based command API, application state logging and periodic reporting of run time statistics and counters.
Detector pipelines are responsible for handling realtime readout data, and must conform to a common software interface definition.The pipelines are implemented as shared libraries that are loaded and launched by the main application as POSIX threads with support for thread affinity, which fixes a thread onto a specific processor core.The plugin must specify at least one processing thread but apart from this no further restrictions are imposed.We have experimented with different configurations for different detectors, but currently the number of threads in a detector pipeline ranges from one to three.
When more than one thread is in use, the data is shared between the producer and the consumer thread by a circular data buffer (FIFO) which preserves the order of the arriving data.The FIFO is based on pre-allocated memory to avoid unnecessary data copying and C++ std::atomic primitives for resource locking.For performance benchmarking we use the rdtsc() instruction call, which gives a high resolution timestamp counter with low latency.
The data processing part of the detector pipelines generally consists of a tight loop with a BSD socket recvfrom() system call, a parse() function, and a produce() step.These processing steps can be done in a single or multiple threads, depending on specific requirements.

FlatBuffers/Kafka
The output of the EFU is a stream of events.We have chosen Apache Kafka [22] as the central technology for transmission, and Google FlatBuffers [23] for serialisation.Apache Kafka is an open source software project for distributed data streaming.Multiple Kafka brokers form a scalable cluster, which supports a publish-subscribe message queue pattern with configurable data persistence.
In Kafka, producers send data to a topic in a cluster.A consumer subscribes to a topic to receive messages, either from the instant the subscription starts or from a previous offset, provided the requested data is still available in storage on the cluster, given a retention policy.Consumers may also be grouped to distribute the processing load among different processes.
Both producers and consumers can be developed using open source Kafka client libraries.The EFU uses librdkafka [24], which offers a C/C++ API.While Kafka offers a scalable and reliable transmission of arbitrary data, FlatBuffers provides a schema-based event serialisation method, and a mechanism for forward and backward compatibility of the schemas.Figure 3 shows the currently used schema for events.

Live detector data visualisation
After writing the first prototype for event formation, it became clear that it would be beneficial to also use the EFU as a DAQ system for early detector experiments and commissioning.One of the easiest ways to validate the event processing is to visualise the detector image and other relevant data, such as channel intensities and ADC distributions.
For this reason the EFU also publishes such information via Kafka, and an application named Daquiri was written for visualising these data.Daquiri subscribes to Kafka topics, collects statistics, and provides plotting functionality, and is planned to be an integral part of the software bundle i n c l u d e " i s 8 4 _ i s i s _ e v e n t s .f b s " ; f i l e _ i d e n t i f i e r " ev42 " ;  developed for ESS operations.The Daquiri GUI is based on Qt [25] and is highly configurable in terms of available plotting formats, the dashboard configuration, labels, axes, colour schemes, etc. Daquiri is open source software [26].

Runtime stats and counters
The availability of relevant application and data metrics is essential for both early prototyping and easy monitoring while in operation, for example incoming packet rates, parsing errors, calculated events, discarded readouts, etc.The detector API provides a mechanism for the detector plugin to register a number of named 64-bit counters.These are then periodically queried by the main application and reported to a time-series server.
We have chosen Graphite as the time-series server [27] technology and use Grafana [28] for presentation.Graphite has a simple API for submission of data, which consists of a hierarchical name such as efu.net.udp_rx, a counter value and a UNIX timestamp.The combination Grafana/Graphite has proven to be very useful, not only for monitoring the event processing software.Scripts have been written to check Linux kernel and network card counters, as well as disk and CPU usage, all of which are relevant when running at high data rates while simultaneously writing raw data to disk.We typically publish monotonically increasing counter values, and then use Grafana to transform these into rates.We plan to offer Grafana/Graphite for the software we are developing for ESS operations.

Trace and logging
For application logging we have chosen Graylog [29].In the EFU, Graylog is used for low rate log messages.We use the syslog [30] conventions for logging levels and severities.Graylog is not currently in wide use in the infrastructure, but will be essential for monitoring the ESS data processing chain once in operations, when multiple EFUs are deployed.
During development we use a simple but effective trace system consisting of groups and masks.These currently print directly to the console, which is extremely detrimental to performance when operating at thousands of packets and millions of readouts per second.Therefore, we made the trace macros configurable at compile time, so that no overhead occurs when they are not needed.
Both the log and trace system accept log messages in a printf()-compatible format using variable arguments.

Software development infrastructure
All the software components that are part of the data aggregation and streaming pipeline are being developed collaboratively by the partners as open source projects released under a BSD license.Git [31] is used for version control and all software is available for public scrutiny on GitHub [32].We use Conan [33] as our C++ package manager and CMake [34] for multi-platform Makefile generation.The projects are built with gcc and clang compilers [35,36].
A Jenkins [37] build server automatically triggers builds and runs commit stage tests each time new code is pushed to a repository: Every commit on every branch for every project triggers a Jenkins build and test cycle on multiple operating systems providing rapid feedback on breaking changes.The tests that are run vary according to the application but for C++ code in general include unit tests with Google Test [38], static analysis with Cppcheck [39], and test coverage reports with gcovr [40].We also check for code format compliance with clang-format [41], and memory management problems with Valgrind [42].
Individual methods and algorithms can be benchmarked for performance with Google Benchmark [43].The some of the executables generated from every build cycle are saved as artefacts and can thus be used for quick deployment or integration testing.Configuration of the machines in the build and test environment is done using Ansible [44], with the scripts kept under version control.

Detector readout
The subject of this report is mainly concerned with the detector data flowing from the readout system backend to and through the event formation system.Due to the high neutron flux delivered by ESS, the data rates will be correspondingly high.The ESS readout system conceptually consists of a detector specific front end and a generic backend as illustrated in 7.
The back end connects to the event formation system via 100 Gb/s optical Ethernet links, which provide more capacity than required for most instruments.However, for the small scale detector prototypes we typically use Gigabit Ethernet.
The ESS readout system is currently under development.Until it becomes generally available a number of different ad-hoc readout systems have been employed for testing of prototypes.The ones relevant for this report are: CAEN, mesytec, RD51 Scalable Readout System and ROSMAP-MP from Integrated Detector Electronics AS [45][46][47][48].These are either controlled by applications supplied by the manufacturer, by custom Python scripts or GUI applications.The digitised data is transmitted as binary data over UDP in a similar way as when the instruments are in operation.None of these ad-hoc systems is currently set up to consume the ESS absolute timing information.

Digital geometry
While the common readout back-end deals with the connection to the event formation, different detector technologies have different electrical connections to the readout front ends.Multi-Grid for example, uses a combination of wires and grids whereas Multi-Blade uses wires and strips.Even for a specific detector technology, the different prototypes can have different sizes and therefore different number of channels.We need to combine the knowledge about the electrical wiring and how the digitisers are connected in order to know anything about where on the detector a signal was induced.We call this the digital geometry.
An example outlining the digital geometry for Gd-GEM is shown in figure 8.The x-position is a function x(a, c, f ) of an asic id ranging from 0 to 1, a channel from 0 to 63 and a front end card id from 0 to 9. For each detector pipeline, a digital geometry C++ class is created to handle this mapping.The classes are typically parametrised so they can handle multiple variants.

Logical geometry
The main end result from the event formation are event tuples.An event tuple (t,p) consists of a timestamp and a pixel_id.Due to its physical construction, the detectors are inherently pixellated and what we calculate is simply which pixel was hit by a neutron, i.e. this step does not need to know anything about the physical size or absolute compositions of the pixels.We call this the logical  geometry.We have defined a common convention for the logical geometry for ESS instruments.The convention covers single-panel and multi-panel, 2D and 3D detectors.For example Multi-Grid is a single panel 3D detector (which then has voxels instead of pixels, but we do not make a distinction) and Gd-GEM is a multi-panel 2D detector.In this scheme we also unambiguously define the mapping between the (x,y,z) coordinates of the (logical) positions and a unique number, called the pixel_id.

Four ESS detector technologies
Neutrons cannot be directly observed, but are observed as the result of a conversion event where the neutron interacts with a material with a high thermal neutron cross section for absorption.In this process the absorber material converts the neutron into charged particles or light, which can then be detected by conventional methods.For the detectors in this study the conversion materials are based on Li, Be and Gd.The detection methods for the individual detectors will be described below.

SoNDe
The Solid-state Neutron Detector (SoNDe) is based on a scintillating material that converts thermal neutrons into light which is detected by a photomultiplier tube.The detector is in an early stage of characterisation and is currently available as a single module demonstrator [3], shown in figure 10.It consists of a pixelated scintillator, a Hamamatsu H8500 series 8 x 8 MaPMT (Multi-anode Photomultiplier Tube) [49] and a SONDE/ROSMAP-MP counting chip-system to read out the MaPMT signals [50].The chip-system consists of four ASICs each responsible for readout of 16 pixels.The final detector will consist of 400 of such modules, arranged in 100 groups of four modules in a 2 by 2 configuration.For a report on the recent progress and patent information, see [51,52].
The ROSMAP module transmits readout data in three different operation modes as UDP data over Ethernet.The supported modes are Multi-Channel Pulse-Height Data, Single-Channel Pulse-Height Data and Trigger Time Hits over threshold Data.For early characterisation and verification it is necessary to extract the charge information for individual channels and thus support for the two "expert mode" data formats have been developed.When in operation at ESS only the event-mode format (Trigger Time) will be relevant.

Processing requirements
SoNDe belongs to a class of detectors requiring little data processing as the readout system already provides event data in the form of (time, asic_id and channel) values.The digital geometry only has to account for the fact that two of the readout ASICs are rotated 180 degrees compared with the others and the fact that they represent a view of the detector surface from the back, which is different from the logical geometry definition we use.For the single module demonstrator, which consists of 8 x 8 pixels the processing steps are • parse the binary readout data and extract (time, asic_id, channel) • combine asic_id and channel to a pixel_id

Multi-Grid
The Multi-Grid (MG) detector has been introduced at ILL and developed in a collaboration between ILL, ESS and Linköping University.The detector is based on thin converter films of boron-10 carbide [53,54] arranged in layers orthogonal to the incoming neutrons.The MG detector uses a stack of grids with a number of wires running through them.
Following the neutron conversion, a signals are induced both on grids and wires, which are digitised and read out.The temporal and spatial coincidence of the signals on wires and grids is used to determine neutron positions.Signals can be induced on multiple grids, and for double neutron events also on wires.The detector geometry is three-dimensional, so our visualisation of the detector image consists of projections of the neutron counts onto the xy-, xz-, and yz-planes respectively as shown in figure 11.

Readout
The Multi-Grid readout system used for prototyping and demonstration detectors is based on stacked MMR readout boards supporting 128 channels, a Mesytec VMMR-8/16 VME receiver card supporting up to 16 readout links, and a SIS3153 Ethernet to VME interface card.It is selftriggered: when the Mesytec hardware registers a signal above a certain trigger-threshold, it triggers a readout of all channels with signals above a second threshold.This readout is then transmitted as UDP packets to the EFU.The binary data format is hierarchical as it supports multiple interface cards, each supporting multiple boards with up to 128 channels.

Processing requirements
The Mesytec UDP protocol has been partially reverse-engineered based on captured network traffic and the available documentation.The protocol parser must be able to support multiple triggers in a single packet, and to discard unused or irrelevant data fields.The data fields consist of 32 bit words each containing a command (8 bits), address/channel (12 bits) and ADC values (12 bits).The channel readouts are given in alternating order (1, 0, 3, 2, 5, 4, ...).All channels are assigned a single common 32-bit timestamp, in units of 16 MHz ticks, by the electronics.Thus temporal clustering is performed in hardware, but no continuous global time is currently available.
The EFU then parses the channel readouts, and applies software thresholds.At this stage it discards inconsistent readouts.Channel readouts for Multi-Grid are then mapped to either a grid or a wire id.The current algorithm for the Multi-Grid event formation simply uses the maximum ADC values for grids and wires to determine the position.The processing steps thus consist of • parse the binary Mesytec readout format to extract time, channel and ADC

• discard inconsistent readouts
• map channel to either grids or wires • apply suppression thresholds independently for wires and grids • check for coincidence (must involve both one wire and one grid) • combine wire_id and grid_id to pixel_id

Multi-Blade
The Multi-Blade detector is a stack of Multi Wire Proportional Chambers operated at atmospheric pressure, with a continuous gas flow.It consists of a number of identical units, called cassettes.Each cassette holds a blade (a substrate coated with 10 B 4 C ) and a two-dimensional readout system, which consists of a plane of wires and a plane of strips.The cassettes are arranged along a circle-arc centered on the sample, and are angled slightly with respect to the neutron beam, for improved counting rate capability and spatial resolution.The operation is based on the temporal and spatial coincidence of signals on strips and wires.Despite inherently being a three-dimensional detector, the visualisations of the detector images typically display an "unfolded" two-dimensional pixel map.For further details of the design and performance of this detector see [56][57][58].

Readout
The Multi-Blade detector prototype currently has nine cassettes, each with 32 wires and 32 strips, for a total of 576 channels.The readout is based on six CAEN V1740D digitisers, and a custom readout application based on the API and software libraries supplied by CAEN.The digitisers each have 64 channels, 32 for wires and 32 for strips.The wires and strips are connected to the digitiser via front-end electronics boards.The final detector will have 32 wires and 64 strips per cassette, and up to 50 cassettes for a total of 4800 channels.
When the CAEN readout system has detected signal above a certain (hardware) threshold it triggers an individual readout of that channel.The readout consists of a channel number, a pulse integral (QDC), a time-stamp, and digitiser id.For each trigger there will be one or more signals from both wires and strips.The readout application continuously reads from the CAEN digitiser's hardware registers using optical links and transmits the raw data over UDP to the event formation unit.

Processing requirements
Readouts are subject to clustering analysis, where they are matched in both time and amplitude.The maximum timespan for which channels can be said to belong to the same cluster is a configurable parameter of the algorithm.For coincidence building there can be up to 2 wires and 4 strips in a cluster, where the typical case is one wire and two strips.Following clustering, we then calculate the pixel where the neutron was detected and adds a timestamp.To summarise, the processing steps for Multi-Blade are: • parse the UDP readout format to extract time, digitiser, channel and QDC values

• collect readouts in clusters
• map channel ids to either strips or wires • check for coincidence (time and amplitude) • combine wire_id and strip_id to pixel_id It is possible to improve the spatial resolution by employing CoG (center of gravity) on strip readouts weighted by the deposited charge (QDC).
The measured amplitudes on the wires and on the strips are strongly correlated.This means that with sufficient dynamic range double neutron events, which would cause some ambiguity, might be resolved by requiring matching amplitudes [56].
The processing pipeline for Multi-Blade currently differs from the other detectors in that the code responsible for clustering and event formation runs in multiple incarnations, namely one for each cassette.This is a case where we explore the solution space for event processing.The approach has the advantage of supporting individual processing for each blade rather than having to explicitly maintain information about blade id's in the processing algorithm itself.

Gd-GEM
The NMX macromolecular diffraction instrument will use the Gd-GEM detector technology.The neutron converter is a 25 µm thin foil of gadolinium, which also serves as cathode in a gas volume (Ar/CO 2 70/30 at athmospheric pressure).After traversing the readout and the GEM foils, the neutron hits the converter where it is captured as shown in figure 12.After the neutron capture, gamma particles and conversion electrons are released into the gas volume.
The conversion electrons loose energy by ionizing the gas atoms, and create secondary electrons along their path.Due to an electric field, those secondary electrons are drifted away form the cathode to an amplification stage consisting of a stack of two or three GEM foils.Each electron generates a measurable amount of charge by an avalanche in the GEM holes, which induces a signal on a segmented anode.
This segmentation is realised by copper strips with a pitch of 400 µm.The signal on the strips is read out with a timing resolution in the order of 10 ns, such that projections of the tracks in the x-t and y-t plane can be used to combine hits in both planes (clustering) and reconstruct the neutron impact point (micro-TPC method) [59].

Readout
The analogue signals of the strips of the Gd-GEM detector are read out by the VMM ASIC developed by Brookhaven National Laboratory for the New Small Wheel Phase 1 upgrade [60].The VMM has been implemented in the SRS [61] at CERN and a schematic drawing of the readout chain is shown in figure 13.The so called front-end hybrids are directly mounted onto the detector.This PCB holds two VMM ASICs, each with 64 input channels connected to the anode strips with a spark protection circuit.For each hit strip where the signal surpasses a configurable threshold, the VMM outputs a 38 bit binary word, see table 2.
For the prototype a Spartan-6 FPGA on the hybrid controls the ASICs and bundles the data, that are transmitted via HDMI cables to the core of SRS, the Front-End Concentrator (FEC) card.Up to eight hybrids can currently be connected to one FEC and the data are encapsulated into UDP packages of a 1 Gb/s Ethernet connection to the readout computer [62].The readout of the Gd-GEM detector is partitioned into 4 sectors.Each of these 4 sectors has 640 strips read out by 5 hybrids in x and y direction, resulting in a total of 5120 strips and 40 hybrids.If a signal is recorded on a detector strip, the VMM on the hybird generates hit data for With the information tuple (channel, VMM ID, FEC ID), the geometrical position of each hit can be reconstructed.A configuration file that reflects this digital geometry of the detector is loaded during the start up phase of the DAQ.The configuration can be modified for reordering, exchange or extension of physical readout components.

Processing requirements
The Gd-GEM detector requires the most complicated processing requirements in terms of the physical processes, the data acquisition and processing power.The steps required are • parse the binary data from SRS readout and extract (time, channel, adc)-tuples • queue up (time, channel, adc)-tuples until enough data for attempting clustering analysis • perform clustering analysis -determine if coincidence occurred • calculate neutron entry position for x and y • convert positions to pixel_id Some of the software related challenges for Gd-GEM are: Scaling up to a full rate, detector size and for discriminating invalid tracks.A neutron event generates a track with extensions in both time and space so it is not possible to just partition the detector in regions for independent parallel processing.Several processing options for distinguishing which tracks from which position can be extracted have been described in [63].In addition, due to the required buffering of data, memory usage and cache performance may well be a concern.

Performance
The key metric we use for the evaluation of performance is the number of events the detector pipeline can process per second.To benchmark this, we use detector data recorded as Ethernet/UDP packets in an number of measurement campaigns.This data is then sent to the event formation system as fast as possible and the achieved rates are retrieved via Grafana.
The setup uses three servers: a macOS laptop acting as a data generator/detector readout, a Ubuntu workstation hosting the EFU and Kafka, and another Ubuntu workstation which hosts Graphite and Grafana metrics.The hardware specifications are listed in table 3. The tests were made on the latest event formation software [64].For Gd-GEM we have implemented a performance test based on Google Benchmark, which directly targets the event processing algorithm, and is likely to present an upper bound for the performance in a single processing thread as there is no other overhead involved.Table 4 summarises the results of the performance measurements.It shows that a pipeline can support the reception and processing of around 85.000 UDP packets per second and several millions readouts per second using one or two CPU cores.The reported event rates reflect the amount of computational work that has to be performed on the data: Gd-GEM has the most complex algorithm, Multi-Grid and Multi-Blade have medium complexity, and SoNDe requires the least processing.
The large uncertainty for Gd-GEM comes from the fact that for this detector technology neutron events gives rise to a range of readouts of up to 20 strip hits for both x-and y-strips.Taking into account that a medium performance server can have two CPU sockets, each having 8 cores/16 hyper threads we can naively scale these numbers to the very high event rates required at ESS by parallelisation.For example, by employing a small number (5 -10) of servers, each dedicated to processing data from a fraction of the detector surface, we expect to scale the rates by more than an order of magnitude.

Conclusion
The previous sections have given an overview of the ESS software architecture for event processing in general and as implemented in four detector designs specifically.We have discussed the technology choices made and the toolchain used for software development.Finally we presented recent performance numbers for four detectors which will be used in ESS instruments.
We have shown an architecture that can be scaled up to deal with the high neutron rate ESS will deliver.Without having spent much time on optimisation of the code so far we have achieved high event processing rates of the order of 1 to 25 M events per second, and have shown how this can be scaled to much higher performance using commodity hardware.The detectors in this paper represents a wide range of the expected processing requirements foreseen at ESS.The toolchain does not wait to become operational after 2021 where ESS is expecting see first beam on target, but is in actual use for data acquisition as the detector development continues.
Most detectors are constructed by the tiling of identical and independent units.Scaling the processing up for these are easy as we can employ multiple event formation units running in parallel.
Not all scalability problems have been solved yet, however.Future work will focus on scaling the Gd-GEM processing as it is markedly more complicated than the other detectors.For example a simple partitioning of the detector surface may not work, because the charge tracks from a single neutron conversion can easily cross partition borders.Collaboration on this topic has already started.Work will also be done on deploying multiple processing pipelines on a multi-core CPU, where typically resource sharing problems, such as memory and network bottlenecks, will become more pronounced than observed so far.
We have anticipated some of these challenges and refer to appendix A for discussions on Ethernet transmission and software packet processing.Finally further detectors designs are under development for ESS and work is still needed to understand the processing requirements, and implement the processing pipelines.This discussion is highly relevant for the ESS data transport architecture: the planned system for data acquisition at ESS is segmented into a FPGA based backend transmitting UDP packets to the PC based event formation software.Thus knowledge of the envelopes of performance of this interface allow us to make important design decisions on how to partition the system.
The appendix presents measurements of UDP performance and reliability as achieved by employing several optimisations.The measurements were performed on Xeon E5 based CentOS (Linux) servers.The measured data rates are very close to the 10 Gb/s line rate, and zero packet loss was achieved.The performance was obtained utilising a single processor core as transmitter and a single core as receiver.The results show that support for transmitting large data packets is a key parameter for good performance.
Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread affinity, core locality and efficient timers.

A.1 Introduction
During experiments data is being produced at high rates: Detector data is read out by custom electronics and the readings are converted into UDP packets by the readout system and sent to event formation servers over 10 Gb/s optical Ethernet links.The event formation servers are based on general purpose CPUs and it is anticipated that most if not all data reduction at ESS is done in software.This includes reception of raw readout data, threshold rejection, clustering and event formation.UDP is a simple protocol for connectionless data transmission [65] and packet loss can occur during transmission.Nevertheless UDP is widely used, for example in the RD51 Scalable Readout System [47], or the CMS trigger readout [66], both using 1 Gb/s Ethernet.The two central components are the readout system and the event formation system.The readout system is a hybrid of analog and digital electronics.The electronics convert deposited charges into electric signals which are digitised and timestamped.In the digital domain simple data reduction such as zero suppression and threshold based rejection can be performed.The event formation system receives these timestamped digital readouts and performs the necessary steps to determine the position of the neutron.These processing steps are different for each detector type.The performance of UDP over 10G Ethernet has been the subject of previous studies [67] [68], which measured TCP and UDP performance and CPU usages on Linux using commodity hardware.Both studies use a certain set of optimisations but otherwise using standard Linux.In [67] the transmitting process is found to be a bottleneck in terms of CPU usage, whereas a comparison between Ethernet and InfiniBand [68] reinforces the earlier results and concludes that Ethernet is a serious contender for use in a readout system.This study is aimed at characterising the performance of a prototype data acquisition system based on UDP.The study is not so much concerned with transmitter performance as we expect to receive data from a FPGA based platform capable of transmitting at wire speed at all packet sizes.In stead comparisons between the measured and theoretically possible throughput and measurements of packet error ratios are presented.Finally, this paper presents strategies for optimising the performance of data transmission between the readout system and the event formation system.

A.2 TCP and UDP pros and cons
Since TCP is reliable and has good performance whereas UDP is unreliable, why not always just use TCP?The pros and cons for this will be discussed in the following.Both TCP and UDP are designed to provide end-to-end communications between hosts connected over a network of packet forwarders.Originally these forwarders were routers but today the group of forwarders include firewalls, load balancers, switches, Network Address Translator (NAT) devices etc. TCP is connection oriented, whereas UDP is connectionless.This means that TCP requires that a connection is setup before data can be transmitted.It also implies that TCP data can only be sent from a single transmitter to a single receiver.In contrast UDP does not have a connection concept and UDP data can be transmitted as either Internet Protocol (IP) broadcast or IP multicast.As mentioned earlier the main argument for UDP is that it is often supported on smaller systems where TCP is not.A notable example are FPGA based systems (see [69] for one case).For a brief overview of efforts for providing TCP/IP support in FPGAs see [70].But some of the TCP features are not actually improving the performance and reliability in the case of special network topologies as explained below.

A.2.1 Congestion
Any forwarder is potentially subject to congestion and can drop packets when unable to cope with the traffic load.TCP was designed to react to this congestion.Firstly TCP has a slow start algorithm whereby the data rate is ramped up gradually in order not to contribute to the network congestion itself.Secondly TCP will back off and reduce its transmission rate when congestion is detected.In a readout system such as ours the network only consists of a data sender and a data receiver with an optional switch connecting them.Thus the only places where congestion occurs are at the sender or receiver.The readout system will typically produce data at near constant rates during measurements so congestion at the receiver will result in reduced data rates by the transmitter when using TCP.This first causes buffering at the transmitting application until the buffer is full and eventually packets are lost anyway.
Also, for some detector readout systems it is not even evident that guaranteed delivery is necessary.In one detector prototype we discarded around 24% of the data due to threshold suppression, so spending extra time making an occasional data retransmission (of order 10 −4 ) may simply not be worth the added complexity.So while we argue that UDP is sufficient in our case, the determination of whether this holds true for other systems must be subject to further analysis.

A.2.2 Connections
Since TCP requires the establishment of a connection, both the receiving and transmitting applications must implement additional state to detect the possible loss of a connection.For example upon reset of the readout system after a software upgrade or a parameter change.With UDP the receiver will just 'listen' on a specified UDP port whenever it is ready and receive data when it arrives.Correspondingly the transmitter can send data whenever it is ready.UDP reception supports many-to-one communication, supporting for example two or more readout systems in a single receiver.For TCP to support this would require handling multiple TCP connections.
Sending data larger than the MTU will result in the data being split in chunks of size MTU before transmission.Given a specific link speed and packet size, the packet rate is given by rate[packets per second] = ls 8 • (ps + ifg) where ls is the link speed in b/s, ps the packet size and ifg the inter frame gap.Thus for a 10 Gb/s Ethernet link, the packet rate for 64 byte packets is 14.88 M packets per second (pps) as is shown in Table 5. Packets arriving at a data acquisition system are subject to a nearly constant per-packet processing overhead.This is due to interrupt handling, context switching, checksum validations and header processing.At almost 15 M packets per second this processing alone can consume most of the available CPU resources.In order to achieve maximum performance, data from the electronics readout should be bundled into jumbo frames if at all possible.Using the maximum Ethernet packet size of 9018 bytes reduces the per-packet overhead by a factor of 100.This does, however, come at the cost of larger latency.For example the transmission time of 64 bytes + IFG is 67 ns, whereas for 9018 + IFG it is 902 ns.For applications sensitive to latency a tradeoff must be made between low packet rates and low latency.
Not all transmitted data are of interest for the receiver and can be considered as overhead.Packet headers is such an example.The Ethernet, IP and UDP headers are always present and takes up a total of 46 bytes as shown in Figure 15 (bottom).The utilisation of an Ethernet link can be calculated as where U is the link utilisation, d the user data size, ifg the inter frame gap and pad is the padding mentioned earlier.For user data larger than 18 bytes no padding is applied.This means that for small user payloads the overhead can be significant, making it impossible to achieve high throughput.For example transmitting a 32 bit counter over UDP will take up 84 bytes on the wire (20 bytes IFG + 64 byte for a minimum Ethernet frame) and the overhead will account for approx.95% of the available bandwidth.In contrast when sending 8972 byte user data the overhead is as low as 0.73%.

A.3.2 Network buffers and packet loss
A UDP packet can be dropped in any part of the communications chain: The sender, the receiver, intermediate systems such as routers, firewalls, switches, load balancers, etc.This makes it difficult in general to rely on UDP for high speed communications.However for simple network topologies such as the ones found in detector readout systems it is possible to achieve very reliable UDP communications.When for example the system comprise two hosts (sender and receiver) connected via a switch of high quality, the packet loss is mainly caused by the Ethernet Network Interface Card (NIC) transmit queue and the socket receive buffer size.Fortunately these can be optimised.The main parameters for controlling socket buffers are rmem_max and wmem_max.The former is the size of the UDP socket receive buffer, whereas the latter is the size of the UDP socket transmit buffer.To change these values from an application use setsockopt(), for example int buffer = 4000000; setsockopt(s, SOL_SOCKET, SO_SNDBUF, buffer, sizeof(buffer)); setsockopt(s, SOL_SOCKET, SO_RCVBUF, buffer, sizeof(buffer)); In addition there is an internal queue for packet reception whose size (in packets) is named netdev_max_backlog, and a network interface parameter, txqueuelen which were also adjusted.
The default value of these parameters on Linux are not optimized for high speed data links such as 10 Gb/s Ethernet, so for this investigation the following parameters were used.net.core.rmem_max=12582912net.core.wmem_max=12582912net.core.netdev_max_backlog=5000txqueuelen 10000 These values have largely been determined by experimentation.We also configured the systems with an MTU of 9000 allowing user payloads up to 8972 bytes when taking into account that IP and UDP headers are also transmitted.

A.3.3 Core locality
Modern CPUs rely heavily on cache memories to achieve performance.This holds for both instructions and data access.For Xeon E5 processors there are three levels of cache.Some is shared between instructions and data, some is dedicated.The L3 cache is shared across all cores and hyper-threads, whereas the L1 cache is only shared between two hyper-threads.The way to ensure that the transmit and receive applications always uses the same cache is to 'lock' the applications to specific cores.For this we use the Linux command taskset and the pthread API function pthread_setaffinity_np().This prevents the application processes to be moved to other cores and thereby causing interrupts in the data processing, but it does not prevent other processes to be swapped onto the same core.

A.3.4 Timers
The transmitter and receiver applications for this investigation periodically prints out the measured data speed, PER and other parameters.Initially the standard C++ chrono class timer was used (version: libstdc++.so.6).But profiling showed that significant time was spent here, enough to affect the measurements at high loads.Instead we decided to use the CPU's hardware based Time Stamp Counter (TSC).TSC is a 64 bit counter running at CPU clock frequency.Since processor speeds are subject to throttling, the TSC cannot be directly relied upon to measure time.In this investigation time checking is a two-step process: First we estimate when it is time to do the periodic update based on the inaccurate TSC value.Then we use the more expensive C++ chrono functions to calculate the elapsed time used in the rate calculations.An example of this is shown in the source code which is publicly available.See Section B for instructions on how to obtain the source code.

A.4 Testbed for the experiments
The experimental configuration is shown in Figure 16.It consists of two hosts, one acting as a UDP data generator and the other as a UDP receiver.The hosts are HPE ProLiant DL360 Gen9 servers connected to a 10 Gb/s Ethernet switch using short (2 m) single mode fibre cables.The switch is a HP E5406 switch equipped with a J9538A 8-port SFP+ module.The server specifications are shown in table 6.Except for processor internals the servers are equipped with identical hardware.The data generator is a small C++ program using BSD socket, specifically the sendto() system call for transmission of UDP data.The data receiver is based on a DAQ and event formation system developed at ESS as a prototype.The system, named the Event Formation Unit (EFU), supports loadable processing pipelines.A special UDP 'instrument' pipeline was created for the purpose of these tests.Both the generator and receiver uses setsockopt() to adjust transmit and receive buffer sizes.Sequence numbers are embedded in the user payload by the transmitter allowing the receiver to detect packet loss and hence to calculate packet error ratios.Both the transmitting and receiving applications were locked to a specific processor core using the taskset command and pthread_setaffinity_np() function.The measured user payload data-rates were calculated using a combination of fast timestamp counters and microsecond counters from the C++ chrono class.Care was taken not to run other programs that might adversely affect performance while performing the experiments.CPU usages were calculated from the /proc/stat pseudofile as also used in [67].
A measurement series typically consisted of the following steps: The above steps were then repeated for measurements of CPU usage using /proc/stat averaged over 10 second intervals.
A series of measurements of speed, packet error ratios and CPU usage where made as a function of user data size for reasons discussed in Section A.3.1.

A.4.1 Experimental limitations
The current experiments are subject to some limitations.We do not however believe that these pose any significant problems in the evaluation of the results.The main limitations are described below.

Multi user issues:
The servers used for the tests are multi user systems in a shared integration laboratory.Care was taken to ensure that other users were not running applications at the same time to avoid competition for CPU, memory and network resources.However a number of standard demon processes were running in the background, some of which triggers the transmission of data and some of which are triggered by packet reception.
Measuring affects performance: Several configuration, performance and debugging tools need access to kernel or driver data structures.Examples we encountered are netstat, ethtool and dropwatch.However the use of these tools can cause additional packet drops when running at high system loads.These tools were not run while measuring packet losses.

Packet reordering:
The test application is unable to detect misordered packets.Packet reordering however is highly unlikely in the current setup, but would be falsely reported as packet loss.

Packet checksum errors:
The NICs perform checksums of Ethernet and IP in hardware.Thus packets with wrong checksums will not be delivered to the application and subsequently be falsely reported as packet loss.For the purpose of this study this is the desired behaviour.

A.5 Performance
The performance results covers user data speed, packet error ratios and CPU load.These topics will be covered in the following sections.

A.5.1 Data Speed
The result of the measurements of achievable user data speeds is shown in Figure 17 (a).The figure shows both the measured and the theoretical maximum speed.For packets with user data sizes larger than 2000 bytes the achieved rates match the theoretical maximum.However at smaller data sizes the performance gap increases rapidly.It is clear that either the transmitter or the receiver is unable to cope with the increasing load.This is mainly due to the higher packet arrival rates occurring at smaller packet sizes.The higher rates increases the per-packet overhead and also the number of interrupts and system calls.At the maximum data size of 8972 bytes the CPU load on the receiver was 20%.This study supplements independent measurements done earlier [67] and reveals differences in performance across different platforms.The observed differences are likely to be caused by differences in CPU generations, Ethernet NIC capabilities and Linux kernel versions.These differences were not the focus of our study and have not been investigated further.But they do indicate that some performance numbers are difficult to compare directly across setups.They also provide a strong hint to DAQ developers: When upgrading hardware or kernel versions in a Linux based DAQ system, performance tests should be done to ensure that specifications are still met.
There are several ways to improve performance to achieve 10 Gb/s with smaller packet sizes, but the complexity increases.For example it is possible to send and receive multiple messages using a single system call such as sendmmsg() and recvmmsg() which will reduce the number of system calls and should improve performance.It is also possible to use multiple cores for the receiver instead of only one as we did in this test.This adds some complexity that has to handle distributing packets across cores in case it cannot be done automatically.One method for automatic load distribution is to use Receive Side Scaling (RSS).However this requires the transmitter to use several different source ports in the UDP packet when transmitting instead of one currently used.This may require changes to the readout system.It is also possible to move network processing away from the kernel and into user space avoiding context switches, and to change from interrupt driven reception to polling.These approaches are used in the Intel Data Plane Development Kit (DPDK) software packet processing framework.

B Source code
The software for this project is released under a BSD license and is freely available on GitHub [73].To build the programs used for the UDP performance experiments [74], complete the steps below.To build and start the producer: > git clone https://github.com/ess-dmsc/event-formation-unit> cd event-formation-unit/udp > make > taskset -c coreid ./udptx-i ipaddress to build and start the receiver: > git clone https://github.com/ess-dmsc/event-formation-unit> mkdir build > cd build > cmake .. > make > ./efu2-d udp -c coreid The central source files for this paper are udp/udptx.cppfor the generator and prototype2/udp/udp.cppfor the receiver.The programs have been demonstrated to build and run on Mac OS X, Ubuntu 16 and CentOS 7.1.However some additional libraries need to be installed, such as librdkafka and google flatbuffers.

Figure 5 .
Figure 5. Grafana dashboard used for performance measurements of an implementation of Multi-Blade data processing.

Figure 6 .
Figure 6.Jenkins monitor view for all data management projects and branches.

Figure 7 .
Figure 7. High level architecture of the ESS readout system.

Figure 8 .
Figure 8.An example of a possible mapping of the digital geometry for x-strips for Gd-GEM.Strip 1 corresponds to asic 0, channel 0, front end card id 0 and strip 1280 to asic 1, channel 64, front end card id 9.

Figure 11 .
Figure 11.Grafana dashboard and live detector images from a recent test run with low neutron intensity at the Source Testing Facility at Lund University [55].

Figure 13 .
Figure 13. Figure of the Gd-GEM readout and data acquisition system, from [61].

Figure 14 .
Figure 14.Gd-GEM detector visualisation: Screen dump of live detector image, a sampled particle track and strip histograms.

Figure 17 .
Figure 17.Performance measurements.a) User data speed.b) Packet Error Ratio.c) CPU Load.Note that for the optimized values PER is zero for user data larger than or equal to 2200 bytes (solid line).

Table 3 .
Machine configurations for the performance test setup.

Table 4 .
Measured performance for detector pipelines.

Table 5 .
Packet rates as function of packet sizes for 10 Gb/s Ethernet

Table 6 .
Hardware components for the testbed