A Deep Learning-based Reconstruction of Cosmic Ray-induced Air Showers

We describe a method of reconstructing air showers induced by cosmic rays using deep learning techniques. We simulate an observatory consisting of ground-based particle detectors with fixed locations on a regular grid. The detector's responses to traversing shower particles are signal amplitudes as a function of time, which provide information on transverse and longitudinal shower properties. In order to take advantage of convolutional network techniques specialized in local pattern recognition, we convert all information to the image-like grid of the detectors. In this way, multiple features, such as arrival times of the first particles and optimized characterizations of time traces, are processed by the network. The reconstruction quality of the cosmic ray arrival direction turns out to be competitive with an analytic reconstruction algorithm. The reconstructed shower direction, energy and shower depth show the expected improvement in resolution for higher cosmic ray energy.


Introduction
The successes of the deep learning ansatz in handwriting recognition [1,2], speech recognition [3], and competitions with humans regarding image identification and playing games [4,5,6] motivates its application in various areas of fundamental research. For an overview of deep learning techniques see [7].
In physics research, tasks like reconstructing particle kinematics or signal separation from background processes often require algorithms that take multiple observables into account in order to obtain a few parameters of interest. Machine learning techniques are frequently used for such complex problems, with boosted decision trees and shallow neural networks being among the most popular methods.
Concepts of deep neural networks have recently been investigated for challenges in astroparticle and particle physics. Various network designs have been successfully applied in simulation studies, e.g., to reconstruct the neutrino flavor in neutrino nucleus interactions [8], to extract a new exotic particle or Email address: erdmann@physik.rwth-aachen.de (M. Erdmann) Higgs boson signal from a background dominated data sample [9,10,11], to identify the underlying parton flavor of a jet or measure jet substructure [12,13], or to assign jets to the underlying partons in top quark-associated Higgs boson production [14].
In this work we investigate deep learning methods for reconstructing cosmic ray properties from simulated air showers. As our observatory of the air showers we use ground-based particle detectors located on a hexagonal grid, each delivering charge measurements as a function of time.
Reconstruction of the cosmic ray properties from air showers includes arrival direction and energy. In addition, the atmospheric depth of the shower maximum is of interest to obtain information on the cosmic ray mass. All this information is commonly extracted from the data using algorithms based on physics arguments which were developed by astroparticle physicists.
As an alternative method, we exploit deep learning techniques to reconstruct the above-mentioned properties of cosmic rays from simulated data. Our aim is to investigate the abilities of the network to learn about various aspects encoded in the data, and to evaluate the reconstruction quality of the cosmic ray arrival direction, energy, and shower maximum.
We designed the network to study the following aspects separately. One aspect is the spatial distribution of signals on the ground. Further information is contained in the time of the first signals arriving at different detector stations on the ground. A third aspect is the time trace of the detector signals which encodes arrival times of different particles from different phases of the shower development.
This work is structured as follows. First we introduce a parametrized simulation of air showers. We then explain the details of the network architecture and discuss how the data are input to the network. We evaluate the quality of air shower reconstructions obtained with the network using various aspects of the data before presenting our conclusions.

Parametrized simulation of air showers
For our investigations we developed a parametrized air shower simulation code, and a corresponding detector simulation inspired by [15,16]. The simulation code is published in [17]. This program enables the efficient generation of a large number of simulated events for network training and evaluation on one hand, and direct control of the information contained in the data on the other hand. On output the simulation delivers the same information as obtained by a ground-based particle detector such that there are no limitations in extending our studies to detailed air shower simulations.
The results of the air shower simulations are parametrized directly in terms of the signal distributions obtained in the detectors. The detectors are placed on a hexagonal grid with a spacing of 1500 m and are located at a height of 1400 m above sea level, motivated by the Pierre Auger Observatory [18].
We consider only the electromagnetic part and the muon part of the air shower, both of which depend on the arrival directions, energies, and nuclear masses of the cosmic rays.
The energy E of the cosmic rays is randomly selected between 3 and 100 EeV following a power law E −1 . The mass composition A of the cosmic rays consists of H, He, N, Fe with equal fractions. The arrival directions (θ, φ) are chosen following an isotropic distribution with zenith angles of up to θ = 60 • . Finally, the impact point of the shower on the surface, the shower core, is randomly sampled around the central detector.
For the chosen values of the energy and mass (E, A), two decisions are taken which influence the detector signals. The first decision is the spatial reference point for calculating distances of the detectors to the shower, see Fig. 1. The distribution of the maximum of the atmospheric shower depth X max can be well approximated by a Gumbel distribution G(E, A) [19]. The randomly selected X max value from G(E, A), together with the arrival direction (θ, φ) and the shower core, defines the reference point for all subsequent geometry calculations. Starting here, the movement of the shower front is approximated in terms of a plane wave.
The second decision to depend on the chosen (E, A) values is the relative energy distribution between the electromagnetic and the muonic components of the shower. For proton primaries, we set the electromagnetic energy to 70% of the cosmic ray energy and the muon part to 30%, respectively. For all other nuclei, the relative energy contained in the muonic component is up-scaled by a factor A 0.15 [20]. The signal distributions in the detectors of both the electromagnetic and the muonic components are calculated separately and are finally superimposed.
Each detector is supposed to have its own clock recording a universal time t 0 when the trigger starts the electronics to record particle signals as a function of time t. The time t 0 is calculated according to the movement of the shower front.
The time trace S(t) of particle signals arriving at the detector contains 80 intervals of 25 ns size. The shape of the signal distributions is approximated by a log-normal distribution for both the electromagnetic and the muon signal (j = µ, em), following [16]: Here t is a reference time to cancel time units, ∆t j a time offset specified below, τ j the location parameter of the log-normal distribution and σ = 0.7 the shape parameter. To simulate the effect of a lateral distribution function of the shower, the location parameter depends on the transverse distance r of the detector to the shower axis. To also include the absorption of shower particles in the atmosphere, the difference ∆X between the atmospheric depth of the detector and the above-mentioned reference point of the shower at X max is calculated. a j contains a global offset, and b j and d j are weights for the lateral distance and the absorption effects, which are different for the muonic and the electromagnetic components. The choice of parameters has been adapted to approximate distributions obtained from full shower simulations [16] with respect to the muonic component, ∆t µ /t = 0. The total energy deposit in the detector, or total signal strength S 0 respectively, respects the same effects of the lateral distribution function and the atmospheric absorption as presented in (2): Here the parameters are α µ = −4.7, α em = −6.1, β µ = 0.1, β em = 0.4, r = 1000 m, X = 100 g/cm 2 . For each detector i, the signal distribution S i (t) as a function of time is obtained by a Monte Carlo method. Random values are chosen according to (1) where the number of drawn values is proportional to the signal strength S j,0 (see eq. 4). To prevent diverging signals a maximum signal is set with respect to the shower core.
Finally, effects of noise contributions are added to the signal distributions. In each interval of the time trace a uniformly distributed noise between 0 − 5% of the signal and a Gaussian background noise are added to the interval (µ = 1.2, σ = 0.6).
Also the trigger times t 0 of each detector are subjected to noise effects expressed in terms of Gaussian distributions where the width varies between 0.06 µs and 0.32 µs depending on the total signal (S 0 = S em,0 + S µ,0 ) in the detector. Fig. 2 shows example distributions of the simulated time traces for a single event in a detector near the shower core (Fig. 2a), and in a detector at some distance from the shower core (Fig. 2b). While the muonic component always arrives early, the electromagnetic component is relatively late for the detector located further away from the shower core.

Data preprocessing
The available data for the reconstruction of cosmic ray-induced air showers consist of the time t 0,i when the first shower particles arrive at each detector i and the time traces S i (t) of the detectors.
In order to fully exploit advantages arising from network techniques specialized in local pattern recognition and to achieve good stability and efficiency in optimizing the network, the data are transformed to suitable formats. These concern geometry aspects on one hand, and the numerical ranges of the data values on the other hand.
Geometry of the detector array. The success of the so-called convolutional network technique [21] in computer vision relies on the analysis of sub-regions of an image with a spatially invariant operation. The convolution technique can be understood as sliding a small filter, say, 3 × 3 pixels, in regular steps over the image and applying it to the respective image subregions.
In a figurative sense, measuring a cosmic rayinduced air shower with an arrangement of identical particle detectors can be considered as taking a pixelized image of the shower footprint. In contrast to a digital camera where the pixels typically form a Cartesian grid, many particle detectors are arranged on a hexagonal grid. Hence, a suitable representation of the hexagonal detector grid is needed in order to apply the convolution technique for investigating subregions in the shower footprint.   3 shows two such representations, using offset and axial coordinates, respectively. While offset coordinates allow for a more compact representation of rectangular sections than axial coordinates, a possible disadvantage arises from the changing filter shape depending on the position where the filter is applied. This is illustrated on the left of Fig. 3 where 3 × 3 filters at two different positions are shown in blue. When comparing the two positions, one can see that in offset coordinates the pixels in the top and bottom row of the filter are at different positions relative to those in the middle row.
Such a changing relationship may be problematic for learning spatial correlations. However, in practice we observed no difference in reconstruction performance between offset and axial coordinates. It is possible that translational invariance is less important in this application because the shower core is always centered close to the image center. Consequently, we choose offset coordinates to allow for a more compact representation. With this mapping, the triggered stations for all showers in the simulated dataset are contained within a 9 × 9 grid, which is the size used in the following.
Normalization of arrival time data. The arrival times of the first particles in all detectors of the observatory provide a two-dimensional map containing information that can be used to extract, e.g., the shower direction. This will be discussed in a later section.
Prior to injecting data to the network it is advantageous to limit the numerical range of the data values by a suitable transformation. The trigger times t 0,i of the first particles arriving at the detectors i are averaged within an event, and are zerocentered by taking this average value τ event as a reference value:t The spread σ t,data set of these time values is obtained by the standard deviation of the transformed time values (t 0,i − τ event ) using all stations and all events of the data sample. Stations without a signal are set to zero,t 0,i = 0.
Normalization of the amplitudes. The lateral energy distribution of the shower approximately follows a power law [18]. In order to obtain a more linear input to the network, the amplitudes S i (t) of each time trace are transformed by a logarithmic function:S Here an offset of 1 is used to map stations without signal to zero, S i (t) = 0 →S i (t) = 0. Also, the total signal of the time trace is calculated for each detector by summing the amplitudes S tot,i = S i (t) of all 80 time intervals. The same transformation as for the single amplitudes is applied here:S tot,i = log 10 [S tot,i + 1] The transformed total signals deliver a twodimensional map which contains information that can be used to extract, e.g., the total energy contained in the shower. This will also be discussed in a later section.
Summary of the input data. The information available to the network on input are the 2 twodimensional maps of the transformed total signals S tot and the transformed arrival timest 0 of the first particles. In addition, we have the time traces of the detectors i with the transformed amplitudes S i (t) which contain information on the longitudinal shower development. Instead of usingS i (t) directly we will use a subnetwork to extract a number of features f j characterizing these distributions. For example, one feature may characterize the initial rise of the distribution, another one the length of the distribution etc. During the training process, the network will learn features of the time traces which are optimal for predicting the requested shower property. The exact choice of features remains hidden. Every feature f j will be evaluated for all detectors such that the f j provide N = 10 two-dimensional maps. These feature maps can be combined directly with the above maps ofS tot andt 0 in further network layers.

Network architecture and training
To reconstruct the individual shower properties, shower direction, energy and X max , we set up and train separate networks of the same structure. The network architecture, depicted in Fig. 4, consists of three functional parts. The first part characterizes the time traces at each detector station. Then follows the main network part performing a series of convolutions on the stacked maps of arrival time, total signal and extracted time trace features. Finally, a dense layer predicts one or three output values depending on the reconstruction task. As deep learning framework we use Keras [22] with Tensor-Flow as backend [23]. Our network implementation is published in [17].
In order to characterize these time traces, three consecutive layers of 1D convolutions are applied to each detector station separately. Technically this is implemented using 3D convolutions with filters of size 1×1 in the spatial dimensions, thus performing the same operations on the time traces.
For the three layers we use {64, 32 and 10} filters of size {7, 7 and 4}, sliding over the time traces with a stride of {4, 4 and 1} without padding. By striding, each convolution sees the time trace at a lower resolution and broader scale. After the third convolution the time dimension is consumed and the output tensor is of shape (9, 9, 1, 10) with the fourth dimension holding the 10 extracted features f j . The consumed time dimension is removed by reshaping to (9,9,10).
This network part can be interpreted as a single function f (S(t)) = f j characterizing the time trace of each detector by a small number of features. Since the entire network is trained as a whole, the function will learn to extract those 10 features which are most useful for the following network in the given task.
Densely connected convolutions. For the main network part, the 9 × 9 images of extracted time trace features are concatenated with arrival time and total signal, yielding a tensor of shape (9,9,12). Two design choices are central to this network part: depth-wise separable convolutions and shortcuts by densely connected layers.
Due to the different physical meaning of each of these 12 feature maps, we expect that spatial correlations within a feature map are sufficiently decoupled from correlations between features, such that it is preferable not to map them jointly. We incorporate this assumption into the network structure by making use of depth-wise separable convolutions [24]. These layers first perform spatial convolutions separately on each feature map, before correlating all feature maps pixel-wise in a second step. This factorization of spatial and cross-feature correlations allows for increased computational and parameter efficiency compared to standard convolutions.
Shortcuts have proven to be an essential feature for successfully training deep neural networks [5,25]. For air shower reconstruction we use a network with densely connected layers based on the DenseNet architecture [26]. In this architecture, the output of a layer is provided as input to each subsequent layer. Technically this is implemented by concatenating the input of a layer with its output, which then forms the input to the following layer.
The benefit is twofold: First, these shortcuts allow the signal and gradients to better propagate in the forward and backward pass, respectively. Second, the unaltered feature maps corresponding to arrival time, total signal and time trace features are input through these dense connections to each weight layer in the main network part as shown in Fig. 4. This allows the network to better represent mathematical expressions from standard reconstruction methods. In the same way, the derived features are input to all following layers, which enforces feature reuse.
Concretely, we use a series of five separable convolutions and concatenations. The number f of output features in each separable convolution is chosen equal to the input features, hence f = {12, 24, 48, 96, 192}, and the output of each concatenation is 2f . The filter size is 3×3 and padding is used to keep the spatial size constant.
Output layer. Finally, the (9,9,192) output tensor of the network's main part is flattened and projected with a single, fully connected weight layer without activation function onto the regression target. For the tasks of reconstructing the energy or the depth of shower maximum, a single output predicts E/EeV and X max /(g/cm 2 ), respectively. For shower direction reconstruction we avoided having the network predict (φ, θ) directly since the spherical angles contain a pole at θ = 0 and a periodicity at φ = 0 = 2π. Instead, the network predicts the Cartesian (x, y, z) unit vector. Our tests showed that normalizing the predicted vector to unit length before the loss function does not result in an improved angular resolution.
Discussion of the network design. We use ReLU as nonlinearity after all weighted layers, except for the output layer [7]. As alternatives we considered tanh and ELU [27], however we found no improvement in terms of training efficiency or final performance.
Tested variations of the main network part include a simple series of convolutions as well as additional residual shortcuts, each of varying depth and number of parameters, and considering both standard and separable convolutions. None of these variations showed improved results.
Interestingly, our tests also showed no benefits of using signal normalization, such as batch normalization [28], in any part of the network. When reconstructing the shower direction, batch normalization even had a contrary effect. We conjecture that reconstructing the shower direction from the arrival times on the ground requires a level of finetuning that is negatively affected by the noise induced through batch normalization. With the training procedure described below, the networks show little generalization error, thus regularization is not required. Using dropout [29] in combination with an increased network capacity showed no significant improvement.
Training. The reconstruction task is set up as a regression problem using the mean squared error between predicted and target value as loss function. Since ReLU activations are used, we initialize all weights according to the MSRA scheme [30]. As optimizer we use ADAM [31] with standard values for the initial learning rate α = 10 −3 and for the first two moments' exponential decay rates β 1 = 0.9, β 2 = 0.999.
The network has 84,547 parameters for reconstructing scalar shower properties (energy, shower maximum), and 115,653 parameters for reconstructing the arrival direction. It is trained on 358,000 showers in batches of 132. Each training is run on a single NVIDIA GeForce GTX 1080 GPU. The training time per epoch is ∼ 100 s and inference on a batch of 132 showers takes 0.04 s.
During training, the validation loss is monitored on a fixed validation set of 2,000 showers and used for reducing the learning rate. Specifically, the learning rate is reduced by a factor of around 0.5 whenever the validation metric reaches a plateau. Each training is run over 120 epochs.
For each task we train five networks and select the best performing network based on the validation set. The reconstruction performance is then evaluated on two independent test sets. One set contains 40,000 simulated showers with an E −1 energy spectrum between 3 and 100 EeV, the other set 80,000 simulated showers of 10 EeV energy. Both sets have a mixed composition of H, He, N, Fe with equal fractions. The performance in these test sets is described in the following sections.

Arrival direction
The reconstruction of the arrival directions is based primarily on the arrival times of the first shower particles at the detectors.
As a benchmark, we perform a plane fit by using the arrival times together with the locations of the detectors. In Fig. 5a, the angular distance of the fitted shower directions and the true directions are shown for 80,000 events of the mixed composition (H, He, N, Fe with equal fractions) and an energy of 10 EeV by the dotted black curve (FIT). The angular resolution amounts to 1.45 • expressed in terms of the 68% quantile of the distribution.
To investigate the performance of the network, initially we provide on input the same arrival times as for the plane fit. The locations of the detectors are not provided and need to be inferred from the training data instead.
In Fig. 5a, the angular distance between the reconstructed and the true directions of the showers are shown by the dashed blue curve (DNN t ). Without explicit information on the physical conditions, e.g. locations of the detectors or velocity of shower particles, the network achieves the same angular resolution as the plane fit.
We also exploit information on the shower development that is contained in the time traces recorded by the detectors in addition to the arrival times of the first particles. In Fig. 5a we show the angular resolution of the combined information of arrival times and time traces (DNN, solid red curve). The angular resolution improves to 1.2 • .
Note that the data set contains a substantial noise component such that the results of the plane fit are disturbed. The network DNN is able to correct for these effects using additional information contained in the lateral signal distribution and the time traces.
In Fig. 5b we also show the energy dependence of the angular resolution. As expected, the angular resolution of the network DNN with the entire available input information provides a consistently better performance than the network DNN t working only on the arrival times. Compared to the plane fit, the network DNN performs equally well or better up to the highest energies.

Cosmic ray energy
Information on the cosmic ray energy is primarily contained in the lateral distribution of the shower particles and the total signals deposited in the detectors [18]. We investigate energy reconstruction in two variants of the input data available to the main network.
Initially, we exclusively provide the twodimensional distribution of the total signals (7) on input. In Fig. 6a we show the reconstructed energy for cosmic rays with the mixed composition and with true energy 10 EeV as the dashed blue curve (DNN s ). The energy reconstruction exhibits a small bias and has a resolution of 0.82 EeV, or σ E /E = 8.2%, respectively. Note that the absolute value of the energy resolution is dependent on the exact details of the shower and the detector simulation, and is used here only as a benchmark for further comparisons.
In a second step, on input we provide all other features in addition to the total signals, namely the arrival times (5) and the 10 features f j extracted from the signal amplitudes of the time traces. In Fig. 6a we show the energy resolution for the 10 EeV cosmic rays by the solid red curve (DNN). Using the full information, the energy resolution is reduced to σ E /E = 5.1%.
In Fig. 6b we show the energy dependence of the energy resolution. As we have parameterized with our simulation a calorimetric energy measurement, the relative energy resolution improves with increasing energy as the number of shower particles increases.
The resolution when using the full information (DNN, red circular symbols) is consistently better than the resolution when using the total signal distributions only (DNN s , blue square symbols). This comparison demonstrates that the network DNN is able to take advantage of information contained in the time traces as well.

Reconstruction of the shower depth
Information on the cosmic ray composition is contained in the details of the shower properties, where a good estimator is the atmospheric depth of the maximum X max of the shower development. Heavy cosmic nuclei typically interact high up in the atmosphere and produce a substantial number of muons. In contrast, showers of light nuclei such as protons penetrate deeper into the atmosphere. They exhibit large fluctuations in X max and produce less muons.
Note that in our air shower simulation, the primary information on X max is encoded in the time traces as we approximate the shower front by a plane wave. We again investigate air showers with the mixed composition of H, He, N, Fe with equal fractions.
In Fig. 7a we show the difference between the reconstructed and the true shower maxima X max for cosmic rays with 10 EeV energy. Here all information, the 10 features f j plus the total signals and the arrival times are provided to the network (DNN). This demonstrates that the network is able to extract the relevant information from the data in order to reconstruct the shower maximum. Note also that the resolution of the X max depends on the details of the air shower simulation.
In Fig. 7b we also show the energy dependence of the X max resolution. The shower footprint increases with increasing energy, allowing to better probe the different lateral distributions of the electromagnetic and muonic components, cf. Fig. 2. Additionally, random fluctuations and relative noise in the time traces decrease with increasing energy. As a result, the reconstruction quality improves slightly with increasing cosmic ray energy.

Conclusions
The footprint of an extended air shower in an Earth-based observatory is like a very short video of the detector responses to traversing shower particles. We developed a parametrized air shower and detector simulation to investigate reconstruction of shower properties from the detector measurements with convolutional network techniques. To provide suitable input to our main network we casted the measured information on a regular two-dimensional (2D) grid of the detector stations.
The arrival times of the first shower particles at the detectors match on such a 2D grid. This time information was sufficient for the network with dense 2D architecture to provide estimates of the shower directions with a resolution competitive to that of a classical fit to the particle shower front. While prior knowledge was input on the exact detector locations and on the particle velocities for the plane fit, the network had no additional information beyond the arrival times.
For reconstructing the shower energy, the timeintegrated signal amplitudes in the detectors can also be casted directly on the 2D grid. To include also the time traces of the signal amplitudes themselves further steps are required. The traces contain information on the longitudinal shower development including the shower depth and the electromagnetic and muonic components of the shower. Instead of directly using the amplitudes in the time bins specified by the hardware, we dedicate a part of the network to extract a small number of characteristic features describing the shape of the time traces. Since the entire network is optimized as a whole, this network part is trained to extract those features which are most useful for the given reconstruction objective. Every single feature was then casted on the 2D grid.
Using on input the above-mentioned arrival times, the time-integrated signal amplitudes, and all features extracted from the time traces, we again reconstructed the shower direction, energy and the shower depth with the dense 2D network. This set of input information yielded the best results. All shower properties were reconstructed with good precision and exhibited the expected dependency on the cosmic ray energy. Without prior information on the physics of air showers, the network extracted all necessary information from the data provided for training purposes.