Particle identification using semi-supervised learning in the PICO-60 dark matter detector

Astrophysical observations have demonstrated the existence of dark matter over the past decades. Experimental efforts in the search for dark matter are largely focused on the well-motivated weakly interacting massive particles (WIMPs) as a dark matter candidate. Current experiments in direct detection are producing increasingly competitive limits on the cross section of WIMP-nucleon scattering. The main experimental challenge for all direct detection experiments is the presence of background signals. These backgrounds need to be either eliminated by providing sufficient shielding or discriminated from WIMP signals. In this work, semi-supervised learning techniques are developed to discriminate alpha recoils from nuclear recoils induced by WIMPs in the PICO-60 detector. The two semi-supervised learning techniques, gravitational differentiation and iterative cluster nucleation, maximize the effect of the most confidently predicted data samples on subsequent training iterations. Classifications using both techniques can reproduce the traditional acoustic parameter with accuracies over 98%. The best model yields an accuracy of 99.2% and a class-wise standard deviation value of 0.11. These techniques can reliably serve as an intermediate verification tool before the acoustic parameter is constructed in future detectors.


Introduction
The presence of dark matter has become evident over the past decades. Its existence is supported by numerous astrophysical observations, including the deviation of the galactic rotation curves from Newton's law of universal gravitation [1], the mass distribution of the bullet cluster due to gravitational lensing effect [2], and it is most strongly inferred by the anisotropy due to baryon acoustic oscillations in the cosmic microwave background [3]. Experiments are being conducted to search for dark matter directly through scattering with a nucleus, or indirectly through annihilation and subsequent decay into standard model particles [4]. Collider production of dark matter offers another gateway to probe its fundamental interactions. There are many dark matter candidates being considered, including axions [5] and primordial black holes [6], with the current detectors most well equipped to search for weakly interacting massive particles (WIMPs) [7] [8].
Direct detection experiments have shown promising results in the field as they improve in sensitivity and scale [9]. To directly search for the hypothetical particle, many detectors are developed to be sensitive to nuclear recoils, where a dark matter particle elastically scatters off of a target nucleus via momentum transfer directly to the nucleus, causing the nucleus to  [11]. However, other particles such as neutrons, gamma rays and neutrinos can produce signals in the detectors that mimic those produced by a dark matter candidate. The experimental challenge is to either eliminate these backgrounds, or differentiate them from dark matter signals. In the last decade, experiments have been seeking deeper underground facilities to shield their detectors from cosmic rays. The remaining background signals can be rejected using calorimetry or machine learning techniques.

The PICO experiment
PICO, the joint effort between the original PICASSO and COUPP experiments [12][13], use superheated C 3 F 8 as the target fluid for WIMP-nucleon scattering. The detector is brought to a metastable thermodynamic (superheated) state at a set temperature by lowering the pressure past the saturation curve. The degree of superheat determined by the pressure and temperature choices sets the Seitz thermodynamic threshold in the detector. When the liquid is in the superheated state, any energy deposition above this threshold for nucleation will induce vaporization of the liquid at the interaction vertex [14].
PICO is located 2073 meters underground at the Sudbury Neutrino Observatory (SNOLAB) in Canada. Therefore, cosmic muons lose significant amount of energy for muon-induced neutrons to be a background in the detector. Additional neutron shielding is provided by immersing the detector in a water tank approximately 1.5 meters in radius. PICO detectors are extremely insensitive to electron recoils induced by gamma rays. The expected gamma background per live year is typically less than 1 event in the current-generation detector with appropriate threshold choices. The remaining background of concern is alpha recoils. Alpha particles are decay products of uranium, thorium and radon, which are naturally present in the lab and in the materials surrounding the detector. The decay of uranium or thorium in the materials around the detector can result in radon diffusion into the active liquid; radon can also cling onto the walls of the quartz jars holding the target fluid as a result of the exposure of the jars to lab air prior to assembly. Alpha recoils, like nuclear recoils, can induce nucleation in the target fluid. However, their acoustic properties are distinctly different. PICO detectors use piezos ( Figure 1) to record the acoustic information of each event, and reject alphas with the integrated acoustic power (AP) in selected frequency bands. [15] The technique provides excellent rejection for alpha recoils, shown in Figure 2. However, the disadvantage is that it requires manual tuning (e.g. frequency band selection, position corrections etc) for each new detector after large sets of calibration data. In some low threshold runs during the operation of PICO-60, ambiguous mid-AP events begin to emerge, which can obscure the detector's ability to discriminate nuclear recoils from alpha recoils. Early machine learning techniques utilizing supervised learning were developed for PICO-60 to complement the discrimination power of AP. In this work, novel techniques exploiting semi-supervised learning using PICO-60 data proved to reproduce AP with over 99% accuracy.

Semi-supervised learning for particle identification
Semi-supervised learning is an uncommon machine learning technique, relative to widely-used supervised and unsupervised learning. The concept is to train a machine learning model on a set of labeled data in addition to a set of unlabeled data. In the context of PICO-60, the labeled sets consist of the neutron calibration runs and the background radiation runs (which consist predominantly of alpha recoils). These labeled sets have a strong but not perfect correlation with particle types. The unlabeled data can consist of any mixture of particle types.  The WIMP search data contain only alpha recoils. Bottom: signal classifications by a neural network using supervised learning (early development, not a part of this work). Nuclear recoil events are assigned a score of 0 and alpha recoil events are assigned a score of 1.
The primary advantage of semi-supervised learning is clear: it requires fewer labels to be collected than conventional supervised learning. PICO-60 data sets are relatively small after analysis cuts (in the order of hundreds of events per category), which makes semi-supervised learning particularly advantageous. A less obvious but powerful secondary advantage is that, since the learning process allows the network to reinforce its own decisions, the negative effect of impure training data is minimized, potentially allowing the resulting network to perform better than a network trained with supervised learning.
Two techniques for semi-supervised learning have been developed for this study: gravitational differentiation and iterative cluster nucleation. The essence of both algorithms is a positive feedback loop. First, the network is trained on a small amount of imperfectly labeled data (data drawn from the neutron calibration or WIMP search runs). Throughout the training process, the network runs predictions on unlabeled data. By using its most confident predictions in the training process, the desired function separating particle types is continuously reinforced, while less confident predictions are de-emphasized. This effectively draws new, unlabeled data into the training set for further iteration.
In this experimentation, multi-layer perceptrons were used as the basic neural network model. They were trained on a banded, integrated Fourier transform of the acoustic data from the two working piezos of the detector (Figures 3 and 4). All architectures were optimized using grid searches, in which every possible combination of hyperparameters within a fixed parameter space is tested [16]. The best architecture, consists of an input layer of 16 neurons, two fullyconnected layers of decreasing number of neurons (12 and 8) and a binary output layer. The dropout probabilities between the last three layers are both 0.25. This architecture was selected using accuracy calculated on a randomly selected validation set.

Gravitational Differentiation
Gravitational differentiation (GD) is a novel technique for training a neural network in a semi-supervised fashion on a relatively small set of imperfectly labeled data, while simultaneously incorporating the network's changing predictions on a set of unlabeled data to encourage decisive classifications. In general terms, it amplifies the training effect associated with highconfidence predictions (those that most clearly show characteristics of either an alpha particle or a neutron) on unlabeled examples. Meanwhile, it minimizes the training impact of lowconfidence predictions, such that the network will not be trained to make incorrect predictions.
Analogous to a gravitational effect, events with high-confidence predictions are attracted closer to their corresponding classifications. They are then used as training data, allowing the patterns they represent to be used more generally to make predictions on other events. The system accomplishes this using a new method for calculating gradients for the last layer of the neural network. Gradients are calculated using the gravitational differentiation function GravDiff(p, ψ, g), which is parameterized by the network's existing prediction p on the training example in question, the degree ψ of the piecewise exponential function used to distort the response of the gradient, and the gravitational multiplier g. The function is defined as follows: GravDiff(p, ψ, g) = g · sgn(p) · abs(tanh(2(p − 0.5))) ψ It is a transformation of the hyperbolic tangent that flattens the central range and comparatively exaggerates the asymptote on either side. The equation above can be described in the following steps and is further demonstrated in Figure 5: (i) Transform the network's prediction p, so its range is 1 to -1 rather than 0 to 1. (ii) Apply the sigmoidal hyperbolic tangent, producing a value of 0 in the center and asymptotic slopes to -1 and 1 at the edges. (iii) Apply a relatively large exponent which is ψ (in the range of 3 to 11) to compress values close to 0 closer to 0. This ensures a shallow slope in the center (so low-confidence network outputs near 0.5 will produce very small output gradients).
This function is applied in a training algorithm that is parameterized by the set of imperfectly labeled data ς, the set of unlabeled training data υ, a binary classification neural network N N (x) (where x is the input data format), and the gravitational multiplier increment δ g . The distortion power ψ is assumed to be constant throughout a training run, and the gravitational multiplier g is initialized at 0, gradually increasing during training. It iterates as follows: Train N N (x) for a single epoch on the combined set ς ∪ υ, using the predefined ground truths for ς (with a mean squared error loss function) and the most recently calculated gravitational gradients for each of the examples in υ. (ii) Calculate predictions p with N N (x) on the entirety of υ. Calculate GravDiff(p, ψ, g) and record the resulting gradients for the next iteration. (iii) Increment g by δ g . At the beginning, when g = 0, training will be entirely based on the labeled set ς, and progressively, over the course of a run, the gravitational effect increases with g. Figure 5: Demonstration of the principle of gravitational differentiation, where the GravDiff function distorts the gradient by ψ and pulls confident predictions closer to the binary categories (0 or 1), creating the positive feedback loop in the semi-supervised learning. The less certain predictions will have a gradient close to 0 in the next training iteration.

Iterative Cluster Nucleation
Iterative cluster nucleation (ICN) is a second semi-supervised learning algorithm developed as part of this study. It takes advantage of some amount of imperfectly labeled data, using it to classify the rest of an unlabeled training set while optimizing to produce an effective discriminator. The basic concept is similar to that of gravitational differentiation. Starting with a relatively small amount of labeled data from two classes (nuclear recoils and alpha particles, in this case), predictions are run on a set of unlabeled data, and those predictions are used to produce further training data. However, there are three key differences: (i) In iterative cluster nucleation, an unlabeled example is not included in the training process at all, until a very confident prediction is made on it. Once that occurs, it is added to the training set, expanding one of the two "clusters" of training data. (ii) Once an example is added to the active training set, it is never removed. This ensures that the highly confident predictions made at the very beginning continue to influence the training process.

Performance Comparison
The criteria that define the success of a network are: classification accuracy and confidence. The accuracy of the network is assessed against AP, assuming AP has perfect discrimination power. The confidence of the network on the predictions is characterized by the class-wise standard deviation (CWSD): where N ,N n ,N a are the number of data samples in total, from neutron calibration set and background set respectively. x n are predictions in the neutron calibration set and x a are predictions in the background set. The µ's are the mean value of their respective dataset. A more confident neural net will predict a CWSD closer to 0 on the scale of 0 to 1.  semi-supervised learning techniques, gravitational differentiation and iterative cluster nucleation, are developed based on the principle that the more confident predictions on unlabeled data have a larger effect on training in the next iteration. While gravitational differentiation produced classifications with the highest accuracy and confidence, iterative cluster nucleation has proven to be more data-efficient. In particular, these two semi-supervised learning techniques consistently perform better in both accuracy and confidence comparing to supervised learning with increasing initial sample sizes. The best network model reproduced the AP predictions with an accuracy of 99.2% and a confidence value (CWSD) of 0.11. These techniques can complement the results predicted by AP, and provide intermediate verifications on small calibration data sets before AP is calculated in future detectors. The results may also be used in assisting the placement of the AP cuts. The development of effective machine learning techniques can be a powerful tool in background discrimination for dark matter detectors, as experiments prepare to deploy ton-scale detectors with potential for WIMP detection.