FETA: Flow-Enhanced Transportation for Anomaly Detection

Resonant anomaly detection is a promising framework for model-independent searches for new particles. Weakly supervised resonant anomaly detection methods compare data with a potential signal against a template of the Standard Model (SM) background inferred from sideband regions. We propose a means to generate this background template that uses a flow-based model to create a mapping between high-fidelity SM simulations and the data. The flow is trained in sideband regions with the signal region blinded, and the flow is conditioned on the resonant feature (mass) such that it can be interpolated into the signal region. To illustrate this approach, we use simulated collisions from the Large Hadron Collider (LHC) Olympics Dataset. We find that our flow-constructed background method has competitive sensitivity with other recent proposals and can therefore provide complementary information to improve future searches.


I. INTRODUCTION
While the Large Hadron Collider (LHC) has been operational for over a decade, no targeted search for physics beyond the Standard Model has found significant evidence for new particles or forces of nature. As a result, there has recently been a growing interest in modelagnostic searches: in such studies, the goal is to find anomalies while making few assumptions about the underlying physics model that could have produced them. This mode of analysis is highly general, and it has gained momentum from recent advances for modeling, classifying, and finding anomalies in data. For a broad review of the role of modern machine learning (ML) methods in searches for new physics, see [1]; for performance summaries of ML-inspired methods on LHC-like data, see the LHC Olympics [2] and Dark Machines Anomaly Score Challenge [3] reports.
One class of anomaly detection searches is for resonant anomalies. In this case, we assume that the beyond-the-Standard Model (BSM) physics signature would be resonant in at least one feature, typically some sort of mass. Such a signal might correspond to the production of a new on-shell particle. These "bump hunts" search for an excess in the mass feature above the known Standard Model (SM) background in a well-defined signal region. However, such searches require the existence of accurate SM background templates within the signal region. While particle generators for SM physics exist, the resulting simulated data (even with detector effects added) is not accurate enough to be used as a background for bump hunts in hadronic final states: hard process generators carry out perturbative calculations only to NLO, showering simulators make use of heuristic nonperturbative models, and detector simulators make necessary simplifying assumptions about particle-detector interactions in order to speed up runtime.
In response to this problem of subaccurate particle generators, a number of studies have focused on constructing SM background templates for LHC-like detected data within a signal region. Broadly, methods can be categorized on two qualities: 1. Simulation-assisted vs. data-driven. For simulation assisted methods, the background template construction is informed by a set of simulated LHClike collider data representing SM processes; for data-driven methods, data from sidebands mass regions is used, where the sidebands are far enough from the resonant process such that the data can be treated as a proxy for SM background.
2. Likelihood learning vs. feature morphing. For the former, methods learn the likelihood of a SM-only dataset (such as simulation, or detected data in sidebands) where there are no signal events. This learned likelihood is then interpolated into the signal region to act as a SM background template above which signal events might be detected. Alternatively, methods can physically morph features from the SM-only dataset to the detected, signalcontaining dataset.
A number of previous methods for SM background construction are classified visually in Table I. This two-axis scheme is not meant to be all-encompassing, as there exist many methods for anomaly detection that cannot be so neatly classified (see [4][5][6][7] for examples of such methods in practice).
In this paper, we propose the "Flow-Enhanced Transportation for Anomaly Detection" (Feta) method to model SM-like background at the LHC. Feta is Simulation-assisted Data-driven Likelihood learning Salad [8] Overdensity searches [9], Anode [10], (La)Cathode [11,12] Feature morphing Feta (this work) Curtains [13]  simulation-assisted and relies on feature morphing, therefore filling the previously empty section of Table I. In Feta, we train a normalizing flow 1 [15] to learn a mapping between simulation and data in sidebands regions. We then apply the learned mapping to signal region simulation to create a simulation-informed template for signal region SM background. Feta benefits from being simulation-assisted since it can use simulated SM data as a physically-informed prior for the background template; the method further benefits from using feature morphing since it is robust to mapping between feature spaces with non-overlapping support.
The structure of this paper is as follows. In Sec. II, we provide a concise background of normalizing flows and outline how they will be applied to physics-specific datasets. In Sec. III, we illustrate the effectiveness of flow-based models for creating context-dependent mappings with a toy example of triangular datasets. In Sec. IV, we exchange the toy models for LHC-like data and use Feta to create a model for LHC-like detected SM data. In Sec. V, we test the performance of Feta in a series of realistic anomaly detection tasks. In Sec. VI, we conclude and suggest avenues for further study. Normalizing flows are constructed from invertible neural networks between sets of variables sampled from different probability densities. Given a random variable X sampled from a reference distribution p R , one can define a transformation T that produces another random variable Z, i.e. Z = T (X). The density of Z is then given by p Z (Z) = p R (X)|det ∂T ∂X | −1 . By chaining together a number of transformations T i , one can produce an arbitrarily complex mapping between the initial reference density p R and a target distribution p T . Typically, the target density is taken to be a standard normal distribution. (Hence the name "normalizing" flow.) 1 A comprehensive review of flow-based models is given in [14].
FIG. 1: A schematic of the Feta method. We train a flow to learn the mapping between simulation and data in sidebands regions, which are expected to be background-only. We then apply the learned flow to simulation in a signal region to produce an approximation for background in that region.
In this work, we use the normalizing flow both for its density estimation power and its ability to construct morphing functions between nontrivial reference X R ∼ p R and target X T ∼ p T distributions. Our reference density is derived from Standard Model simulation (X SIM ∼ p SIM ), and our target dataset is derived from detected data (X DAT ∼ p DAT ).
More explicitly, we define a set of N event observables such that events in the reference and target are N -dimensional vectors X SIM and X DAT , which respectively are sampled from the N -dimensional feature densities p SIM and p DAT . We then train a flow to learn the mapping T between p SIM and p DAT . In training the flow, we must ensure that the learned mapping is between simulated Standard-Model background and LHC-detected background. For resonant anomalies, we assume that the signal will be localized in one feature m res . This allows us to define sidebands (SB) and a signal region (SR) in m res , where data from the former regions is assumed to follow the SM distribution. We then train the flow only on data from SB, using the resonant feature to condition the mapping T (·|m res ). The learned flow is then applied to simulation in the SR to produce an approximation of X DAT : X * SIM = T (X SIM |m res ) for LHC-detected background in the SR. A schematic of this method is shown in Fig. 1.
There are several advantages to using a flow-based architecture over other architectures such as GANs or VAEs. Normalizing flows are known to be more stable and achieve convergence during training faster, especially in higher dimensions. This property allows for the freedom to choose a larger feature space X, which may be desirable for a model-agnostic study. For GANs specifically, attempting to learn conditional mappings between datasets is not an easily done task. In addition, the density estimating power of the normalizing flow al-lows us to oversample from the reference distribution and reduce statistical uncertainties (explored in more detail in App. A). This is not possible with VAEs, which require the definition of explicit pairs to train the encoderdecoder architecture.

B. Flow construction
All flows were constructed using the nflows package [16] and were trained using PyTorch [17].
As an important procedural note: to learn a flow that maps between two nontrivial densities, we use a two-step procedure. We first train a Base Density normalizing flow to learn the mapping between a normal distribution and the reference density p R across all mass bands (i.e. using reference data from both the SB and SR). We then train a Transport flow to learn the mapping between the Base Density distribution and the target density p T , this time only using data from the SB. This specific method, which allows for the use of flows to map between nontrivial distributions, was proposed and implemented in the Curtains background construction method (and is further explored in [18]). It is thus more accurate to simply call Feta a flow-based method, rather than a normalizing flow-based one.

III. TESTING THE FETA METHOD WITH A TOY MODEL
To concretely illustrate the Feta method, we train a flow to map between two toy triangular datasets. Both datasets contain 100,000 samples in a two-dimensional feature space, with a feature X that is conditionally dependent on a feature M . This conditioning feature can be interpreted as a mass-like feature, or one in which we expect an anomaly to be resonant.

A. Toy model datasets
To construct each dataset, we first generate samples of the conditioning (resonant) feature M , which we take to be uniformly distributed between (0, 1). On this feature, we define the signal region SR ∼ (0.34, 0.66). We also define two sidebands, SB1 ∼ (0.0, 0.34) on the low mass side of the SR, and SB2 ∼ (0.66, 1.0) on the high mass side.
For the corresponding nonresonant feature, we draw samples from a triangular dataset with endpoints at (0, 1). In order to condition this feature X on the mass, for each sample we define the midpoints m R , m    [22].) Flow architecture and training hyperparameters used for the toy (triangular) dataset. Both the Base Density and Transport flow parameters were optimized through manual tuning.
(0, 1), and the toy target dataset X T consists of righttriangular samples. We further shift all the samples X T of the target dataset by 0.5 such that the reference and the target sets have nonidentical support. Samples from the two datasets are shown in Fig. 2.
The flow architecture and hyperparameters are given in Table II. All settings were optimized via manual tuning and were chosen to give the best possible performance on the SB regions (as quantified by the ROC AUC, which is defined in Sec. III B). Flow training is optimized with AdamW [19], and the learning rate is annealed to zero following a cosine schedule [20]. Before training, all features are minmaxscaled to the range (-3, 3), which was found to be optimal with respect to the flow training; further, the samples are split into training (80%) and validation (20%) sets. The model from the epoch with the lowest validation loss is used for evaluation.  TABLE III: ROC AUCs for a binary classifier trained to discriminate the transformed reference X * R from the target dataset X T . For comparison, we also provide the AUCs for a binary classifier trained to discriminate the untransformed reference X R from the target. Uncertainties are the 1σ bounds from retraining the binary classifier 20 times, each with a different random seed.

B. Toy model results
Plots of the reference X R , transformed reference X * R as found by the Feta method, and target X T distributions are shown in Fig. 3. We compare the distributions separately in SB1, SR, and SB2 (recall that the flow is trained only on data from SB1 and SB2 and is applied to the blinded SR data). Qualitatively, there is excellent agreement between the distributions for X * R and X T in SB1 and SB2 and good agreement in the SR. We conclude that the flow has learned to map between the reference and the target data.
In Fig. 3, we also plot the reweighted reference X W R as found through the Salad method. For this method, we train a binary classifier to discriminate between X R and X T . Such a classifier will learn the likelihood ratio between the reference and the target distributions. This likelihood ratio can then be interpolated into the SR and used to reweight X R . Note that for this toy model, feature reweighting is expected to fail due to the fact that there are regions of non-overlapping support between the reference and target. We can quantify the performance of the learned morphing function by training a 5-fold binary classifier neural network to discriminate between X * R and X T . Each classifier is a fully connected (dense) network, with linear layers of sizes (5, 64, 32, 1) and a dropout of 0.1 between each layer. Each classifier is trained for up to 100 epochs with a batch size of 128, learning rate of 10 −3 , and patience of 5 epochs. We evaluate all test data on the classifier from the fold with the best (lowest) validation loss.
As our scoring metric, we use the area under the receiver operating characteristic (ROC) curve. These AUC scores for this classifier are shown in Table III. An ideal morphing function would result in the transformed reference and target datasets being indistinguishable from each other, which would correspond to a AUC close to 0.50. Indeed, the AUCs hit this performance benchmark, and they are also far lower than those for a classifier trained to discriminate between untransformed X R and X T .  We now move to a realistic example: training a flow to learn the morphing function between Standard Model simulation and detected data. For an ideal application of Feta, the reference dataset would consist of simulated Standard Model-like data (X R = X SIM ), and the target dataset would consist of LHC-detected Standard Model data X T = X DAT ). However, we do not have access to LHC data. Therefore in this study we use two distinct sets of simulated data for the reference and target datasets.
For this study, we choose a feature space of six dijet observables m J1 , ∆m JJ , τ 21 J1 , τ 21 J2 , ∆R JJ , and m JJ . Following the example of the Curtains analysis, we include the ∆R JJ dijet observable as a feature that is highly correlated with m JJ .
As with many anomaly detection searches, we assume that the anomaly is resonant in one feature, m JJ . This would correspond to a new resonant particle that would produce the two jets. The assumption of resonance allows us to define a signal region (SR) in m JJ -space, as well as two sidebands (SB1 and SB2) on the low-and high-mass ends of the SR. The region edges are defined in Table IV. While we have chosen to focus on one SR / SB setup, one could in practice perform a sliding bump hunt and scan the SR across a wide range of resonant masses.
The LHC Olympics dataset does not contain any LHCdetected data that would be the obvious choice for X SIM . Therefore we take the LHC Olympics Herwig++ data as our simulation dataset X SIM and the Pythia data Probability distributions for X R , X * R , and X T for the toy model. The X * R samples represent the trained flow acting upon X R . The good agreement between X * R and X T indicates that the flow has successfully learned to map between the two datasets. We also provide the distribution for reweighted reference X W R to illustrate its failure to provide an accurate model for X T , as the reference and target datasets have different regions of support.  parameters were derived from the main architecture from Cathode, but were confirmed to give the best performance through manual tuning. The Transport flow parameters were optimized through manual tuning. as our target dataset X DAT . Histograms of the six dijet observables for X SIM and X DAT are shown in Fig. 4.

B. Training Method
The method for training a flow on the LHC data is the same as for the toy dataset. However, the flow architectures used for LHC-like data are significantly more expressive. Architectures and hyperparameters are outlined in Table V. Notably, the Base Density flow parameters were derived from the main architecture from Cathode (which relies on faithful density estimation of detected collider data in SB), and they were confirmed to give the best performance through manual tuning. Data is minmaxscaled to the range (-3, 3) before flow training, and a training-validation split of 80%-20% is used. All settings were manually optimized to give the bestperforming flow possible, as quantified by the ROC AUCs in SB1 and SB2.  VI: ROC AUCs for a binary classifier trained to discriminate the transformed simulation X * SIM from the target dataset X DAT . For comparison, we also provide the AUCs for a binary classifier trained to discriminate the untransformed simulation X SIM from the target.

C. LHCO results
In Figs. 5a and 5b, we plot the distributions for X SIM , X * SIM , and X DAT for the LHC-like data in SB1 and SB2. For each band, we also plot the ratio of untransformed and transformed simulation distributions to the target distribution. For all features, the transformed simulation X * SIM is visually much closer to the target X DAT than the untransformed simulation.
In Fig. 5c, we provide the same distributions for X SIM , X * SIM , and X DAT in SR. For these plots, we once again see good qualitative agreement between X * SIM and X DAT , despite the fact that the flow was not explicitly trained to morph between SR datasets.
In Table VI, we quantify the performance of the flow through the ROC AUC of a binary classifier (with the same architecture as in Sec. III B) trained to discriminate X * SIM from X DAT in each band. In all bands, the AUC is consistent with below 0.51, so our benchmark for indistinguishability of transformed simulation from the target is achieved.  However, if the LHC-detected data did contain some anomalous events, this would cause X * SIM to differ from X DAT . In this case, an avenue for resonant anomaly detection emerges. We can train a binary classifier to discriminate X * SIM in the SR from data in the SR, and a significant deviation found between the distributions might provide evidence of anomalous events in the detected data. This method for anomaly detection relies on the fact that a binary classifier trained to discriminate between two mixed samples (such as a background-only SM template and a detected mixture of SM background and resonant signal) is in fact the optimal classifier for distinguishing pure signal from pure background [28][29][30].

A. Signal injection procedure
To explore the capability of Feta for anomaly detection, we repeat the flow training method as outlined in Sec. IV B. We inject a known number of signal events into the X DAT (Pythia) dataset. Since our chosen SR extends across the m JJ range [3300, 3700] GeV, this allows for possible detection of the Z ′ resonance centered at 3500 GeV. We test a range of signal injections, scanning over n S ∈ [300, 500, 750, 1000, 1200, 1500, 2000, 2500, 3000] (corresponding to S/B ≈ [0.30%, 0.49%, 0.74%, 0.99%, 1.18%, 1.48%, 1.97%, 2.47%, 2.96%]). As a word of caution: despite the relatively wide SR window chosen, about 20% of the injected events go into the SB regions.
For each signal injection, we rerun the full Feta pipeline and train a flow to learn the mapping between simulation and data with the injected signal. (This retraining is necessary due to the signal contamination in the sidebands.) We then train a binary classifier (with the same architecture as in Sec. III B) to discriminate transformed simulation from data.
We compare the results of the Feta method with those from the Cathode, Curtains, and Salad methods. For each alternative method, we use the architecture cited in the respective paper 2 . To place all methods on an equal footing, we use the same set of training and validation dijet events for all methods. Such event breakdowns are given in Table VII. Note that all methods use the validation loss to select the best-performing model to be used for further analysis. The Feta, Cathode, and Curtains methods all make use of oversampling of the SM background template to achieve better performance, which is a mechanism that allows for the reduction of statistical uncertainties. We investigate the effects of oversampling on the Feta method performance in App. A.
All methods are evaluated by training a binary classifier to discriminate the SM background template samples from a set of 100k dijet data events from the SR. All classifiers are tested on the same set of 20k signal and 20k background SR dijet events, which were not used at any point during the training or validation procedures.

B. Anomaly detection performance summary
In Fig. 6, we show a selection of summary plots from this final binary classifier corresponding to 2500 injected signal events (S/B = 1.97%, S/ √ B ≈ 7.9). Each curve represents the mean performance of 20 different random classifiers trained to discriminate the given SM background template from the "detected" events in the SR. Errorbands represent the spread across the (16, 84) percentiles across these 20 runs. Note that the errorbands are comparable for Feta, Cathode, Curtains, and Salad.
In Fig. 6a, we provide the significance improvement characteristic curves, given by SIC = true positive rate √ false positive rate . The SIC can be interpreted in the limit of large S and B as the gain in signal significance (i.e. the multiplicative factor) over the initial significance that can be achieved by making a well-motivated cut on the dataset. Therefore for an optimally performing classifier, we expect to see the SIC >> 1. In Fig. 6b, we provide the classifier rejection curves, given by the reciprocal of the false positive rate. An optimally performing classifier should see this rejection also be >> 1. Finally, in Fig. 6c, we plot the SIC curves against the rejection curves.
In addition to the curves for the four methods considered, we also provide the curves corresponding to a fully supervised classifier, i.e. a classifier trained on perfectly labeled signal and background events. This curve demonstrates the maximum possible performance to discriminate SM physics from anomalous physics. Importantly, the fully supervised classifier should not be interpreted as the limiting case of an idealized anomaly detection study, which would come from a classifier trained to discrimi-  Note that oversampling (the reduction of statistical uncertainties by drawing more SM samples) is not possible with the Salad method, which does not use a generative model for the reference dataset. The oversampling factor is the number of SM background template samples generated over the number of dijet events used to construct the SM background template samples.
nate detected data from perfectly simulated background. We find that the Feta method is competitive with the performance of a fully supervised classifier at signal efficiencies of around 0.3 and lower. This performance is encouraging as in practice, we would expect Feta to be used primarily in this range (for higher signal-efficiencies, the amount of background is sufficiently large that nonfully-supervised methods are unlikely to effectively find the signal). Indeed, the performance curves of Feta for all three metrics, (SIC, rejection, and SIC vs. rejection) very closely align with those of Cathode, which was demonstrated to be state-of-the-art, and Curtains.
In Fig. 7, for each signal injection, we plot the maximum of the SIC, where the maximum is taken across all signal efficiencies. Since the SIC relates to an increase in signal significance from a well-motivated cut, the max(SIC) is then the best possible cut for a given signal. Across the board, Feta, Cathode, Curtains, and Salad achieve similar performances, all of them becoming just about consistent with a fully supervised classifier at ∼1.97% signal injection. The S/B corresponding to a minimum detectable signal (set to be ∼3 for "observation" of BSM physics) lies between 0.49% and 0.74%, as calculated by max(SIC)×(S/ √ B).

VI. CONCLUSIONS AND OUTLOOK
In this study, we have proposed a new method, Feta, for SM background construction that can be used for resonant anomaly detection. The method is simulationassisted and relies on feature morphing (rather than reweighting) between a reference set of simulated SM data and detected data.
Feta can be seen as a hybrid of Curtains and Salad: like Curtains, Feta uses a flow-based architecture that allows for feature morphing from a reference dataset to construct a SM template. This morphing property performs well in low-density regions of feature space where reweighting methods fail. Like Salad, Feta uses simulated data as the reference dataset. This provides an advantage over data-driven methods as the reference dataset can act as a prior that is free of signal contamination.
For such simulation-assisted methods of background construction, optimality of transport is desirable as the simulated data provides a physics-informed prior that is expected to be close to the detected data. Therefore an efficient reweighting or morphing function should ideally do as little as possible and reduce to the identity when the simulation is exactly correct. One might ask if the Transport flow used in Feta executes the optimal transport between the reference and target datasets, especially as an out-of-the-box normalizing flow contains no loss terms that penalize non-optimal transport. In fact, for the scope of this problem (i.e. morphing between sets of Herwig++ and Pythia simulated LHC data), we found that modifications to the flow training method that enforce optimal transport did not significantly change the performance of Feta [31]. However, these modifications might become important if Feta were applied to morph between a less similar reference and target.
Future avenues for exploration include monitoring the effects of the SB and SR widths on the performance of Feta adding rigorous uncertainty estimates for the performances of all four background construction methods considered in this study, and testing the four methods on a resonant anomaly other than the LHC Olympics one.

CODE AVAILABILITY
The code can be found at https://github.com/rmastand/FETA. In this section, we investigate the effects of oversampling on the performance of the Feta background template. Oversampling was found to greatly improve the performance of the Cathode and Curtains background templates, and similar improvements were found for Feta.
Oversampling refers to the reduction of statistical uncertainties by using a larger number of events. This method is only possible when the background template construction makes use of a density estimator for the reference dataset that can be sampled from multiple times. With Feta, the Base density flow (defined in Sec. II B) plays this role: without oversampling, Feta transports from X SIM to X DAT ; with oversampling, Feta draws samples X BD from the Base density and transforms these to X DAT .
In Fig. 8a and Fig. 8b, we plot the SIC and rejection (respectively) curves corresponding to oversampling factors from one (120k SR events) through six (720k SR events) for a signal injection of 1000 (S/B = 0.99%). We find that the classifier performance generally increases to a peak at the oversampling factor of ×5, after which performance saturates. However, for signal injections of ≥ 2000 (S/B = 1.97%), all oversampling factors perform similarly.

Appendix B: Signal vs. Background Correlation
In this section, we compare in more detail the events produced by the various SM background template construction methods (Feta, Cathode, Curtains, and Salad).
We expect all of the methods to produce background samples that are significantly different from anomalous events. This is what allows us to detect resonant anomalies when training a classifier to discriminate between SR background samples and SR detected data. However, it is not obvious (or expected) that the background samples will be similar between methods. In fact, we might expect classifier scores to be uncorrelated for background events.
To probe how different the phase spaces are between construction methods: we train a binary classifier (with the architecture as described in Sec. III B) to discriminate between SR background samples from SR detected data with 0 injected signal events. We evaluate the trained classifier on the test set of 20k signal and 20k background dijet events to get a score between 0 and 1 for each event.
(Note that these scores are different than the AUC scores considered in the main analysis.) We finally plot these scores (standardized to zero mean and unit variance) for different methods as a function of each other, for Feta vs. Cathode in Fig. 9a, for Feta vs. Curtains in Fig. 9b, and for Feta vs. Salad in Fig. 9c.
For both such cases, we find that the scores assigned to background events derived from classifiers trained on SM background samples from Feta are broadly uncorrelated with those from classifiers trained on SM background samples from the other three methods. There is perhaps a mild degree of correlation between scores for signal events, but the correlation would not be expected to be strong as the classifiers saw no signal events during training.
We then repeat the analysis, this time training the classifiers to discriminate between SR background samples from SR detected data with 2500 injected signal events. We plot the same standardized scores for Feta vs. Cathode in Fig. 10a, for Feta vs. Curtains in Fig. 10b, and for Feta vs. Salad in Fig. 10c.
In these cases, there is a large degree of correlation between the classifier scores assigned to signal events between Feta and all of the other construction methods. This result indicates that all of the SM background construction methods are (roughly) equally sensitive to this particular LHCO anomaly.  FIG. 10: Classifier scores for a binary classifier trained to discriminate a given constructed SM background template from detected SR data with 2500 injected signal events and evaluated on pure signal and pure background. The scores have been normalized to have zero mean and unit variance.