Jet Classification Using High-Level Features from Anatomy of Top Jets

Recent advancements in deep learning models have significantly enhanced jet classification performance by analyzing low-level features (LLFs). However, this approach often leads to less interpretable models, emphasizing the need to understand the decision-making process and to identify the high-level features (HLFs) crucial for explaining jet classification. To address this, we consider the top jet tagging problems and introduce an analysis model (AM) that analyzes selected HLFs designed to capture important features of top jets. Our AM mainly consists of the following three modules: a relation network analyzing two-point energy correlations, mathematical morphology and Minkowski functionals for generalizing jet constituent multiplicities, and a recursive neural network analyzing subjet constituent multiplicity to enhance sensitivity to subjet color charges. We demonstrate that our AM achieves performance comparable to the Particle Transformer (ParT) while requiring fewer computational resources in a comparison of top jet tagging using jets simulated at the hadronic calorimeter angular resolution scale. Furthermore, as a more constrained architecture than ParT, the AM exhibits smaller training uncertainties because of the bias-variance tradeoff. We also compare the information content of AM and ParT by decorrelating the features already learned by AM. Lastly, we briefly comment on the results of AM with finer angular resolution inputs.


Introduction
Understanding how to identify the initiating particle of jets in hadron collisions has been a crucial challenge in pursuing physics beyond the Standard Model.This problem becomes increasingly important as the Large Hadron Collider (LHC) accumulates substantial events for discovering new physics signatures at energies above the electroweak scale.The search involves the identification of the boosted heavy particles, such as W/Z bosons, Higgs bosons, and top quarks, produced in hard processes with high energies.Those boosted particles often create collimated clusters similar to the jets originating from light quarks and gluons (QCD jets).Therefore, designing and understanding taggers separating the heavy particleinitiated jets from the QCD jets has been a crucial problem of collider physics, essential for maximizing the discovery potential of the LHC and future colliders.
Modern methods for jet tagging are based on deep learning algorithms to maximize the utilization of information in a jet.Deep learning classifiers are based on artificial neural networks capable of handling high-dimensional data and correlation among them.The flexibility and capability allow us to build a tagger directly analyzing low-level features (LLFs) like jet constituents.The early studies have concentrated on network architectures that analyze jets in representations close to those popular in computer science, such as jet images [1][2][3][4][5][6][7][8] and clustering sequences [9][10][11].These deep learning-based classifiers have significantly outperformed traditional methods, which rely on a limited number of highlevel features (HLFs) designed to capture specific substructural features of heavy particleinitiated jets [12][13][14][15][16][17][18][19] by orders of magnitude.Consequently, deep learning has been regarded as a promising approach for boosting the sensitivity of the LHC experiments in detecting new physics.
The state-of-the-art (SoTA) models in classification performances are based on graph neural networks [20] and transformers [21].First examples in jet physics include ParticleNet [22] based on dynamic graph convolutional neural networks [23] and Particle Flow Network [24] based on deep sets [25].ParticleNet outperformed image-based approaches in top jet tagging [26] and has gained significant attention [27].Graph neural networks are also convenient in fusing physics constraints, such as imposing IRC-safety in terms of energy correlators [25,28], using symmetry equivariant layers [29,30] and using Lund plane [31,32] inspired correlations between jet constituents [33].The most advanced taggers currently are graph neural networks that incorporate physics knowledge: Particle Transformer [33], LorentzNet [29], and PELICAN [30], as the additional constraint simplifies the problem while preserving networks' capabilities.
Deep learning-based taggers have shown impressive performance in learning all the detailed patterns of given jets; however, there are concerns about systematic uncertainties, particularly regarding the learning of features from soft QCD.Since the dynamics of the strong interaction at low energies are not perturbatively described, simulations rely on phenomenological models tuned to experimental data instead [34][35][36][37].Consequently, the soft features in simulated jets can vary significantly from one simulation to another.This systematic variance poses a challenge in building taggers, especially the supervised classifiers trained on simulated jets, as they may become sensitive to the choice of simulation setups [28,32,[38][39][40][41], though in some cases, the simulation dependencies are minor [42].A thorough understanding of the physics information utilized by each tagger is crucial to assess and minimize the systematic uncertainties and simulation bias of taggers.
To address this problem of understanding deep learning-based jet taggers, we revisit the classifiers based on HLFs.Compared to classifiers analyzing LLFs, taggers focusing on HLFs are more constrained and offer clearer interpretations in terms of physics.These characteristics make HLF-based taggers more interpretable, particularly when analyzing how HLFs influence network predictions [24,39,43].Additionally, constrained taggers typically exhibit smaller training uncertainties, such as variation in network predictions due to random initialization of networks or train/validation dataset splitting [28,44].This phenomenon is known as the bias-variance tradeoff, which suggests that constrained models tend to have less variance than more general models at the cost of performance.
What if HLF-based taggers are competitive with more general taggers such as graph neural networks?There would be two immediate advantages: network predictions would have less training uncertainty, and simpler taggers would require fewer computational resources.In principle, if we could design a set of HLFs that fully capture features crucial for jet tagging, we could develop a high-performing jet tagger based on HLFs.Inputs used by such taggers highlight the distribution that should be calibrated by the data or give us the clear benchmarks that event simulation should satisfy.
This paper is structured as follows to address the challenges discussed above.Section 2 introduces our analysis model, which consists of modules analyzing groups of HLFs that capture specific characteristic features of top jets.This model is an extension of our previous works [28,39,40,45].Specifically, we will revisit a set of constituent multiplicities sensitive to subjet color charges, which are also important features in top jet tagging.We will introduce a recursive neural network analyzing these numbers to our framework.
In Section 3, we compare the performance of the analysis model to one of the stateof-the-art jet taggers, Particle Transformer (ParT) [33].As a benchmark, we focus on top jet classification problems using various simulations, including Pythia [35], Herwig [36], and Vincia [46,47].Additionally, we compare jets simulated by different generators to assess each network's ability to differentiate simulations.For simplicity, our discussion will concentrate on features at the angular scale above the hadronic calorimeter scale, 0.1.We will demonstrate that our model has a classification performance comparable to that of ParT, emphasizing that all the introduced HLFs are crucial for achieving this level of performance.
In Section 4, we estimate the statistical significance of the difference between our analysis model and ParT using the bootstrap method.Our model exhibits smaller training uncertainties while maintaining performance comparable to ParT, within the range of 2 ∼ 3σ.We also assess the difference in information learned by the two models by penalizing information in ParT that has already been learned by the analysis model.
Section 5 discusses the potential high-level features that start appearing below the hadronic calorimeter scale and presents the classification performance of AM tweaked for the electromagnetic calorimeter scale analysis without accounting for the additional highlevel features.We summarize our findings and conclude in Section 6. 2 Analysis Model Using High-Level Features

Anatomy of Top Jets and Feature-wise Analysis Modules
We design a neural network analyzing selected HLFs, capturing the most significant characteristic features of top jets, listed in Figure 1.Firstly, the top quark in a top jet decays into three quarks through the process t → bW → bq q, resulting in a jet with a distinctive three-prong substructure.Among these three prongs, two subjets originate from the decay of W-boson, and they should satisfy the W-boson decay kinematics.
In addition to these topological substructures, color substructures appear in various scales.Since the entire system originates from a quark, the top jet exhibits a color triplet radiation pattern at the jet radius scale.We observe two important color features at the subjet level, where three prongs are visible.The W-boson subjet comes from a color singlet, resulting in confined radiation [48], and a color connection between the two prongs is expected [49].The presence of the three leading quark subjets is also a distinctive feature setting top jets apart from QCD jets, which consist of subjets from both quarks and gluons [50].Therefore, careful consideration of these color substructures is crucial in the analysis.
Note that we do not explicitly consider the color connections that extend beyond the jet boundary in this paper.While these external color connections are still features of top jets and can enhance the tagging performance in a supervised setup, they are highly process-dependent, potentially making taggers less process-agnostic [51].Although the SoTA models are generally able to exploit this information to achieve the best tagging performance, our model will not directly address the external color connections.Instead, we will leave our model to use the information encoded in the selected HLFs and evaluate how effectively our setup can match the tagging performance of SoTA models.For analyzing the selected HLFs derived from top jet anatomy, we introduce our AM, consisting of neural networks that take HLFs designed to capture top jets' characteristic features fully.The AM consists of analysis modules Φ A , which analyze a set of HLFs x A , where A is an HLF label.We denote the network output vector as z A , i.e., where x kin is a set of global features provided to all the analysis modules.The outputs z A are then converted to a single output ŷ by an MLP Φ, and it will be trained by minimizing the binary cross-entropy loss function. 1n the following subsections, we describe each of the HLFs and their analysis modules. Figure 2 shows a schematic diagram of the AM combining the HLFs motivated from the following.
• (Section 2.4) jet color charge: p T -scale conditioned jet constituent multiplicity distribution x count and its generalization x MF [28,40].
• (Section 2.5) subjet momenta x subj and its constituent multiplicity for its color charge x ex subj .The p T -conditioned constituent multiplicities x count and the last subjet analysis module are the main new features of this paper.In Section 3, we will show that the jet classifier built from AM performs nearly equal to ParT.Namely, the selected HLFs are sufficient for describing the top jet classifier based on ParT.This setup further allows us to determine which features are more relevant in the classification by comparing tagging performance when some HLFs are dropped.

Global Features: Basic Jet Kinematics
We first select the global features x kin mainly to provide all the analysis modules with the characteristic angular scales of jets, which are given in terms of the jet mass m and transverse momentum p T .In particular, we use the kinematical variables of the whole jet J and the trimmed jet J trim to capture the angular scale of top quark decay. 2 We additionally provide the variables of the leading p T subjet J lead to implicitly capture the W-boson decay scale. 3The resulting global features consist of the following inputs: x kin = (p T,J , m J , p T,J trim , m J trim , p T,J lead , m J lead ). (2.3)

IRC-Safe Energy Correlators: Relation Network and Two-Point Energy Correlation Spectrum
The topological features and color charge characteristics are often captured by IRC safe energy correlators [55], for example, N -subjettiness [16] and jet width/girth [56].To build an analysis module covering the topological and color charge features described by the energy correlators, we use a relation network (RN) with IRC safety constraint [45].
The vanilla RN [57,58] is a fully connected graph neural network that only uses edge features between vertices.Since perturbative phenomena in jets, such as particle decays and parton showers, are often described by two-body kinematics, an RN is a good choice for the first feature extraction.The RN consists of two networks: Φ E for modeling derived edge features between two vertices and Φ G for mapping accumulated edge features to the global output feature.The network output is then written as follows: where v i and v j are the vertex features, and the sum runs over all the vertices representing jet constituents in sets a and b.
With IRC safety and additional translation and rotation invariances on the (η, ϕ) plane, the summand must be a two-point energy correlator with angular weighting function W (R) that solely depends on the relative distance between two constituents This RN can be significantly simplified by introducing an IRC safe two-point correlation spectrum [28,39,40,45], (2.6) The term S 2,ab (R) dR can be interpreted as a p T,i p T,j weighted histogram of the angular distance of jet constituent pairs.The double summations in edge feature accumulation are then simplified to a single integral, i∈a,j∈b Note that this expression covers two-point energy flow polynomials EFP n 2 [24,43] when the angular weighting function W (R) is a polynomial.We evaluate the integral with a simple rectangle rule with finite bin size ∆R = 0.1; the integral turns into a linear combination of binned two-point energy correlation spectrum S 2,ab (R; ∆R) = R+∆R R dR S 2,ab (R).The linear combination coefficients can then be absorbed into the weight matrix of the first linear layer in Φ G .The RN simplifies an MLP by taking S 2,ab (R; ∆R)'s as inputs.
We again employ the trimmed jet J trim and leading p T subjet J lead for the constituent subsets to explicitly capture correlations in hard vs. soft radiations and the leading jet vs. others, respectively.In the J trim analysis module, we use a, b ∈ {J trim , J c trim = J−J trim } and denote the set of binned S 2,ab 's as x trim .Similarly in the J lead analysis module, we use a, b ∈ {J lead , J c lead = J − J lead } and refer the set of binned S 2,ab 's as x lead .It is noteworthy that the complementary sets J c trim and J c lead are formally IRC safe, and they provide a different perspective on the distribution of soft particles.We represent the output of the two analysis modules as z trim and z lead , The RNs Φ trim and Φ lead are modeled using MLPs with two hidden layers.The schematic structure of the networks is illustrated in Figure 3a. Concat.Concat.

Constituent Multiplicity Distributions and Mathematical Morphology
To fully capture the color-related characteristics of jets, the IRC-safe shape variables alone are not enough; we also need IRC-unsafe variables.For instance, counting variables play a crucial role in the quark vs. gluon jet classification problem, where the distinguishing feature is the color charge of originating parton: color triplet or octet [56,59,60].Such counting variables are also essential in top jet tagging, as quark vs. gluon jet classification is needed at the subjet level.Note that the kinematics of subjets are different in top jets and QCD jets; the quark subjets of top jets tend to have similar starting p T scale, whereas those of QCD jets do not, due to the kinematics of parton shower.This kinematic difference in subjets suggests that conditional distributions of the counting variable given jet constituent p T can be an informative variable for capturing those color substructures of subjets.
Figure 4 shows the expected count histogram of jet constituent p T , averaged over our training samples.(See Section 3.1 for the detailed information on our datasets.)Results from three event generators -Pythia (PY), Herwig (HW), and Vincia (VIN) -are represented by solid, dotted, and dashed lines, respectively.
In the figure, QCD jet distributions are shown in blue lines.The HW and VIN expectations have similar trends, while PY produces slightly more constituents overall.The cases of top jets are shown in yellow lines; the PY and HW expectations are similar, while the VIN curve is slightly lower than the others.Note that the difference between event generators is more prominent in the soft region as this region is more sensitive to the parton shower, hadronization modeling, and their simulation tuning.
The differences in expectations between the QCD and top jets are more significant than the above systematic uncertainties between the simulations.In the low p T region, QCD jets typically have more constituents than top jets due to gluon (sub)jets and the high mass selection.On the other hand, the high p T region captures features of the hard process.The top jet histogram notably shows a prominent bump at ∼ 50 GeV.This bump is mostly due to the subjets originating from the W boson, and the hadronization of the corresponding boosted q q system.This multiplicity distribution conditioned on p T captures the differences between top and QCD jets and the systematic differences between simulations.Including this histogram and its generalization for the AM will significantly enhance the information encoding performance.
One can further extend the analysis of counting information by using Minkowski Functionals (MFs) in mathematical morphology.MFs are complete linear basis functions of geometric measures with a notion of size, including the constituent multiplicity.If we use the MFs as inputs to an MLP, the resulting network effectively becomes a module analyzing geometric measures.The first linear layer models geometric measures, and the subsequent layers take care of its non-linear transformation.This method of constructing an analysis module is similar to the previously discussed simplification of RN with IRC safety constant.For analyzing jet constituent distributions on the two-dimensional (η, ϕ) plane, we need to consider the three MFs: area A, perimeter length L, and Euler characteristic χ.
These MFs are used with dilation operation to encode non-trivial changes in the topology and geometry appearing at various scales, similar to persistence homology in topological data analysis [61].Let us define a dilated image I(R) as a Minkowski sum of the set of constituents C and a disk B(R) with radius R, i.e.,4 (2.10) We illustrate this dilation operation in Figure 5.The corresponding MFs of I(R) are denoted as A(R), L(R), and χ(R).By construction, these MFs are translation and rotation invariant.The persistence analysis on MFs under the dilation captures changes in topology and geometry.When the dilation is sufficiently small, the MFs of the dilated image are described by Steiner's formula, i.e., the dilated MFs are linear combinations of MFs without dilation, i.e., dots.For example, consider the dilation from left to middle of Figure 5 just before the disks touch each other.The MFs, in this case, can be expressed as follows.

𝑅 𝑅: Radius large small
(2.11) However, as soon as the disks come in contact, this relationship no longer holds, indicating that a non-trivial topological change occurred at the dilation scale.The MFs after the dilations are computed using shapely package [62] based on GEOS library [63].For the numerical evaluations, the disks are approximated by regular 64-sided polygons.The Python code used for computing the MFs is available at https://github.com/sunghak-lim/mfjet.
To make this persistence analysis conditioned on constituent p T , we examine the MFs of jet constituents exceeding certain p T thresholds p th T .This setup yields MFs -A(R; p th T ), L(R; p th T ), and χ(R; p th T ) -as functions of disk radius R and the threshold p th T .These functions can simultaneously capture the morphology of soft constituents and hard substructures within the jet.The MFs are sampled at the following values of R and p th T to create a discrete set of network inputs.
Here, ϵ is a small positive number to avoid the numerical instability appearing when two disks overlap only at the boundary.Note that the diameter 2R is the parameter describing the correlation scale in this persistence analysis.The first MFs with R = 0.025 are simply the constituent multiplicities since we are using angular resolution 0.1 and no disks are overlapping each other; the MFs will be simply given as Eq.2.11.For k > 1, some disks may overlap, and MFs can capture non-trivial topological changes during the dilation.For R > 0.2, we switch to a coarser grid.Most small-scale effects are smeared out at this scale, leaving only large-scale information less sensitive to the R choice, such as hard substructure and large-angle radiations.We limit the angular scale up to R max = 0.3 in this paper.The dependence on the choice of R max can be found in Section B.
In summary, this jet morphology analysis module covering IRC-unsafe counting observables considers the following inputs: • x count : the values of jet constituent histogram bins over constituent p T within 0.1 GeV to 50 GeV.We use the following bin edges scaling exponentially: We use n bins = 20 bins for this analysis module.
• x MF : A(R; p th T ), L(R; p th T ), and χ(R; p th T ) evaluated at points described in Eq. 2.12 with R ≤ 0.3.
We simply use an MLP Φ MF to analyze these inputs.The outputs z MF are given as follows. (2.14)

Recursive Neural Network for Subjet Constituent Multiplicities and Kinematics
The analysis of constituent multiplicity and Minkowkski functional conditioned on p T provides information regarding the color charge of jets and its morphological behavior at each p T scale.This approach implicitly captures the information about subjet color charges.However, a few ambiguous cases exist, especially when two subjets have a similar p T scale.
To address this ambiguity, we introduce an analysis module for dealing with subjet constituent multiplicity explicitly.
In our subjet analysis module, we focus on seven leading p T Cambridge-Aachen subjets [64,65] with radii of 0.1, 0.2, and 0.3.This is because the angular resolution of inputs is 0.1, and other modules handle larger angular scales.The choice of seven subjets is motivated by the results of n-subjettiness based taggers [66].As the subjettiness analysis for top jet tagging saturates at 5-body analysis [67], considering seven subjets allows us to account for a few additional soft radiations.
We consider the following sets of features for each k-th subjet with radius R. x Here, p k is the subjet p T , η, ϕ and m, and N c,k denotes the number of constituents in the subjet.For simplicity, we do not use the full Minkowski functionals; the N c,k alone provides sufficient information when this module operates together with the others.
We will analyze the above subjet features x (ex) subj,k,R in a sequential setup to emphasize the subjet ordering, particularly because the first three subjets in top jets are likely to be quark jets.For this purpose, we stack the features of k-th subjets with different radii as follows, for k = 6, 7. (2.17) For the 6th and 7th subjets, we omit R = 0.3 since the preceding subjets sufficiently cover most of the jets constituents, and the remaining are again minor clusters.We also consider N c,k only in the first five subjets as subsequent subjets are soft subjets, and their color charge is less relevant.A simpler input set x subj,k without constituent multiplicities is also introduced for comparison, To effectively accommodate the subjet features into our AM, we use a recursive neural network for the first five subjets, structured as follows: Here, each Φ C represents a shared MLP with two hidden layers common for k = 2, • • • , 5.
The 6th and 7th subjets contain less hierarchically important information; we use a separate MLP Φ 67 with two hidden layers to process the information.
where z (ex) subj corresponds to x (ex) subj,k inputs respectively.Figure 3b illustrates the schematic diagram of this network configuration.Note that if we use a deep set architecture [25] to introduce permutation symmetry for this analyzing module, the resulting network is a variant of the jet flow network [68] with additional subjet constituent multiplicity information.

Datasets
In this section, we explain top jet and QCD jet datasets used for training jet classifiers.Top jets are sampled from the hard process pp → t t, and QCD jets are from pp → jj5 .Both processes are simulated at a center-of-mass energy √ s = 13 TeV using MadGraph5 3.4.2[69].The simulated top quarks are decayed hadronically using madspin [70].To focus on generating samples for boosted top jet classification, we exclusively generate events where the transverse momentum of top quarks in t t events and leading p T quark/gluon in dijet events is greater than 450 GeV.
For the parton shower and hadronization simulations, we use Pythia 8.308 [35] and Herwig 7.2 [71], considering the following parton shower models, • Pythia, simple shower, (denoted as PY) • Pythia, Vincia shower, (denoted as VIN) • Herwig, default shower, (denoted as HW) Vincia shower [46,47] is a p T -ordered parton shower model for QCD + QED/EW showers based on the antenna formalism.This model exhibits improved color-coherence effects relative to the default simple shower model of Pythia.We additionally include Vincia to assess whether our AM has enough sensitivity to distinguish different Monte Carlo simulations, particularly in PY vs. VIN comparisons.Detailed results are presented in Appendix A. Regarding the simulation tune settings, the Monash tune [72] is used for Pythia, and the default tune is used for Herwig.Furthermore, we simulate neutral pion decay to ensure accurate modeling of particle distribution at angular scales near 0.1.
The generated particle-level events are processed using the Delphes detector simulation [73] using the default ATLAS detector card with a modification: jets are reconstructed from EFlow objects [73,74].The jets are reconstructed by the anti-kt algorithm [54] with radius parameter R = 1, which is implemented in fastjet [75,76].Note that the EFlow objects include information from tracks and electromagnetic calorimeters, offering a better angular resolution than hadronic calorimeter towers.However, to compare network architectures at the same lower angular scale threshold, we will later mask out the shorter-scale information during the construction of HLFs for AM and LLFs for ParT.We select the leading p T jets with transverse momentum p T,J ∈ [500, 600] GeV, jet mass m J ∈ [150, 200] GeV, and pseudorapidity |η| ≤ 2 as our top jet and QCD jet samples.For top jets, we further require that all quarks originating from the top and W boson decay are located within a distance R = 1 from the jet axis on the (η, ϕ)-plane.We denote the PY, VIN, and HW simulated QCD jet datasets as P Q , V Q , and H Q , respectively.Similarly, we use P T , V T , and H T to represent PY, VIN, and HW t t datasets, respectively.The training and test dataset sizes are listed in Table 1.While the dataset sizes are not uniform, we ensure the class balance by downsampling the larger dataset so that both classes have the same number of samples during the training.However, during testing, we use the entire dataset without downsampling.
Note that the dataset sizes are bigger than the public top tagging challenge dataset [26,77] in a similar kinematic range.The community challenge dataset, covering a slightly different p T,J range of [550, 650] GeV, does not include preselection based on jet mass.The number of top (QCD) training dataset in the jet mass ∈ [150, 200] GeV is 476K (59K), while our QCD training datasets have about 400K samples.Our dataset has more QCD jets in this mass range, which helps to mitigate the class imbalance during the training.

Particle Transformer
We will mainly compare our AM results with the Particle Transformer (ParT) [33].ParT is a transformer-based neural network [21] that directly analyzes low-level features (LLFs) of jet constituents.The ParT architecture consists of three main modules: • Particle attention blocks [33], which analyze LLFs.This module is based on Norm-Former architecture [78] with augmented multi-head attention layers.The query-key inner product in the attention layers is augmented by pairwise features inspired by LundNet [32].
• Class attention blocks [79] that transform a trainable class token [80,81] into an output, utilizing the final output of the particle attention blocks.
• An MLP that converts the transformed class token into classifier outputs.
The final output of ParT is then trained by minimizing the binary cross-entropy loss function.
Since our aim is to compare classifier performance with inputs having a finite angular resolution 0.1, we will mask out small-scale information by utilizing jet images [1][2][3].The jet constituents are first preprocessed as described in [40].This preprocessing involves shifting, rotating, and flipping on the (η, ϕ) coordinate to ensure that the leading subjet is at the origin, the second subjet lies on the positive ϕ-axis (η = 0 and ϕ > 0), and the third subjet is on the right side of the plane (η > 0).After the preprocessing, the jet constituents are pixelated into bins of size ∆η = ∆ϕ = 0.1.Pixels with non-zero energy deposits are converted into massless 4-momenta using their (η, ϕ) coordinates and energy deposits.This set of 4-momenta is used as ParT inputs.The results using a finer resolution of ∆η = ∆ϕ = 0.025 will be shown in Section 5.
As a result of the pixelation process, the inputs become relatively simpler than the original 4-momenta; therefore, we use a smaller version of ParT compared to the one presented in the original paper [33].Given that our sample size is about 380k per class, we target a similar number of network parameters in order to maintain a positive degree of freedom.Specifically, we halve the number of attention blocks and the width of each neural network from the default setting.In this configuration, ParT has 340k parameters.

Comparing Classification Performance
In this subsection, we compare the classification performance between AM and ParT, specifically using samples generated by PY and HW.The results involving samples generated by VIN are presented in Appendix A.
We particularly focus on the following four classification problems: (P T vs. P Q ), (H T vs. H Q ), (P Q vs. H Q ), and (P T vs. H T ).The first two, (P T vs. P Q ) and (H T vs. H Q ), are the top jet tagging problems using samples generated by PY and HW, respectively.Comparing these two will provide insight into the generator-dependency of taggers.The latter two, (P Q vs. H Q ) and (P T vs. H T ), focus on detecting differences between jets generated by PY and HW.These classifiers are trained to learn the likelihood ratio between two distinct Monte-Carlo simulations, which is crucial in quantifying differences between simulations.In this paper, our initial focus is on evaluating whether the HLFs used in AM are sufficient for effectively paralleling ParT analyzing LLFs.In-depth analyses of systematic differences using HLFs will be future works.
We present the classification performances of AMs and ParT in terms of AUC6 in Table 2, and two rejection rates R 50% and R 30% in Table 3.The rejection rate R X% defined as follows: R X% = 1/FPR at TPR = X%, where TPR is the true positive rate, and FPR is the false positive rate.In this comparison, the first class is assumed to be the positive class, and the second class is considered the negative class.We report the performance metrics of AM and ParT that achieved the best AUC, selected from six AM and ten ParT trainings, each of them trained with different random seeds, respectively.
In Table 2, we see that AM performs almost as well as ParT in our classification tasks, except for the (P T vs. H T ) classification.We will discuss the significance of the differences in (P T vs. H T ) later.The AUCs for top jet tagging are similar, and AM's AUC is even a bit higher in (P Q vs. H Q ), even though it only uses a few HLFs.However, when AUCs are close, the corresponding ROC curves may intersect at some point.AUC alone may not clearly determine the performance hierarchy in these cases.
The rejection rates in Table 3 are more precise measures for the performance comparison.The rejection rates of (P T vs. P Q ), (P Q vs. H Q ), and (P T vs. H T ) show the same trends seen in the AUCs.But the HW top tagging problem (H T vs. H Q ), whose AUC was higher for ParT, now shows a mixed behavior: R 50% is better for ParT, R 30% is better for AM.In summary, AM and ParT show similar performances in (H T vs. H Q ), ParT slightly outperforms AM in (P T vs. H T ) and (P T vs. P Q ), but AM does slightly better in (P Q vs. H Q ).These experiments indicate that while AM and ParT are comparable in performance, their exact hierarchy is unclear.In Section 4, we will discuss the significance of the differences using bootstrap methods, but the statistical significance after counting statistical uncertainty and training stochasticity does not exceed 3σ.
We tested several modifications of the AM, especially for closing the small gap between AM and ParT in the PY top tagging problem (P T vs. P Q ), although HW top tagging and VIN top tagging (see Appendix A) do not have such a performance gap.The modification includes various adjustments such as replacing the recursive network model for x ex subj to a fully connected network, adding MFs at an energy threshold of 16 GeV, increasing the hidden layer numbers from two to three, and concatenating x kin in each layer for more explicit jet kinematics conditioning.However, these changes were insufficient to close the AUC gap.The gap exclusive to PY suggests that introducing PY-specific HLFs appearing above the hadronic calorimeter resolution could potentially close it.However, delving into HLFs specific to a particular Monte-Carlo simulation is beyond the scope of the paper, so we do not further attempt to fill the gap within this paper.

Importance of each Sub-module of Analysis Model
In addition to the direct comparison between AM and ParT, we further consider AM with partial inputs (AM-PIPs) for a hierarchical comparison of the inputs.Our AM incorporates selected sets of HLFs described in Section 2. These HLFs, which include jet kinematics x kin , two-point energy correlations x S 2 , p T -conditioned constituent multiplicity x count , Minkowski functionals for jet morphology x MF , subjet constituent multiplicities and kinematics x ex subj , are designed to capture various physics features of top jets illustrated in Figure 1.To evaluate the impact of each HLF set on the classification performance, we systematically activate or deactivate specific sets of inputs and their corresponding analysis modules.This could help gain performances compared to CNN with moderate depth.
The last comparison among E), F), full AM, and ParT justifies the importance of subjet kinematics and their color structures.The baseline of this comparison is E), which is the AM-PIP without subjet information.Including subjet kinematic information x subj to E) slightly improves performance, as seen in F).The reason is that two-point energy correlations, counting observables, and Minkowski functionals treat all the constituents equally during the feature aggregation while subjet information is not.However, the performance of F) still does not closely match that of ParT.Further considering subjet constituent multiplicities makes F) to the full AM, whose performance is comparable to ParT as discussed in the previous sections.This shows the colors of subjets, i.e., the origin of the subjets (q or g), plays a certain role in the classifications.

Statistical Significance of the Difference between Analysis Model and Particle Transformer
In Section 3, we observed that 1) AM and ParT performance are comparable, and 2) AM-PIP improves as more inputs are included.In this section, we estimate the statistical uncertainties of the performance measures to show the significance of our conclusion.

Estimating Uncertainty of Performance Metrics by Bootstrapping
We estimate uncertainties associated with network training by using the bootstrap method.This method involves the generation of bootstrap datasets from the original dataset by sampling with replacement.In this resampling procedure, we treat the empirical distribution of the training dataset as an approximate true distribution.The bootstrap datasets effectively act as representative samples of this approximate true distribution.By training networks again on each of these datasets, this bootstrap method allows us to estimate the variance of training results.For this study, we use ten bootstrap datasets.Note that our AM is computationally cheaper than ParT, so this bootstrap analysis can be done much faster.The computational resources utilized by AM and ParT are explained in Appendix C.
In Table 4, we show the average and standard deviation of AUC, R 50% , and R 30% of AMs and ParTs trained on the bootstrap dataset.Here, we use the original test datasets without bootstrapping to evaluate those performance measures.Note that the performance measures are worse than those in Table 2 and Table 3.The difference is due to the bootstrap dataset being sampled from the original dataset.The resampling does not capture all the training samples, which makes the classifiers slightly suboptimal.Nevertheless, the variance is still a consistent estimator of the uncertainty.For ParT, we further validate this method by repeating the procedure with another set of bootstrap datasets.We denote the bootstrap result with a higher average AUC as the ParT-best and the other as ParT-2nd.The ParTbest and ParT-2nd results are compatible with the uncertainties.
One notable observation is that the bootstrap-estimated uncertainties for ParT are significantly larger than those for AM.Specifically, in the HW top tagging problem, the uncertainties of ParT are twice as large as those of AM.The uncertainty difference is amplified by factor 10 in the generator classifications (P T vs. H T ) and (P Q vs. H Q ).This Model  We show the average and standard deviations of these classification performance metrics.For ParT, the classifier is trained twice; the ParT with a higher average AUC is labeled as ParT-best, and the other as ParT-2nd.The numbers in brackets represent the statistical uncertainties associated with each metric.Note that the bootstrap estimated uncertainties already include this.
phenomenon is due to the bias-variance tradeoff inherent in statistical modeling.More expressive models, like ParT, can model classifiers more accurately, but at the cost of increased variance in the results.Consequently, ParT requires a larger sample size to achieve the same level of statistical precision as AM; for an analysis with a given sample size, the uncertainty of AM is always expected to be smaller than that of ParT.This precision difference becomes more prominent if we subtract the statistical uncertainty from the bootstrap estimated uncertainty.Here, we specifically focus on rejection rates, as their statistical uncertainty can be quickly estimated using analytic formulas.A naive estimation on 1σ statistical uncertainty of the false positive rate R −1 X% is given by the Wald interval, σ R −1  samples.The uncertainty in the rejection rate, σ R X% , is then derive as The statistical uncertainty estimates for the rejection rates are listed in Table 4, denoted within brackets.
We can see that the uncertainty is dominated by the statistical uncertainty in the case of AMs, whereas for ParT, this is not the case.This additional variance observed in ParT is from randomness of training, such as network initializations and batch splitting.In the case of AM, the training procedure sufficiently suppresses this initial condition and other randomness dependence in training so that the bootstrap estimated uncertainty and statistical uncertainty are approximately the same.In contrast, ParT training is less successful in reducing this variance to levels below the statistical uncertainties, resulting in higher residual variance in the bootstrap estimated uncertainties.This stability in the training process is one significant advantage of AMs, especially when their performance is comparable to that of state-of-the-art models like ParT.
For a more quantitative comparison of the difference in performance measures between AM and ParT, we define the significance of their performance measure difference as follows: For X ParT and σ X,ParT , we use the average of ParT-best and Part-2nd results.The significances σ X are shown in Table 5, all of which are less than 3σ.This indicates that the moderate performance gap between AM and ParT in PY top tagging (P T vs. P Q ) and top jet simulation comparison (P T vs. H T ), shown in Table 2 and Table 3, becomes less significant after accounting both statistical uncertainty and training stochasticity.Finally, to test the consistency of AM and ParT predictions under the randomness of the training, we revisit the classifiers for (P Q vs. H Q ) trained with different random seeds, as discussed in Section 3.3.1.Figure 6 illustrates the differences in classifier outputs between the networks with the best and second-best AUCs.A classifier less influenced by training randomness will yield more similar network outputs.The difference histogram will peak narrowly around zero in such a case.In this comparison, the peak for AM (red dot-dashed line) is narrower than that for ParT (red solid line), indicating more consistent predictions.Specifically, the standard deviation of the histogram is 0.03 for AM, and 0.06 for ParT.A simpler model, such as AM-PIP involving (x kin , x MF , x count ), shows even smaller variation of 0.012, though its performance is suboptimal.

Comparing the Information Content of Classifiers using Odds Ratio
In this section, we directly compare AM and ParT using an odds ratio, which is useful in comparing the information content of two classifiers.Let s I (x) be the classifier output for given input x, i.e., the posterior being a signal as modeled by classifier I.The odds ratio ŵI/II is defined as the ratio of odds estimated by models I and II, expressed as follows: An odds ratio greater than 1 indicates that the classifier I is more confident in its signal prediction by the factor of the odds ratio than the classifier II.If both models are identical, the odds ratio will be 1.7 Therefore, deviations from 1 in the odds ratio indicate differences in the predictions of the two classifiers for a given sample.Note that when there is a clear hierarchy between two models, the odds ratio can also be interpreted as a likelihood ratio between two classes, using only the features exclusive to the more expressive model.If the classifier I is more expressive than II, the prediction s II (x) in the denominator acts as a normalizer countering information in s I (x) covered by the model II, similar to conditional probability.With this odds ratio interpretation, we can construct a posterior probability that works as a classifier utilizing only the features exclusive to I: A more detailed derivation of this interpretation is described in Appendix D. Since the odds ratio has a classifier interpretation, we quantify the difference between AM and ParT in terms of performance metrics (AUC and rejection rates) of ROC curves using the probability s I/II (equivalently the odds ratio OR I/II ) as a threshold variable.If two classifiers are equivalent, the AUC will be that of a random guess, 0.5, and the rejection rate R X% will be 1/X%.
To estimate uncertainties together, we take the average prediction of the ParT-best models from the bootstrap analysis as s I .We then evaluate the odds ratio against AM and ParT-2nd, each trained on bootstrap datasets.Specifically, for each of the i-th bootstrap dataset, s II is either the prediction of AM, s AM i , or ParT-2nd, s ParT-2nd i .We evaluate the mean and variance of AUC and rejection rates (R 50% and R 30% ) across the bootstrap datasets to assess the difference between the two models.
The above uncertainty estimation procedure primarily focuses on model II variations and does not explicitly account for the uncertainties of model I.To estimate the uncertainties in ParT-best, we can take ParT-2nd as s II .The uncertainties associated with ParT-2nd can be interpreted as an additional uncertainty due to the variation within ParT-best as model I because ParT-best and ParT-2nd are essentially the same models trained under different random states.
In Table 6, we show the odds ratio analysis results.Ideally, the results of ParT-2nd should be those of random guess results (AUC = 0.5, R 50% = 2, and R 30% = 3.33).The rejection rates R 30% for the top tagging problems (P T vs. P Q ) and (H T vs. H Q ) are slightly deviate from 3.33.However, the uncertainty is also sufficiently large, so the results are consistent with the expectation.Overall, all the performance metrics are compatible with those of a random guess, and the expectation holds for all the classification tasks.
The results of AM apparently deviated from those of a random guess, especially in (P T vs. H T ).The AUC for this classification is 0.546 ± 0.003.However, it is important to remember that the uncertainties of AM's performance metric are relatively small compared to those of ParT.In this odds ratio analysis between ParT and AM, the uncertainties are dominated by those of ParT-best.When comparing 0.046 to that in the ParT-2nd analysis, 0.02, the deviation is about twice the uncertainty.Similarly, when considering the uncertainties of ParT-best, all the performance metrics for AM are compatible with random guess within approximately 2σ.
Finally, Figure 7 shows the average and standard deviations of the ROC curves for the following classifiers trained on the bootstrap datasets: ParT-best (black circle), ParT-2nd (red inverted triangle), and AM (blue left-facing triangle).Additionally, we show the ROC curves from the odds ratio analysis Eq. 4.2 and Eq.4.3 of ParT-best vs. ParT-2nd (in red triangles) and ParT-best against AM (in blue right-facing triangles).The ROC curves of AM are compatible with that of ParT within 3σ, consistent with other metrics discussed previously.The ROC curves in the odds ratio analysis between ParT-best and AM look closer to the random guess, except for the classification (P T vs. H T ).But again, comparing the difference to the uncertainties of the odds ratio between ParT-best and Part-2nd, the gap becomes less significant.In summary, our AM is aligned to ParT sufficiently well for the top jet tagging and distinguishing simulations while showing smaller training uncertainties.

Discussion on Analysis Model and High-Level Features at Sub-Hadronic Calorimeter Angular Resolution
Throughout the previous analyses, we used datasets pixelated or binned to the angular resolution of the hadronic calorimeter (HCAL), ∆R H ≃ 0.1.The classification performance of our AM is consistent with one of the state-of-the-art models, Particle Transformer.A natural follow-up question is whether this consistent performance still holds at a finer resolution, such as the angular resolution of the electromagnetic calorimeter (ECAL), ∆R E ≃ 0.025.This section addresses the expected AM performance at sub-HCAL angular resolution.Before delving into the analysis of sub-HCAL angular resolution, we must pay attention to new features appearing at the smaller scales.The first important difference is that ECAL towers and tracks are implicitly distinguishable from HCAL towers in the EFlow object analysis at sub-HCAL angular resolution.If there is a constituent cluster with a size below ∆R H , that cluster is highly likely to consist of ECAL towers and tracks.Similarly, if a cluster is smaller than ∆R E , the cluster mostly consists of tracks.
This implicit accessibility of constituent type leads to an interesting consequence: particle identification (PID) information is also partially and implicitly accessible in sub-HCAL angular resolution analysis.Comparing energy deposits in the HCAL, ECAL, and tracks partially reveals how much energy is carried by certain types of particles.This is useful information when comparing jets with different PID compositions, and the use of the energy comparison has been studied in the context of jet flavor identification [83,84].Also, explicitly considering PID information improves jet tagging performances, and it has been demonstrated using particle level analysis, i.e., when there is no implicit utilization of angular resolution [22].Designing an analysis module and high-level features for capturing this constituent type information may further supercharge the tagging performance of AM.However, constructing a new module analyzing implicit PID information is beyond the scope of the paper.We will focus on tweaking the current AM setup to a finer resolution analysis and compare its performance to that of ParT.
For adjusting AM for higher resolution, we first have to be cautious about an inflated number of inputs.For example, the set of two-point energy correlations with trimming, x trim , described in Section 2.3, originally consisted of 60 inputs, i.e., (3 types of S 2,ab ) × (20 bins).But at the ECAL angular resolution level, now we have to consider 80 bins per one S 2,ab , and it inflates the number of inputs by a factor of 4. Similarly, the set of MFs, x MF , described in Section 2.4, consists of 150 inputs, i.e., (3 Minkowski functionals) × (10 dilation scales) × (5 p T thresholds).Again, we have to consider four times more dilation scales to cover the same dilation scale range, so the total number of inputs will be inflated to 600.
The reason why we have to be cautious of inflated inputs is that it may make the training dataset more sparsely distributed on the input space.Jets in our dataset contain about 50 constituents, and the dimensionality of the input space of the AM for ECAL angular resolution analysis exceeds the dimension of the 4-momenta of jet constituents, 4 × 50.In this case, less carefully tuned AM may require more training samples to compensate for the sparsity of the data.
Here, we present the results of AM tweaked for the ECAL angular resolution ∆R E .In this quick analysis, instead of using finer pixels and bins everywhere, we simply augment the previous AM for HCAL angular resolution analysis in order to avoid input dimension inflation, making the dataset sparse.Here is the list of changes: • Pixelated jet constituent multiplicity variables in x count , x subj are now evaluated using jet images with a pixel size ∆R E × ∆R E .
• Two-point energy correlation analysis modules Φ trim and Φ trim additionally takes extra binned S 2,ab 's on a domain [0, 0.25) and with bin size 0.025.
• Minkowski functional analysis module Φ MF additionally accounts Minkowski functionals with dilation scales in [0.0125, 0.125] and with step size 0.0125.
Namely, we count the sub-HCAL angular scale information up to the angular scale R = 0.25, and we consider information above the ECAL scale only using inputs with HCAL angular resolution.With this setup, each of the analysis modules is able to take care of differences in their high-level inputs in ECAL and HCAL angular resolution around and below HCAL angular resolution.Also, 10 new inputs per each of S 2,ab and MFs are considered so that the input dimensions are not inflated too much.The classification performances of AM and ParT at this ECAL angular scale are shown in Table 7 and Table 8.Here, the average and uncertainties of the performance metrics are estimated using a bootstrap method.The ParT model used here is the same ParT model described in Section 3.2.The difference in the classification metrics between AM and ParT lies between 2σ to 5σ when we only count variation of classifier under change of random seeds.Again, the error in this comparison is mostly dominated by uncertainties of ParT training, and P T vs. H T shows the largest difference in metric about 5σ.We will leave accounting for the additional high-level features for the smaller scales, further fine-tuning AM setup and binning of the inputs, and estimating the full uncertainty precisely for future studies.

Summary and Conclusion
This paper introduces an analysis model (AM) employing the important high-level features (HLFs) in top jet tagging.The AM simplifies the current state-of-the-art taggers analyzing low-level features (LLFs) while maintaining classification performance.To demonstrate the effectiveness of our model, we revisited the problem of top jet tagging using various Monte Carlo simulated samples at the angular resolution of a hadronic calorimeter scale.We compared the classification performance of AM with a classifier based on the Particle Transformer (ParT).In our previous works, we have developed AMs using the IRC-safe relation network [28,39,45] and mathematical morphology for jet physics [28,40].This model's performance matched the performance of a convolutional neural network, but it fell short of matching ParT's performance.Nevertheless, once we fully considered the newly introduced features -p T -conditioned jet constituent multiplicity distribution and subjet constituent multiplicities for capturing color charges of subjets -the performance of AM is aligned with that of ParT within a range of 2 ∼ 3σ of training uncertainties.The compatibility between our AM and ParT varies depending on the specific classification problem.For instance, in the top tagging problem using Herwig-generated sam-ples, AM demonstrates comparable performance to that of ParT.Similarly, AM matches ParT's performance in distinguishing Pythia-generated QCD jets and Herwig-generated QCD jets.In the top tagging problem with Pythia-generated samples and distinguishing Pythia-generated top jets and Herwig-generated top jets, we observed discrepancies at the level of 2 ∼ 3σ of training uncertainty.The results suggest that Pythia-simulated jets may have unique features that are not covered by AM, and those are absent in Herwig-simulated jets.Identifying the HLFs exclusive to Pythia to fill this minor gap would be interesting, particularly in understanding the differences between simulations and calibrating those with experimental data; we plan to explore this in future studies.
One key advantage of employing simpler classifiers is the reduction in training uncertainty, a direct consequence of the bias-variance tradeoff.We explicitly demonstrated this behavior in AM through bootstrap methods applied to our jet tagging problems, and our AM demonstrated much smaller training uncertainty than ParT.Importantly, the smaller uncertainty is useful when classifiers are utilized for estimating statistical quantities, such as a likelihood ratio for reweighting simulated datasets [85] and decorrelating certain features by planing [3,86].We expect that our AM and this kind of HLF-analyzing classifier-based approach will be valuable in narrowing down uncertainties of reweighting-based calibration for simulated events [87,88] and synthetic samples from machine-learned generative models [89][90][91].
Model   change of this scale parameter.The choice of this parameter may vary the performance not only in jet tagging but also in simulation comparison since the angular distribution of parton shower pattern depends on its modeling.The subsequent emissions in parton showers with angular ordering of HW and p T -ordering of PY are different [92,93], and the influence can also be observed by twopoint energy correlation S 2 based analysis [39].On top of this, the Minkowski functionals capture the geometric features of parton shower in angular coordinate (η, ϕ), which are more sensitive to soft physics compared to the energy correlations.
In Figure 8, we show the rejection rates R 30% of the AM for different choices of R max .The left plots show the PY vs. HW classification result for top and QCD jets.Each point in the plot is the average and standard deviation of R 30% for the different bootstrap datasets.We can see that too small R max results in degraded performance since topological and geometrical features prominent at a larger dilation scale are not counted yet, but as R max increases, the rejection rate approaches that of ParT and starts to saturate from R max = 0.3 In the same plot, we also show in dashed line the R max dependence of AM-PIP E) taking the following inputs: x kin , x S 2 , x count and x MF .AM shows a weaker dependence on R max compared to AM-PIP E).This is because x MF and the subjet constituent multiplicies in x subj ex for R = 0.1, 0.2, and 0.3 are partially correlated.The residual R max dependence of R 30% of AM comes from the jet constituents outside the leading subjets.The variances of rejection rates of AM under the scaling of R max is around 5% for AM and 10% AM-PIP E) within the range [0.025, 0.3], and they are also statistically significant.As discussed in the main text, the estimated error of R 30% for the ParT model is significantly larger than that for AM(-PIP) models.The solid and dashed horizontal lines show R 30% of ParT and its 1σ deviation estimated by bootstrapping.The deviation is almost as large as the changes of R 30% of AM-PIP from R max = 0.025 to 0.3.This illustrates that AM is beneficial in interpreting the classifiers, such as precision analysis on the influence of input features on the classification performance.
In the right plot, we show the results of P T vs. P Q and H T vs. H Q classification.Unlike the top jet tagging problems, P Q vs. H Q and P T vs. H T , we do not see significant R rmax dependences.This is expected because the difference of the hard substructure is the dominant part of the classification, and the parton shower emissions and the hadronization model are common between P T and P Q or H T vs. H Q .

C Comparing Computational Resources utilized by AM and ParT
In this appendix, we describe the computation resources utilized by AM and ParT.During training with a batch size of 1,000, AM requires less than 1 GB of GPU memory, while ParT consumes approximately 14 GB.Additionally, the training process of AM is significantly faster.When training AM on an NVIDIA GeForce 1080 Ti GPU (FP32 performance: 11.3 TFLOPS), the training session takes 30 minutes with an overall GPU utilization of 35%.In comparison, training of ParT on an NVIDIA RTX 6000 GPU (FP32 performance: 38.7 TFLOPS) also takes 30 minutes, but with a GPU utilization of 95%.The ratio of the number of floating-point operations, calculated as (FP32 performance) × (GPU utilization) × (training time), between ParT and AM is approximately 18.3 to 2.0.This indicates that AM requires a significantly smaller number of GPU operations than ParT.

Figure 1 :
Figure 1: Illustrations of characteristic features of top jets.The three diagrams at the bottom are the W -subjet features.Blue-crossed radiation in the color singlet W -subjet illustration denotes suppressed radiation.

Figure 2 :
Figure2: A schematic diagram of our AM combining x kin (Section 2.2), x S2 = x trim , x lead (Section 2.3), x count and x MF (Section 2.4), and x subj (Section 2.5) for the inference.

Figure 3 :
Figure 3: (a) A schematic diagram of relation network.(b) A schematic diagram of recurrent neural network analyzing subjet constituent multiplicities and subjet kinematics.

Figure 4 :
Figure 4: Expected p T distribution of pixelated jet constituents in top and QCD jets.The constituents are converted into pixels on (η, ϕ) planes, with a pixel size of 0.1 × 0.1.The expected counts of the histogram bins are shown, with boundaries at p T ∈ {0.1, 0.5 × ( √ 2) k | k=0,••• ,20 } GeV.For our event samples and pre-processing details, see Section 3.1.

Figure 5 :
Figure 5: An illustration of the dilation of point clouds by disks.This dilation is used before evaluating the Minkowski Functionals.

Figure 6 :
Figure 6: Distribution of the difference between the classifier outputs of the best model and the second best model for P Q vs. H Q classification.The red solid line is for ParT, and the blue dot-dashed line is for AM.The distribution for ParT is significantly broader than that of AM.Additionally, the green dotted line is for the AM-PIP with only the morphology analysis module active, and the orange dashed line is for the AM-PIP with relation network and subjet analysis modules active.These two extra lines are for demonstrating output variations in simpler models compared to the full AM.

Figure 7 :
Figure 7: The average ROC curves and their standard deviations for the AM, ParT-best, and ParT-2nd trained on bootstrap datasets.Additionally, we show the ROC curves for the odds ratios OR I/II , where I is the ensemble average of ParT-best, and II is AM or ParT-2nd.The ROC curves shown here are for the following classification problems: (top left) P T vs. P Q , (top right) H T vs. H Q , (bottom left) P Q vs. H Q , and (bottom right) P T vs. H T .

Table 1 :
The number of training and testing samples generated by Pythia (P), Herwig (H), and Vincia (V).The subscripts T and Q denote top jet and QCD jet samples, respectively.During the training of a binary classifier, we ensure the use of an equal number of samples for both classes by downsampling the larger dataset.Still, all available samples are used for testing.

Table 2 :
Model P T vs. P Q H T vs. H Q P Q vs. H Q P T vs. H T AUC for CNN, ParT, AM, and various configurations of AM-PIPs.The specific inputs activated in each AM-PIP are listed in the left column.The AUC of AM is closest to ParT in all classifications, except for P T vs. H T .Note that x subj does not include subjet constituent multiplicities, while the full AM uses x ex subj that includes the multiplicities.30% R 50% R 30% R 50% R 30% R 50% R 30% ModelP T vs.P Q H T vs. H Q P Q vs. H Q P T vs. H T R 50% R

Table 3 :
Rejection rates R X% for CNN, ParT, AM, and various configurations of AM-PIPs.The specific inputs activated in each AM-PIP are listed in the left column.Note that x subj does not include subjet constituent multiplicities, while the full AM uses x ex subj that includes the multiplicities.

Table 4 :
AUCs and rejection rates R X% of the AM and ParT, both trained on bootstrap datasets.
where N is the number of second class metric P T vs. P Q H T vs.H Q P Q vs. H Q P T vs. H T

Table 5 :
Estimated significance of the difference between classifier performance metrics between AM and ParT, based on the training uncertainties measured by the bootstrap method.

Table 6 :
Classification performance metrics for the odds ratio OR I/II , where I is ParT-best and II is AM and ParT-2nd.This table uses the results from bootstrap analysis, and we show the training uncertainties together.An AUC value closer to 0.5 and rejection rates R X% closer to 1/X% indicates that the two classifiers have similar information contents.

Table 7 :
P T vs. P Q H T vs. H Q P Q vs. H Q P T vs. H T The average and standard deviations of AUC, considering ten networks trained using different random number seeds and trained on bootstrapped samples.Here, we use the same ParT model described in Section 3.2.

Table 8 :
The average and standard deviations of the rejection rates R 50% and R 30% , considering ten networks trained using different random number seeds and trained on bootstrapped samples.Here, we use the same ParT model described in Section 3.2.

Table 11 :
Rejection rate R 30% of the AM and ParT in classifications involving VIN-generated datasets V T and V Q .II metric V T vs. V Q P Q vs. V Q V Q vs. H Q P T vs. V T V T vs. H T

Table 12 :
Classification performance metrics for the odds ratio OR I/II , where I is ParT-best and II is AM and ParT-2nd.An AUC value closer to 0.5 and rejection rates R X% closer to 1/X% indicates that the two classifiers have similar information contents.
Left: The R max dependence of R 30% of the P T vs. H T classification (blue lines) and P Q and H Q classification (red lines).The solid lines are for the AM, and the dashed lines are for the AM-PIP E) x kin , x S2 , x MF + x count .The R 30% of the ParT model is also shown in horizontal lines.All errors are estimated by bootstrapping.Right: The same plot for P T (H T ) vs. P Q (H Q ) classification.