Pineline: Industrialization of High-Energy Theory Predictions

We present a collection of tools automating the efficient computation of large sets of theory predictions for high-energy physics. Calculating predictions for different processes often require dedicated programs. These programs, however, accept inputs and produce outputs that are usually very different from each other. The industrialization of theory predictions is achieved by a framework which harmonizes inputs (runcard, parameter settings), standardizes outputs (in the form of grids), produces reusable intermediate objects, and carefully tracks all meta data required to reproduce the computation. Parameter searches and fitting of non-perturbative objects are exemplary use cases that require a full or partial re-computation of theory predictions and will thus benefit of such a toolset. As an example application we present a study of the impact of replacing NNLO QCD K-factors in a PDF fit with the exact NNLO predictions.


Licensing provisions: GPLv3
Programming language: Python, Rust Nature of the problem: The computation of theoretical quantities in particle physics often involves computationally-intensive tasks such as the calculation of differential cross sections in a systematic and reproducible way.Different groups often use different conventions and choices which makes tasks such as the fitting of physical parameters or quantities very challenging.
Solution method: We create a pipeline of tools such that a user can define an observable and a theory framework and obtain a final object, containing all relevant theoretical information.Such objects can be then used in a variety of interchangeable ways (fitting, analysis, experimental comparisons).

Introduction and motivation
Modern particle physics phenomenology is increasingly reliant on complex theoretical calculations whose accuracy needs to match very precise measurements, chiefly the ones from experiments at the Large Hadron Collider (LHC) [1].An increase in accuracy in those predictions is associated to the computation of higher orders in the strong and/or electroweak couplings for partonic cross sections, and usually performed by numerical programs, which we will call generators throughout this paper.Since the computations are very demanding in runtime, memory and storage, these generators are usually optimized for and can only calculate a small set of observables, and, furthermore, they often use different conventions and strategies.Being able to generate, store and exchange predictions in well-suited formats for a large set of processes, such that they can be utilized for a variety of analyses, is therefore advantageous.
In this paper we propose a framework, which we call pineline, that aims to generate theory predictions by 1) building a translation layer from a common input format to each of the different generators and 2) implementing a common output format for all of them.This is the idea that we call industrialization: while specific generators are sufficient for the calculation of single processes, there is no single generator that is able to calculate all processes, which are not necessarily limited to processes at the LHC, but may also include deep-inelastic scattering processes, for example.By interfacing to multiple generators, and thus connecting them in an "assembly line" or "pipeline", we can easily run the generator best suited for a particular process, and by having a common input format we can easily perform variations, such as changing parameters for parameter scans.
The motivation for this project was initially fitting parton distribution functions (PDFs) [2][3][4][5], but the output generated by pineline can be used in any fit or analysis that requires theory predictions.One interesting feature of a PDF fit in this context is that a very large number of predictions go into it.This complicates keeping track of the theory parameters used, for example.While this is a manageable problem for a few predictions, for a complete PDF fit it is crucial to make sure that different processes make use of sets of parameters that are compatible with each other.Keeping track of the parameters in a central place makes it then easily possible to be able to rerun predictions if we want to change (some) of those parameters, for example.
An important part of this project is the use of interpolation grids [6][7][8], which store theory predictions independently of PDFs and the strong coupling, since most of the output of pineline are interpolation grids.Interfaces of them to some generators are available [9][10][11].Being independent of PDFs they are ideally suited for PDF fits where they have been widely adopted, but their use is not limited to this area.

Input and output formats
Our goal is to build a framework to generate and store theory predictions in a standard format from a common set of inputs.By making the input common across different generators we can enforce consistency in theory settings, and, by storing them in a unified format, we ensure they can be used and analyzed regardless of how they were computed originally.
The output of the programs is a hadronic observable, which means it has already been folded with non-perturbative objects, such as the PDF.By standardizing the output of all generators to be an interpolation grid we can reanalyze the same prediction in different scenarios, without requiring an (expensive) recomputation.The evaluation of the results for different sets of PDFs becomes almost instantaneous.As a by-product, it also facilitates parameter fits for objects that depend on those quantities.
In the context of PDF fitting we can think of two common scenarios: • the inclusion of new data points into the fit (coming from existing or new experiments [31][32][33]) • investigate the impact of theory settings (such as the reference value of the strong coupling α s (M 2 Z ) [34]).
Both require us to (re-)compute theory predictions for a large number of data points.To give a concrete example of the scale of the problem, consider NNPDF4.0 [2] which fits more than 4500 data points across almost 100 different datasets.In order to match the increasing demands from the theory side we require more and more automation to avoid time-consuming and error-prone manual processes.
In addition, by re-fitting the PDF, any observable that depends on it will change.However, the partonic cross sections do not depend on the PDFs.By having them stored as interpolation grids, one can update all predictions without recomputing the most computationally heavy part of the observables.
In summary, our goal is to provide a reliable and easyto-use workflow that connects the necessary intermediate steps and that can be scaled to any amount of data.

Reproducibility
A very important aspect of joining all of these different generators in a pipeline is the reproducibility of the results: it must always be possible to trace every prediction back to its inputs, so that any result can be independently checked by a third-party, and so that the impact of the change from a base set of parameters can be gauged.To this end, each interpolation grid and all intermediate objects contain all the (meta)data needed to recalculate them and to verify that both are compatible with each other.In particular, this includes: the programs used, their version numbers and random seeds, the value of relevant standard model parameters, renormalization scheme choices, phase space cuts, and Monte Carlo uncertainties.We note that many interpolation grids publicly available on hepdata [35] and ploughshare [36] do not include this information, though sometimes it can be inferred from the associated publications.However often these data are not available, making comparisons more difficult and time-consuming.We make that metadata explicitly available in the grids and all other outputs, from which it can be reliably and easily extracted.

Open-source software
All the software used in this framework is open source, to facilitate its distribution, use and maintenance.In addition to the code, also the data are available online in formats that can be analysed with open-source tools.Specifically, we store all metadata in the widely used YAML1 format while interpolation grids are stored as PineAPPL grids, which can be interfaced to with many programming languages.
Finally, we note that this work can also be seen as a continuation of the effort already started with the publication of the NNPDF fitting code [37], giving the community all necessary tools to reproduce and perform (theory) variations of NNPDF fits.

Interpolation grids, EKOs and FK tables
In the following we describe the deliverables, i.e. the objects that pineline produces.These are shown in Fig. 1 and are the oval objects, namely 1) PineAPPL grids, 2) evolution-kernel operators (EKOs) and 3) fast-kernel (FK) tables.PineAPPL grids, like APPLgrids and fastNLO tables, store theoretical predictions independently from their PDFs and the strong coupling.EKOs and FK tables are tailored towards PDF fits, and translate interpolation grids to use a single factorization scale.
Let us consider the calculation of a single observable σ, which for the sake of readability we assume to contain only a single convolution, e.g., for the case of a DIS structure function.The extension to more convolutions is straightforward.Eq. ( 1) shows the defining property of interpolation grids, namely how convolutions with PDFs f a (x, µ2 ) are performed: The grid itself is the set of values σ ) for all partons a and perturbative orders k.Note that the PDFs are interpolated, and therefore evaluated at specific momentum fractions {x i } and (squared) factorization scales {µ 2 j }, just as the partonic cross sections σ a .For simplicity, we also assume here that the renormalization scale equals the factorization scale (µ 2 R = µ 2 F ), and we refer to this single scale as µ 2 .
The interpolation transforms the convolution integral to a sum, resulting in the grid being a PDF-independent quantity.In particular, the PDF is expanded over an interpolation basis, with the expansion coefficients being the values of the PDF on some nodes.This means the specific interpolation basis is only used in the construction of the grid, but is not relevant for the construction of the PDF table (and so not of concern for any PDF user).
For the special case of PDF fits, interpolation grids are not the most efficient representation yet, given that the factorization dependence of the PDFs is known perturbatively and consequently not fitted.We can therefore rewrite Eq. (1) to refer only to a single factorization scale µ 0 , which in PDF fits is known as the initial scale or the fitting scale: The object {FK a (x i ; µ 2 0 )} is known as a fast-kernel (FK) table [38] and is a special case of an interpolation grid that • uses a single factorization scale and • contains the resummed evolution, thus combining various perturbative orders and therefore consuming the dependence on the strong coupling.
An FK table can be computed using evolution kernel operators (EKOs), where EKO b,l,j a,i are the (linear) operators resulting from the evolution equations.FK tables are ideally suited for PDF fits, because the time-and memory-consuming evolutions are done only once and not during the fit.
What we have gained are theoretical predictions {σ}, represented as FK tables, which allow us to perform convolutions with a set of one-dimensional PDFs f a (x; µ 2 0 ) very efficiently.However, the price we have to pay is that we need a set of tools that calculate all the required objects: 1.A numerical calculation must generate interpolation grids for each observable σ that we want to incorporate in a fit.2. Next, we need to calculate the EKOs, for the corresponding choices in each observable calculated previously and the choices made in the fit.3. Finally, we need to evolve the interpolation grids using the EKOs to generate FK tables.
In the subsequent sections we briefly review the various programs dedicated to each step.

Generating grids: pinefarm
PineAPPL itself is physics agnostic and therefore we need a parton-level generator to create and actually fill the grids.This requires a generator to be interfaced to PineAPPL, which then sends the relevant phase-space information, i.e. x, µ F , a, . .., to PineAPPL, which collects it in a space-efficient data structre representing σ (k) a (x i , µ 2 j ) (see Eq. ( 1)).Practically, this is done using an interface offered by PineAPPL, available for the programming languages C, C++, Fortran, Python and Rust.
As of now, PineAPPL has been interfaced to the following providers: • Madgraph5 aMC@NLO [18,19] to calculate LHC processes, including NLO EW and QCD-EW corrections, • yadism [39] to calculate NC and CC DIS processes, • a modified version 2 of Vrap [29] for fixed-target Drell-Yan processes, and • an interface to MATRIX [40] is in progress.Furthermore, PineAPPL can convert already existing AP-PLgrids and fastNLO tables into its own format.The program pinefarm abstracts away most of the differences of different generators.For the generators listed above it recognizes different input files, which specify the requested physical observable.It also performs substitutions from a theory parameters database, and directly runs the generators to produce predictions and collect the desired interpolation grid.

Generating evolution kernel operators: EKO
While grids σ (k) a (x i , µ 2 j ) are convoluted with PDFs evaluated at high scales µ 2 j , FK tables FK a (x i ; µ 2 0 ) are convoluted with PDFs evaluated at the fitting scale µ 2 0 reducing the dimensionality thus to just two dimensions for DIS observables (parton flavor index and momentum fraction) and four for hadronic observables.This reduction is possible because the scale dependence of PDFs is given by the DGLAP equation [41][42][43].
The software package EKO [44,45] has been developed to solve these equations in terms of evolution kernel operators (EKOs): In contrast to similar programs [12,[46][47][48] EKO focuses specifically on the direct computation of the operator which allows the described pipeline to use them to produce FK tables.Since the operator itself is PDF independent it allows also to reuse existing operators just like reusable tools in the theory factory.

Generating FK tables: pineko
Interpolation grids and EKOs are joined together in pineko to produce FK tables according to Eq. ( 3).Specifically, pineko has to extract the relevant information from a grid and a theory runcard (containing all the relevant theory parameters) and then pick or, if it has not been calculated yet, compute the required EKO as described in Section 2.2.Once the EKO is computed, pineko loads the grid and evolves it using the EKO to produce the final FK table.
Since Eq. ( 2) is a special case of Eq. ( 1), PineAPPL can also represent FK tables in the same format.This serves an important purpose: at any point in the pipeline, a theory prediction, whether it is an interpolation grid or an FK table, whether it was created using a Monte Carlo generator or converted from other interpolation grids, is always a PineAPPL grid.Therefore, the same tools can be used on all of them.
The separation of the computation of the EKO and its convolution with the grid is convenient from a computational point of view.To illustrate the problem this separation solves, consider two possible scenarios: • studies on the variation of α s (M Z ) [34] which require only the recalculation of EKOs, but not the grids (Note that in Eq. ( 1) the strong coupling is factored out) • studies on the variation of M W which require only the recalculation of grids, but not the EKOs.

Application: K-factors vs. exact predictions
As an application of the previously presented tools, we have integrated Vrap [29] into pinefarm and interfaced it to PineAPPL to produce FK tables for fixed-target Drell-Yan observables (FTDY) with up to next-to-next-toleading order (NNLO) precision in the strong coupling.In the following we use it to produce fits similar to NNPDF4.0 [2], which however differ in their treatment of predictions for the FTDY datasets: E605 [49], E866 [50,51] and SeaQuest [52].
In particular, we change these predictions 1. to include only NLO, 2. to include NNLO approximately as K-factors (as in NNPDF4.0)and finally 3. to include NNLO exactly by using interpolation grids.
We note that the bulk of the hadron-hadron collider data (in particular all Drell-Yan Z and W production at the LHC) in all PDF fits are still limited to NNLO K-factors.K-factors are known to suffer from accidental cancellations between different partonic channels [53] and therefore they should be replaced by interpolation grids to produce a truly NNLO-accurate PDF fit.However, their use is widespread when studying complex observables for which the computation of exact NNLO prediction as a grid might be very difficult, computationally expensive, or simply not publicly available.Fig. 2 shows the result of a fit including FTDY datasets only at NLO QCD (green), normalized to the results of a fit with exact NNLO QCD predictions (orange).In Fig. 3 we address the impact of including the NNLO contribution to the predictions in two different ways: exactly at NNLO (orange) and approximated by multiplying the NLO results by a bin-dependent K-factor (green).
In the particular case of FTDY we note already in Fig. 2 that the effect of NNLO corrections is constrained to a small portion of the PDF space.In Fig. 3 we can see that the effect of performing a fit with K-factors does move the fit in the direction expected from Fig. 2, but that the K-factors are not able to fully capture the nuances of the NNLO contribution.A similar behavior is shown by the plot of the s PDF in the same figures.These contributions are however compatible within uncertainties and the impact of using the K-factor approximation in this case is  48) of Ref. [54].A distance of 10 units corresponds in this case to a 1σ difference between the two PDF sets.
negligible.The quantitive difference between PDFs fitted from exact NNLO contributions or the K-factors is shown in Fig. 4; the difference is never significant and stay well below half a σ.This is just one example of a phenomenological study facilitated by the framework presented in this paper.From a single run of Vrap we have been able to extract NLO, NLO×K-factor and NNLO (QCD) predictions.All of these predictions have been evolved to FK tables using the same NNLO EKOs, producing three different FK tables for three different fits.On the pineline site (https://nnpdf.github.io/pineline3 ) the reader can find a tutorial for the reproduction of these results.An independent PineAPPL interface to xFitter [55] is also under development 4 which means objects produced by this framework will also be compatible with fitting frameworks beyond NNPDF.

Conclusions and outlook
In this paper we described pineline, which is a collection of tools that includes pinefarm and pineko.The program pinefarm uses existing generators like vrap, yadism or Madgraph5 aMC@NLO to generate PineAPPL interpolation grids, which in turn can be converted to tables with pineko, which uses EKO.The produced objects are PineAPPL grids and store theory predictions independently from their PDFs, so that convolutions with arbitrary PDFs can be done near instantaneously after generation.The grids are useful for phenomenological studies, and we have shown an application in which we estimate the effect of replacing NNLO QCD K-factors by the exact calculation.
To name a few more applications, we expect this framework to be beneficial in systematic studies of the effect of theory settings and theory predictions in PDF studies, in particular for the following use cases: • we need to consistently account for theory uncertainties [56], coming either from the hard scattering process or the PDF evolution, and propagate these additional constraints into the final PDF delivery.
• Furthermore, it seems necessary to increase the perturbative order to next-to-next-to-next-to leading order (N3LO) [57,58] to match the experimental precision.
• Finally, we need to consider the interplay of QCD and QED [59][60][61] and eventually consider EW corrections in a PDF determination.
In addition, the understanding of the impact of PDF uncertainties on beyond standard model searches [62] is fundamental in the hunt for new physics searches.The framework is not restricted to the case of unpolarized proton PDF determination, but can already be applied to the extraction of other factorizable objects.Specifically, the extraction of transverse-momentum dependent PDFs [1,63,64] as well as the extraction of fragmentation functions [65] can be facilitated with the interpolation grids produced by this pipeline.With the advent of the EIC projects [32,33] the refined determination of nuclear and polarized PDF [66] will also become available.
The framework also provides a standardized way to compare theory setting in different PDF groups, and allows an easy benchmark between the respective settings [67].

Figure 1 :
Figure 1: Flow diagram showing the overall pipeline architecture and deliverables in the case of parameter fits.Arrows in the picture indicate the flow of information (together with the execution order) and the orange insets on other elements indicate an interface to PineAPPL.The programs pinefarm and pineko act as interfaces between other programs and the deliverable objects, represented by ovals.These objects can be PineAPPL grids (orange) or Evolution Kernel Operators (blue).

Figure 2 :
Figure2: Comparison of PDF fits with and without NNLO contributions for FTDY in the determination.In both cases all other datasets are included at NNLO, the only difference between them is the exact NNLO contribution for FTDY.

Figure 3 :
Figure 3: Comparison of PDF fits in which the FTDY datasets are included up to NNLO, including the exact predictions in the FK tables up to NNLO (orange) or up to NLO with K-factors (green).The orange fit corresponds to that of Fig. 2.

Figure 4 :
Figure4: Distance plots between using the exact NNLO calculation and the K-factors as computed per Eq.(48) of Ref.[54].A distance of 10 units corresponds in this case to a 1σ difference between the two PDF sets.