zfit: scalable pythonic fitting

Statistical modeling is a key element for High-Energy Physics (HEP) analysis. The standard framework to perform this task is the C++ ROOT/RooFit toolkit; with Python bindings that are only loosely integrated into the scientific Python ecosystem. In this paper, zfit, a new alternative to RooFit written in pure Python, is presented. Most of all, zfit provides a well defined high level API and workflow for advanced model building and fitting together with an implementation on top of TensorFlow. It is designed to be extendable in a very simple fashion, allowing the usage of cutting-edge developments from the scientific Python ecosystem in a transparent way. Moreover, the main features of zfit are introduced, and its extension to data analysis, especially in the context of HEP experiments, is discussed.


Introduction
Data collected by experiments such as the Large Hadron Collider (LHC) or ultra-energetic cosmic ray detectors are described in terms of physics-motivated mathematical models. Typically, these models have a parametric form, and the values of their parameters are the quantities of interest in a given analysis. Once a model has been defined, a fitting procedure that minimizes the disagreement between the data and the model is applied to find the optimal values of these parameters, a crucial step in almost every analysis.
Model fitting as done in High Energy Physics (HEP) has some peculiarities and differs from other fields. Models are often multidimensional, built by composing complicated, custom shapes, and have a non-trivial normalization. In addition, two further requirements need to be fulfilled; on one hand, reasonable scaling with data size and model complexity, motivated by the large data samples being collected at the LHC, and, on the other hand, the capability to not only find optimal values for the parameters, but to be able to calculate their uncertainties with enough flexibility to allow for advanced statistical treatment, as uncertainties of measurements are as important as their central values.
Since recent years, HEP analysis is moving more and more towards the usage of Python over the more traditional C++. The moving force behind this paradigm shift is the development of an extremely rich scientific computing Python ecosystem [1,2], which provides a collection of efficient, easy-to-use tools for data analysis that are shared across a very wide community, including many industry leaders, and which one can build new applications on.
The de facto standard for model fitting in HEP is RooFit [3], which is integrated in the ROOT toolkit [4] and provides a very powerful object-oriented architecture, as well as plotting capabilities and an advanced statistics module. It is written in C++ and can be accessed in Python through the ROOT Python bindings, which create a non-negligible set of problems: complex memory management that easily leads to memory leaks, difficult integration with the scientific computing Python stack, lack of extendability from Python, and a complex installation procedure due to the need to install the full ROOT toolkit.
General-purpose Python-based libraries such as SciPy [5], lmfit [6] or Ten-sorFlow Probability [7] do not fulfill the HEP fitting requirements, and therefore cannot be used as-is. Python-based HEP-specific fitting packages, such as probfit [8], TensorProb [9] or TensorFlow Analysis [10], have tried to address these issues and can be considered as proof-of-concept that HEP-centric model fitting is possible in Python; none of them, however, offers the same fitting functionalities and ease of use as RooFit does while still integrating well with the scientific computing Python ecosystem.
The goal of zfit is to fill this important gap in the HEP software ecosystem and provide a well defined API and workflow for model fitting together with a Python-based package that is fast, scalable and extensible. With zfit, the ongoing paradigm shift from specialized C++ to Python will be complete, and scientists will be able to perform HEP analysis purely in Python-something that was essentially impossible before-with seamless access to the full power of the scientific computing Python software stack.

Software description
The structure of a generic model fitting workflow in HEP is illustrated in Fig. 1. Its basic goal is to maximize the agreement of a set of data points to a probability distribution described by a set of parameters. The measurement of the metric of the goodness-or, more precisely, the disagreement-of the predicted model with the experimental data is referred to as loss function. One of the most used loss functions in HEP is based on the likelihood function, in particular the negative log-likelihood, which is minimized in order to estimate the parameters of the model. Once the optimal values of the model parameters have been determined, several statistical techniques can be used to calculate their uncertainties.
The structure defined by model -data-loss-minimize can also be used to describe a typical Deep Learning workflow-even more clearly by replacing model by "Neural Network" and minimize by "Training". This similarity has profound implications in the design of the zfit implementation, as it allows to use an industry standard Deep Learning tool such as TensorFlow [11] (TF) as its computational backend by using the low-level API that consists of mathematical functions. The immediate benefits for zfit arise from key features of TF such as the capacity to be run on heterogeneous architectures, mathematical optimizations such as automatic differentiation (required by the most used minimization algorithms in HEP), and operation caching.

Software Architecture and Functionalities
Unlike RooFit, zfit has a clear, restricted scope: model fitting and sampling with a well defined and consistent API. Tasks such as data processing, plotting and statistical analysis are left for other packages to tackle.
The abstraction of the fitting process in the steps shown in Fig. 1 is one of the main features of the architecture of zfit: by developing each step as a separate "brick" of the library, it becomes possible to tightly integrate with the scientific computing Python ecosystem, and use the available tools to improve performance, reduce code duplication and add extra functionality. Each of these bricks (model, data, loss, minimizer and fit result) provides override hooks for all its core functionality, allowing to completely customize or replace any aspect of their behavior. As previously mentioned, internally these objects deal with TF tensors, but most of these complications are hidden from the user and the library offers a similar user experience as the one found in other model fitting libraries.
The (possibly several) dimensions and corresponding limits in which the model fit is performed are handled by the Space object. This object allows inter-object identification of dimensions, as well as their corresponding axes (used for intra-object manipulations) and limits. Space objects can be singleor multi-dimensional, and, thanks to the use of observables, can be combined using the Python arithmetic operators to tackle arbitrarily complex uses cases.
Models are defined inside a given Space and implement a shape function that returns a value given some input data. Custom models are easily implemented using their functional form and by default receive standard attributes such as numerical integration and sampling. These generic functionalities can also be extended to analytical integration and specific sampling if provided by the user. Models can be combined to build more complex models (including increasing the number of dimensions) while keeping automatic integration, observable handling and event sampling.
Models are parametrized by Parameter objects, which can be used in the implementation of the shape function like any other scalar. Independent parameters can be floated in the minimization process and their value at the optimal point is one of the main outputs of a model fit. Composed parameters can be built by combining parameters through mathematical operators and can be used like independent parameters.
The loading, ordering and processing of data in zfit is handled uniformly, no matter their source, by the Data object. While only the most important formats are currently implemented-ROOT trees, Numpy arrays and Pandas DataFrames, as well as TF tensors-the simple API of the Data object makes it easy to add additional data handling capabilities. Thanks to the internal use of a Space object, columns in datasets are matched to observables, and the resulting object can be treated as a normal TF tensor, i.e., it is possible to apply operations to it.
Data and Models are combined to build a Loss function. Having the loss as an independent part of the fitting workflow-acting as the connection between model-data structure and minimization process-is a crucial design feature in zfit. As a consequence of this decoupling, the minimization procedure is completely agnostic to the underlying models, data or the actual definition of the loss. This abstraction adds even more flexibility to zfit, making it possible to provide an external loss calculation through a simple TF tensor wrapper 1 and just use the minimization capabilities of the library.
The computation of the value of the loss for a given configuration of the model is efficiently implemented in zfit thanks to TF. It is also possible to add constraints to the loss in order to modify its behavior: while the most used constraints in HEP are directly implemented in zfit, custom functions can be also used.
Minimization is performed through stateless Minimizer objects with a very simple API. Since the loss calculation is usually the most intensive part of the minimization problem, in practice it is possible to wrap any existing Python minimizer to be used by zfit. Several minimizers are included in zfit, such as the SciPy optimizers, the most-used minimizer in HEP, called Minuit [12], or the TF implementation of the Adam optimizer [13].
The results of the minimization process are returned in a FitResult object, which stores all the information about the fit parameters, the minimization process, as well as the used instances of the Minimizer and the Loss.
While the approximate uncertainties related to the fit parameters can be calculated directly inside zfit using the Hessian matrix or a simple profiling method, the calculation of precise parameter uncertainties and confidence intervals are left to specialized packages, such as those in Refs. [14,15], which don't need any object other than the FitResult.

Illustrative examples
The architecture and features of zfit are better visualized with a typical HEP example: an invariant mass fit including both signal and background events, extended to multiple dimensions, and introducing a control channel.
Let's assume the region of interest of the mass observable, in this case the B 0 meson mass, is in the range [4.5, 6.0] GeV. In zfit this is expressed with a Space defining our domain: obs = zfit . Space ( obs = " mass " , limits = ( 4 .5 , 6 . 0 ) ) The best interpretation of the observable at this stage is that it defines the name and range of the observable axis.
Data are loaded from a numpy array, where tobs Space defines the names of the columns: data = zfit . Data . from_numpy ( array = numpy_array_data , obs = obs ) The signal component of the fit will be modelled using a Gaussian, which has two free parameters-its mean and its width: mu = zfit . Parameter ( " mu " , 5 .2 , 4 .5 , 6 . 0 ) sigma = zfit . Parameter ( " sigma " , 20 , 1 , 40 ) and can be used to instantiate the Model as: signal = zfit . pdf . Gauss ( obs = obs , mu = mu , sigma = sigma ) While most common models are provided within the pdf module, the philosophy of zfit is to provide a clear API with powerful mechanisms to facilitate that the users and the community can implement their own. Making use of another built in model, the background that is parametrized as an exponential: bkg = zfit . pdf . Exponential ( obs = obs , lambda_ = -0 . 001 ) In order to build the sum of these two models, an additional parameter frac is used to describe the fraction of each species: 2 frac_sig = zfit . Parameter ( " signal fraction " , 0 .5 , 0 , 1 ) model = zfit . pdf . SumPDF ( pdfs = [ signal , bkg ] , fracs = frac_sig ) With the model and the data built, a loss function can be constructed. In this case, the goal is to perform an unbinned maximum likelihood fit, so the corresponding Loss object is initialized as nll = zfit . loss . UnbinnedNLL ( model = model , data = data ) Finally, the minimization process is executed, first instantiating a Minimizer object-in this case using the Minuit minimizer-and then running the minimization algorithm: This class provides a simple interface to implement a custom model, internally handles the free parameters, which can be accessed by name through the params attribute, and provides a simple way to access the input data. To create a two-dimensional model, the models describing the different dimensions can be simply multiplied together. Given that their observables are different, the new model will automatically be defined in the combined space of the two: combi ned_mod el = angular * model This example can be extended to simultaneously fit a control dataset from which a more precise determination of the parameter mu can be obtained. To do so, a second Gaussian is created to model this control mode: obs_control = zfit . Space ( ' control ' , limits = ( 5 .2 , 5 . 6 ) ) sigma_control = zfit . Parameter ( " sigma control channel " , 2 , 0 , 5 ) gauss_control = zfit . pdf . Gauss ( obs = obs_control , mu = mu # common mu sigma = sigma_control ) A simultaneous Loss can be built by either two independent Loss or by providing both models and datasets in a list as where data control has been created analogously to data. This simultanoeus Loss can be minimized as shown before.

Impact and conclusions
The zfit package is a crucial piece required to complete the transition of HEP analysis to a stack fully based on the scientific computing Python ecosystem. The importance of this paradigm change should not be underestimated: given the limited availability of resources in HEP, especially in areas such as software development, the possibility of integrating physics analysis within a welldeveloped and well-supported software ecosystem-allowing to focus mainly on the HEP-specific parts of the analysis software-will have long term repercussions in the quality and usability of HEP analysis software.
Additionally, the integration with the scientific computing Python ecosystem and the multiple API entry points favor the use of cutting-edge libraries in conjunction with zfit, giving analysts access to the latest developments in the computing community; with large, monolithic multi-purpose packages like ROOT, the integration of new algorithms and techniques is not always possible and takes significant time.
The package itself is not only an implementation of a complete model fitting library, but also specifies a carefully designed and agreed upon API of the most crucial components and the workflow. With the aim to be a standard in its kind, other packages such as high level statistics and plotting tools, can be built against the standardized API instead of a concrete implementation, strongly reducing package dependencies. In addition, other packages can focus on the specific implementation of a certain task, such as building a specialized Loss, Model or a Minimizer, without the need to also implement the other parts of the workflow themselves.
Another key feature of zfit is the ability to easily build custom models while being able to adjust the behavior of methods in an assisted or even fully usercontrolled way. This core design feature allows the fast and safe implementation of arbitrary models and therefore promotes the synergy between theory/experimental communities as illustrated in Refs. [16,17].
Finally, fits in HEP analysis often consist of large datasets and complicated models, and require far more computing power than a single core. Scalability in terms of size and complexity of the data and models and the possibility to run on large computing infrastructures, including accelerators such as GPUs, is therefore a necessity. TensorFlow, the graph based computing backend in zfit, was built for the sole purpose of efficient large scale computing and takes care of the non-trivial task of optimal workload parallelization. This does not only yield state-of-the-art performance, but also enables the HEP community to keep up with the latest algorithm implementations and low-level computing instructions at virtually no cost.
To summarize, this paper has presented zfit, a high-level statistical modeling and fitting library purely designed within the pythonic ecosystem. Its modular architecture and API allows users to not only formulate complex analyses within a simple formalism, but also easily to integrate developments from the scientific Python community and seamlessly use heterogeneous architectures. In light of the increasing and unprecedented statistics that will be collected in HEP experiments in the coming years, zfit will enable more rapid prototyping and will allow more efficient usage of the limited resources within academia. This will result in the completion of a paradigm change in data analysis that started with the eclosion of the scientific Python ecosystem.  -There is no advantage using parallelization for very few calculations, since the overhead of splitting and collecting the results is dominant. With increasing number of events this gets negligible and thus the execution time of zfit increases way slower than for RooFit. This also comes from the fact that more events mean a more stable loss shape, so the minimizer used in zfit performs better.
-The GPU is highly efficient in computing thousands of events in parallel.
For only a few data points though, the overhead of moving data back and forth dominates, resulting in a worse performance in this regime.
As a conclusion, the speed for a multicore CPU system is comparable between zfit and RooFit. For large number of events, zfit typically outperforms with a similar response for both GPU and a multicore system.