Analysis of counting data: Development of the SATLAS Python package

.


Introduction
Research on exotic nuclei is one of the cornerstones of contemporary nuclear physics [1]. These exotic nuclei are produced with very low yields, resulting in recorded spectra with low count rates (e.g. hyperfine structure spectra, gamma spectra, etc.). In order to extract accurate results from these low statistics datasets, adopting a correct analysis and error determination method is essential.
Traditional data fitting relies on the use of χ 2 minimization techniques. One of the assumptions underlying these techniques ✩ This paper and its associated computer program are available via the Computer Physics Communication homepage on ScienceDirect (http://www.sciencedirect. com/science/journal/00104655).
is that the distribution of the experimental uncertainties is approximately Gaussian. Due to the Poisson distribution of counting data [2], the validity of this assumption has to be investigated, especially in the low statistics limit of interest. In this article, we present the literature on the equations used to fit data and perform an uncertainty analysis. The suitability of the different approaches is investigated using the custom-written Python package ''Statistical Analysis Toolbox for LAser Spectroscopy'' (SATLAS). SATLAS started as a project to provide a single package that combines several statistical formulas typically used in the analysis of laser spectroscopic results. It grew into a cohesive unit for fitting data, analyzing results and some convenience functions for visualization, gathering all findings in one location. The selected code architecture allows easy integration of user-defined model functions, so this toolbox is easily extended to other physics problems, such as the analysis of low-statistics gamma-ray spectra. Emphasis lies on the ease of use and clear documentation. A major advantage of this approach for fitting and model definition is that the results of different experiments can be directly compared.
In the following section, the equation optimized while fitting a model to experimental data is derived from the theory of Maximal Likelihood. Section 3 describes the different components of the SATLAS package which implement the statistical formulas derived in Section 2. Several approaches for calculating uncertainties on the parameters are explored through a toy hyperfine structure dataset, after which the package is tested on recent hyperfine spectra of 203 Fr data. This illustrates both the capabilities of the package as well as the difference that can arise from different minimization and error analysis approaches. The end result of both investigations is summarized in the conclusion.

Fitting of counting experiments
As already argued in 1984 by Baker [3], fitting histograms of counting data requires a different χ 2 formula than for other types of data. Given the continuous push towards the study of weakly produced radioactive isotopes, these results are becoming more relevant. Therefore, we briefly summarize here the results of Baker in order to introduce the different minimization procedures that we integrated into our SATLAS package.

Cost function
When extracting results from data, the objective is often finding the best fitting model function. This function describes the expected response given an input variable and a number of parameters, denoted by f (x i ). To select the best fitting curve, the cost function is calculated, which assigns a numerical value to the agreement between the model function with the current set of parameters and the experimental data. The selection of the model function depends on the physics involved in the experiment, and is thus not treated here. In the SATLAS package, the user can insert the model function(s) that is appropriate for the problem being investigated.
The traditional cost function for performing nonlinear leastsquares fitting is the weighted χ 2 formula: where χ 2 is to be minimized. Here, y i is the experimental value and σ i the uncertainty. This formula can be derived from the formalism of maximum likelihood [4] by constructing a Gaussian probability distribution around each data point, with the width of the distribution given by the experimental uncertainty. The deviation of the model function from each data point, calculated in units of σ i , is then optimized. This maximizes the probability that the data is described by the model. In counting experiments, the uncertainty on each data point is taken as the square root of its value, which is an approximation of the underlying Poisson distribution [2]. The formula then becomes However, for spectra with a limited number of counts, approximating σ i in this way imposes an inherent problem. In such a limit, a data point of 0 counts is likely to be observed which would imply σ i = 0, giving rise to infinities in the formula. To avoid this problem, the minimum value of σ is approximated as 1.
Another interpretation of Eq. (1) enables a different way of calculating σ . The model function f (x i ) can be seen as an estimation of the mean value for the distribution from which the data point is However, in the limit where many data points report 0 counts (such as in a background-free experiment with a low amount of signal counts), neither approach holds. Rather than using cost functions which approximate the Poisson distribution as a Gaussian, a more suitable cost function has to be derived. Calculating the loglikelihood given a Poisson distribution around f (x i ) instead of a Gaussian, the simplified formula is found to be Since most algorithms are developed to find a minimum rather than a maximum, the negative log-likelihood is used in computer programs.
One of the issues with this formula is its form: as it cannot be expressed as a sum of squares, the standard Levenberg-Marquardt algorithm cannot be used [5]. This negatively influences the computation time needed. Indeed, as summarized in Table 1, the LM algorithm is a factor of 15 faster than the follow-up, and a factor of 25 faster than the well known Nelder-Mead method. Fig. 1 shows the result of Eqs. (2)-(4) when calculating the best fit function to a Gaussian dataset with different statistics.
The original χ 2 formula (2) builds an expected distribution around the data point, while the modified formula (3) and the likelihood formula (4) take advantage from the knowledge of the Poissonian nature of a counting experiment. In the dataset with 10 samples, it is clear that the drawn probability distributions each offer a different contribution to the cost function, while the contributions are practically the same in the dataset with 500 samples. This illustrates that low-statistics data need to be treated with nonstandard minimization techniques.

Cost function modifications
Several reasons exist for further modifying the cost function. Two cases will be discussed here: taking uncertainties of the variable into account (errors on the x-axis), and adjusting the fit procedure to consider a known literature value for one (or more) parameter(s).

Errors-in-variables
For some datasets, a measurement uncertainty on the variable is non negligible. In the current literature [6,7], the method of effective variance is the most common. Linearizing the response function and assuming no correlation between the scatter of the variable and the response observable, the effective variance is calculated as and the weights in the χ 2 formula are adapted to this new calcu- lation. An alternative description relies on probability theory instead of calculus as the effective variance method does. This approach is based on the work of Hsiao [8]. The joint distribution of variables x and y is given by In order to convert this expression into a usable fitting formula, a likelihood has to be calculated. Integrating over x and taking the experimental data into account gives This integral can be converted into a convolution integral of the form This convolution integral is implemented in SATLAS, with the assumption that P θ is given by a Gaussian distribution. Due to the high computational cost of evaluating this convolution integral, this approach is not used by default.

Known values of parameters
In some cases, the value of one of the parameters is already known from other measurements. When this occurs, there is the option of either fixing the parameter in question to the known value (thus reducing the number of parameters) or to include the known value in the cost function. By doing so, the fit takes the previous data into account.
In the χ 2 formula, this is achieved by adding each of the known parameter values as an additional data point to the experimental data: with i running over each parameter a for which the literature value and uncertainty are known. For the likelihood formula, the modification has the same purpose and is of a similar form. The log-likelihood becomes: Modifications like these are well known in Bayesian statistics and are referred to as priors [9].

Uncertainty boundaries
When fitting the parameters of the model function to data, the uncertainty associated with each parameter has to be calculated. Unless stated otherwise, these uncertainties are typically taken to be the standard deviation σ of a Gaussian distribution around the optimal parameter value. Calculating these uncertainties can be done in several ways [4,10]: • The uncertainties can be calculated by numerically solving the associated equations (see 2.3.1).
• The Hessian matrix at the optimal point can estimate the uncertainties for a locally Gaussian distributed parameter.
• A histogram of parameter values from a random walk can be generated. The (log-)likelihood value is used as a sampling criterion in this case.
The first two methods are different ways of performing the same calculation, with the estimates merely being a computationally quick approximation of the full numerical solution. Sampling the distribution is a completely different approach. Here, we will discuss all three options.

Numerical evaluation
Consider the likelihood function as only depending on the parameter a j . The values of all other parameters will be changed to achieve the best fit for that particular value of a j . The assumption is then made that the likelihood is described by a Gaussian shape [4]: where a ′ j is the optimal value for the parameter. Going to a χ 2 formulation, this is then given as It is then easy to see that ∆χ 2 = 1 corresponds to a value of a j which is exactly one standard deviation away from the optimal value. This can thus be taken as the criterion to find the uncertainty for the determined parameter value.

Hessian approximation
Instead of numerically finding the parameter values which match ∆χ 2 = 1, it is possible to make a parabolic estimate. From Eq. (12), it follows that Theoretically, the second derivative is thus sufficient to calculate the uncertainty bounds. One pitfall is that it is not just the second derivative of the desired parameter that has to be calculated. Rather, the Hessian matrix is needed: Inverting this matrix and using the diagonal elements also takes the correlations between parameters into account.
When using the log-likelihood formula, the factor −1/2 which is dropped in the χ 2 formulas has to be reintroduced. This means that the criterion for a 1σ error range from log-likelihood is ∆L = It is possible that non-linear effects perturb the assumed Gaussian shape. This can result in either a different distribution or an asymmetric Gaussian shape with the discovered σ not being the same for the upper and the lower bound. This can only be known by calculating the likelihood for a series of parameter values and plotting them. In either case, the estimate based on the inverted Hessian matrix is lacking, since this approach does not detect these non-linearities.

Distribution sampling
Instead of assuming a Gaussian distribution for the parameters, a Bayesian look at the problem can be adopted. In this language, we are interested in the posterior distribution of the parameters. This is the product of the probability distribution of the data given the model parameters, and the prior distribution on the parameters, divided by a normalization factor: In the context of this article, the (log-)likelihood function gives the probability distribution and any previous knowledge (such as previous measurements or boundaries) about a parameter is encoded in the prior distribution. For a more complete treatment of Bayesian inference, the reader is referred to Refs. [11,12]. Random steps are made in the high-dimensional parameter space and the value of the log-likelihood is calculated. Depending on the value of the log-likelihood, the step is accepted or rejected according to the criterion used in the sampling algorithm [13,14]. After having sampled sufficiently, a histogram of the accepted steps will show the probability distribution for each parameter. From the properties of this distribution, the final uncertainties on the parameters are estimated.
The advantage of this method is that sampling does not make any assumptions about the details of the parameter probability distribution. The generated data can then be analyzed as preferred: in analogy with the common practice of reporting the 1σ uncertainties, reporting the 16, 50 and 84 percentiles of the parameter histograms is a common choice. The disadvantage is that the computational efforts involved, meaning mostly calculation time and disk space, are significant. Another drawback of this sampling approach is that, a priori, it is not known when the walk has converged, making the estimate of a good walk length a question of both experience and data quality.

Systematic deviations
As a measure of goodness-of-fit, the reduced chi-square (χ 2 red ) value is often quoted. This is calculated as χ 2 /n d , with n d the number of degrees of freedom, taken to be the number of datapoints minus the number of parameters. For large n d , this is expected to be 1 on average, with a variance of 2/n d [15]. This implies an uncertainty on the calculated χ 2 red which can be quite significant. In the case of likelihood calculations, there is a statistical measure which will asymptotically be distributed like a χ 2 distribution [3]: where Y i is the true value at x i . Since this value is unknown, this is replaced by the measured values: From this value, χ 2 red can then be calculated. As a rule of thumb, uncertainties are often multiplied by √ χ 2 red , with the argument that this accounts for systematic fluctuations not included in the model. Enlarging the error bars on the data points by this factor forces χ 2 red = 1. This factor can propagate through the (linear) fit parameters, thereby negating the need for a refit. However, as argued in Refs. [15,16], there are some strong implicit assumptions with this method: 1. The uncertainty distribution on the parameters is Gaussian. 2. The model is linear in all fit parameters.
3. Demanding χ 2 red = 1 implies that the used model is the correct model underlying the data.
Apart from these requirements, the expected spread on χ 2 red makes a value deviating from 1, despite the model and uncertainty distributions matching all requirements, very likely. There appears to be no formal justification for this practice. Taking this into account, a too large or too small value of χ 2 red should be investigated for shortcomings of the model instead of adjusting the uncertainties. In the end, it is up to the experimentalist to decide on the goodness-of-fit and to take the comments above into account.
One specific application where this approach is justified is the calculation of weighted averages [12]. Assuming a Gaussian uncertainty distribution on the individual measurements, all three requirements are met by using a constant to fit to the data.

SATLAS
For a detailed explanation of the SATLAS package, the user is referred to the online documentation or the docstrings accompanying all classes and methods for a technical description as well as the required input and expected output. What will be presented here is a short overview of the structure and possibilities of the package.
Beyond the scientific Python stack (NumPy and Scipy), SATLAS makes use of the LMFIT package [17] for the implementation of parameter boundaries and expressions linking several parameters, and the emcee package [14] for exploring the likelihood surface via a Monte Carlo walk.

Available models
In this section, the implementation and design of the models in SATLAS will be explored. These models are used to define the model function as presented in Section 2. The modular design gives the user complete freedom concerning the definition of the response function. This includes convenience functions that allow extension of predefined models, as well as an easy way to define a new model.

General use
The BaseModel is the abstract class used to define common behavior for all models. BaseModel implements the parameter viewing, extracting and restricting, accounting for literature values, and model saving and addition methods. This means that models can be individually defined and the sum (using the standard ''+'' operator) will return the expected new model.
SumModel and LinkedModel are subclasses of the BaseModel. The input and output for these classes are such that SumModel creates a response function which is the sum of the underlying models. LinkedModel allows simultaneous fitting of different datasets and models with optional links between the parameters of the different models.
Since a polynomial fit is encountered very often in all branches in science, an implementation of such a model is already provided under the name PolynomialModel. Accommodating quick generation of models, MiscModel has to be supplied with a userdefined function and optionally a list of parameter names. The supplied function will then be used by the generated object as a response function. For more involved calculations, where the response function is either too complicated to use a single function or some optimization tricks can be used, the source code of MiscModel provides a minimal example of how BaseModel should be subclassed.

Laser spectroscopy specific
For laser spectroscopic purposes, an HFSModel was implemented. This model calculates the hyperfine spectrum (HFS) of the transition between two atomic levels, given the nuclear and electronic spins involved and the magnetic dipole A and electric quadrupole B parameters. For a system of levels with quantum number F , being a coupling between a nuclear spin I and an electronic spin J, we define the following constants: The resonant transition frequencies between a lower and upper hyperfine level ν i,j are then dependent on their (frequency) energy difference [18] and an offset labeled 'centroid': with the hyperfine level frequency defined as: A higher order correction for the octupole contribution is also available in SATLAS, but is typically not used. These transition frequencies are automatically calculated by the model object when the parameters are changed and are used when the model function is evaluated. For the lineshapes of the resonances, implementations of Gaussian, Lorentzian, Voigt profiles, a Crystal Ball (Gaussian with low-energy power-law tail) shape [19] and an asymmetric pseudo-Voigt [20] are available. The initial intensity of the resonances is set according to the Racah intensities (I Racah ) between the F states, which follow from the coupling of the angular momenta [21]. This intensity can either be fixed to these calculated intensity ratios, left as free parameters or varied between I Racah and I, where I is proportional to 2F lower + 1 [22]: ) . (22) Here, I sat is the saturation intensity of 2F lower + 1.
Additionally, a TransformHFSModel is provided in order to perform model dependent transformations of the x i input data and the f (x i ) output data. Applications here are, for example, mass dependent doppler shifting from an input voltage to a laser frequency.

Cost function selection
The fitting routines have been built on top of the LMFIT package. Due to this, boundaries on the parameters and expressions linking them (through the corresponding methods implemented in BaseModel) are accounted for in this package. The behavior of the parameters can be controlled after the creation of a model through the methods: • set_expr to set the value of the parameter depending on the other parameters, e.g. Ampl 1 = 2 · Ampl 2 • set_variation in order to set the parameter as a constant or a variable, • set_boundaries to restrict the possible values of the parameter, • set_lnprior_mapping to define a prior function for the parameter value and • set_literature_values to define the prior for a parameter with a known literature value.
After setting the desired parameter options, two fitting routines are available: chisquare_fit and likelihood_fit. Both these functions require a model to be supplied, along with the measurement data. The chosen cost function is then optimized in order to find the best fitting model function and parameters.
When fitting with the χ 2 routine, either the uncertainties on the data points have to be given, or a function that is applied to f (x i ) in order to get σ i . When using the likelihood routines, the log-likelihood calculating function has to be supplied. If this is not done, the Poissonian log-likelihood calculation is used. As a convenience function, chisquare_spectroscopic_fit automatically ensures that √ f (x i ) is used to calculate σ i . When supplied with the keyword xerr=value, these routines automatically transform their respective cost functions in order to incorporate the errors-in-variables with the correct value.
The loglikelihood module is used for all the log-likelihood calculations. It also contains a log-likelihood function for Gaussian uncertainties. Through the use of this, normal weighted least squares can benefit from the implementation of the random walk for uncertainty calculation.
Note that the minimization routine for chisquare_fit uses the well-documented Levenberg-Marquardt algorithm, which is a quickly converging algorithm that takes advantage of the sum-ofsquares form of the cost function. This algorithm cannot be used for likelihood_fit, but a selection of other options is available. Convergence will be either not possible or much slower with these algorithms, so some experimentation may be required to find the best algorithm for a given dataset.

Uncertainty calculators
The Hessian matrix is automatically calculated after fitting such that Eq. (13) is used to estimate the uncertainties. The advantage of the Levenberg-Marquardt algorithm is that the Hessian matrix Listing 1: Usage example of SATLAS, as applied to the toy data. The comment lines (starting with #) describe the function of the following lines. More details on the code are given in the text. 1 import satlas as sat 2 import numpy as np 3 import matplotlib.pyplot as plt 4 # Parameter and model initialization 5  , 'Centroid'], distance=3.1) 24 # Likelihood analysis 25 sat.likelihood_fit(model, x, y, hessian=True, walking=True, walk_kws={'filename': 'random_walk.h5', ' nsteps': 12000}) 26 model.display_mle_fit() 27 fig3, ax3, cbar2 = sat.generate_correlation_map(model, x, y, method='mle', distance=3.5) 28 fig4, ax4, cbar = sat.generate_correlation_plot('random_walk.h5', selection=(5, 100)) 29 fig5, ax5 = sat.generate_walk_plot('random_walk.h5', selection=(0, 5)) is estimated during the fitting procedure, eliminating the need for extra calculations. For likelihood_fit, the Hessian matrix is calculated using the numdifftools package, which is designed to calculate gradients and Jacobians of functions. Depending on the number of parameters and data points, the external calculation of the Hessian matrix can be quite time consuming. The matrix is then inverted and uncertainties are assigned. In certain cases this inversion is impossible, which is the result of at least one of the parameters being near its specified boundary. This turns the Hessian into a singular matrix.
The calculate_analytical_uncertainties function is used to calculate the ∆χ 2 = 1 or ∆L = 1/2 boundaries. Keywords can be passed to the selected fitting routine in order to modify it properly. This method takes the given boundaries into account by not allowing the parameters to exceed the given values. For the random walk, the emcee package is employed. It implements an affine invariant random walking system [14], where ''walkers'' through the parametric space can slingshot each other so the space gets explored quicker than with the traditional Hamiltonian walking system. This functionality is available as either the function likelihood_walk or in likelihood_fit with the keyword walking=True. The random walk routine creates an HDF5 file containing all the accepted positions for all the parameters. This file can be processed and plotted through gener-ate_correlation_plot. An overview of the likelihood shape can be made by the function generate_correlation_map. This creates a plot similar to the output of the processed random walk: 1D curves representing the likelihood as a function of one parameter are drawn on a diagonal array of plots, while 2D contourmaps are made by fixing two parameters and refitting the rest. Example figures are given in the next section.

Uncertainty calculation testing
As a test of the different approaches for the uncertainty calculation, a toy dataset has been generated and fitted. This was done by generating a hyperfine structure spectrum for a nuclear spin of I = 1/2, with a lower electron state J l that has spin 0 and an upper state J u with spin 1, resulting in two hyperfine peaks. The spectrum was then sampled at 30 MHz intervals around the peak positions that are defined by Eq. (20) using a predefined A and centroid value.
Using these values as mean values, Poisson distributed counts (1230 in total) have been generated. The generated data can be seen in Fig. 2. Table 2 presents the result of the full analysis for all approaches given in Section 2.3. Using the χ 2 formalism, Eq. (3) has been used while the likelihood has been calculated according to Eq. (4). The code for this analysis is also given in Listing 1 and will serve as a worked example.
The code starts out (lines 1-16) by loading the needed packages, followed by parameter and model initialization. Loading the saved As can be seen in Table 2, all approaches give very comparable results. This shows the equivalence of all approaches under the correct assumptions. If the correlation plots (see lower-left panels in Figs. 3 and 4) show a non-ellipsoidal correlation between parameters or the 1σ -boundaries are very asymmetric, the Hessian estimate will not be justified. In this case, the numerical approach is only valid if left and right sides of the curves on the diagonals in Fig. 3 can be approximated as a parabola. If this is not the case, the assumption of the criterion is meaningless since there is no Gaussian approximation in likelihood space. The only way that uncertainties can then be extracted is through the random walk and characterizing the posterior distribution. Such is the case for the background, amplitude and FWHM parameters in this case, with the background parameter especially presenting a strong deviation from a Gaussian distribution.
Of interest is also Fig. 4(b), where the trajectory of the walkers as generated by the emcee package is plotted. At the plotted scale, it is clear that all walkers start from the same point in parameter space and need some steps in order to start spreading out. This phenomenon is known as burn-in and is a result of the tuning of the sampler settings [14]. The calculated autocorrelation time of the walk can be used to calculate how much burn-in is present and has to be removed. In practice, the easier method by far is plotting the trajectory and judging by eye.

Application in low-statistics data: 203 Fr
To demonstrate the applicability of the different approaches, data taken by the Collinear Resonance Ionization Spectroscopy (CRIS) team at ISOLDE, CERN in their November 2015 campaign on 203 Fr is used [23]. The acquisition at the CRIS setup is of a nearly event-by-event type. This allows later reconstruction of the spectrum for an arbitrary number of scans. 203 Fr has a nuclear spin I = 9/2, and the atomic transition at 423 nm that couples an S 1/2 to a P 3/2 state was probed. For more details on this type of experiment see [24,25]. This gives rise to 6 possible hyperfine transitions between the levels. Due to the large ground state splitting, the spectrum is split into a left and right multiplet. systematic deviations as discussed in Section 2.4, the low values of χ 2 red were judged to be due to the low statistics of the dataset and invalidity of the Gaussian model for low count rates. Therefore the uncertainties were not enlarged. The result is given in Fig. 6.
Several striking features are immediately present. The uncertainties as calculated by using Eq. (2) are roughly a factor of 2 smaller than those reported by either the modified formula Eq. (3) or the Poissonian approach Eq. (4). Although the same value is reached after 27 scans for all approaches for the B parameter and the centroid, there is a 1σ deviation for the A-parameters when using the normal χ 2 procedure. The normal formula gives an inconsistent picture, and deviates sharply from all other methods below 20 scans for the other parameters. The disagreement between the modified formula and the Poissonian likelihood is much less and much smoother, with the deviation starting only below 10 scans.

Conclusion
Based on the statistical formalism of Maximal Likelihood, the SATLAS package has been written to provide an easy interface with these formulas. The focus on user-friendliness has resulted in easy model definitions and cost function selection. Advanced routines to calculate the uncertainty are available with intuitive interfaces.
SATLAS allows easy and advanced analysis of experimental data with different cost functions, requiring only a minimum of input from the user. The results from the tests show the importance of selecting the correct function as already suggested in 1984 [3], which has an impact on both the value of the parameters extracted and their uncertainty.
From the generated data, we see that the uncertainty calculations are consistent with each other when their individual assumptions are met.
From the CRIS data, it is shown that even at high statistics, the traditional χ 2 formula is incorrect due to the underestimation of the statistical uncertainty on the parameter values. Furthermore, there is a discrepancy between the analysis results obtained for the low and high statistics datasets. This inconsistency is removed when using the modified χ 2 equation, using the model value to estimate the uncertainty on the counted number. At lower statistics, the difference between the modified equation and likelihood is negligible until a very low number of counts is reached. Between these two approaches, the modified equation is therefore preferred where feasible due to the lower computational cost. When the spectrum only contains very few counts however, the likelihood equation must be used in the minimization procedure.