: Pygpc : a sensitivity and uncertainty analysis toolbox for Python

We present a novel Python package for the uncertainty and sensitivity analysis of computational models. The mathematical background is based on the non-intrusive generalized polynomial chaos method allowing one to treat the investigated models as black box systems, without interfering with their legacy code. Pygpc is optimized to analyze models with complex and possibly discontinuous transfer functions that are computationally costly to evaluate. The toolbox determines the uncertainty of multiple quantities of interest in parallel, given the uncertainties of the system parameters and inputs. It also yields gradient-based sensitivity measures and Sobol indices to reveal the relative importance of model parameters.


Motivation and significance
Computational modeling is crucial for the investigation of a broad range of natural and artificial systems, such as technical appliances, climate and weather systems, economical and social systems, as well as the brain. The output of such models depends on parameters that are usually not exactly known for various reasons, but which can be described by probability density functions. Therefore, it is crucial to know the degree of uncertainty (quantified by its probability density function) of the output variables, given the uncertainties of the parameters. Furthermore, we sought to reveal how sensitive the output is towards each parameter, as this tells us which parameters are the most important to be determined accurately and which are less critical.
In order to answer these questions, the parameter space is usually sampled in a random or systematic fashion, possibly guided by known or assumed probability density functions. Classical methods, such as naïve Monte Carlo approaches or Bruteforce search [1], would require producing many samples of the computationally expensive models. However, in many relevant models the parameter spaces are large, the model evaluations can be time consuming, and these methods quickly generate prohibitive computational costs. Alternative approaches that seek to reduce the computational effort, such as worst case evaluation [2] or linearization [3], are much more limited in their conclusions. For these reasons, pygpc is based on the more efficient, nonintrusive, generalized polynomial chaos method (gPC) [4]. It is capable of analyzing black-box systems by virtue of a highly efficient meta-model of the original transfer function, from which the stochastic properties and sensitivities of the quantities of interest (QOI) are derived. gPC in general has been applied in a variety of applications such as computational fluid dynamics [5][6][7], heat transfer [8,9], multibody dynamics [10,11], and robust design optimization [12].
Pygpc is a Python based gPC library, with high performant C and CUDA-C extensions, offering a user-friendly interface to analyze one's own models written in either Python, Matlab, or any shell interfaceable program. Its efficient parallel CPU/GPU implementation allows the analysis of very complex models, including those containing discontinuities, on computer hardware of different performance ranging from laptops to large compute clusters. The implemented sensitivity analysis allows the identification of the most important parameters of the model under investigation and considerably accelerates prototyping and model analysis. Existing uncertainty quantification software packages, such as the UQ Toolkit (UQTK) [13,14] (https://github.com/sandialabs/UQTk), the MIT Uncertainty Quantification (MUQ) C++ Library (http:// muq.mit.edu/), and UQLab [15] (https://www.uqlab.com/), do not offer this combination of advanced features.
Thus far, we have applied pygpc in the frameworks of nondestructive testing [16] and neuroscience [17][18][19][20]. However, it can be applied to a broad range of scientific and engineering fields.

Software description
The core concept of the gPC method is to find a functional dependence between the random variables ξ (input parameters) and the quantity of interest q by means of an orthogonal polynomial basis Ψ : The Pygpc offers the possibility to use a regression or a quadrature approach to obtain the gPC coefficients u α . For example, writing (1) in matrix form yields the system of equations solved in the regression approach: containing the gPC coefficients of the N q QOIs, and [Q] is the solution matrix of size containing the results of the model evaluations. Solving (2) for [U] yields the polynomial surrogates of the QOIs as a function of the random input parameters ξ. This enables computationally efficient investigations of their statistics and sensitivities. For example, the expectation (i.e., mean) µ and variance ν are determined by: The Sobol indices S (υ) i decompose the total variance ν of the quantity of interest into components that can be attributed to individual random variables ξ i or combinations thereof [22,23].
For each S (υ) i , only the squared coefficients u 2 α , whose multiindices α belong to the set A i with non-zero values only for the ξ i of interest are considered.
The global derivative-based sensitivity coefficients S are measures of the average change of the quantity of interest with respect to the ith random variable. They are determined by means of the gPC-coefficients and the corresponding partial derivatives of the basis functions [24]: Finally, the PDFs of the QOIs are obtained by sampling the gPC surrogate using traditional MC methods. This is possible without much computational effort because the evaluation of the polynomial description is far less time consuming than the original computation.
The foundation of gPC is the construction of a basis capable of emulating model behavior while choosing the most informative sampling points in order to determine the associated gPC coefficients. There are several possibilities published in the literature, and developed by the authors of pygpc, to increase the efficiency of the gPC method, which will be discussed in more detail below.

Software architecture
A simplified overview of the software architecture of pygpc is given in Fig. 1. After the model is set up by the user, using the Model class, the uncertainty problem is defined by initializing the Problem class. This is done by assigning the random parameters using the RandomParameter class. The gPC algorithm is the core element and determines how the surrogate model is constructed. It handles the construction of the basis and manages the sampling in the Grid class. Subsequently, the Computation class calls the model and runs the calculations necessary to determine the gPC coefficients in the GPC class. In case of discontinuous transfer functions, the random space is divided into sections and the classifier assigns the sampling points to the different sub-regions. The analysis is completed by reaching a certain approximation order or by satisfying a previously defined convergence criterion. Thereafter, all metadata is saved in a gPC Session object, whereas the gPC coefficients, simulation results, etc., are saved in an associated file using the hierarchical data format (HDF5, www.hdfgroup.org).

Software functionalities
Pygpc offers the possibility to assign uniform, beta, gamma, and normal distributions to the random variables. The surrogate model is constructed using the corresponding families of orthogonal polynomials, i.e., Legendre, Jacobi, Laguerre, and Hermite polynomials. Besides models written in Python, it is possible to investigate MATLAB models by installing the MATLAB Engine API for Python [25].
An important feature of pygpc is the large variety of implemented gPC algorithms. In general, there is the possibility to either use static algorithms [17] with a predefined approximation order and number of sampling points, or adaptive algorithms [19], which successively construct the basis and the grid of sampling points until a certain error criterion is met. So far, standard random sampling and latin hypercube sampling (LHS) [26,27] are implemented, which will be extended by further sampling methods in the future.
In pygpc, it is possible to use the gradient of the transfer function in order to enhance the accuracy and efficiency of gPC [28,29]. Besides that, different solvers can be chosen to determine the gPC coefficients including, for example, classical smooth l2 minimizers, like the Moore-Penrose pseudo inverse, or sparse l1 minimization, like least-angle regression (LARS), or orthogonal matching pursuit (OMP) [30]. The latter, in combination with our random sampling methods, constitutes a compressed sensing scheme [30,31].
As the number of random variables increases, the curse of dimensionality can make analysis considerably more difficult or even impossible. To prevent this, pygpc allows the adaptive reduction of the number of random variables by re-parameterizing the original model. An optimal rotation and reduction of the gPC basis [32,33] is performed by identifying the principle components of the Jacobian of the QOI.
When the quantity of interest depends discontinuously on the input parameters, the standard gPC may lack sufficient accuracy because it makes use of polynomials defined in the entire domain. To overcome this, pygpc supports a multi-element gPC approach (MEGPC), which includes a domain decomposition method based on k-means clustering [34]. The latter subdivides the random space and provides input data for the Multi-Layer Perceptron classifier [35][36][37], in order to assign new sampling points to the corresponding domain. Both, the clusterer and the classifier can be replaced by different algorithms, which are implemented in the machine learning package scikit-learn [38].
After the gPC is finished and the surrogate is constructed, the model is post-processed by determining the output PDFs and the first four central moments, i.e. mean, standard deviation, skewness, and kurtosis of the QOI(s). In a subsequent sensitivity analysis, the Sobol indices [22,23] and the global derivative based sensitivity coefficients [9] are determined for every QOI.
The construction of the gPC matrix, its solution to determine the coefficient matrix, and the multiplication of the gPC matrix with the coefficient matrix are the processes with the highest computational cost. This is especially true when the surrogate consists of a high number of basis functions or a high number of surrogate model evaluations have to be performed. To increase the performance of pygpc, we developed a highly efficient C++ extension. We used OpenMP and CUDA to perform the computationally complex creation of the (possibly large) gPC matrix in parallel. Depending on the hardware used, the performance could be significantly increased compared to pure Python implementations.

Sample code snippet analysis
As a first step of every gPC analysis, the user is required to provide a wrapper of their model using the AbstractModel class (Fig. 2). In a second step, the gPC session is set up by loading the model, defining the uncertain parameters and their distributions (problem definition), choosing an algorithm, and running the analysis by starting the gPC session. The frontend of pygpc has a modular structure of the form: model → problem → algorithm → session. In this way, the model can be easily replaced by another one or the problem can be redefined by assigning different properties to the random variables or by defining some of them as constants. Alternatively, the algorithm to construct the surrogate model can be replaced with little effort.

Illustrative example
In biological systems, large inter-individual differences translate to considerable uncertainties in the model parameters. A typical problem in neuroscience is to link unknown structural parameters of the brain to recordings of brain activity [39][40][41][42][43]. The Jansen-Rit neural mass model (NMM) is a non-linear model for the dynamic interaction between two excitatory and one inhibitory population of neurons within a cortical column [44]. It is described by a set of six nonlinearly coupled differential equations and has been previously used to infer the connection strength and direction between brain areas from electrophysiological recordings of macroscopic brain activity [45,46]. To link this model to experimental data, an understanding of the model's parameter dependencies is essential. In this example, we examine the input/output behavior of a Jansen-Rit NMM (Fig. 3b). We varied the amplitude and the frequency of the external stimulation as beta distributed random variables (Fig. 3a). The QOI is defined as the dominant frequencyf PC of the post-synaptic potential of the pyramidal cells, i.e. the frequency with the highest powerspectral-density. The model was set up using the Python package PyRates [47]. Since the model behavior shows rapid phase transitions (Fig. 3d), we used a MEGPC approach to approximate its behavior. The domain was split into three domains covering the low (blue), medium (green) and high frequency regions (red). The mean and the standard deviation of the dominant frequency in the parameter space under investigation, considering the probability densities of the input parameters, were 10.806 Hz and 0.305 Hz, respectively. The smooth transition between the low and high frequency domain (f ≈ 11 Hz) is approximated by the gPC as a step. Overall, the gPC approximated the complex model behavior quite well. The overall difference between the original model and the gPC was about 5% (Fig. 3c). It was computed from an independent evaluation dataset of size N using the normalized root mean square deviation (NRMSD) between the original model solutions q i and the gPC approximationsq i :    Table 1. Both sensitivity measures demonstrate that the dominant frequency is insensitive to the stimulation amplitude α, but very sensitive to the stimulation frequency f. The high Sobol coefficient of second order in combination with the coefficients of first order indicates a pronounced discontinuity in the parameter space.

Impact
Data errors resulting from, for example, errors in input variables or model parameters belong to the largest sources of error besides model and numerical errors. In most situations, the model parameters cannot be exactly specified, due to limitations in the available experimental data or because of inherent case-to-case variability of the systems studied. Inaccurate knowledge of the model parameters may lead to considerable differences between the studied real world system and the numerical simulations. The development process of new models should therefore always be accompanied by a thorough sensitivity analysis to investigate and quantify its stability and robustness. Those analyses have to run alongside the modeling process and should not become the main task of the modeler. The black-box character of pygpc allows rapid integration of user defined models without the need to make time-consuming adjustments. The combination of applying projection techniques together with state of the art l1 minimization allows the computation of sparse gPC representations that counteract the curse of dimensionality, thus offering the possibility of investigating high dimensional systems. Moreover, the implemented MEGPC allows the analysis of discontinuous transfer functions, which are very common in real world systems.
In engineering, the identification of the most important sources of uncertainty, prior to production, makes it possible to decrease the number of iterations during prototyping. This shortens development time and increases cost efficiency.
Besides that, pygpc can be used to validate simulations against real world measurements. A comprehensive validation study must carefully take into consideration both experimental and computational uncertainty ranges. The latter are inherently affected by measurement errors and system imperfections, which are typically represented using error bars. The post-processing routines of pygpc allow the computation of the same measures for the numerical model, making measurements comparable to simulations. A practical example, where pygpc was applied, was presented for non-destructive Lorentz force eddy current testing [16,48]. By means of pygpc, it was possible to quantify the impact of multiple unknown input parameters to cost-efficiently improve the laboratory setup in terms of reliability and reproducibility and to validate the simulation results [16].
Another area of application is reliability and risk analysis, where the aim is to determine the probabilities and associated parameter combinations of the system where certain critical values or operating thresholds are exceeded.
Besides the aforementioned applications, the derived surrogates can be used to optimize the model behavior. In this case, the QOI is used to compute the goal function to be minimized. After applying gPC, the optimization may be performed on the computational efficient polynomial surrogate without the need to run expensive model calculations.

Conclusions
The presented software can be used in numerous areas of application. We demonstrated the capabilities of pygpc on a neuronal mass model with discontinuities. It was possible to construct a surrogate model of the complex transfer function with an accuracy of about 95%. With the help of sensitivity analysis it was possible to get insight into the parameter dependencies and to identify the most important parameters influencing the dominant frequency.
Our goal is to provide a versatile tool for efficient uncertainty and sensitivity analysis of black-box systems. Over the next years, it will be continuously maintained and further developed by incorporating new sampling methods, matrix solvers, postprocessing routines and adaptive algorithms to further enhance its efficacy. The modular structure of pygpc allows for contribution and fast implementation in any of the aforementioned fields.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.