Korali: Efficient and Scalable Software Framework for Bayesian Uncertainty Quantification and Stochastic Optimization

We present Korali, an open-source framework for large-scale Bayesian uncertainty quantification and stochastic optimization. The framework relies on non-intrusive sampling of complex multiphysics models and enables their exploitation for optimization and decision-making. In addition, its distributed sampling engine makes efficient use of massively-parallel architectures while introducing novel fault tolerance and load balancing mechanisms. We demonstrate these features by interfacing Korali with existing high-performance software such as Aphros, Lammps (CPU-based), and Mirheo (GPU-based) and show efficient scaling for up to 512 nodes of the CSCS Piz Daint supercomputer. Finally, we present benchmarks demonstrating that Korali outperforms related state-of-the-art software frameworks.


Introduction
Over the last thirty years, High-Performance Computing (HPC) architectures have enabled high-resolution simulations of physical systems ranging from atoms to galaxies.HPC has also reduced the cost and turnaround time of such simulations, making them invaluable predictive and design tools across all fields of science and engineering.Multiple simulations at resolutions that would have been impossible a decade ago are routinely employed in optimization and design.The transformation of these simulations into actionable decisions requires the quantification of their uncertainties.In recent years, HPC has become central in the way that we conduct science with massive amounts of data.Such data are used to develop and calibrate physical models as well as to quantify the uncertainties of their predictions.The integration of data and physical models has a history of over 300 years, dating back to Laplace and Copernicus and to the framework known as Bayesian inference.However, due to its computational cost, the application of Bayesian inference has been, until recently, limited to simple models or through inexpensive approximations.
Bayesian inference requires sampling of distributions with dimensionality greater or equal to the number of model parameters.The sampling necessitates numerous evaluations, making the process computationally demanding, particularly when the underlying model requires hundreds of compute hours per evaluation.Moreover, special care is necessary to develop sampling algorithms that harness the capabilities of modern supercomputers [1].The sampling involved in Bayesian inference serves as a bridge to stochastic optimization algorithms [2] that aim to identify the probability distribution of the parameters that maximize a particular cost function.distribution.Stochastic optimization algorithms such as CMA-ES, MBOA and Natural Gradient Optimization [3,4] are non-intrusive and rely on handling the physical models through input/output relations.
The need for efficient deployment of optimization and uncertainty quantification algorithms has motivated to the development of several statistical frameworks [5,6,7,8,9,10,11].However, to the best of our knowledge, only a few such frameworks are well-suited for deployment in massively parallel computer architectures [12,13].In this paper, we present Korali, a new framework for Bayesian uncertainty quantification (UQ) and optimization.The framework enables efficient large-scale sampling while providing mechanisms for fault-tolerance, load balancing, and reproducibility, which are essential requirements for emerging supercomputers [14].We demonstrate these capabilities guided by three motivating studies.First, a highdimensional optimization study of the shape of fluid transporting pipes.Second, a hierarchical Bayesian analysis of the dissipation parameter of human red blood cell (RBC) membranes.Lastly, a Bayesian study on the parameters of a coarse-grained (CG) water model.In addition, we provide a synthetic benchmark that compares the performance of Korali with that of other state-of-the-art frameworks for optimization on up to 512 nodes of the CSCS Piz Daint supercomputer.
The rest of this paper is organized as follows: in Section 2, we present the principles behind the unifying framework; in Section 3, we present the framework's design; in Section 4, we present the results of three experimental cases; in Section 5, we discuss state of the art UQ frameworks and compare their efficiency, and; in Section 6, we present conclusions and future work.

Unified Approach to Optimization and Sampling
We designed Korali to exploit the common patterns in optimization and sampling while exposing a unifying interface.Consider a computational model, represented by a function f , that depends on parameters ϑ with unknown values and possibly other parameters x with known values.We wish to infer the values for the parameters ϑ such that the model evaluated at known parameters x i will approximate given values y i for i = 1, . . ., N. The variables y i are called the data and usually correspond to experimental measurements.Since the measurements are typically affected by random errors, we introduce a probabilistic model that associates the measurements y i with the model evaluations f (ϑ, x i ).This model is represented by a probability density function p(y | ϑ; #« x ) where y = {y 1 , . . ., y N } and #« x = {x 1 , . . ., x N }.For fixed #« x and y the density is a function of the parameters ϑ and is called the likelihood function.Furthermore, any information on the parameters ϑ that is known prior to observing any data is encoded in a density function p(ϑ).The distribution of the parameters ϑ conditioned on the known values y i is given by Bayes' theorem: p(ϑ | y; #« x ) ∝ p(y | ϑ; #« x )p(ϑ).The posterior density function can either be optimized or sampled.
By optimizing the posterior density we obtain a single value for the vector ϑ that represents the value of the parameters with the highest probability.If the derivatives of the posterior with respect to ϑ are available, a local optimization method can be used, e.g., the Adam algorithm [15].Otherwise, a derivative free optimization algorithm can be used, e.g., evolution strategy (ES) algorithms.At every iteration, this type of algorithms draw samples from a parametrized distribution family and rank them from highest to lowest value.For example, the CMA-ES [2] uses this ranking and the history of the evolution to update the mean and the covariance matrix of a normal distribution.In the limit, the normal distribution converges to a Dirac distribution located at an optimal value.The covariance at each iteration can be interpreted as scaled approximation of the Hessian of the objective function at the mean of the normal distribution.
If we sample the posterior distribution instead of optimizing it, we obtain a set of samples that represent the volume of the distribution in the parameter space.This extended information, compared to the single parameter obtained with optimization, allows us to estimate the uncertainty in the inference of the parameters.Moreover, the uncertainty can be propagated to quantities of interest of the computational model, assessing this way the uncertainty in the predictions of the model.If derivatives are available, algorithms like Hamiltonian Monte Carlo (HMC) [16] can be utilized that accelerate the sampling and are efficient in high-dimensional spaces.In the opposite case, derivativefree algorithms, similar to Metropolis-Hastings, nested sampling (NS) [17] or Transitional Markov Chain Monte Carlo (TMCMC) [18] can be utilized.In particular, the NS and the TMCMC algorithms have the benefit of parallel evaluation of samples, allowing the acceleration of sampling of problems that involve computationally demanding models.
The common pattern between optimization and sampling algorithms is the iterating cycle of evaluation of the posterior density function.Additionally, for algorithms like ES, NS and TMCMC many function evaluations can be executed in parallel within the same iteration.

Framework Design
The framework is specially tailored for the execution of massively parallel population-based algorithms that rely on generating and evaluating sets of parameters (samples) iteratively.At each iteration, a set of n samples S i are evaluated by a statistical model .The statistical model to use is given by the choice of problem, e.g., Optimization, to search the optimum of an objective function; Sampling, to sample an unnormalized density function, and; Bayesian Inference, to sample the posterior distribution and uncertainty of an inverse Bayesian problem.The statistical model evaluation may require the execution of a computational model ( f ) representing, e.g., the simulation of a physical system.
To solve a given problem, the framework runs an experiment (see Fig. 1).Experiments are user-defined (for a detailed description of the user interface, see Appendix A) and contains the required configuration for the problem, the solver algorithm, and the variable space.The variable space represents the range of values within which the solution is to be identified.Variables are uniquely identified by their name and can be restricted either through an upper and a lower bound, or described by a prior distribution.
Experiments run under the control of the framework's sampling engine.The engine will coordinate the exchange of samples between the experiment and the computational model until the solver terminates.Figure 1 shows the workflow of the engine when executing a given experiment.A generation represents a cycle in which the experiment produces and evaluates a population of samples.We define a sample S i as a particular selection of values within the feasible variable space.
Figure 1: Generation-based workflow of the sampling engine.The solver module generates samples S i .The problem module preprocesses the samples into S i and passes them to a computational model.The results of the computational model f (S i ) are processed into (S i ) and the state of the solver is updated.
The first generation starts when the user runs k.run(e), where e is the user-defined experiment object and k represents an instance of the engine.The first step in every generation is to check whether any of the defined termination criteria has been met.If not, the engine yields execution to the solver algorithm, which generates an initial population of samples {S i } n i=1 and relays them to the problem module for pre-processing.During this stage, the samples are transformed to the correct input format for the computational model.The samples (S i ) are then passed on to the computational model for evaluation ({ f (S i )} n i=1 ).Upon receiving the results from the computational model, the problem module calculates a derived quantity, e.g., the loglikelihood (S ) for a problem of type Bayesian inference, and passes it back to the solver module.The solver uses these quantities to update its internal state and produce partial results, which serve as the basis for creating the next sample population during the next generation.

Distributed Sampling Engine
The sampling engine supports the parallel evaluation of multiple computational models.To enable parallel sampling, the user runs multiple instances of the Korali application, typically via the mpiexec command1 .The engine determines the number k of processes that have been instantiated and assigns roles to them.The first process assumes the role of the engine, managing sample production and evaluation, as shown in Fig. 1.The rest of k − 1 processes assume the role of workers, whose task is to wait for incoming samples, run the computational model f , and to return their results.
The distribution of samples and collection of results is managed by a distribution conduit module between the experiment and the computational model.This conduit keeps track of the state of each worker (i.e., idle, working, pending), and assigns incoming samples workers that are in the idle state.As soon as a sample is received, a worker transitions to busy state, during which it executes the model f .When the worker finishes executing f , it returns the results and sets its state to pending, indicating that the engine can collect the result.The worker state transitions back to idle only after the conduit has retrieved the result.The engine employs an opportunistic strategy for work distribution in which it maintains a common queue of pending samples from which all workers are served.The conduit distributes samples on a one-by-one basis, in real-time, as soon as a worker becomes idle.
The engine also supports the evaluation of parallel (MPIbased) models.For this case, the execution conduit creates a set of worker teams, each assigned a subset of MPI ranks.All ranks from the same team work together in the evaluation of incoming samples.Users define the number of MPI ranks per team (m) through the "Ranks Per Worker Team" configuration parameter.For a run with N MPI ranks (as specified in the MPI launch command), the conduit assigns one rank to the sampling engine, and creates k worker teams, each with (N −1)/m ranks.Every worker team owns their private MPI communicator, which allows message passing between the ranks contained therein.Any MPI-based computational model passed to Korali should use this team-specific communicator for message exchanges.To identify the ranks in a given team, the conduit module appends an MPI Communicator field to the sample indicating which group corresponds to the receiving worker.With this value, the model can determine the m number of ranks in the team and the rank identifiers therein.The model can then operate like a regular MPI application and produce a result collaboratively.
A novelty in the sampling engine is the ability to execute multiple independent experiments concurrently.The goal is to maximize the pool of pending samples at any moment during execution, maximizing worker utilization in the presence of load imbalance (see Section 4.2). Figure 2 shows the engine's dataflow when executing two experiments simultaneously (e 0 , and e 1 ) on a supercomputer cluster.The engine switches its execution context between both experiments, continuously polling whether either is ready to advance to the next generation or return partial results for storage.During execution, each experiment produces and evaluates its own set of samples (S for e 0 and T for e 1 ).
The distribution conduit manages each of the experiment's samples independently, distributing them among the common set of workers.Depending on which experiment has issued the sample, the conduit indicates to the worker which computational model to run.In this case, f , if the sample belongs to e 0 , or; g, if the sample belongs to e 1 .The results are asynchronously returned to the collection module, which distributes them back to the corresponding experiment.The engine evaluates each experiment's termination criteria after the given experiment reaches the end of its current generation.Experiments advance independently from each other, storing their results in separate folders, and the engine returns when all of them complete.

Modularity and Fault-Tolerance
The framework can be extended by adding new problem types and solver modules.To integrate a new module, developers create a new folder in the source code, containing three files: the module's base C++ class declaration header (.hpp) file, its method definitions in the source (.cpp) file, and a configuration (.config) file.Although the module class may contain any arbitrary number of method definitions, it must satisfy a common interface of abstract virtual methods.These methods are necessary to run a generation-based paradigm and depend on whether the new module is of problem or solver type.
The purpose of the configuration file is to automatize the generation of serialization routines.The engine calls these routines at the end of every generation to save the internal state of a module into a file.The state file serves as a checkpoint from which the execution can be resumed in case of failure or to split large jobs into shorter stints.The engine also stores the internal state of all random number generators to guarantee that the same intermediate and final results are obtained in case of resuming from a file.This approach guarantees reproducible results in long-running experiments that require more than one job to complete.

Distribution Conduit
Distribute Samples . . .The engine's source pre-processor creates the serialization routines automatically before compilation, based on the fields declared in the configuration file, as shown in Fig. 3.The preprocessor enforces that no class attributes are declared in the header (.hpp) file.Instead, attributes should be inserted as entries in the configuration file specifying, for each one, a field name, a C++ datatype, a description, and a default value.In this way, the framework ensures that the entire internal state of the module is accounted for during serialization.The preprocessor adds these configuration fields as class members to the module's base .hppheader file automatically, so that they can be referenced from the .cppfile.The declaration and definition of the serialization methods are automatically generated and added to the module's source.A secondary product of code pre-processing is the automatic production of documentation pages for the module's entry in the user manual [19].

Experimental Evaluation
We tested Korali on three research studies.First, an optimization study of the shape of fluid transporting pipes.This study shows the use of Korali on a large-scale, high-dimensional optimization job.Second, a hierarchical Bayesian analysis of the dissipation parameter of human RBC membranes.This study exemplifies the efficiency gains due to scheduling simultaneous experiments in the presence of load imbalance.Lastly, a Bayesian study on the parameters of a coarse-grained water model.This study demonstrates the fault-tolerance mechanisms and reproducibility mechanisms in Korali.
As computational platform, we used both XC40 and XC50 partitions of Piz Daint [20], a Cray supercomputer located at the Swiss National Supercomputing Centre (CSCS).The XC40 partition comprises 1'813 compute nodes, each equipped with two Intel Xeon E5-2695-v4 18-core processors running at 2.10GHz and 128 GB RAM.The XC50 partition comprises 5'704 compute nodes, each equipped with a single Intel Xeon E5-2690-v3 12-core processor, running at 2.60GHz and 64GB RAM, and an NVIDIA "Tesla" P100 graphics processing unit (GPU) with 16GB of device memory.In Appendix B, we provide all resources required to reproduce the results presented in this paper.Pipe networks are commonly used to convey flowing substances at various scales ranging from microfluidics to industrial facilities.The flow pattern in a pipe is determined by its shape and the flow conditions.Here we apply Korali to optimize the shape of a two-dimensional pipe which transports liquid with bubbles.The pipe consists of two straight segments connected with a U-turn, as illustrated in Fig. 4, where circulating bubbles can coalesce into larger bubbles.We consider three cases of optimization with different objectives: (i) minimize the transport time of bubbles, i.e, the time for all bubbles to exit the pipe; (ii) maximize the maximum bubble volume at the time when it exits the pipe, and; (iii) minimize the maximum bubble volume and therefore achieve a state without coalescence.
The shape of the pipe is defined by 10 parameters that specify the width and radial offset of the U-turn along five directions used as knots of two cubic splines.The flow parameters are the Reynolds number Re = wVρ l /µ l = 100 and the capillary number Ca = µ l V/σ = 0.1 defined from the liquid density ρ l , liquid viscosity µ l , surface tension σ, pipe width w, and mean inlet velocity V.The gas density and viscosity are set to 0.1ρ l and 0.1µ l .
We find the optimal shape of the pipe for the three cases using Covariance Matrix Adaption Evolution Strategy (CMA-ES) as optimization algorithm, with a population size of 32 samples per generation.As computational model, we used a finitevolume solver (Aphros [21,22]) with embedded boundaries [23] to treat complex geometries and a particle method [24] for surface tension forces.Each instance of Aphros ran on 4 nodes (72 cores) of the XC40 partition, a total of 128 nodes (2'304 cores) per case.Each case ran for an average of 15 hours and consumed approximately 2k node hours.
Results of the optimization are shown in Fig. 5. Case (i) results in a shape where the U-turn contracts and then expands to redirect the bubbles towards the centerline, where velocity is maximized.The transport time of the optimized shape has decreased by a factor of 2.1 compared to the baseline.Case (ii) forms a cavity at the end of the U-turn where the flow stagnates and all bubbles except one coalesce into one elongated bubble.Finally, case (iii) results in a wide shape where bubbles circulate in parallel lanes, achieving a state without coalescence.RBCs are highly deformable objects that incur complex dynamical transitions in response to external disturbances.These transitions lay the foundation for understanding the rheology of blood in flows through capillaries, vascular networks, and medical devices.There is still significant uncertainty in the choice of the mechanical law to describe the RBC properties, as well as in the parameter values of each model [27].

Study 2: Red Blood Cell Membrane Viscosity
In this study, we infer the membrane viscosity which controls the relaxation time of an RBC membrane.Here, the RBC membrane is modeled as a collection of particles placed on the nodes of a triangular network [28].We used data from five experimental observations (four from Hochmuth [25] and one from Henon [29]), on the relaxation of a stretched RBC to its equilibrium shape in order to infer the posterior distribution of the membrane viscosity (η m , see Fig. 6), and its associated uncertainty (σ).Due to the presence of heterogeneous data, we employed a hierarchical Bayesian model.The sampling of the posterior distribution is approximated by a two stage importance sampling algorithm.A detailed description of the statistical model, the experimental data, and the results of the hierarchical model can be found in a previous work [26].
Here, we analyze the performance of the framework during the first stage, where the parameters were sampled individually, conditioned on each experimental data set.For sampling, we employed BASIS, a reduced bias variant of the TMCMC algorithm for the sampling of the posterior distribution.BASIS is a sampling algorithm tailored for Bayesian uncertainty quantification and targeted to parallel architectures [30].We configured BASIS to run a population size of 512 samples per generation.For the computational model, we used Mirheo [31], a highperformance GPU-based library for microfluidic simulations.The latter can be inferred given a set of experimental measurements [26].
We ran the five experiments on 512 nodes of the XC40 partition using 512 MPI ranks, each running an instance of Mirheo per node.In a previous work [26], we had found that the RBC membrane relaxation model shows a high variance in running times (40 ∼ 100min per sample).This variance caused a workload imbalance among the workers, with detrimental impact on the performance of the BASIS algorithm.The effect can be appreciated when running each of the five BASIS sampling experiments individually, as shown in Fig. 7 (top).The five experiments took 48.1 hours to complete on 512 nodes, requiring a total of 24.6k node hours.This approach yielded sampling efficiency2 (e) of 72.7%.It follows that nodes remained idle 27.3% of their running time.Table 1 (row: Single) shows that only this resulted in a loss of 6.6k (idle) node hours.In total, the energy usage, as reported from Piz Daint's job scheduler, was of 10.45 GJ.
To alleviate the effect of load imbalance, we configured Korali to schedule all five experiments simultaneously.The timeline in Figure 7 (bottom) shows that nodes remained busy during most of the run with the multi-experiment variant.The results, summarized in Table 1, indicate that this approach yields a superior efficiency (98.9%) compared to the former approach, wasting much fewer node hours (186), as well as requiring less energy (7.80 GJ).Furthermore, it also reduced the runto-completion time from 47.32 to 34.78 hours.

Study 3: Coarse-Grained Water Model
In this study, we apply Bayesian inference on the assumed Lennard-Jones potential between particles with two parameters  (ε and σ) for a coarse-grained water model where a single water molecule is represented by a solid sphere (see Fig. 8).We reproduced the computational setup of a previous work [32], where the parameters of the model are calibrated to experimental data (density, dielectric constant, surface tension, isothermal compressibility, and shear viscosity) at fixed temperature and pressure.To find the parameters that maximize the posterior distribution of the parameters, we use CMA-ES as optimization algorithm, with a population size of 16 samples per generation.For the computational model, we used LAMMPS (Largescale Atomic/Molecular Massively Parallel Simulator) [33], a well-known molecular dynamics simulation library that models atoms or ensembles of particles in solid, liquid or gaseous state.We run the experiment using 16 compute nodes of the XC50 partition and two workers per node.Each worker ran an instance of LAMMPS using 2 MPI ranks over 12 OpenMP threads.
Here, we validate the framework's fault-tolerance and reproducibility by running the same study twice.For the first run, we allow Korali to complete without interruptions.For the second run, we allow it to run for only 15 minutes at a time before forcing the job scheduler to terminate it.To reach the final results with the interrupted run, we re-schedule its launcher job after each interruption, reloading the internal state of CMA-ES upon restart.Since both runs use the same random seed, we expect to observe the exact same intermediate and final results.Figure 9 shows the comparison of the per-generation evolution of parameter optima between the single-run execution (continuous line), and the interrupted execution (markers) for the two optimization parameters.In the figure, vertical grid lines indi-  cate the generation at which the interruptions occurred for the latter, for a total of 16 restarts.Results show that the optimal parameters and their convergence path was identical for both runs.
Here, we analyze in further detail the Π4U and Dakota frameworks as they offer support for distributed sampling and thus relate closer to Korali's goal.The Π4U [12] framework is one of the first efforts in providing support of distributed UQ experiments and, to the best of our knowledge, the only to have reported detailed performance metrics at scale.Π4U employs the TORC tasking library [48] to distribute the execution of samples to workers.Dakota [13] is a well-established C++based framework for uncertainty quantification, that interfaces with simulation software through a multi-level MPI-based parallelism interface.Dakota is used in a wide-range of applications for the US Department of Energy [49].
To compare the efficiency between frameworks, we created a synthetic benchmark that runs a single-variable optimization experiment on a computational model that passively waits for a given number of seconds (T wait ) before returning a random result.By employing a passive wait, we can fix the running time of each sample, ruling out time variances common to computeintensive computational models.To drive sampling, we employ evolutionary optimization algorithms (CMAES [50], for Korali and Π4U, and; COLIN-EA [51], for Dakota) configured to run 5 generations, 4N samples per generation 3 , where N is the number of workers, and one node per worker.This configuration allows to conduct weak scaling studies, evaluating the impact of node scaling on efficiency, while keeping the workload per worker constant.We configured average sample times to add up to 40 seconds in total per experiment, which represents their ideal running time.We measure efficiency (e) for each framework by dividing the ideal time by its actual running time.
We test the following 5 variants: Korali, Π4U, Dakota-M (master/worker scheduler), Dakota-PS (static scheduler), and Dakota-PD (dynamic scheduler).We use the synthetic benchmark to run two weak scaling studies, with and without load imbalance.To account for the effect of stochastic waiting times, we ran 10 repetitions of each experiment.In Appendix B, we provide the experimental setup and data necessary to replicate the results.
The first study represents a scenario where there is no load imbalance (T wait = 2.0 seconds, for all samples).Here, we measure the inherent efficiency of the frameworks in distributing samples to workers without the detrimental effect of imbalance.The gap between the attained and an ideal efficiency therefore illustrates the time spent on communication, I/O operations, and scheduling overhead only.Fig. 10 (left) shows the results of weak scaling by running the 5 variants from 64 to 512 nodes of the Piz Daint supercomputer.All variants provide high efficiencies (86% < e < 97%) at smaller scales (64 and 128 nodes), with Korali as the most efficient by a small margin.At the largest scale (512 nodes), the differences are evident, since both Π4U (e = 62%) and Dakota (e = 67%) appear to be especially susceptible to the increasing scheduling costs, compared to Korali (e = 86%).
The second study simulates experiments with a high load imbalance.Here, the waiting time for each sample is drawn from a random variable T wait ∼ U(1.0, 3.0) seconds.We consider the same ideal case (in average, 40 seconds per experiment) as basis for the calculation of efficiency.Fig. 10 (right) show the results for this study.We observe that load imbalance plays a detrimental effect on the efficiency of all variants.However, Korali is the least affected of them throughout all scales.We observe the larger difference between the variants when running on 512 nodes and the larger imbalance, where Π4U and Dakota show a low efficiency (e = 41% and e = 52%, respectively), while Korali sustains a higher performance (e = 78%).

Conclusions and Future Work
We introduced Korali, a unifying framework for the efficient deployment of Bayesian UQ and optimization methods  on large-scale supercomputing systems.The software enables a scalable, fault-tolerant, and reproducible execution of experiments from a wide range of problem types and solver methods.The experimental results have shown that Korali is capable of attaining high sampling efficiencies on up to 512 nodes of Piz Daint.Furthermore, we have shown that running multiple experiments simultaneously on a common set of computational resources minimizes load imbalance and achieves almost perfect node usage, even in the presence of high sample runtime variance.We have also shown that the framework can run outof-the-box libraries, such as LAMMPS, while providing fault tolerance mechanisms.Finally, we demonstrate that Korali attains higher efficiencies than other prominent distributed sampling frameworks.
We are currently integrating support for distributed Reinforcement Learning (RL) methods 53] that target the incremental optimization of a policy for multiple concurrent agents exploring and acting in a virtual environment to collect and maximize rewards.The architecture for these methods closely resembles the workflow provided by Korali, thus making them suitable candidates for integration.We believe that a platform integrating stochastic optimization, UQ and RL will be of great value for the broad scientific community.of a computational model on experimental data.We configured the example to describe a Bayesian inference problem where the uncertainty in the parameters is quantified by sampling the posterior distribution of the parameters conditioned on the data.
To better explain the software interface, we define the statistical problem first, and then show its correspondence with the code in through user defined functions.Korali works by defining and running an experiment.(Line 10) An experiment consists of the description of a statistical problem to be solved, a description of the involved variables and their distributions, and the configuration of the desired solver.In this application, all lines of code between Line 13 and Line 40 that are required for the description of the problem, are made entirely via dictionary-tree accesses.The rest of the code consists of importing libraries (Lines 1 and 4), initializing the experiment (Line 10), initializing the engine (Line 43), and running Korali (Line 45).The type of the problem is defined in Line 13, and the likelihood function is defined in Line 14.The observations Y are passed to Korali in Line 16 and the computational model in Line 15.In top of Fig. A.12 an example of a computational models is given, where f (x i ; ϑ) = ϑ 1 x i + ϑ 2 .Next, the variable vector ϑ is defined by the experiment's variables.Each variable is defined by a unique name and represents one entry to the variable vector.The example code contains three variables, P1, P2 and Sigma, in Lines 19 to 21.The variables are passed in the user-defined model F and used to compute the likelihood function, given by the keywords "Reference Evaluation" and "Standard Deviation", respectively.To complete the description of the problem, the variables require the definition of a prior distribution p(ϑ).Here, we specify that the prior distribution of the variables corresponding to the parameters P1 and P2 is a normal distribution (Lines 22 and 23), and of the variable corresponding to the variable Sigma a uniform distribution (Line 24).Finally, we set the solver to the TMCMC sampler [54].

Appendix A.1. Computational Model Support
The user specifies the computational model by passing a function as part of the problem configuration.Such a function should expect the sample's information as argument.In the example in Fig. A.11, the computational model is passed as a lambda function that calls the computational model F, imported from the myLibrary module.A model that has two parameters P1 and P2 and produces as result a vector of evaluations, one for each value of the input vector X. (Middle): A model that requires a single variable X and produces a single function evaluation f (x) = −x 2 , to be maximized using a derivative-free method.
(Bottom): A model that executes an external application and returns its output as result.
Functions passed as computational models do not return a value.Instead, they save their results into the sample container.The expected results from the execution of the computational model depend on the selected problem type.

Figure 4 :
Figure 4: Pipe shape parametrized by width and radial offset along five equiangular directions.Offsets are relative to the baseline shape (dashed lines).Circles show the initial positions of bubbles and the arrows indicate the flow direction.

Figure 5 :)Figure 6 :
Figure 5: Visualization of the baseline pipe shape and the results for the three optimization cases.The vorticity field is shown in blue, for clockwise, and; red, for counterclockwise.The snapshots are taken at time tV/w = 14.

Figure 8 :
Figure8: The Lennard-Jones potential is a function of the distance (r) between two water molecules, represented as solid spheres.It is repulsive for distances less than r min and attractive for larger distances.The potential is parameterized by ε that controls the depth of the well (V min ), and by σ that controls the change from repulsion to attraction point.

Figure 9 :
Figure 9: Evolution of the parameter values that maximize the posterior distribution of the Bayesian problem, showing: in solid lines, the run in which CMA-ES was uninterrupted, and; in markers, the run which was interrupted every 15 minutes.Vertical grid lines indicate the generations at which the interrupted run was restarted.

Figure 10 :
Figure 10: Weak scaling studies comparing the sampling efficiency (e) of Korali, Π4U, and Dakota (three variants) for a synthetic benchmark (left) without (T wait = 2.0), and (right) with load imbalance (T wait ∼ U(1.0, 3.0)) on 64, 128, 256, and 512 nodes.The colored bars highlight the median efficiencies, and the black intervals indicate the maximum and minimum.

Fig. A. 11 .
The vectors #« x and y correspond to the variables X and Y in the code of Fig. A.11 that are initialized in Lines 6 and 7

Figure A. 12 :
Figure A.12: Examples of computational models.(Top):A model that has two parameters P1 and P2 and produces as result a vector of evaluations, one for each value of the input vector X. (Middle): A model that requires a single variable X and produces a single function evaluation f (x) = −x 2 , to be maximized using a derivative-free method.(Bottom): A model that executes an external application and returns its output as result.
Figure A.12 (Top) shows the function F, as specified in the example in Fig. A.11.
Figure 3: Korali's source pre-processor takes the module's .cppand .hppfiles as inputs and appends to them automatically-generated class attributes and serialization routines, based on its .configfile.The output source is compiled into an object file and linked to the framework.
Supercomputing ClusterFigure2: Dataflow of the framework's sampling engine running two experiments, e 0 and e 1 , concurrently.Samples from both experiments are passed through the distribution conduit, which assign them to any available idle worker.Here, workers represent teams of m MPI ranks, which collaborate to compute an assigned sample.Workers evaluate the model corresponding to the sample's experiment (f, for e 0 , or; g, for e 1 ).Upon receiving results, the distribution conduit assigns them to the corresponding experiment.

Table 1 :
Performance comparison between the two scheduling strategies employed in the RBC stretching experiment: Single, with individually scheduled experiments, and; Multiple, with all experiments scheduled simultaneously.Here, e represents sampling efficiency.The energy usage measurements were obtained from Piz Daint's job scheduler.
Core usage timelines of Korali during sampling of the experimental datasets, starting from Henon (darkest shade), Hochmuth01, Hochmuth02, Hochmuth03, and ending on the right with Hochmuth04 (lightest shade).The figure shows two timelines: on top, with sequentially scheduled Bayesian Annealed Sequential Importance Sampling (BASIS) experiments, and; on bottom: multiple experiments scheduled simulta- neously.The horizontal axis represents the elapsed time (minutes) from the start of the experiment.On the vertical axis, each line represents a different node.Solid lines represent the execution of the model.Blank spaces represent times where a node is idle.The black line indicates the cumulative sampling efficiency (e) across time.A higher efficiency reflects a better node usage.