hopsy — a methods marketplace for convex polytope sampling in Python

Abstract Summary Effective collaboration between developers of Bayesian inference methods and users is key to advance our quantitative understanding of biosystems. We here present hopsy, a versatile open-source platform designed to provide convenient access to powerful Markov chain Monte Carlo sampling algorithms tailored to models defined on convex polytopes (CP). Based on the high-performance C++ sampling library HOPS, hopsy inherits its strengths and extends its functionalities with the accessibility of the Python programming language. A versatile plugin-mechanism enables seamless integration with domain-specific models, providing method developers with a framework for testing, benchmarking, and distributing CP samplers to approach real-world inference tasks. We showcase hopsy by solving common and newly composed domain-specific sampling problems, highlighting important design choices. By likening hopsy to a marketplace, we emphasize its role in bringing together users and developers, where users get access to state-of-the-art methods, and developers contribute their own innovative solutions for challenging domain-specific inference problems. Availability and implementation Sources, documentation and a continuously updated list of sampling algorithms are available at https://jugit.fz-juelich.de/IBG-1/ModSim/hopsy, with Linux, Windows and MacOS binaries at https://pypi.org/project/hopsy/.


S.2 Constrained vs. Unconstrained Sampling
We emphasize the need for specialized convex polytope (CP) samplers by comparing the performance of unconstrained off-the-shelf samplers implemented in the popular Bayesian inference library PyMC against some of those provided in hopsy.For this, the density given by the negative exponential of the d-dimensional Rosenbrock function serves as test case [5]: with s = 100 and a = 0.For the constrained test-case, we limit θ to the unit cube [0, 1] d .In both cases, i.e., constrained and unconstrained, we test for d ∈ {2, 4, . . ., 16, 18}. Figure S.1 presents the performance comparison of the commonly used No-U-Turn sampler (NUTS) [9] with Gaussian proposal as implemented in PyMC [1], against the performance achieved with the Gaussian Hit-and-Run algorithm, designed for CP sampling, and the unconstrained Gaussian proposal implemented in hopsy.For both, we drew 10, 000 • d 2 samples and ran 4 parallel Markov chains.The performance is measured in terms of relative effective sample size (ESS) per sample (n eff /n), ESS per second (n eff /s), and rank normalized split R [16].Moreover, we report the achieved acceptance rate α.
For the unconstrained problem, we observe in Figure S.1 (left column) that our implementation of the unconstrained Gaussian proposal does not yield satisfying convergence ( R > 2 for 8 dimensions and higher), most likely due to a lack of step size tuning.PyMC, in contrast, performs some hyperparameter tuning, yielding better convergence for both variants tested.
For the constrained problem, we observe in Figure S.1 (right column) that the Gaussian sampler implemented in PyMC achieves the highest relative ESS per sample when the dimensionality is 8 or higher, but is superseded in terms of ESS per second by hopsy's constrained Gaussian Hit-and-Run sampler in all tested dimensions.Importantly, with increasing dimension, the acceptance rates of PyMC's Gaussian sampler progressively decline, which suggests that this sampler eventually will get stuck as the problem dimension further increases.Interestingly, the No-U-Turn sampler, while working well for unconstrained sampling, turns out to be a bad choice for constrained sampling, indicated by its high R statistics as well as very low ESS values.This is most likely due to the discontinuity along the CP borders.
Concluding, our numerical experiments with simple (orthogonal) constraints demonstrate the need for specialized constrained samplers, even in the case of relatively low problem dimensions.

S.3 Uniform Convex Polytope Sampling
As an example for deriving an explicit CP formulation from an implicit one, we consider the case of Metabolic Flux Analysis (MFA).MFA metabolic networks consist of a set of biochemical reactions.The evolution of intracellular metabolite concentrations X over time is described by a system of ordinary differential equations (ODE).These ODEs are derived from mass balancing of each reaction and parameterized by the metabolic reaction rates ν, which are also referred to as fluxes.The stoichiometric coefficients of the metabolites in each reaction are collected in the stoichiometric matrix S. Consequently, the ODE system is given as At metabolic steady state, the intracellular metabolic concentrations are considered constant.In this case, the dynamic system becomes an underdetermined linear equation system for the unknown fluxes ν: The null space of S is parameterized by a subset of free fluxes θ, which are confined to a CP Any θ ∈ P is a feasible solution for Equation (3).Given no additional information, all solutions of Equation ( 3) are equally valid.In this case, from a probabilistic perspective, the feasible solutions are encoded by the uniform distribution over P. In the case of Bayesian inference, the uniform distribution over P is a proper prior, if and only if the CP is bounded, which is generally ensured in practice.Within the specification of a hopsy.Problem, this corresponds to choosing a constant density f (θ) = const.Notice that this case is implicitly assumed, if no hopsy.Model is passed to the hopsy.Problem.Here, the metabolite M is considered "intracellular" and therefore mass-balanced, with a constant concentration M , whereas the pools S, P, Q, and R are considered extracellular metabolites with time-dependent concentrations S, P, Q, R.

S.3.1 A Toy Example
A toy example network is depicted in Figure S.2.Balancing the in-and outgoing fluxes of the internal metabolite M gives the stoichiometric matrix S u x y z Normalizing the intracellular fluxes by the uptake rate u and assuming the fluxes x, y, z to go only in the nominal direction (arrow), we get 1 = x + ŷ + ẑ and 0 ≤ x, ŷ, ẑ ≤ 1 (6) as the solution of Equation (3) with x = x/u, ŷ = y/u, ẑ = z/u.Selecting x and ŷ as free fluxes, the space of feasible solutions is given by the inequality which results in the CP P being a two-dimensional simplex

S.3.2 Numerical Experiment
We used the problem of feasible flux space sampling to highlight hopsy's capabilities to rapidly prototype and implement MCMC algorithms in Python.In particular, we implemented a updated version of the little-considered Over-relaxed Hit-&-Run algorithm by [4] and a very recent iterative MCMC algorithm [3]: • Over-relaxed Hit-&-Run with Rounding (ORHR), • Multiphase Monte-Carlo algorithm using a Billiard walk proposal (MMBW) We benchmarked both algorithms alongside hopsy's native Coordinate Hit-&-Run with Rounding and Thinning algorithm (CHRRT) by uniformly sample three CP problems: • a 16-dimensional Birkhoff polytope (BP5) • a 18-dimensional metabolic flux model of E. coli from [20] (zamboni) • a 24-dimensional constraint-based E. coli model from [13] (e coli core) See https://gitlab-public.fz-juelich.de/IBG-1/ModSim/Fluxomics/hopsy-publication/-/blob/main/Eval-use-case-1.ipynbfor definitions of the CPs and further details regarding this experiment.All calculations were run on a system equipped with AMD 5950x CPU and running Ubuntu 22.04.For every of the three sampling algorithms, we ran four parallel chains, parallelized using Python's multiprocessing library.For the comparison of the three algorithms, each chain was run until an effective sample size (ESS) [6] of at least 1,000 was achieved.According to [10], the selection of the thinning factor impacts the algorithm performance considerably when rounding is involved, primarily due to the de-rounding step that maps the set of thinned samples back to the original parameter space.Here, following the guideline derived in [10], the thinning factor was set to problem dimension 2 ×1./6 for CHRRT and ORHR.Because the MMBW algorithm rounding works differently, thinning was not applied in this case.

S.3.3 Recon3D Benchmark
We used the CHRR algorithm [7] and the preprocessed Recon3D model files from [10] to benchmark hopsy for sampling large genome-scale models uniformly on a desktop PC (Intel i7-6700) and on the JURECA supercomputer at Jülich Supercomputing Centre (JSC).On the desktop PC, we used Python multiprocessing to run four independent chains in parallel on the four cores of the processor.On the supercomputer, we used MPI to parallelize the sampling on two so-called standard compute nodes, each equipped with 2 AMD EPYC 7742 processors, giving us 256 independent and parallel chains.

S.4 Bayesian 13 C-Metabolic Flux Analysis
Uniform flux space sampling, as described in Sec.S.3, represents the constraining impact of the model formulation on the fluxes, without considering measurements.In Bayesian 13 C-Metabolic Flux Analysis (MFA), two data types enter the inference problem.One data type is provided by the so-called extracellular rates.These rates represent the biomass-specific substrate uptake and product excretion rates, as well as the growth rate of the organism under study.The extracellular rates are extracted from exometabolomic concentration data, either by metabolite-wise linear regression [12] or by consistent inference using bioprocess models [8] (see also Sec.S.5).
While the knowledge of the extracellular rates is helpful, they are insufficient for determining the intracellular fluxes in parallel and cyclic metabolic pathways [19].For this, the second data type is 13 C labeling data of intracellular metabolites that are acquired by administering carbon-based substrates in isotope labeling experiments.Specifically, 13 C-labeled substances are taken up by the cells, the label is propagated through the metabolic pathways according to the underlying fluxes, and chemical analytics (mass spectrometry or NMR) allows for the quantification of fractional enrichments of internal metabolites.On the other hand, the metabolic model is complemented with a carbon atom transition network, describing the fate of the individual carbon atoms through the metabolic network.This model is used to simulate the measurable isotope labeling states.Together, this allows estimating the fluxes by means of a computational procedure.Bayesian 13 C-MFA uses Bayes theorem to produce flux posterior probability distributions, see [15] for details.
The measurement errors on extracellular rate and labeling data are both modelled to be additive Gaussian.Formally, let Y 13C be the data vector of extracellular rate and isotope enrichment measurements, then where g(θ) is the vector of computed extracellular rates and simulated isotope labeling, given known fluxes θ, and Σ 13C is the associated diagonal measurement covariance matrix.Note, that the definition space of θ is restricted to a CP as detailed in Sec.S.3.Using the uniform distribution on the flux CP as prior, the log-density function for Bayesian 13 C-MFA (up to a constant) then is Since simulation of isotope data by evaluating g(θ) is computationally expensive, we rely on the highperformance simulator 13CFLUX2 [18] for fast log-density evaluation.

S.4.1 Numerical Experiment
Using the zamboni model, we simulated isotope labeling data for an input tracer of [1,2]-13 C-labeled glucose.
We used an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz running Ubuntu 22.04 to simulate 64 chains in parallel twice.In the first (non-PT) run, all chains were independent.In the second (PT) run, we let the chains communicate by binding the chains together in sets of 16.In both experiments, we sampled 1e6 samples with a thinning of 20 using the Hit-and-Run with Rounding algorithm and discarded the first half of the chains as burn-in.

S.4.1.1 Results
Both MCMC runs, PT and non-PT, converged, as indicated by the potential scale reduction factor R [6], and the trace plots in Figure S.6.For the independent chains, the R was below 1.02 after 10.8 h, indicating convergence.The PT run achieved an R below 1.002 in 12.8 h, therefore mixed better for the same number of samples drawn with around 1.2 times longer run-time.Finally, we show how sampling results contribute to experimental design decisions.As an example, Figure S.8 shows the isotope labeling data predicted for [1,2]-13 C-labeled glucose as tracer, by re-using the results generated by uniform sampling before (cf.Sec.S.3).In the same way, arbitrary tracer mixture can be tested, predicting likely isotope labeling patterns to emerge, being a valuable information source for analytical experts [14].

S.5 Composite Bioprocess and Metabolic Flux Modeling
In 13 C-MFA, intracellular fluxes are traditionally determined sequentially: first, extracellular rates are estimated using linear regression or a bioprocess model, then, the inferred rates along with isotope labeling data are used for estimating the intracellular flux using a 13 C-MFA model (cf.Section S.4).This sequential process, however, neglects the correlation between those parameters that are shared between the bioprocess and the 13 C-MFA models, i.e. the extracellular rates.In turn, the neglection might lead to an information bias.To improve the reliability of the parameter estimates, we here, for the first time, simultaneously inferred bioprocess and 13 C-MFA parameters, exploiting hopsy's trampoline mechanism.Finally, we compared the inferences using the composite model with those of the sequential approach.
A bioprocess model is a kinetic model that balances the dynamic substance concentrations X in a bioreactor, starting from initial concentrations X 0 : where h is a nonlinear function of kinetic parameters θ bp , which describes the time-dependent "mechanistics" of biomass and extracellular compound concentrations in a cultivation, such as exponential biomass growth or Monod substrate uptake kinetics.Given a time-series of extracellular biomass and concentration measurements Y bp with additive Gaussian errors, we have with X(θ bp ) being the solution of Equation ( 11) for given kinetic parameters θ bp and the (diagonal) measurement covariance matrix Σ bp .
In contrast to bioprocess models that balance time-dependent extracellular concentrations within a bioreactor, 13 C-MFA models balance the metabolite concentrations within a cell that is in a metabolic (pseudo)steady state (cf.Section S.4).The precondition to connect bioprocess and 13 C-MFA models is, thus, that in the time interval in which isotope labeling is administered, the rate of change in extracellular concentrations h(X, θ bp ) is approximately constant, suggesting that the cells' uptake, secretion, and growth rates remain reasonably constant.We check this criterion by computing the maximal deviation from the mean rates during the labeling interval.If the maximal deviation does not exceed a tolerable threshold, we expect that the bioprocess shows constant growth, and thus, the cells are in a metabolic steady state.
If this precondition is fulfilled, 13 C-MFA (Equation ( 9)) and bioprocess (Equation ( 12)) measurements are combined to the joint measurement vector Y and the joint measurement covariance matrix, resulting in the composite statistical model With the joint parameter vector θ now constituted by the parameters of the bioprocess and 13 C-MFA models, this gives the composite posterior log density where g(θ 13C ) is the forward simulation of rate and labeling measurements of an isotope labeling experiment with the 13 C-MFA model, as described in Section S. 4. Note, that the composite model introduces an implicit dependency between the parameters of the 13 C-MFA model and the parameters of the bioprocess model, namely θ 13C = θ 13C (θ bp ).That is, extracellular fluxes are computed from the bioprocess parameters, whereas intracellular fluxes remain free parameters that are to be inferred directly.We make this more explicit for actual models in Equation (18).Here, θ 13C has to satisfy the linear constraints introduced in Section S.3.Since the dependency is nonlinear in θ bp , the condition θ 13C ∈ P has to be tested for every new parameter estimate θ bp .

S.5.1 Numerical Experiment
We combined the previously used zamboni model of E. coli with a chemostat bioprocess model, which is given by the following system of ordinary differential equations (adapted from [2]), where S is the substrate (glucose), X the biomass, and P (acetate) as well as Q (CO 2 ) the extracellular product concentrations.Hence, the parameters θ bp of the bioprocess model are the growth rate µ [1/h], yield coefficients Y XS [gCDW/mmol], Y P S [gCDW/mmol], and Y QS [gCDW/mmol], as well as product formation rates q p [mmol/gCDW/h] and q q [mmol/gCDW/h]: On the other hand, the 13 C-MFA model is parameterized with the same free fluxes as selected before in Section S.
To combine the bioprocess and the 13 C-MFA models, we associate extracellular substrate decline and product formation rates with the specific uptake rate, upt.n, and the two specific secretion net fluxes, acOut.nand coOut.n, of the 13 C-MFA model.That is, we map yielding the joint parameter vector θ = ( µ, Y CS , Y P S , Y QS , q q , q p , emp1.n, emp6.n,tca1.n,ana1.n,BM oaa6 aux.n, emp1.x,emp5.x,ppp2.x,ppp3.x,ppp4.x,ppp5.x,ppp6.x,BM pga1.x,BM oaa1.x,BM oaa3.x) As the rates of the chemostat model are constant with respect to the concentration changes, constancy in growth conditions is ascertained, rendering the test in (14) naturally fulfilled.
Labeling measurements and extracellular rate measurements for the 13 C-MFA model were simulated as described before in Section S.4.Bioprocess concentrations were simulated with parameters θ bp that reproduces the previous 13 C-MFA parameters using Equation (18).The simulated values were then perturbed with normally distributed ϵ ∼ N (0, σ), where σ was set to 0.1 for X, P and Q, 0.02 for S, and the 13 C-MFA measurements as described in Section S.4.All bioprocess parameters were bounded within a range of [0, 20]. 13C-MFA model bounds were as before in Section S.3 and Section S.4.The bioprocess model in Equation ( 11) is solved by a SciPy [17] integrator, the zamboni model is simulated with 13CFLUX2.Glue code to implement the composite model was written in Python and can be found at https://jugit.fz-juelich.de/IBG-1/ModSim/Fluxomics/hopsy-publication/-/blob/main/src/bioprocciso_pt.py?ref_type=heads.
We used an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz running Ubuntu 22.04 to simulate 64 chains in parallel.Because parallel tempering performed well for inference of the 13 C-MFA model, we also use parallel tempering and let chains communicate in sets of 16, obtaining four replicates.We generated 1e6 samples with a thinning of 20 using the Hit-and-Run algorithm and discarded the first half of the chains as burn-in.
The run converged with a potential scale reduction factor ( R) below 1.08 in around 55 hours.Since our case study was focused on rapid prototyping, less emphasis was laid on run time performance.Clearly, the run time can be improved by optimizing the simulation code.

S.5.1.1 Results
In Figure S.9,marginal distributions of fluxes obtained from sampling Equation ( 14) are shown.Note, that the fluxes upt.n, acOut.nand coOut.nare functions of the sampled bioprocess parameters θ bp (cf.Equation ( 18)).Thus, for these fluxes we depict the marginal forward distributions obtained from applying the map Equation (18) to the posterior samples of the bioprocess parameters.Notably, we observe slightly improved identifiability for some fluxes, in particular the rates coOut.nand upt.n, but also the intracellular fluxes tca1.n,emp6.n and emp1.n.These results suggest that combining models as proposed here can improve flux estimations.

Figure S. 1 :
Figure S.1: Performance comparison of unconstrained off-the-shelf samplers implemented in PyMC against those implemented in hopsy.

Figure S. 2 :
Figure S.2:A simple metabolic network.Upper-case letters denote metabolites and lower-case letters are metabolic reactions.Nominal reaction directions are indicated by the arrowhead.Here, the metabolite M is considered "intracellular" and therefore mass-balanced, with a constant concentration M , whereas the pools S, P, Q, and R are considered extracellular metabolites with time-dependent concentrations S, P, Q, R.

Figure S. 3 Figure S. 3 :
Figure S.3 shows the relative ESS per sample (ESS/n) and per unit of time (in seconds, ESS/s) as measures of efficiency of three sampling algorithms used.Considering both measures, the Multiphase Billiard algorithm MMBW has the highest performance for BP5 and e coli core.However, for the zamboni model, the CHRRT algorithm, which is built-in in hopsy, is most efficient.The marginal flux distributions of the three uniform sampling problems are shown in Figure S.4.In these plots the influence of the geometry of the CPs, on the marginal distributions becomes evident.Both e coli core and zamboni are core models of E. coli, yet different algorithms perform best, showing that the performance of MCMC algorithms is problem specific.This motivates the development and inclusion of new algorithms in hopsy.Indeed, in view of the results in Figure S.3, the MMBW algorithm was natively included in hopsy version 1.4.0.

(a) 16 -
Figure S.4: Marginal distributions of random variables uniformly distributed on CP test problems.In (c) the full flux space, consisting of 95 fluxes, is shown.8

Figure S. 5 :
Figure S.5: Recon3D benchmark.Left: Run times for producing 40, 000 samples per chain in the preprocessed space with a thinning constant of 10 * dim(Recon3D) = 10 * 4, 861.The back-transform to the full flux space was done by a one shot transformation, taking up only a fraction of the time (28 seconds for the desktop PC samples, and 235 seconds for the JURECA samples).In total, the desktop PC produced 40, 000 * 4 = 360, 000 samples in 16 hours and the supercomputer produced 40, 000 * 256 = 10, 240, 000 samples in 11 hours.Middle: ESS obtained on the desktop PC and JURECA.A higher ESS is better.Because the chains are independent, the ESS scales with the number of chains.However, ESS estimates are only reliable for converged chains, which we checked by computing the rank-normalized split potential scale reduction factor R[16]. Right: R values are below 1.1 for the desktop PC run and JURECA, indicating acceptable mixing.

Figure S. 6 :
Figure S.6: Trace plots of MCMC simulations without (blue) and with parallel tempering (orange).The horizontal red line represents the reference fluxes.High confidence in MCMC convergence is visible as a 'lack of structure" in the trace plots, i.e. little auto-correlation of the fluxes.For some fluxes, such as ppp2.x, the reduction in auto-correlation from using parallel tempering is clearly visible.For each flux, the rank normalized potential split scale reduction factors R were computed using Arviz[11].

Figure S. 7 :Figure S. 8 :
Figure S.7: Comparison of marginal distributions for the independent fluxes of the zamboni model.Uniform (prior) marginal distributions (blue) and posterior marginal distributions (orange).The reference fluxes are indicated by a vertical dashed line.The comparison emphasizes the information gain in major fluxes (primarily net fluxes), obtained from taking 13 C-MFA data into account.

Figure S. 9 :
Figure S.9: Comparison of posterior marginal distributions of fluxes of the zamboni 13 C-MFA model obtained using the sequential approach, described in Section S.4, versus marginal distributions obtained from the composite modeling approach (cf.Section S.5).The black solid line indicates the true value used to generate the simulated data, the black dashed line indicates the maximum a posteriori (MAP) estimate of the composite model.The comparison emphasizes additional information gain in major fluxes (primarily net fluxes), obtained from taking the coupling of bioprocess and 13 C-MFA modeling into account.