PYLFIRE: Python implementation of likelihood-free inference by ratio estimation

Likelihood-free inference for simulator-based models is an emerging methodological branch of statistics which has attracted considerable attention in applications across diverse fields such as population genetics, astronomy and economics. Recently, the power of statistical classifiers has been harnessed in likelihood-free inference to obtain either point estimates or even posterior distributions of model parameters. Here we introduce PYLFIRE, an open-source Python implementation of the inference method LFIRE (likelihood-free inference by ratio estimation) that uses penalised logistic regression. PYLFIRE is made available as part of the general ELFI inference software http://elfi.ai to benefit both the user and developer communities for likelihood-free inference.


Introduction
Computer-simulator-based models are becoming increasingly popular across a wide spectrum of scientific applications as they allow a more flexible framework for encoding expert knowledge than typical statistical model families. A simulator model must nevertheless under most circumstances be carefully tuned to produce realistic outcomes when compared with observed data from some process that the simulator is trying to mimic. Likelihood-free inference deals with the need to estimate parameters and quantify uncertainties in parameter values in the light of the observations x when using simulator models. The most common approaches in likelihood-free inference include approximate Bayesian computation (ABC) 1,2 and synthetic likelihood (SL) 3,4 . In a general statistical setting, Bayesian inference combines a probability model for the observations x with the parameter(s) θ and a prior distribution for the parameters p(θ) to define the posterior distribution over parameters θ as where p(x|θ) denotes the likelihood function and p(x) denotes the marginal likelihood defined as p(x) = ∫ p(θ)p(x|θ )dθ. The present work considers posterior estimation when the likelihood function is not available, but assuming that synthetic data can be be generated from the model given a configuration of its parameters. Posterior distribution or at least some summaries for it must then be obtained with likelihood-free inference methods, such as ABC or SL. The methods typically use summary statistics ψ(x) that are engineered to capture the relevant information about the model parameters present in the real or simulated observations x.
Recent advances in likelihood-free methods include likelihood-free inference by ratio estimation (LFIRE) 5 . LFIRE converts the posterior estimation problem into a density-ratio estimation problem that can be solved with logistic regression 6 . Assume that the dataset { } n x θ includes N observations simulated using particular parameter values θ and that { }, m n x includes observations simulated from the marginal data distribution, which is obtained by averaging the forward simulation process over random parameter values sampled from the prior distribution θ ~ p(θ). Logistic regression is then used in LFIRE to model the log-ratio which approximates the log-ratio between the likelihood and marginal likelihood functions evaluated at θ. Thomas et al. 5 propose modelling the log-ratio as a sparse linear combination of summary statistics ψ i (x) calculated from the observations x: where β denotes the linear model parameters and ψ(x) contains the summary statistics and a constant term. Estimation of the posterior is then based on the idea that we: (1) find the linear model parameters ˆ( ) β θ that minimise the l1-penalised logistic loss function evaluated over observation sets { } n x θ and { }, and (2), use the estimated log-ratio model to approximate the posterior density with To summarise, LFIRE uses lasso logistic regression to approximate the intractable likelihood function and to achieve data-driven selection of summary statistics in likelihood-free inference. Related works include sparse precision matrix estimation in synthetic likelihood approaches 7 and semi-parametric synthetic likelihood 8 . LFIRE implementations are currently available in MATLAB 5 and in ABCpy (Python) 9 and the related synthetic likelihood methods in BSL (R) 10 , while most other likelihood-free inference tools are focussed on ABC and related summary statistic selection methods. General-purpose tools most relevant to the current contribution are reviewed in 11 and available summary statistics selection methods in 12.
The present work introduces a new LFIRE implementation that is compatible with models constructed with the ELFI software 11 . ELFI is a general-purpose likelihood-free inference software that provides tools for modelling inference problems and includes various ABC approaches. Our PYLFIRE implementation introduced here extends ELFI by adding LFIRE to its pool of available inference methods. PYLFIRE implementation is discussed in closer detail in the next section, after which we provide operational instructions and demonstrate the workflow for estimating a posterior distribution with the software. PYLFIRE constructs the simulated datasets with ELFI 0.7.4 11 and estimates the sparse logistic regression model with glmnet 2.1.1 13 . ELFI models inference problems as networks that are used in PYLFIRE to generate observations with random parameters that follow the user-specified prior distribution or with fixed parameter values. Logistic regression parameters are estimated with the glmnet implementation that utilises Fortran subroutines for fast execution. Estimation is based on cyclical coordinate descent to minimise the penalised loss function with respect to regression parameters and cross-validation to optimise the penalisation level. When multiple computation cores are available, PYLFIRE can parallellise dataset construction or cross-validation.

Implementation
Operation PYLFIRE requires Python 3.6.0 (or a later version) and a Fortran compiler. However we also provide a Docker container image to run PYLFIRE with the requirements pre-installed. The package and installation instructions are available in ELFI zoo https://github.com/elfi-dev/zoo/tree/master/pylfire. PYLFIRE is provided with a makefile and installation options as follows. To install PYLFIRE in an environment that has Python 3.6 and a Fortran compiler, run installation:

make install
We then recommend testing the PYLFIRE framework with:

make test
The alternative is to build and run the PYLFIRE Docker image: make docker-build make docker-run This requires that Docker is installed, but avoids possible problems with the Fortran compiler. When PYLFIRE is installed, it is available with: import pylfire Running posterior inference with LFIRE implemented in the PYLFIRE package then includes (1) ELFI model construction and (2) running LFIRE to estimate posterior probabilities at predetermined parameter combinations. ELFI model construction means that the user adds their parameter priors, simulator, and summarisation rules into a network structure. While model construction does not require observed data, observations can be added in the model. The process is demonstrated in ELFI documentation and tutorials: https://elfi.readthedocs.io/.
ELFI model provides PYLFIRE means to generate observations from the marginal and conditional distributions. In addition to the network model, the user must determine the parameter combinations where approximate posterior is evaluated and the dataset size used in logistic regression. The observed data x must also be added in the network model, and the model, parameter combinations, and dataset size are then used to initialise an LFIRE instance that calculates the approximate posterior probabilities. The process is demonstrated in the next section.

Use case
We demonstrate how ELFI and PYLFIRE can be used to estimate the posterior distribution over model parameters in a lag-one autoregressive model with conditional heteroscedasticity (ARCH(1)). The model describes dependencies between observations in a time series as where y (0) = 0, and e (0) and ξ (t) are independent standard normal random variables. The time series used as observed data in the current example is simulated with parameters (θ 1 , θ 2 ) = (0.3, 0.7) and has T = 100 observations. The simulator is available in the PYLFIRE package: from pylfire.models import arch Dependencies between parameters, observations, and summary statistics are described in a predetermined ELFI model that is loaded with: Here the ARCH(1) simulator parameters are associated with prior distributions θ 1 ~ U (-1, 1) and θ 2 ~ U (0, 1). In our codes we denote the parameters θ 1 and θ 2 as t1 and t2, respectively. The parameters are mapped into time series observations with the simulator, and observations are in turn reduced into summaries that include the time series mean, variance, autocorrelations, and pairwise combinations between the autocorrelations, as described in previous work 5 . Figure 1 illustrates the network structure of the ARCH(1) ELFI model. LFIRE can additionally take precomputed marginal data and custom logistic regression parameters as optional inputs, and allows the user to control whether cross-validation or dataset generation is run in parallel. By default LFIRE generates the marginal data when initialised, uses lasso logistic regression, and runs cross-validation in parallel.
After the LFIRE method is initialised, one can run inference and extract results with:

Summary
We have introduced PYLFIRE, an open-source Python package running on ELFI and implementing the ratioestimation-based LFIRE method for likelihood-free inference with automatic summary statistic selection is implemented. PYLFIRE seeks to minimise the computation time in LFIRE with parallelisation and using the external glmnet package 13 where key components are written in Fortran. PYLFIRE uses glmnet to fit lasso logistic regression. For convenience, PYLFIRE provides summarised inference results and two built-in plotting methods for visualising the estimated posterior distribution.

Data availability
Underlying data All data underlying the results are available as part of the article and no additional source data are required.

Summary of the paper
In modern statistical analysis, the adoption of complex probabilistic models frequently results in likelihood functions that are computationally costly to evaluate ( ). A vast collection of methods has been intractable engineered to enable statistical inference in such situations. In particular, Approximate Bayesian Computation (ABC) and Synthetic likelihood (SL) are well established classes of likelihood-free algorithms. As an alternative approach, Thomas (2016) proposed a novel technique, named LFIRE et al.
(likelihood-free inference by ratio estimation), which converts the problem of estimating the posterior distribution as a problem of modeling (by lasso logistic regression fitted over samples) synthetic/pseudo the ratio between the data generating distribution and the marginal distribution. A convenient byproduct of such formulation is the resulting semi-automatic selection of summary statistics implied by the lasso (least absolute shrinkage and selection operator) regularization. This paper introduces PYLFIRE, an open-source Python implementation of LFIRE that provides for practitioners and developers an easy-to-operate toll for parameter estimation using LFIRE. In an introduction section, the paper describes the general problem of likelihood-free inference in the Bayesian framework and then outline how LFIRE approaches such problem. Next, in a methodological section, it elaborates on the implementation and operational aspects of the package, including details about the installation process and the support offered for parallel computing. A simple bi-dimensional autoregressive model with conditional heteroscedasticity (ARCH(1)) model is then fitted with PYLFIRE for illustrational purposes. To conclude, the paper summarizes the package main features and implementation decisions.

Major comments:
The paper is generally well-written and serves its purpose as a helpful reference source for the PYLFIRE package documentation. The package itself also provides a valuable contribution to the initiative of offering efficient open-source tools that allows immediate exploitation of some of the most advanced solutions for the likelihood-free estimation problem.
Albeit the focus of the paper is, understandably, the computational aspects of the package, it 1 2 3 1.

5.
Albeit the focus of the paper is, understandably, the computational aspects of the package, it would be beneficial for the reader if the paper was self-contained, in the sense that all information regarding the practical use of the package was present. The description of the LFIRE method, especially, should be meaningfully extended. In this regard, I recommend the inclusion of a pseudo-algorithm that lists the inputs, outputs and the steps involved in the referred approximate posterior sampling algorithm. Moreover, throughout the introduction, a more explicit distinction between and data should be pursued.

observed synthetic
It would also be of great use to include some guidance on how to specify the parameters of the algorithm (e.g. the number of simulated samples from the marginal distribution and from the data generating distribution) and to discuss when the method is suitable. For instance, models with a moderate number of parameters or where the prior is diffuse (as compared to the posterior) seem to be out of reach.
Considering that another Python routine is currently available, namely, ABCpy, a deeper comparison of the relative merits and deficiencies of each implementation seems necessary. A well-designed numerical simulation would certainly contribute to the robustness of the comparison. Minor comments: The notation used for the sample sizes is a bit confusing and could be improved. Is there a reason to use lowercase (n) and uppercase (N) letters? In addition, the number of simulated samples from the marginal distribution and from the data generating distribution are necessarily the same?
The term "is implemented" should be suppressed from the first paragraph of the Summary section.
It is not sufficiently clear how the observed summary statistics are passed on to the function when the user itself provides the simulated data sets.
I wonder how feasible it would be to embed the provided functionalities in an optimization process that aims to find the MAP (maximum a posterior probability). If possible, a brief discussion on how to accomplish that could extend the reach of the package when estimating the whole posterior distribution is impracticable.
The lasso regression is a reasonable and convenient classifier for estimating the density ratio. However, other classification algorithms could, at least in principle, be employed, either for better accuracy or for computational speed. As such, have the authors considered giving the user control of the classification process, for example, allowing them to pass on a custom python function that returns the estimated classification probabilities (instead of the reported optional "custom logistic regression parameters")? If feasible, this extension would readily benefit from advances on the probabilistic classification area (e.g. whenever a new package for classification becomes available or in cases where the glmnet 2.1.1 package underperforms).

Conclusions
The paper is well-structured, generally clear and represents a valuable addition to the practitioner's toolbox. For the readers convenience, an extra effort should be made to ensure the paper is self-contained. A list of suggestions was included in the comments above to indicate some options that could be explored in this regard.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Approximate Bayesian computation; Bayesian Dynamic models.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Introduction to the topic
The problem that is considered in this paper is parameter inference for implicit statistical models (i.e. computer-simulator model) in the Bayesian framework. In particular, what the authors aim at is posterior estimation when the likelihood function is intractable, but data-simulation from the model is available, i.e. what is considered is a likelihood-free inference method. The paper presents the PYLFIRE package (here estimation when the likelihood function is intractable, but data-simulation from the model is available, i.e. what is considered is a likelihood-free inference method. The paper presents the PYLFIRE package (here ), which is an open-source Python package that implements the likelihood-free inference by ratio estimation (LFIRE) method [6]. The LFIRE method is a simulation-based inference method for implicit Bayesian models, where the inference problem is transformed into a density-ration estimation problem, which is solved using Lasso logistic regression [6]. Thus, LFIRE has similarities to approximate Bayesian computation (ABC) [5] and in-particular synthetic likelihood (SL) [4]. However, in contrast to SL, LFIRE does not require any distribution assumptions on the summary statistics [6]. Another feature of the LFIRE method is that it allows for automatic selection of summary statistics by employing Lasso logistic regression [6]. This is an interesting feature since selecting proper summary statistics is one of the main challenges when using ABC in practice.

Summary of paper
The article starts by introducing the implicit statistical model (i.e., computer-simulator model); in the introduction, the LFIRE method is also briefly introduced. Then the authors explain how PYLFIRE is structured: PYLFIRE extends the ELFI software [1] and utilizes the ELFI framework, and the glmnet 2.1.1 package [2] is utilized for logistic regression modeling. The authors stress that PYLFIRE is developed with performance in mind; therefore, fast FORTRAN subroutines and parallelization are supported. Instructions on how to install PYLFIRE are provided, and a simple case study (simulation study for an ARCH(1) model) highlights the capabilities of PYLFIRE. The paper ends with a summary of the most important features of PYLFIRE.

Comments
The introduction in the paper is quite useful to the reader since it contains a short general introduction to the LFIRE method. However, this could be extended, which would make it easier for a reader not already familiar with the LFIRE method to follow the paper. An idea could be to try to include in the introduction a similar figure as to Figure 1 in [6].
The authors' reason for developing PYLFIRE could be further processed. Is PYLFIRE developed primarily to extend the capabilities of the ELFI software, to provide an easy-to-use package for LFIRE based inference, or to provide an LFIRE implementation with better computational performance than other implementations?
Another minor comment is that the authors could include a link to the Jupyter notebook ( ) with the here ARCH(1) example in the beginning of the section. This would make it easier for a reader to get Use case started using PYLFIRE.
Python implementation (ABCpy [1]) of the LFIRE method. A computational performance comparison with these implementations (or at least the ABCpy implementation) would be of interest, in order to investigate if PYLFIRE indeed has a better computational performance. A comparison with the other implementations could also investigate how easy the packages are to use, and possibly show the reader the potential advantages of using PYLFIRE.
Furthermore, PYLFIRE utilizes parallelization by default. However, in the paper, the authors do not investigate the potential performance gains from utilizing parallelization. Setting up such a case study is, of course, quite tricky since the results will be highly dependent on the model considered and the particular computer-system used. However, including such a case study could still be of interest since it could highlight: 1) how the user can run the PYLFIRE package in parallel, 2) some idea on what performance gains are plausible.
The only case study provided in the paper is a simulation study for the ARCH(1) model. In particular, since computational performance is of interest, would it be interesting to see how the PYLFIRE performs for a more complicated model with a higher-dimensional parameter space. Such a case study could potentially also highlight the parallel computation capabilities of PYLFIRE.

Conclusion
This is an overall clear and well-written paper that for the most parts is easy to follow. However, some parts of the paper could be extended for clarification, and the paper would be improved by including some further analyses (for details, see the section). The paper fulfils article approval status -Comments approved, but it could be further improved by incorporating the suggestions presented in the Comments section.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Likelihood-free inference methods I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.