Complete parameter inference for GW150914 using deep learning

The LIGO and Virgo gravitational-wave observatories have detected many exciting events over the past five years. As the rate of detections grows with detector sensitivity, this poses a growing computational challenge for data analysis. With this in mind, in this work we apply deep learning techniques to perform fast likelihood-free Bayesian inference for gravitational waves. We train a neural-network conditional density estimator to model posterior probability distributions over the full 15-dimensional space of binary black hole system parameters, given detector strain data from multiple detectors. We use the method of normalizing flows---specifically, a neural spline normalizing flow---which allows for rapid sampling and density estimation. Training the network is likelihood-free, requiring samples from the data generative process, but no likelihood evaluations. Through training, the network learns a global set of posteriors: it can generate thousands of independent posterior samples per second for any strain data consistent with the prior and detector noise characteristics used for training. By training with the detector noise power spectral density estimated at the time of GW150914, and conditioning on the event strain data, we use the neural network to generate accurate posterior samples consistent with analyses using conventional sampling techniques.

Introduction.-Since the first detection in September 2015 [1], the LIGO/Virgo Collaboration has published observations of gravitational waves from more than a dozen compact binary coalescences [2][3][4][5], primarily binary black hole mergers, but also two binary neutron star mergers. In addition, the LIGO/Virgo Collaboration has publicly released around fifty additional triggers [6] of events of interest, the details of which have so far not been published. These observations have had a transformative impact on our understanding of compact objects in the Universe, facilitated by inferring the parameters of the system using accurate physical models of the emitted gravitational waves. This inference is extremely computationally expensive, as posterior distributions of the parameters are usually obtained using more than one waveform model to probe any potential systematic effects, and using the physically most complete (and thus computationally most costly) waveforms available. LIGO/Virgo currently employ Markov Chain Monte Carlo and nested-sampling algorithms to obtain posterior samples [7,8]. Run times for single posterior calculations typically take days for binary black hole systems and weeks for binary neutron stars [2,9]. These long run times will become increasingly problematic as the sensitivity of the instruments improves and event rates reach one per day or higher [10].
There is an urgent need for new approaches that can generate scientific inferences much more rapidly than the existing pipelines [7,8]. Deep-learning methods are one promising approach to increase the speed of gravitational wave inference by several orders of magnitude, that has been receiving increasing focus in recent years [11]. These techniques attempt to train a neural-network conditional density estimator q(θ|s) to approximate the Bayesian posterior distribution p(θ|s) of parameter values θ given detector strain data s. Neural networks typically have millions of parameters, which are optimized stochastically during training to minimize an appropriate loss function. With a "likelihood-free" training algorithm, it is never necessary to draw samples from the posterior or evaluate a likelihood, rather the procedure is generative and just requires an ability to simulate data sets. Consequently, training can be done in a time comparable to that taken to obtain posterior samples using a standard method. There have been several previous works on this topic, but these either simplified the description of the posterior, e.g., by using a Gaussian approximation [12], or simplified the input, e.g., using a reduced space of parameters and a single detector [13].
In a previous paper [14], we used a type of neural network known as a conditional variational autoencoder (CVAE) [15,16] combined with normalizing flows [17][18][19][20] to learn the posterior distribution for inference of all parameters of an aligned-spin quasi-circular merger observed with a single gravitational-wave detector. With a single detector we could not recover the full set of waveform parameters, and all data sets analyzed were artificially generated with Advanced LIGO design-sensitivity noise [21]. In this Letter, we extend that work into a tool that can, for the first time, be used to analyze real data from the LIGO/Virgo interferometers. We describe a neural network architecture, based on normalizing flows alone, that is able to generate posteriors on the full D = 15 dimensional parameter space of quasi-circular binary inspirals, using input data from multiple gravitational-wave detectors. We apply this network to analyze observed interferometer data surrounding the first observed gravitationalwave event, GW150914, and show that we can successfully recover posterior distributions consistent with conventional methods. This is the first demonstration that these methods can be used in a realistic setting to produce fast and accurate scientific inference on real data. This Letter thus establishes a new benchmark in fast-and-accurate gravitational wave inference, as well as describing methods that could also be applied to other inference problems in experimental physics.
Neural network model.-Our aim is to train a neural conditional density estimator q(θ|s) to approximate the gravitational-wave posterior p(θ|s). To this end, q(θ|s) must have sufficient flexibility to capture the detailed shape of the true posterior over parameters θ, as well as the dependence on the complicated strain data s. We use the method of normalizing flows.
A normalizing flow f is an invertible mapping on a sample space with simple Jacobian determinant [17]. For a conditional distribution, the flow must depend on s, so we denote it f s . The idea is to train the flow so that it maps a simple "base" distribution π(u) into the far more complex q(θ|s). We define the conditional distribution in terms of the flow by which is based on the change of variables rule for probability distributions. The base distribution π(u) should be chosen such that it can be easily sampled and its density evaluated; we will always take it to be standard multivariate normal of the same dimension D as the sample space.
By the properties of a normalizing flow, q(θ|s) inherits the nice properties of the base distribution. Indeed, to draw a sample, one first samples u ∼ π(u), and then sets θ = f s (u); it follows that θ ∼ q(θ|s). To evaluate the conditional density, one uses (1); the right hand side may be evaluated by the defining properties that f s is invertible and has simple Jacobian determinant.
Normalizing flows are under active development in computer science, and are usually represented by neural networks. Neural networks are very flexible trainable function approximators, so they can give rise to complex conditional densities. Our previous work [14] used a Masked Autoregressive Flow [19] with affine transformations. In the present work, we use a much more powerful flow called a neural spline flow [22]. We use directly the original neural spline flow implementation [26], which we illustrate in figure 1. We now give a brief summary.
The flow is a composition of "coupling transforms" c s (u), each of which transform elementwise half of the parameters (say, u d+1:D ) conditional upon the other half (u 1:d ) as well as the strain data s [27], i.e., If c i is invertible and differentiable with respect to u i , then it follows immediately that the coupling transform is a normalizing flow. By composing n flows of these transforms, and permuting the indices of u in between, a much more flexible flow can be obtained.  [23], we inserted batch normalization layers to speed training [24] and Exponential Linear Units for nonlinearity [25]. Each block is also conditioned on the strain data s.
The neural spline coupling transform [22] takes each c i to be a monotonically-increasing piecewise function, defined by a set of knots {(u , between which are interpolated rational-quadratic (RQ) functions. The knots and derivatives are output from a residual neural network [28], which takes as input u 1:d and s; details of the network are given in figure 1. The RQ spline is differentiable and has analytic inverse, so it satisfies the properties of a coupling transform.
Training.-The conditional density estimator q(θ|s) must be trained to approximate as closely as possible the gravitational-wave posterior p(θ|s). We do this by tuning the neural network parameters to minimize a loss function, the expected value (over s) of the cross-entropy between the true and model distributions, On the second line we used Bayes' theorem to express L in a form that involves an integral over the likelihood rather than the posterior; this is a key simplification which means posterior samples are not needed for training. We evaluate the integral (3) on a minibatch of training data with a Monte Carlo approximation, where θ (i) ∼ p(θ), s (i) ∼ p(s|θ (i) ), and N is the number of samples in the minibatch. We then use backpropagation (the chain rule) to compute the gradient with respect to network parameters, and minimize L stochastically on minibatches using the Adam optimizer [29].
To obtain a training pair (θ (i) , s (i) ), we draw θ (i) from the prior, we generate a waveform h(θ (i) ), and then we add a noise realization to obtain s (i) . Waveform generation is too costly to perform in real time during training, so we adopt a hybrid approach: we sample "intrinsic" parameters in advance and save associated waveform polarizations h (i) +,× ; at train time we sample "extrinsic" parameters, project onto detectors, and add noise. We used 10 6 sets of intrinsic parameters, which was sufficient to avoid overfitting.
Prior.-We perform inference over the full D = 15 dimensional set of precessing quasi-circular binary black hole parameters: detector-frame masses m i (i = 1, 2), reference phase φ c , time of coalescence t c, geocent , luminosity distance d L , spin magnitudes a i , spin angles (θ i , φ 12 , φ JL ) [30], inclination angle θ JN , polarization angle ψ, and sky position (α, δ). Of these, we consider (m i , φ c , a i , θ i , φ 12 , φ JL , θ JN ) to be intrinsic, so they are sampled in advance of training. To analyze GW150914, we take a uniform prior over The prior is standard over the remaining quantities. We take t c, geocent = 0 to be the trigger time, and we require m 1 ≥ m 2 . Although a prior uniform in the comoving source frame [8] would be most physical, we adopted a uniform prior over d L and an upper bound of 1000 Mpc to more uniformly cover the parameter space and improve training. We applied the physical prior in postprocessing by reweighting samples. Also to improve training, we rescaled all parameters to have zero mean and unit variance.
Strain data.-For likelihood-free training, we require simulated strain data sets s (i) that arise from the data generative process, s (i) ∼ p(s|θ (i) ). We assume stationary Gaussian noise, so the gravitational-wave likelihood function is known explicitly, but the likelihood-free approach applies even when this is not the case-e.g., in the presence of non-Gaussian noise-as long as it is possible to simulate data.
We generate training waveforms using the IMRPhenomPv2 frequency-domain precessing model [31][32][33]. We take a frequency range of [20,1024] Hz, and a waveform duration of 8 s. We then whiten h (i) +,× using the noise PSD estimated from 1024 s of detector data prior to the event. Following [12], we compress the whitened waveforms to a reduced-order representation; we use a singular value decomposition (SVD), and keep the first n SVD = 100 components.
At train time, we sample extrinsic parameters and generate detector signals. This requires a trivial rescaling to apply d L , and linear combinations of h (i) +,× to project onto the antenna patterns for the two LIGO detectors. To apply time delays in the reduced-basis (RB) representation, we follow the approach of [34,35] of pre-preparing a grid of time-translation matrix operators that act on vectors of RB coefficients, using cubic interpolation for intermediate times. Since the transformation to RB space is a rotation, we add white noise directly to the RB coefficients of the whitened waveforms to obtain s (i) . Finally, data is standardized to have zero mean and unit variance in each component.
Results.-We trained for 500 epochs with a batch size of 512. The initial learning rate was 0.0002, and we used cosine annealing [36] to reduce the learning rate to zero over the course of training. We performed a search over network hyperparameters, and have listed those with best performance (as measured by final validation loss) in figure 1. During training, we reserved 10% of our training set for validation, and found no evidence of overfitting. With an NVIDIA Quadro P4000 GPU, training took ≈ 6 days.
To perform inference on GW150914, we took 8 s of detector data containing the signal and expressed it in the RB representation. We then drew samples from the base space, and applied the normalizing flow conditioned on the strain data to obtain samples from q(θ|s). This produced samples at a rate of 5,000 per second. We benchmarked these against samples produced by bilby [8,37] with the dynesty sampler [38].
Our main result is presented in figure 2, which compares the neural network and bilby posteriors. Both distributions are clearly in very close agreement. There are minor differences in the inclination angle θ JN , where the neural network gives more support to the secondary bilby dynesty neural network mode, and the sky position, where the neural network also develops a small secondary mode. With more training or a larger network, we expect even better convergence.
Although our demonstration has focused on GW150914, the neural network has been trained to generate any posterior consistent with the prior and the given noise PSD. To demonstrate this, performed inference on 100 injec-tions of waveforms drawn from the prior with added noise. Performance is summarized in the P-P plot shown in figure 3. This shows the cumulative distribution function (CDF) of the percentile scores of each injection parameter within the one-dimensional marginalized posteriors. The percentiles should be distributed uniformly between 0 and 1; since the CDF lies close to the diagonal, we conclude that the network is properly sampling the posteriors. This is confirmed by a Kolmogorov-Smirnov (KS) test.
In our experiments, we also varied n SVD , and we found slightly reduced performance as this was increased. This indicates that, although with less compression it should be possible to produce tighter posteriors, better network optimization is required to take full advantage. Indeed, subleading SVD elements contain mostly noise, which makes training more difficult.
Conclusions.-In this Letter, we have demonstrated for the first time that deep neural networks can accurately infer all 15 binary black hole parameters from real gravitational-wave strain data. Once the network is trained, inference is extremely fast, producing 5,000 independent samples per second.
Rapid parameter estimation is critical for multimessenger followup and for confronting the expected high rate of future detections. An advantage of likelihood-free methods is that waveform generation is done in advance of training and inference, rather than at sampling time as for conventional methods. Thus, waveform models that include more physics but may be slower to evaluate [39] can be used to analyze data in the same time as faster models.
The network we presented is tuned to a particular noise PSD-in this case, estimated just prior to GW150914. However, the noise characteristics of the LIGO and Virgo detectors vary from event to event, and ultimately we would like amortize training costs by building a conditional density estimator that can do inference on any event without retraining for each PSD. One approach would be to condition the model on PSD information: during training, waveforms would be whitened with respect to a PSD drawn from a distribution representing the variation in detector noise from event to event, and (a summary of) this PSD information would be passed to the network as additional context. (PSD samples can be obtained from detector data at random times.) For inference, PSD information would then be passed along with the whitened strain data. Similar approaches could also be used to treat non-Gaussian noise artifacts.
In contrast to CVAEs used in past work [13,14], normalizing flows have the advantage of estimating the density directly, without any need to marginalize over latent variables. This means that the loss function can be taken to be the cross-entropy (4) rather than an upper bound [15,16]. Moreover, since q(θ|s) is a normalized probability distribution, the Bayesian evidence can be obtained as a byproduct. The performance we achieved without latent variables in this work was made possible by the use of a more powerful normalizing flow [22] compared to [14]. As new and more powerful normalizing flows are developed by computer scientists in the future, they will be straightforward to deploy to further improve the performance and capabilities of deep learning for gravitational-wave parameter estimation.
We would like to thank C. Simpson for helpful discussions and for performing training runs during the early stages of this work. Our code was implemented in PyTorch [40], and with the neural spline flow implementation of [26]. Plots were produced with matplotlib [41], ChainConsumer [42] and ligo.skymap [43].