Binary Vision: The Mass Distribution of Merging Binary Black Holes via Iterative Density Estimation

Jam Sadiq; Thomas Dent; Mark Gieles

doi:10.3847/1538-4357/ad0ce6

1. Introduction

Ever since the first gravitational-wave (GW) detection revealed a binary black hole (BBH) source with the—previously unsuspected—component masses of around 35 M_⊙ (Abbott et al. 2016a, 2016b), LIGO–Virgo–KAGRA (LVK) observations of compact binaries have continued to yield surprises, of which the binary mass distribution arguably contains the most information bearing on formation environments and channels. In the first three observing runs of Advanced LIGO (Aasi et al. 2015) and Advanced Virgo (Acernese et al. 2015) the better part of 100 detections of mergers of binary compact objects via GW emission were made, as cataloged in the GWTC releases (Abbott et al. 2019b, 2021a, 2021c, 2023a). Having a set of detected events, it is possible to study population properties of these compact binaries and eventually draw implications from these properties for binary astrophysical formation and evolution. Detailed investigations of the population properties of BBH mergers, the most commonly detected source type, were undertaken in Abbott et al. (2021b, 2023b), focusing on several population characteristics including their component masses and spins and possible dependence on redshift.

Among parameters estimated and studied in connection with the population properties of binary compact objects, the component masses are obtained with least uncertainty. Many parameterized and semi- or nonparametric models have been proposed to study the mass dependence of the compact binary merger rate or the mass distribution of the merger population (see Abbott et al. 2023b, and references therein). In parametric models, Bayesian hierarchical techniques are used to infer model hyperparameter posteriors, and thus the population distribution (e.g., Mandel 2010; Thrane & Talbot 2019). On the other hand, nonparametric models are data-driven methods that learn population properties either without requiring any specific functional form or (for semiparametric models) allowing for generalized deviations from a given parametric model (Powell et al. 2019; Edelman et al. 2022, 2023; Rinaldi & Del Pozzo 2022; Tiwari & Fairhurst 2021; Tiwari 2021, 2022; Veske et al. 2021; Callister & Farr 2023; Ray et al. 2023; Toubiana et al. 2023).

In Sadiq et al. (2022) we introduced a fast and flexible adaptive-width kernel density estimation (awKDE) as a nonparametric estimation method for population reconstructions of BBH distribution from observed gravitational-wave data. A limitation of this method arose from the measurement uncertainty in each individual event's parameters. Given the relatively low signal-to-noise ratios of typical detections, the binary component masses have significant uncertainties (e.g., Veitch et al. 2015) that can bias the overall population distribution if not properly accounted for. Specifically, for typical binary components of order 35 M_⊙ and above, detector noise results in ≲10% mass uncertainties (e.g., Abbott et al. 2021c); for lower-mass BBHs, the chirp mass ${ \mathcal M }={({m}_{1}{m}_{2})}^{3/5}{({m}_{1}+{m}_{2})}^{-1/5}$ is measured more precisely, but the mass ratio and component masses are in general more uncertain than for heavier binaries. These uncertainties are quantified by Bayesian parameter estimation (PE) techniques that generate thousands of samples for each detection, representing the posterior probability distribution over m₁, m₂ given an uninformative (flat) prior.

In Sadiq et al. (2022), we incorporated such uncertainties either by using the median PE sample mass as a point estimate or by randomizing over samples. However, these procedures inevitably broaden or overdisperse any rapid variations in the population distribution, yielding a population estimate that is biased toward being too "smooth" or slowly varying. In addition, the use of an uninformative prior introduces a bias in mass measurements as compared to a prior informed by our knowledge of the population distribution (e.g., Abbott et al. 2023c, Appendix E).

In this work we describe a new method to treat measurement uncertainties of source properties and substantially reduce such biases in the estimated population distribution. We propose an iterative scheme for reweighting the PE samples of each observed event based on a current population estimate, similar in spirit to the standard expectation-maximization algorithm (Dempster et al. 1977). We demonstrate that this reweighting significantly reduces biases in the population density estimate, although some bias remains, as is unavoidable for a kernel density estimation (KDE) with a relatively small number of observations from an unknown true distribution.

As an application of this new method, we estimate the full two-dimensional component mass distribution, without assumptions on its functional form aside from the use of Gaussian kernels. While attention has often focused on the primary component mass or on the more precisely measured chirp mass (see among others Dominik et al. 2015; Tiwari & Fairhurst 2021; Tiwari 2022, 2023; Farah et al. 2023a; Edelman et al. 2023; Schneider et al. 2023), less attention has been paid to the full binary distribution, considered via either the secondary m₂ or the mass ratio q ≡ m₂/m₁. We expect these parameters to bear traces of the possible BBH formation channels (Kovetz et al. 2017), in that for dynamical (cluster) formation the two masses may be independent variates, up to a factor modeling probability of binary formation and merger (Fishbach & Holz 2020; Farah et al. 2023b; Antonini et al. 2023) that typically favors near-equal masses (e.g., O'Leary et al. 2016; Rodriguez et al. 2016). Conversely, for isolated binary evolution, some nontrivial though probably highly model-dependent correlation of component masses may arise (e.g., van Son et al. 2022).

Typically, parametric models have assumed power-law m₂ dependence at fixed m₁ (Fishbach & Holz 2017; Kovetz et al. 2017; Talbot & Thrane 2018; Abbott et al. 2019a), recovering mildly positive powers indicating a preference for equal masses. A more detailed study using GWTC-1 events indicated a preference for the two BHs of a given binary to be of comparable mass (Fishbach & Holz 2020). More recent nonparametric or semiparametric studies have relaxed these assumptions, either through allowing the power-law index to vary over chirp mass (Tiwari 2022) or by allowing p(q) to be a free (data-driven) function (Callister & Farr 2023; Edelman et al. 2023), though enforcing the same dependence over all primary masses. Tiwari (2023) introduced a more flexible approach with p(q) modeled by a truncated Gaussian whose parameters depend on chirp mass, finding some significant variation. Ray et al. (2023) measured the full 2D distribution with a binned (piecewise-constant) model over m₁, m₂ (including possible redshift dependence), although they did not consider the q distribution.

Note that the mass ratio distribution presents nontrivial technical issues since (at least for lower-mass BBHs) typical event measurement uncertainties are both large and correlated with the BH orbit-aligned spin components (Cutler & Flanagan 1994; Baird et al. 2013). As the iterative reweighting scheme is designed to address such uncertainties, we expect it to yield a more accurate reconstruction of the full mass distribution than our previous KDE method.

The remainder of the paper is organized as follows: in Section 2 we motivate and explain our method, and demonstrate it using simple one- and two-dimensional mock data. In Section 3 we apply our method to detected BBHs in GWTC-3; we compare the resulting primary mass distribution with our previous studies and extend its application to the full two-dimensional mass plane. In Section 4 we discuss the implications of our results and consider extensions of the method.

2. Method

2.1. Statistical Framework

Our general approach to population inference can be considered as similar to maximum likelihood, with uncertainties quantified via empirical bootstrap methods (Efron 1979). Given a set of observed events, if we neglect measurement uncertainty in each event's parameters, our population estimate is a KDE where the kernel bandwidth for each event is adjusted (Breiman et al. 1977; Abramson 1982; Terrell & Scott 1992; Sain & Scott 1996) using an adaptive scheme to maximize the cross-validated likelihood (Sadiq et al. 2022). The adaptive-bandwidth KDE (Wang & Wang 2011) computes a density estimate $\hat{f}$ from observations X_i, i = 1, ..., n via

$\begin{eqnarray}&&\hat{f}(x)={n}^{-1}\displaystyle \sum _{i=1}^{n}\displaystyle \frac{1}{h{\lambda }_{i}}K\left(\displaystyle \frac{x-{X}_{i}}{h{\lambda }_{i}}\right),\end{eqnarray} \tag{ 1 }$

where K( · ) is the standard Gaussian kernel,

$\begin{eqnarray}&&K(z)=\displaystyle \frac{1}{\sqrt{2\pi }}\exp \left(\displaystyle \frac{-{z}^{2}}{2}\right),\end{eqnarray} \tag{ 2 }$

n is the total number of samples, and h λ_i takes the role of a local bandwidth, with h being the global bandwidth.

The local bandwidths are determined by first computing a pilot estimate ${\hat{f}}_{0}$ setting λ_i = 1 for all i, a standard fixed-bandwidth KDE; based on this pilot density, we then set

$\begin{eqnarray}&&{\lambda }_{i}={\left(\displaystyle \frac{{\hat{f}}_{0}({X}_{i})}{g}\right)}^{-\alpha },\end{eqnarray} \tag{ 3 }$

where α is the sensitivity parameter of the local bandwidth (0 < α ≤ 1) and g is a normalization factor,

$\begin{eqnarray}&&\mathrm{log}g={n}^{-1}\displaystyle \sum _{i=1}^{n}\mathrm{log}{\hat{f}}_{0}({X}_{i}).\end{eqnarray} \tag{ 4 }$

Finally the adaptive KDE $\hat{f}(x)$ is obtained by evaluating Equation (1) with the variable (local) bandwidth h λ_i. The equations are written for the case of one-dimensional data X_i, but the method may be applied in more dimensions by linearly transforming the data to have zero mean and unit covariance along each dimension and using an N-dimensional unit Gaussian kernel scaled by h λ_i.

The KDE hyperparameters, global bandwidth h, and sensitivity parameter α are determined by grid search using maximum likelihood as a figure of merit. A naïve "maximum likelihood" KDE is not well defined, as the likelihood increases indefinitely in the limit of small-bandwidth kernels centered on the observations, i.e., delta functions (Silverman 1986, Section 2.8). We prevent this collapse and address the bias–variance tradeoff via leave-one-out cross-validation (Silverman 1986, Section 3.4.4; see also Hastie et al. 2001). The cross-validated (log) likelihood is

$\begin{eqnarray}&&\mathrm{log}{{ \mathcal L }}_{\mathrm{LOO}}=\displaystyle \sum _{i=1}^{n}\mathrm{log}{\hat{f}}_{\mathrm{LOO},i}({X}_{i}),\end{eqnarray} \tag{ 5 }$

where ${\hat{f}}_{\mathrm{LOO},i}$ is the KDE constructed from all samples except X_i. Being linear in the logarithm of the estimate at observed values, this choice will penalize relative errors. Since we wish to obtain an accurate estimate of densities over a large dynamic range, the log likelihood is more suitable than considering absolute error or squared absolute error.

We then quantify counting uncertainties for the underlying inhomogeneous Poisson process using generalized bootstrap resampling (Chamandy et al. 2012): we take a number of PE samples from each detected event that is a random variable distributed as Poisson(1).

The new aspect of this work concerns the choice of mass sample values to treat as KDE input data X_i. We noted in Sadiq et al. (2022) that for nontrivial uncertainties in component masses this method, in addition to possible intrinsic biases due to the choice of a Gaussian KDE, will be biased toward an overdispersed estimate of the true distribution. This is expected because, with the uninformed priors used in PE, the posterior over mass for any given event will be randomly displaced relative to the (unknown) true mass due to detector noise. Here we motivate and present our strategy for correcting this overdispersion bias.

Our motivation is linked to Bayesian hierarchical population inference (Mandel 2010), where measurement errors are treated by considering the true event properties θ _i for detected events labeled i = 1, ..., N as nuisance parameters. Here, the likelihood of a set of gravitational-wave data segments d_i corresponding to the events is

$\begin{eqnarray}&&{P}_{N}(\{d\}| {\boldsymbol{\lambda }})=\displaystyle \prod _{i=1}^{N}\int P({d}_{i}| {{\boldsymbol{\theta }}}_{i}){p}_{\mathrm{pop}}({{\boldsymbol{\theta }}}_{i}| {\boldsymbol{\lambda }})\,d{{\boldsymbol{\theta }}}_{i},\end{eqnarray} \tag{ 6 }$

where P(d_i∣...) is the likelihood for a single data segment and p_pop( θ _i∣ λ ) is the population distribution over θ _i for population model hyperparameters λ (here, for simplicity, we omit selection effects). Inference is implemented using PE samples that were generated using a standard or fiducial prior π_PE( θ ), often chosen as uniform over parameters of interest (see, e.g., Veitch et al. 2015; Thrane & Talbot 2019). Samples (labeled by k) are distributed as the posterior density p( θ _i∣...) using this prior, hence

$\begin{eqnarray}&&{{\boldsymbol{\theta }}}_{i}^{k}\sim p({{\boldsymbol{\theta }}}_{i}| {d}_{i},{\pi }_{\mathrm{PE}}({\boldsymbol{\theta }}))\propto P({d}_{i}| {{\boldsymbol{\theta }}}_{i}){p}_{\mathrm{pop}}({{\boldsymbol{\theta }}}_{i}| {\boldsymbol{\lambda }})\displaystyle \frac{{\pi }_{\mathrm{PE}}({\boldsymbol{\theta }})}{{p}_{\mathrm{pop}}({{\boldsymbol{\theta }}}_{i}| {\boldsymbol{\lambda }})}.\end{eqnarray} \tag{ 7 }$

Hence, integrals over θ _i, as in Equation (6) may be performed (up to a constant factor) by summing over samples ${{\boldsymbol{\theta }}}_{i}^{k}$ reweighted by the ratio of the population distribution to the PE prior, p_pop( θ _i∣ λ )/π_PE( θ ).

Here, while not making use of this hierarchical likelihood, we remark that such reweighted PE sample sets give an unbiased estimate of event properties if p_pop( θ _i) is equal to the true population distribution; conversely, the unweighted PE samples give a biased estimate if π_PE( θ ) differs from the true population distribution. This observation is the basic motivation for our improved method.

A KDE trained on points drawn randomly from PE samples will be biased because these samples are themselves biased by the "uninformative" prior. If we have access to an estimated population distribution ${\hat{p}}_{\mathrm{pop}}({\boldsymbol{\theta }})$ that is closer to the true distribution than the PE prior is, we will obtain more accurate estimates of event properties by drawing samples weighted proportional to ${\hat{p}}_{\mathrm{pop}}({\boldsymbol{\theta }})/{\pi }_{\mathrm{PE}}({\boldsymbol{\theta }})$ , as detailed below. The better an estimate of the true distribution we are able to obtain, the smaller will be the bias in event parameters using reweighted PE samples, and ultimately the smaller will be the bias of the KDE.

2.2. Iterative Reweighting

The above discussion suggests an iterative procedure where, beginning with both biased PE samples and a biased population KDE, one may be improved in turn using the other, finally reaching a stationary state where—ideally—both the sample draws and the corresponding population estimates are unbiased. This iterative strategy is similar to the expectation-maximization algorithm (Dempster et al. 1977), a popular method to estimate parameters for statistical models when there are missing or incomplete data.

Our basic algorithm follows these steps:

1.
for each GW event, draw Poisson-distributed (with mean 1) PE samples weighted by the current estimate of population density ${\hat{p}}_{\mathrm{pop}}$ ;
2.
create an awKDE from this sample set, optimizing the global bandwidth (and sensitivity parameter α, if not fixed);
3.
update the current population estimate using one or more KDEs and the selection function, and go to step 1.

In more detail, in step 1 we draw PE samples with probability proportional to the ratio of ${\hat{p}}_{\mathrm{pop}}({{\boldsymbol{\theta }}}_{i}^{k})$ to the PE prior distribution. Step 2 reproduces our previous awKDE method. Step 3 relates the KDE of detected events to an estimate of the true population distribution, hence in general it requires us to compensate for the selection function over the event parameter space: i.e., we estimate the true distribution as the KDE of detected events divided by the probability of detection, as detailed in Sadiq et al. (2022, Section 3.1).

In step 3 we may choose to derive the updated population density ${\hat{p}}_{\mathrm{pop}}({\boldsymbol{\theta }})$ from only the most recently calculated KDE; then the iterative process is a Markov chain,⁶ and we may characterize it via the autocorrelation of various scalar quantities computed at each iteration. We use the optimized global bandwidth h (and adaptive sensitivity parameter α, if not fixed to unity) for this purpose.

After discarding a small number of initial iterations and then accumulating a number significantly greater than the autocorrelation time, we expect the collection of iterations to provide unbiased (though not necessarily independent) estimates of the population distribution. For subsequent iterations we then use the median of ${\hat{p}}_{\mathrm{pop}}({{\boldsymbol{\theta }}}_{i}^{k})$ over a buffer of previous iterations (usually the previous 100) to determine the sample draw probabilities for the next iteration. This population estimate should be more precise than one using only a single previous KDE; in addition using the buffer estimate, the samples for each successive iteration are essentially independent.

In reality, we do not expect the resulting population density estimate to be entirely unbiased, due to more fundamental limitations in both the PE input (e.g., inaccuracies in the merger waveform model used, see Abbott et al. 2023b) and in the KDE procedure itself. Specifically, the expected bias of the KDE is proportional to the second derivative of the true distribution function (Silverman 1986, Section 3.3), thus, depending on the eventual choice of bandwidth, sharp peak or gap features may not be well represented. This bias might be reduced by replacing the KDE with a more general estimate, for instance Gaussian mixture models that allow the positions and weights of kernels to vary (e.g., Rinaldi & Del Pozzo 2022), although this implies higher complexity and computational cost.

2.3. One-dimensional Mock Data Demonstration

We first test this iterative reweighting method on a simple mock data set. We generate true event parameters for a mixture, drawing 30 events each from a truncated power law p_pop(x) ∼ x^−0.5 and from a Gaussian with mean (standard deviation) of μ = 35 (σ_p = 3). We then simulate "measured" event values with a Gaussian random scatter relative to the true parameters of standard deviation σ_m = 5 (broader than the true Gaussian peak); for each measured value we generate 100 mock parameter samples with mean equal to the measured value and with the same uncertainty σ_m = 5. This procedure mimics PE using an uninformative (flat) prior over x.

First, applying awKDE as in Sadiq et al. (2022) to random draws from these mock parameter samples, we find as expected an overdispersed estimate around the peak (Figure 1, top). Second, applying awKDE with our iterative reweighting algorithm we obtain the bottom plot of Figure 1: here the Gaussian height and standard deviation appear accurately reconstructed and the true distribution is well within the 90% percentiles of iteration samples, except near the step-function truncations of the power law, which cannot be accurately represented by a Gaussian KDE.

**Figure 1.** awKDE for 60 events from the distribution of a mock data mixture with 50% power-law (α = −0.5) and 50% Gaussian (μ = 35, σ = 3) samples. One hundred mock parameter samples are generated for each event with measurement error σ_m = 5. Top: awKDE drawing random parameter samples without reweighting. Bottom: applying iterative reweighting. The solid (dotted–dashed) lines represent the median (symmetric 90% confidence band) from 900 bootstrap iterations.
Download figure:
Standard image High-resolution image

We verify that the initial Markov process has accumulated several independent samples by plotting the autocorrelation of the optimized global bandwidth h versus lag (separation along the chain) in Figure 2.⁷ The autocorrelation drops near zero by a lag of ≲10 iterations, thus a buffer of 100 iterations contains several independent population estimates. (The estimate is more noisy for larger lags as fewer iterations are available.)

**Figure 2.** Autocorrelation of the (log) optimized global bandwidth series for the iterative Markov chain. The x-axis shows lag, i.e., separation of the iterations between which autocorrelation is calculated.
Download figure:
Standard image High-resolution image

To further investigate possible statistical biases, we performed the same two awKDE analyses (unweighted and iterative reweighted) for 50 realizations of mock data with the same distribution. At any given x value, we can find the percentile (p-value) of the true p_pop relative to the awKDE Monte Carlo samples, and thus produce a probability-probability (P–P) plot. The first panel in Figure 3 for unweighted awKDE shows consistent estimates for values well away from the population peak, but catastrophic under- or overestimation close to the peak, as expected from strong overdispersion. For reweighted awKDE, the second panel shows much smaller biases close to the peak. (We do not show x values close to 0 or 100, as these are subject to large edge effects for both methods.) While not entirely free of bias, the reweighted method offers much improved reconstruction of population features.

**Figure 3.** P–P plots for 50 realizations of mock data as in Figure 1, for the unweighted awKDE (left) and for the iterative reweighted awKDE (right). The x-axis is the percentile of the true population density relative to the KDE Monte Carlo samples at each of several chosen parameter values. The y-axis is the fraction of mock data experiments below the given percentile, while the gray area is an approximate 95 percentile confidence band. Plots were created with https://lscsoft.docs.ligo.org/ligo.skymap/plot/pp.html.
Download figure:
Standard image High-resolution image

In the next subsection we verify the improved behavior of the reweighted awKDE for two-dimensional mock data in the presence of correlated parameter uncertainties.

2.4. Two-dimensional Mock Data Test of Iterative Reweighting

To investigate the performance of iterative reweighted KDE over 2D data, we start with 60 mock events, 50% from a two-dimensional uniform distribution and 50% from a bivariate normal (Gaussian) distribution. The mean of the normal distribution is μ = (50, 50) in arbitrary units and we take the two dimensions to be uncorrelated, each with a variance of 9. We then simulate measurement uncertainty in our mock data values with an anticorrelation between the two dimensions using covariance ${{\rm{\Sigma }}}_{p}=\left(\begin{array}{cc}32.5 & -31.5\\ -31.5 & 32.5\end{array}\right)$ , corresponding to an error ellipse with 8:1 axes at 45°. This is a simple choice to mimic the anticorrelation between m₁ and m₂ along a contour of constant chirp mass for lower-mass BBHs. As for the 1D mock data, 100 parameter samples with the same covariance are generated around each measured value.

We then investigate whether the reweighted awKDE can reconstruct the true (uncorrelated) population peak by reducing the correlation of the artificial measurement uncertainty. The top panel of Figure 4 for the unweighted awKDE (median over 900 iterations) clearly shows a biased reconstruction with anticorrelation around the peak at (50, 50).⁸ The bottom panel shows the results using the iterative reweighting scheme: the anticorrelation appears to be removed and the reconstructed peak is as expected for the true distribution.

We also compute 1D results from these 2D reweighted awKDE iteration samples by integrating over the second parameter (labeled m₂ in Figure 4) and compare with the corresponding 1D true distribution. Figure 5 shows this comparison: the true Gaussian peak height and width are well recovered by the reweighted awKDE, although as in the 1D case there are evident edge artifacts near the step-function truncations of the uniform distribution.

**Figure 5.** Integrated 1D component distribution derived from 2D mock data results. The black dashed curve is the true 1D distribution composed of uniform and Gaussian components with a Gaussian mean μ = 50. The red curves are derived from numerical integration of 2D mock data results using iterative reweighted awKDE, as in Figure 4, bottom plot.
Download figure:
Standard image High-resolution image

3. Results from GWTC-3

We now apply the reweighted awKDE method to LVK observations. As in Sadiq et al. (2022), as input to our analysis we use parameter estimation samples (LIGO Scientific Collaboration et al. 2021a) for the set of BBH events cataloged in GWTC-3 (Abbott et al. 2023a, 2023c) with false-alarm rate below 1 per year. For the sensitivity estimate we employ a fit to search injection (simulated signal) results (Wysocki et al. 2019; Wysocki 2020) released with the catalog (LIGO Scientific Collaboration et al. 2021b).

3.1. One-dimensional Mass Distribution

We start by evaluating the effect of the iterative reweighting method on the 1D primary mass distribution, taking 100 random PE samples for each of the 69 BBH events; here, we assume a power-law distribution for secondary mass p(m₂) ∼ q^1.5. We reproduce the awKDE results from Sadiq et al. (2022) and use this estimate to seed the reweighted iteration algorithm. After ∼1000 reweighting iterations we compute the median and symmetric 90% interval from the last 900 rate estimates (the first 100 are used to set up a buffer for population weighting, as above): results are presented in Figure 6.

**Figure 6.** Differential rate over BBH primary mass using the iterative weighted KDE on GWTC-3 PE samples, assuming a power-law secondary distribution. We overplot our estimates on two models, Flexible mixtures (FM) and Power Law + Spline (PS) in Abbott et al. (2023b). Our estimate is generally consistent with other nonparametric methods, though with a higher and narrower peak around 35 M_⊙ and lacking any feature between the 10 M_⊙ and 35 M_⊙ peaks.
Download figure:
Standard image High-resolution image

Our estimate is generally consistent with other nonparametric or semiparametric approaches, represented by the Flexible mixtures and Power Law + Spline models in Abbott et al. (2023b), and does not show the overdispersion apparent in Figure 8 of Sadiq et al. (2022); specifically, we find a slightly higher and narrower peak around 35 M_⊙, but no identifiable feature around 20 M_⊙ (compare Tiwari 2022; Toubiana et al. 2023).

3.2. Two-dimensional Mass Reconstruction

Next, we apply our reweighting scheme on PE samples for both component masses and compute the two-dimensional merger rate, using the estimated sensitive volume × time as a function of the two masses. We will first discuss various technical aspects of extending the 1D calculation without assuming any power-law dependence for m₂.

Binary exchange symmetry. Typically, the convention m₁ > m₂ is applied when presenting binary parameter estimates. However, all aspects of binary formation physics and event detection and parameter estimation will be invariant under swapping the component labels, i.e., exchanging m₁ ↔ m₂ (at the same time exchanging the spins). Thus, considering the differential merger rate ${ \mathcal R }({m}_{1},{m}_{2})$ as a function over the whole plane, it must have a reflection symmetry about the line m₁ = m₂. To respect this symmetry and remove biases resulting from the apparent lack of support at m₂ > m₁, if using samples with the m₁ > m₂ convention, we train and evaluate KDEs on reflected sample sets, i.e., after adding copies of the samples with swapped components. (Note also that a power-law m₂ distribution implies the rate is a nondifferentiable function at the equal-mass line, whereas a KDE by construction is smooth and differentiable everywhere.)

Choice of KDE parameters. In Sadiq et al. (2022), we mainly considered a KDE constructed over linear mass (or distance) parameters; however, here we choose the logarithms of component masses. While this choice is not expected to have a large impact on the results, since the kernel bandwidth is free to locally adapt in either case, it is technically preferable for a few reasons. We avoid any possible KDE support at negative masses; there is a generally higher density of events toward lower masses considering the entire 3–100 M_⊙ range; the density of observed events also shows less overall variation over log coordinates; and when evaluating the KDE on a grid with equal spacing, fewer points are required to maintain precision for the low-mass region.

For a 2D KDE we also have a choice of kernel parameters, i.e., the Gaussian covariance matrix: given the similar or identical physical interpretation and range of values between lnm₁ and lnm₂, we choose a covariance proportional to the unit matrix, with an overall factor determined by the local adaptive bandwidth for each event.

PE prior. The PE samples released by LVK use a prior uniform in component masses (LIGO Scientific Collaboration et al. 2021a) up to a factor dependent on cosmological redshift; we currently do not consider reweighting relative to the default cosmological model. As the prior is a density, it transforms with a Jacobian factor when changing variables to $\mathrm{ln}{m}_{1},\mathrm{ln}{m}_{2}$ ; thus, we must divide the estimated rate ${ \mathcal R }(\mathrm{ln}{m}_{1},\mathrm{ln}{m}_{2})$ by a prior ∝m₁ m₂ when obtaining reweighted draw probabilities.

With these technical choices, we perform 1500 reweighting iterations in total, the first 600 using the Markov chain (i.e., the immediately preceding rate estimate) for sample draw weights, and the remaining 900 using the buffer median estimate. Figure 7 shows the rate estimate computed with iterative reweighting for BBH events in GWTC-3 (median over the last 900 iterations). The autocorrelations of optimal global bandwidth and sensitivity parameter α for the first 600 iterations are shown in Figure 8: the correlation drops close to 0 at a lag of ∼30 iterations.

**Figure 7.** Merger rate over binary component masses, estimated using 69 BBH events from GWTC-3. Red + symbols show the median mass samples for each event. The contours and color scale show the two-dimensional differential merger rate over $\mathrm{ln}{m}_{1},\mathrm{ln}{m}_{2}$ from iterative reweighted KDE (median over 900 iterations). Two main maxima are visible at m₁ (m₂) ∼ 10 (8) M_⊙ and ∼35 (32) M_⊙ with a possible less significant overdensity around m₁ ∼ 20 M_⊙.
Download figure:
Standard image High-resolution image

**Figure 7.** Merger rate over binary component masses, estimated using 69 BBH events from GWTC-3. Red + symbols show the median mass samples for each event. The contours and color scale show the two-dimensional differential merger rate over $\mathrm{ln}{m}_{1},\mathrm{ln}{m}_{2}$ from iterative reweighted KDE (median over 900 iterations). Two main maxima are visible at m₁ (m₂) ∼ 10 (8) M_⊙ and ∼35 (32) M_⊙ with a possible less significant overdensity around m₁ ∼ 20 M_⊙.
Download figure:
Standard image High-resolution image

**Figure 8.** Autocorrelations of the KDE (log) global bandwidth and adaptive parameter α using the initial 600 Markov chain iterations. The autocorrelation drops close to 0 by a lag of ∼30 iterations. The estimate becomes noisy at high lags as a smaller number of iterations is available.
Download figure:
Standard image High-resolution image

The mass distribution shows several interesting features in addition to the expected peaks (overdensities) around primary masses of ∼10 M_⊙ and ∼35 M_⊙, with corresponding peaks over secondary mass. For primary masses of ∼35 M_⊙ up to 80 M_⊙, the most likely secondary mass is ∼30–35 M_⊙. Thus, over this range the two component masses appear almost independently chosen. Around the m₁ ∼ 10 M_⊙ peak, some anticorrelation of the two components appears, i.e., higher m₁ favors lower m₂. We investigate the significance of this feature by calculating the median of the m₂ distribution as a function of primary mass, i.e., ${m}_{2}^{50 \% }({m}_{1});$ this is a decreasing function over a range of m₁ from ∼10 through ∼14 M_⊙ for 90%–95% of our sample estimates, depending on the exact range of m₁ chosen. The anticorrelation is thus moderately but not highly significant.

Between the two peaks the distribution of mass ratios appears broader than at either one (as hinted at in Tiwari 2023), although the apparent trend is based on a small number of events. We also see a narrow lower-density region just above the ∼10 M_⊙ peak (cf. the local minimum at chirp mass ∼11 M_⊙ in Tiwari 2023).

We also integrated the 2D KDE rate estimate numerically over both m₁ and m₂ to obtain merger rates over component mass. As shown in Figure 9, we recover features consistent with the 2D estimates and with other methods. Each component mass distribution appears well modeled by a combination of two Gaussian peaks and (broken) power laws.

To elucidate features in the 2D distribution, we choose various representative values of primary mass to plot the distribution of m₂ in Figure 10. The similarity between secondary distributions for m₁ ≳ 35 M_⊙ is evident.

**Figure 10.** Distributions of secondary mass m₂ estimated via iterative reweighted KDE for various fixed values of primary mass: for each m₁ (in M_⊙) we plot the median and symmetric 90% confidence band. To better visualize features in the distributions we use a log (linear) m₂ scale in the upper (lower) plot.
Download figure:
Standard image High-resolution image

We may also derive the distribution of mass ratio q from our 2D rate estimate. We plot this for various representative values of primary mass in Figure 11, and compare to a typical power law ∝q^1.5. First, we see that the q distribution varies over primary mass; hence, models where it is forced to the same form over the whole mass range are likely to have nontrivial bias. For some primary masses, p(q) is consistent with a monotonically increasing function such as a positive power, but for others it clearly decreases over some of the range. Roughly, if m₁ is close to a peak then p(q) is consistent with an increasing power law, but for other values the mass ratio rather shows a maximum at intermediate values, down to q ∼ 0.5 for m₁ = 15 or m₁ = 70.

**Figure 11.** Distributions of mass ratio q estimated via iterative reweighted KDE for various fixed values of primary mass: for each m₁ (in M_⊙) we plot the median and symmetric 90% confidence band. For comparison we overplot a power-law dependence ∝q^1.5.
Download figure:
Standard image High-resolution image

This behavior may suggest that the primary and secondary masses are independently drawn from similar distributions, modulo a q-dependent "pairing factor" (Fishbach & Holz 2020) that influences the relative probability of binary merger (recent work by Farah et al. 2023b further explores the case of similar or identical underlying primary and secondary mass distributions). In any case, the preference toward q ∼ 1 seen in previous work is not confirmed here. Callister & Farr (2023) reached a similar conclusion, although assuming a "universal" distribution p(q) over all primary masses.

Results including GW190814. We also estimate the 2D and 1D integrated merger rates using our iterative reweighted KDE method when the outlier event GW190814 (Abbott et al. 2020), which has a mass ratio ${ \mathcal O }(10)$ and a secondary mass barely above the likely neutron star maximum mass, is included in the BBH population. Detailed results are presented in the Appendix: roughly summarizing the trends seen there, the bulk of the estimated distribution remains little changed by the addition of the extra event, although the peak in secondary mass below 10 M_⊙ is shifted toward lower values and becomes both higher and broader; this is likely due to a general increase in KDE bandwidth when optimized with cross-validation. (In parameterized models the estimated mass distribution is also highly sensitive to inclusion of GW190814 ; Abbott et al. 2021b, 2023b.) It is not clear whether other methods for choosing the bandwidth would yield more accurate estimates; higher event statistics in the regime of low m₂ are clearly desirable.

4. Discussion

Summary of results. In this work we undertook a detailed investigation of the full 2D mass distribution of merging compact binary black holes observed by LIGO–Virgo–KAGRA up to the O3 run without assuming any specific functional form for the secondary mass or mass ratio, enabled by a new method of iterative density estimation to address uncertainties in mass measurements. Although we reproduce the broad features and local maxima seen in other parametric and nonparametric analyses, we find significantly less preference for near-equal masses than in most previous works; we also find that the mass ratio distribution cannot be described by a single function over the whole population (compare Tiwari 2023). For a range of primary masses, we find nonmonotonically varying distributions of secondary mass and mass ratio, thus a power-law dependence is ruled out. Furthermore, we find that for primary masses above 35 M_⊙ the secondary mass distribution is nearly independent of m₁, with a "preferred partner" mass of m₂ ≃ 30–35 M_⊙. Conversely, near the low-mass peak m₁ ≃ 10 M_⊙ we observe an anticorrelation between the two components, i.e., higher m₁ implies lower m₂.

Possible astrophysical interpretations. Our new estimate of the joint m₁–m₂ distribution may be compared to model predictions in the literature; because our marginal distributions for individual components are similar to previous findings, we focus here on the mass ratio. Broadly speaking, we can distinguish model predictions from the isolated binary and dynamical channels.

The isolated binary channel predicts relatively flat p(q) distributions (e.g., Belczynski et al. 2020; Olejak et al. 2021) compared to the dynamical channel (see Figure 2 in Baibhav et al. 2023 and Figure 1 in Zevin et al. 2021). Our estimates show generally flatter q distributions than the GWTC-3 results presented in Abbott et al. (2023b).

Because predictions for p(q) in the isolated binary channel depend strongly on the adopted parameters (see, e.g., Broekgaarden et al. 2022), our results provide an important step toward constraining astrophysical parameters with GWs. For example, the steep p(q) found for very small common-envelope efficiency parameter (α_CE ≃ 0.2, Baibhav et al. 2023) and the chemically homogeneous evolution model (Mandel & de Mink 2016) seem disfavored, implying that these routes cannot account for the majority of the observed population.

The stable mass transfer channel is efficient for primary masses near ∼10 M_⊙. van Son et al. (2022) predicts a dearth of near-equal-mass mergers, which is because the binary needs to be relatively unequal in mass during the second mass transfer phase for the orbit to shrink, but not too unequal to avoid unstable mass transfer. This is only partly supported by our Figure 7 in that the low-mass peak has support from equal mass out to q ≃ 0.5. For some parameter choices their models predict bimodality in p(q), with peaks at q ≃ 0.35 and q ≃ 0.75: our results suggest a peak at q ≃ 0.8 for m₁ ≃ 10 M_⊙ and at q ≃ 0.45 for m₁ ≃ 15 M_⊙ (see Figure 11), which suggests that a more detailed comparison may yield interesting constraints.

For the dynamical channel, it is interesting to consider whether models now predict q distributions that are too steeply rising. Rodriguez et al. (2016) modeled BBH mergers that formed dynamically in globular clusters: they find a median mass ratio of 0.87, with 68% of sources having mass ratios q > 0.8. As shown in Figure 11 we find comparable support for near-equal masses only at m₁ ∼ 10 M_⊙ or ∼35 M_⊙; elsewhere our median q is significantly lower.

Antonini et al. (2023) model BBH mergers in globular clusters in comparison to the GWTC-3 q distribution: their model distributions are flatter and underestimate the power-law LVK fits by an order of magnitude at q ≃ 1. They find a final q distribution flatter than the q distribution of dynamically formed BBHs (p(q) ∝ q⁴ for metal-poor clusters) because the BH mass function is not always sufficiently sampled, such that a secondary BH with a mass similar to the primary is present in a cluster, and due to a slight bias against equal-mass BBHs due to their lower inspiral probability. The reported p(q) in their Figure 1 is relatively flat for q ≳ 0.7, qualitatively in agreement with our findings for m₁ ≳ 20 M_⊙ (Figure 11, lower panel; their models cannot reproduce observed rates for lower-mass primaries). Due to the predicted pair instability gap, all BBH mergers in their models with m₁ ≳ 50 M_⊙ are hierarchical mergers, i.e., a BBH in which at least one of the components is a BBH merger remnant that was retained in the cluster (e.g., Antonini & Rasio 2016; Rodriguez et al. 2019; Kimball et al. 2021). Mergers with second-generation primaries are expected to have a mass ratio q ≃ 0.5, which is supported by our distribution for m₁ = 70 M_⊙ (Figure 11).

The p(q) distribution is expected to be slightly flatter for dynamically formed BBHs in young (open) star clusters, because they have fewer BHs per cluster and their higher metallicities lead to steeper BH mass functions and therefore lower companion masses (e.g., Banerjee 2021). An accurate picture of the mass ratio distribution is therefore important to understand the relative contribution of dynamically formed BBHs in young (and metal-rich) and old (and metal-poor) star clusters, and also more generally for understanding the relative contributions of isolated and dynamically formed binaries in the population as a whole (Zevin et al. 2021; Baibhav et al. 2023).

An intriguing apparent feature in our reconstruction, the anticorrelation between m₁ and m₂ in the low-mass (m ∼ 10 M_⊙) peak, suggests a connection to isolated binary dynamics, though it would be premature to link it with a specific mechanism.

Technical issues and biases. As noted in the introduction, measurement errors of the binary mass ratio are correlated with those in (orbit-aligned) spins. Since we have so far not attempted to reconstruct or estimate the spin distribution of merging binaries, we implicitly assume that distribution is equal to the prior used for parameter estimation (uniform in magnitude and isotropic in direction): this is a potential source of bias that remains to be addressed by future work. The distribution of aligned spins has been found to be concentrated near zero, with a slight preference for positive aligned spin (Abbott et al. 2020, 2023c; Miller et al. 2020); hence, the degree of bias may be limited. Callister et al. (2021) also note the intriguing possibility that the true mass ratio and aligned spin (after allowing for measurement errors) are anticorrelated.

A converse question concerns inferences of BH spin distributions, which assume either a specific distribution in q or a power law with index as a hyperparameter: if the p(q) model is significantly inaccurate, are such spin inferences biased? (Ng et al. (2018) and Miller et al. (2020) contain detailed discussion of potential biases in estimates of aligned spin population.) The effect may not be large, as most BBH events by necessity have parameter values close to the observed peaks, for which we find a q distribution that is not far from power law.

Extensions of the method. As already noted, here we restricted the application of our KDE to the binary mass distribution; component spins, and distance or redshift are then the next relevant parameters for population analysis. We expect to encounter a technical issue in optimizing the Gaussian kernel for a multidimensional data set, where it will not be appropriate (or even meaningful, given the different units) to impose equal variances over different parameters as we currently do for (log) m₁ and m₂. For more than two dimensions a grid search may not be practicable; more sophisticated methods may be required in order to realize the potential of iterative KDE over a full set of population parameters.

Acknowledgments

We thank Daniel Wysocki for making available fitted sensitivity estimates for binary mergers in the O1–O3 data. We also benefited from conversations with Lieke van Son, Floor Broekgaarden, and Fabio Antonini, and with Will Farr, Thomas Callister, Amanda Farah, Vaibhav Tiwari, and others in the LVK Binary Rates & Populations group. This work has received financial support from Xunta de Galicia (CIGUS Network of research centers), by European Union ERDF and by the "María de Maeztu" Units of Excellence program CEX2020-001035-M and the Spanish Research State Agency. T.D. and J.S. are supported by research grant PID2020-118635GB-I00 from the Spanish Ministerio de Ciencia e Innovación. J.S. also acknowledges support from the European Union's H2020 ERC Consolidator Grant "GRavity from Astrophysical to Microscopic Scales" (GRAMS-815673) and the EU Horizon 2020 Research and Innovation Programme under the Marie Sklodowska-Curie grant Agreement No. 101007855. M.G. acknowledges support from the Ministry of Science and Innovation (EUR2020-112157, PID2021-125485NB-C22, CEX2019-000918-M funded by MCIN/AEI/10.13039/501100011033) and from AGAUR (SGR-2021-01069).

The authors are grateful for computational resources provided by the LIGO Laboratory and supported by National Science Foundation grants PHY-0757058 and PHY-0823459. This research has made use of data or software obtained from the Gravitational Wave Open Science Center (gwosc.org), a service of the LIGO Scientific Collaboration, the Virgo Collaboration, and KAGRA. This material is based upon work supported by NSF's LIGO Laboratory which is a major facility fully funded by the National Science Foundation, as well as the Science and Technology Facilities Council (STFC) of the United Kingdom, the Max-Planck-Society (MPS), and the State of Niedersachsen/Germany for support of the construction of Advanced LIGO and construction and operation of the GEO600 detector. Additional support for Advanced LIGO was provided by the Australian Research Council. Virgo is funded, through the European Gravitational Observatory (EGO), by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale di Fisica Nucleare (INFN) and the Dutch Nikhef, with contributions by institutions from Belgium, Germany, Greece, Hungary, Ireland, Japan, Monaco, Poland, Portugal, and Spain. KAGRA is supported by Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan Society for the Promotion of Science (JSPS) in Japan; National Research Foundation (NRF) and Ministry of Science and ICT (MSIT) in Korea; Academia Sinica (AS) and National Science and Technology Council (NSTC) in Taiwan.

Appendix: Results Including GW190814

In addition to our main results from significant BBH events in GWTC-3, we examine the effect of including the outlier event GW190814 (Abbott et al. 2020) with low secondary mass m₂ = 2.59 M_⊙ (q = 0.11), which may be consistent with either a very massive neutron star or a light black hole. We slightly extend the range of m₂ over which the KDE is evaluated in order to include the additional event. Figure 12 summarizes our results, which are also briefly discussed in the main text.

**Figure 12.** Left: two-dimensional merger rate density estimated via reweighted awKDE for 70 events (69 BBH plus GW190814) from GWTC-3. Red + symbols show the median mass samples for each event. Right: the corresponding one-dimensional component mass distributions from numerical integration of 2D results.
Download figure:
Standard image High-resolution image

Binary Vision: The Mass Distribution of Merging Binary Black Holes via Iterative Density Estimation

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction