Physics-inspired deep learning to characterize the signal manifold of quasi-circular, spinning, non-precessing binary black hole mergers

The spin distribution of binary black hole mergers contains key information concerning the formation channels of these objects, and the astrophysical environments where they form, evolve and coalesce. To quantify the suitability of deep learning to characterize the signal manifold of quasi-circular, spinning, non-precessing binary black hole mergers, we introduce a modified version of WaveNet trained with a novel optimization scheme that incorporates general relativistic constraints of the spin properties of astrophysical black holes. The neural network model is trained, validated and tested with 1.5 million $\ell=|m|=2$ waveforms generated within the regime of validity of NRHybSur3dq8, i.e., mass-ratios $q\leq8$ and individual black hole spins $ | s^z_{\{1,\,2\}} | \leq 0.8$. Using this neural network model, we quantify how accurately we can infer the astrophysical parameters of black hole mergers in the absence of noise. We do this by computing the overlap between waveforms in the testing data set and the corresponding signals whose mass-ratio and individual spins are predicted by our neural network. We find that the convergence of high performance computing and physics-inspired optimization algorithms enable an accurate reconstruction of the mass-ratio and individual spins of binary black hole mergers across the parameter space under consideration. This is a significant step towards an informed utilization of physics-inspired deep learning models to reconstruct the spin distribution of binary black hole mergers in realistic detection scenarios.


Introduction
Gravitational wave (GW) observations provide unique insights into the formation channels of compact binary systems. For instance, it is expected that inferring orbital eccentricity in GW observations may provide the most conclusive evidence for the existence of compact binary systems in dense stellar environments . For instance, in the case of binary black hole (BBH) mergers, it is assumed that the spin distribution of BBHs formed in dense stellar environments may be distributed isotropically, whereas BBHs formed through massive stellar evolution in isolation may have spin distributions that are aligned with the binary's orbital angular momentum [50,51].
As the number of GW observations of BBH mergers continues to grow in years to come, it will be possible to infer the astrophysical properties of these sources and elucidate their formation history.
The goal of this paper is to quantify the adequacy of deep learning to characterize the signal manifold of quasi-circular, spinning, non-precessing, BBH mergers. In different words, we will use a large catalog of time-series waveforms that are parameterized in natural units, i.e., we use numerical relativity (NR) type waveforms, such that the GWs that describe BBH mergers may be fully described in terms of mass-ratio, q, and the individual spins of the binary components, s z i with i = {1, 2}. The neural network will be trained to cover this 3-D signal manifold (q, s z 1 , s z 2 ), with the aim of accurately inferring these three parameters when unlabelled waveform are fed into the neural network. This work builds upon an emergent field of deep learning research which has thus far provided new methodologies to do detection and point-parameter estimation for GW sources in the context of simulated noise [52], and real advanced LIGO noise [53,54]; detection-only methods in the context of simulated noise [55,56]; denoising of GW signals [57][58][59], among others (see [60] and references therein).
As we describe in this manuscript, naive methods to train deep learning architectures lead to rather sup-optimal results. However, we show that when we use physics-inspired optimization algorithms, which incorporate general relativistic constraints of the spin of BBHs, we are able to accurately recover the individual spins and mass-ratio of BBH signals across the mass-ratio under consideration. This analysis provides benchmarks for the performance of deep learning models when applied to the reconstruction of these parameters in the absence of noise, and provides a baseline of accuracy when real noise from GW observations is taken into consideration. That study will be presented shortly in a follow up paper.
This study is organized as follows. Section 2 describes the data sets used to train, validate and test our neural network model. We also describe therein the deep learning model used, and show how to build a physics-inspired optimization algorithm that significantly improves the predictive accuracy of our neural network. Our findings are presented and discussed in Section 3. We summarize this work and outline future directions of work in Section 4.

Methods
In this section we describe the data sets used to train, validate and test our neural network model. We also describe the neural network architecture used for this work, and the construction of a physics-inspired optimization algorithm that significantly improves the ability of our neural network to accurately infer the astrophysical properties of BBH mergers from the time-series NR waveforms that describe them.

Data Curation
We generate our training, validation and test sets using NRHybSur3dq8, a surrogate model for hybridized non-precessing NR waveforms. While this surrogate model may be extrapolated up to mass-ratio q = 10 and s z i = 0.998, with i = {1, 2}, it has only been trained with 104 NR waveforms in the parameter range q ≤ 8 and s z i ≤ 0.8, and hence we restrict our data sets to the same parameter span. Furthermore, we consider the = |m| = 2 mode to train, validate and test our neural network model.
The signals produced for this analysis are such that the amplitude peak occurs at t = 0M , covering a time span t ∈ [−10, 000 M, 130 M] with a time step ∆t = 0.1 M. A sample waveform is shown in Fig. 1 Amplitude Figure 1: Sample waveform for a binary black hole system with mass-ratio q = 7.9, and whose binary components have spins s z 1 = s z 2 = 0.8, respectively. All waveforms used in this analysis cover the time range t ∈ [−10, 000 M, 130 M], and are sampled with a time step ∆t = 0.1 M.
The training set is generated by sampling the mass-ratio q ∈ [1,8]   of ∼ 190, 000 waveforms each respectively. The distributions of parameters for training, validation and test sets is summarized in Fig 2. The entire data set is ∼ 1.5TB in size, and we make use of mpi4py to parallelize the data generation using the Campus Cluster at the University of Illinois at Urbana-Champaign [61].

Neural Network Model Architecture and Loss Function
The neural network architecture consists of two fundamental components, a shared base/root consisting of layers slightly modified from the WaveNet architecture, and two branches consisting of fully connected layers that take in features extracted from the root to predict the mass ratio and the individual spins of the binary components, respectively.
WaveNet is a probabilistic and auto-regressive model originally released by DeepMind for generating raw audio waveforms, with predictive distribution for each audio sample conditioned on all previous ones. Trained on data with tens of thousands of samples per second of audio, it exhibited not only state of the art performance on text-to-speech, but also showed promising results in music generation and as a discriminative model for phoneme/speech recognition. Additionally WaveNet inspired architectures have also been demonstrated to successfully denoise recent GW observations [58]. Inspired by such recent successes, as well as key architectural features (which we describe in more detail below) suited to processing wideband raw waveforms, we modify the WaveNet architecture and test it as a discriminative model to predict key parameters of GWs from quasi-circular, spinning, non-precessing binary black hole mergers.
The key architectural components of a WaveNet are dilated causal convolutions, gated activation units, and the usage of residual and skip connections, which we describe below.

Root Layers
Dilated Causal Convolutions. The original WaveNet was used as a generative model to predict audio sample x t conditioned on all previous timesteps, where the joint probability of the waveform x = {x 1 , x 2 , ..., x T } is factorized as a product of conditional probabilities p(x) = T t=1 p(x t |x 1 , ..., x t−1 ). Traditionally recurrent neural networks (RNNs) and Long short-term memory (LSTM) models have been used to model conditional probabilities of sequences, but they are typically considerably slower than convolutional networks and also suffer from vanishing gradient problem for very long sequences. WaveNet addressed the aforementioned issues using causal convolutions, which are implemented by shifting the output of a normal convolution by a few timesteps, and ensuring that the prediction made by the model at time step t does not depend on the future timesteps x t+1 , ..., x T . Since ours is a discriminative model, we turn off the causality of the WaveNet architecture, so that our model can simultaneously process all time steps of the waveform.
A dilated convolution is an effective way to increase the receptive field of a convolutional network by orders of magnitude with only a small computational overhead. This is achieved by applying a convolutional filter over the input sequence by skipping input values with a certain step size (called the 'dilation'). Figure 3 shows dilated convolutions for dilations of 1, 2, 4, and 8 respectively. Similar to the original WaveNet architecture, we double the dilation for every layer up to a limit and then repeat the same stack of layers, to efficiently capture long range structure of the waveform.
Gated Activation Units. A potential advantage of using LSTMS is that they have multiplicative units (memory gates) that may help with modeling complex inter-dependencies between time steps. In the original WaveNet architecture, this is amended by replacing the rectified linear units (ReLU) between convolutions with the following gated activation unit: where * is a convolution operator, is an element-wise multiplication operator, σ(.) is the sigmoid function, k is the layer index, and W f and W g denote filter and gate convolutions, respectively. We keep the same gated activation Units in our network.
Residual and Skip Connections. Similar to the original WaveNet architecture, both residual and skip connections are used for fast convergence by resolving vanishing gradients when training deep neural networks. Fig 4 shows a residual block of the model, which is stacked many times in the network. More in depth discussion of the structure of WaveNet may be found in the original paper [62].

Leaf Layers
At the end of root layer the model outputs a sequence with the same time dimensionality as the input sequence. Since the input sequence has 101, 300 time-steps, we pick the last 10, 000 time-steps from the output sequence, flatten and feed into two leafs of fully connected layers predicting the mass ratio q and spins s z i respectively, as shown in Fig 4. Hence the shared root can be thought of as a feature extractor, and the leafs as sub-networks specializing to predict different parameters from the extracted features. The motivation behind this root/leaf structure is two-fold; firstly the mass ratio q ∈ [1, 8] and the spins s z i ∈ [−0.8, 0.8] have different numerical ranges, and secondly we performed experiments without a leaf structure and empirically got sub-optimal results.

Loss Function
We performed several experiments to find an optimal way to constrain s z i predictions from the second leaf sub-network. In the first iteration, we predicted s z i directly using mean-squared error as the loss function. This resulted in reasonable predictions for s z 1 , but s z 2 predictions could not be constrained with continued training. Consequently, we explored the use of effective one-body general relativistic dynamics of two spinning black holes as delineated in [63]. Specifically, we focus on the derivation of a spin-dependent effective one-body Hamiltonian for small and moderate spins as a ν-deformation of a Kerr metric of mass M ≡ m 1 + m 2 and effective spin  where σ 1 ≡ 1 + 3 4q and σ 2 ≡ 1 + 3q 4 . In addition, we also consider the effective spin parameter [50,64] Predicting S eff and σ eff , and using our prediction for q to solve Equations (2) and (3), we were able to tightly constrain s z i predictions.

Results
This section presents our main findings in the following format: we first discuss specific challenges to be addressed, namely, the parameter space degeneracy of the signal manifold under consideration and the need to use a large training data set to densely sample this 3-D signal manifold. We then present qualitative results of the performance of our neural network model, and finalize with a quantitative set of results that provide a thorough description of the realm of applicability of our neural network model.

Parameter space degeneracy and convergence of high performance comput-
ing with deep learning The characterization of the signal manifold of quasi-circular, spinning, nonprecessing, BBH mergers presents a number of complications given that different time-series NR waveforms, say h(t) and s(t), that describe different astrophysical systems are remarkably similar. To illustrate this property we have selected a few astrophysical systems and then computed the overlap, O(h, s), between them and fixed mass-ratio slides of the signal manifold under consideration using the relation whereŝ [tc, φc] indicates that the normalized waveformŝ has been time-and phase-shifted. Using this metric, Figure 5 shows that the overlap between a given signal and other BBH systems with remarkably different spin combinations are effectively indistinguishable. Other observations we obtain from these results is that inferring the individual spins of near-equal mass-ratio systems is very hard given the intrinsic symmetry of the system, i.e., the binary components may be interchanged m 1 ⇐⇒ m 2 , as shown in the top panels of Figure 5. Indeed, near equal BBH systems present the largest regions of degeneracy across the signal manifold. We also observe that systems with small to moderate spins present large regions of degeneracy, as shown in the second and third row of panels in Figure 5. These features are also present in BBHs whose binary components have large spin values, as shown in the bottom panels of Figure 5. In summary, inferring the individual spins of BBH mergers is a rather challenging endeavor.
To try to address these challenges, we begin by densely sampling the signal manifold under consideration following the methodology described in Section 2.1, which leads to the construction of a data set that includes over 1.5×10 6 waveforms. Training the model described in Section 2.2 with this large data set using a single V100 GPU would take about 30 days to ensure that the model converges and attains optimal performance, which is quantified by computing the overlap between ground-truth signals and signals whose parameters are predicted by the neural network. In order to minimize time-to-insight and to test multiple, physics-inspired optimization algorithms, we had to design and deploy a distributed training scheme on the Harware-Accelerated Learning (HAL) deep learning cluster at the National Center for Supercomputing Applications [65,66]. This cluster has 64 NVIDIA V100 GPUs distributed evenly across 16 nodes, and connected by NVLink 2.0 inside the nodes and EDR InfiniBand across the nodes. As shown in Figure 6, this approach reduces the training stage to only 12.4 hours.

Qualitative Analysis
The first assessment we have conducted to examine the performance of our fully trained model is presented in Figure 7. Therein we show the absolute errors in the estimation of the mass-ratio and individual spins for several fixed, mass-ratio slices of the signal manifold. The top left panel shows again the degeneracy expected for near equal mass-ratio systems. Note that while we can obtain a good estimate of the mass-ratio of these systems, it is hard to infer the individual spins of the binary components. Once this degeneracy is broken, i.e., we consider q > 1 BBH mergers, it is now possible to infer with fair accuracy both the individual spins and the mass-ratio of the BBH mergers, as shown in the top-right and bottom-left panels in Figure 7. As expected, as the  mass-ratio of the BBH increases, it becomes increasingly difficult to infer the spin of the secondary-see bottom right panel of Figure 7. This is expected, since for asymmetric mass-ratio BBH mergers, the effect of the secondary in the dynamics of the binary system becomes less dominant, i.e., for a fixed massratio and spin of the primary, it is difficult to tell apart signals when we vary the spin of the secondary.

Quantitative Analysis
While this qualitative analysis suggests that our neural network may be correctly characterizing the signal manifold under consideration, we have also conducted a quantitative approach, i.e., we collect the predictions of the neural network for the mass-ratio and individual spins for each waveform in the testing data set. Thereafter, we generated waveforms with these predicted parameters using NRHybSur3dq8. Finally, we computed the overlap between the ground-truth signals in the testing data set, and the waveforms whose parameters are predicted by our neural network. We summarize the results of this analysis for a sample of mass-ratio slices in Figure 8. These results indicate that we are able to accurately infer the mass-ratio and individual spins over a broad range of the parameter space under consideration. Indeed, both the median and mean  We provide a more detailed analysis of the overlap results in Figure 9, where we show overlap results at every single point of the testing data set for a number of mass-ratio slices. These results show that our neural network model can predict the mass-ratio and individual spins of BBH mergers with excellent accuracy. Close to the edge of the parameter space under consideration, i.e., q → 8, the neural network has a gradual decrease in accuracy. Figure 10 presents a visual representation of high, median and low overlaps samples, i.e., 2 sigma below the median overlap. These random samples from the testing data set show that: (i) our neural network can identify signals that reproduce with excellent accuracy the dynamics of near-equal BBH mergers. However, given the parameter space degeneracy of the signal manifold for q ∼ 1 systems, it is difficult to accurately recover the individual spins of these systemssee left column in Figure 10; (ii) for BBH systems with q = 1-middle column in Figure 10-we notice that our network can recover with excellent accuracy both the mass-ratio and individual spins; and (iii) systems with asymmetric massratios-right column in Figure 10-can be characterized with different levels of accuracy. We notice that even for systems whose overlap is two sigma below the median overlap, the neural network is able to provide a fair estimate of the Our findings are in good accord with the expectation that it is hard to estimate the individual spins of BBH mergers whose mass-ratios are near the edges of the signal manifold, i.e., q ∼ 1 and q ∼ 8. This is expected given the symmetry of the problem at hand for q ∼ 1, and the fact that the spin of the secondary is difficult to reconstruct for asymmetric mass-ratio BBH mergers. All these results may be interactively perused at [67].
We provide a careful follow up of BBH systems that are not accurately reconstructed by our neural network in Appendix A, which tend to be concentrated at the edges of the signal manifold.
While we have found that our deep learning model performs very well to characterize q ∼ 1 BBH mergers, as shown in the top left panel of Figure 8, we leave to future work the construction of deep learning models that include higher-order modes to explore whether spin reconstruction may be improved for the most asymmetric mass-ratio BBH mergers considered in this study. : Each point in these panels represents the overlap between a signal in the testing data set and its counterpart whose individual spins and mass-ratio are predicted by our neural network model. The mass-ratio slices presented in this figure were randomly selected from the testing data set.

Conclusion
Inferring the spin distribution of BBH mergers will unveil new and detailed information about the formation channels of these objects. It is then timely and relevant to design signal-processing algorithms that are scalable, and which may readily handle a dense sampling of this higher-dimensional signal manifold.
To contribute to this effort, in this paper we introduced a deep learning model to characterize the signal manifold of quasi-circular, spinning, non-precessing, BBH mergers. To do this, we densely sampled the (q, s z 1 , s z 2 ) parameter space with over 1.5M waveforms produced with the NRHybSur3dq8 model. We then designed and deployed a distributed training algorithm to reduce the training stage from one month, using a single V100 GPU, to just 12.4 hrs using 64 V100 GPUs in the HAL cluster at the University of Illinois. The convergence of deep learning and high performance computing proved critical to reduce time-to-insight, and enabled us to test a variety of optimization algorithms that incorporate general relativistic constraints. This approach enabled us to demonstrate that,  in the absence of noise, physics-inspired deep learning models can effectively reconstruct the mass-ratio and individual spins of the binary components of BBH systems.
Our study sheds light on the ability of deep learning to characterize waveform signals that span a degenerate signal manifold. For instance, in the case of equal or comparable mass-ratio systems we showed that our neural network can reconstruct the mass-ratio of these systems with excellent accuracy. However, the intrinsic symmetry of these systems, i.e., the binary components are indistinguishable m 1 ↔ m 2 , reduces the ability of the neural network to provide good estimates of the individual spins of the system. Despite these difficulties, we have found that the signals predicted by our network reproduce the dynamics of the ground-truth signals with excellent accuracy. On the other hand, the neural network can accurately reconstruct the mass-ratio and individual spins for asymmetric mass-ratio systems. Near the edge of the signal manifold, i.e., q ∼ 8 BBH mergers, the neural network has a drop in accuracy recovering the spin of the secondary. This is expected because the spin of the secondary has a less dominant effect in the overall dynamics of the waveform signal. We leave to future work a systematic assessment of the importance of including higher-order modes to better constrain the spin of the secondary.
In summary, these results represent a significant step towards the design of scalable, physics-inspired neural network models that may be used to infer the spin distribution of actual GW observations. This analysis will enable a firm understanding on the impact of real GW noise when deep learning is used to infer the spin distribution of BBH mergers. An analysis of this nature will be presented in a forthcoming publication.

Acknowledgements
AK and EAH gratefully acknowledge National Science Foundation (NSF) awards OAC-1931561 and OAC-1934757. We are grateful to NVIDIA for donating several Tesla P100 and V100 GPUs that we used for our analysis. AD acknowledges support from the National Center for Supercomputing Applications (NCSA) Students Pushing INnovation (SPIN) program. The authors thank Dawei Mu and Chit Khin for their superb support using the HAL cluster and the ICCP, respectively.
This work utilized resources supported by the NSF's Major Research Instrumentation program, the Hardware-Learning Accelerated (HAL) cluster, grant OAC-1725729, as well as the University of Illinois at Urbana-Champaign.
This work made use of the Illinois Campus Cluster, a computing resource that is operated by the Illinois Campus Cluster Program (ICCP) in conjunction with the NCSA and which is supported by funds from the University of Illinois at Urbana-Champaign.

Appendix A. Overlap distribution
We have conducted a detailed analysis of the regions of parameter space where our neural network model does not perform optimally. By the numbers, the test data set has 197,516 waveforms. Our neural network predicts signals whose overlap with the corresponding ground-truth signals in the test set has the following distribution: These results are presented at a glance in Figure A.11. We have also found that if we excise the edges of the signal manifold, i.e., we consider the ranges q ∈ [1,8], s {1, 2} ∈ (−0.8, 0.8), then most of the low overlaps are removed. The reader may conduct a more detailed inspection of these results at [67]. Furthermore, as mentioned in the main body of this article, it is worth exploring whether the inclusion of higher-order modes may improve the performance of this model for asymmetric mass-ratio BBH mergers. Such a study will be pursued in the near future.