Calibrating gravitational-wave detectors with GW170817

The waveform of a compact binary coalescence is predicted by general relativity. It is therefore possible to directly constrain the response of a gravitational-wave (GW) detector by analyzing a signal's observed amplitude and phase evolution as a function of frequency. GW signals alone constrain the relative amplitude and phase between different frequencies within the same detector and between different detectors. We analyze GW170817's ability to calibrate the LIGO/Virgo detectors, finding a relative amplitude calibration precision of approximately $\pm20\%$ and relative phase precision of $\pm15^\circ$ (1-$\sigma$ uncertainty) between the LIGO Hanford and Livingston detectors. Incorporating additional information about the distance and inclination of the source from electromagnetic observations, the relative amplitude of the LIGO detectors can be tightened to $\sim\pm15\%$. We investigate the ability of future events to improve astronomical calibration. By simulating the cumulative uncertainties from an ensemble of detections, we find that with several hundred events with electromagnetic counterparts, or several thousand events without counterparts, we reach percent-level astronomical calibration. This corresponds to $\sim$5-10 years of operation at advanced LIGO and Virgo design sensitivity. It is to be emphasized that direct {\em in-situ}\/ measurements of detector calibration provide significantly higher precision than astronomical sources, and already constrain the calibration to a few percent in amplitude and a few degrees in phase. In this sense, our astronomical calibrators only corroborate existing calibration measurements. Nonetheless, astrophysical calibration may become an important corroboration of existing calibration methods, providing a completely independent constraint of potential systematics.


I. INTRODUCTION
Gravitational-wave (GW) astronomy has become an important new discipline within modern astrophysics. In particular, the LIGO/Virgo detection of the coalescence of two neutron stars (GW170817; [1]) constrained the supra-nuclear equation of state within neutron star cores [2][3][4] and multi-messenger observations of the associated electromagnetic (EM) counterparts in the host galaxy NGC 4993 (GRB170817a [5] and AT 2017gf0 [6]) informed GRB and kilonova phenomenology (see, e.g., [7,8]) and provided a direct standard-siren measurement of the Hubble constant (H 0 ; [9]). Although the full breadth of the science enabled by the advanced LIGO [10] and Virgo [11] detectors is too long to enumerate here (see [12,13] for summaries), we note that these discoveries were possible only with both accurate and precise interferometric calibration [14][15][16]. Because most, if not all, inferences made with GW data compare the observed signal to theoretical predictions, it is vital that the recorded data is correctly calibrated. Errors in calibration of the detectors will result in errors in derived parameters, including distance (due to amplitude errors) and sky position (due to relative timing, phase, and amplitude errors). Indeed, accurate high-precision calibration is an essential component of GW astronomy, and in what follows we focus on an independent method for characterizing the precision of GW instruments.
Calibration uncertainty in the current network of GW detectors was limited to a few percent in amplitude and a few degrees in phase during the first two observing runs [14][15][16]. While the uncertainty in detectors' responses has roughly improved with their sensitivity to date, existing calibration techniques are quickly approaching their fundamental limits. These techniques are based upon either phyiscally displacing the detectors' test masses, or causing an equivalent displacements, by independent "first-principles" systems with verifiable accuracy and high precision. Many such fundamental references have been used, including the primary laser's wavelength, as was done for initial LIGO [17] and advanced VIRGO [16], auxiliary laser systems currently used by advanced LIGO, such as photon calibrators [18], and corroborating techniques using the detectors' frequency stabilization system [14,19], or local gravity gradient generators [20]. Each of these systems are used either directly or indirectly to induce strain within the detector and thus measure the uncertainty in the detector's response. Each is fundamentally limited by the uncertainty in its reference.
There exists another, completely independent, way to induce strain on a GW detector: astrophysical sources. If we assume that the theory of gravity is known to sufficient precision (i.e., general relativity accurately describes our sources and GW propagation), we can use observed astrophysical sources to infer the calibration of each detector within the global network. The basic idea behind astrophysical calibration is that the amplitude and phase evolution of a detected source is constrained by general relativity. By carefully analyzing the detected waveform, it is possible to directly measure how the detector responds to impinging gravitational waves. There are a range of potential constraints, including i) the relative phase and amplitude, as a function of frequency, for a single detector, ii) the relative phase and amplitude between multiple detectors, and iii) the absolute amplitude and phase calibration of a detector network. We explore all of these in turn. We emphasize that the remarkable accuracy and precision of existing in-situ calibration techniques are essential to LIGO/Virgo science (see, for example, [2,3,9,21]). Astronomical calibration will primarily provide an independent corroboration of these techniques.
This idea was first proposed in a prescient paper by Pitkin, Messenger, and Wright [22]. They focused on the possibility of constraining the overall amplitude calibration, and argued that a BNS within 100 Mpc, with an EM counterpart that somewhat optimistically determines the distance to the source exactly and places a strong prior on the inclination, would constrain the amplitude response to 10%. We extend their work, focusing on both absolute and relative responses of GW detectors. In addition, we generalize their approach, and consider both amplitude and phase errors, and incorporate possible frequency dependence. Furthermore, we apply our approach to real data, deriving constraints on systematic errors from the observation of GW170817. For GW170817, we apply conservative priors based on EM observations and find similarly looser constraints on detector calibration. Finally, similar to [22], we also perform a projection of future constraints.
We describe our formalism in §II, and then derive calibration constraints from the observation of GW170817 in §III. We also describe the key features of astronomical calibration and how these depend on the nature of the sources (e.g., GW170817 was near a null in Virgo's antenna response; more favorable source locations may result in better polarization measurements, and therefore better inclination constraints). We estimate how astronomical calibration scales with an ensemble of detections in §IV and conclude in §V.

II. FORMALISM
We assume additive stationary Gaussian noise (n) in each detector and model the data recorded (d) as a function of frequency as where δA is the amplitude calibration error, δψ the phase error, F +,× are the antenna response functions, 1 and h +,× the astrophysical strain incident on the detectors. The strain incident on the detector can be written as where D L is the luminosity distance to the source, θ jn the angle between the source's total angular momentum and our line of sight, φ o the orbital phase at coalescence, and h the intrinsic waveform generated by the source. If the recorded data's calibration is correct, then δA = δψ = 0. Calibration errors are modeled as a function of frequency by spline interpolation, as implemented by LALInference [26,27]. Anchor points are logarithmically spaced in frequency separately for both amplitude and phase errors, and the interpolation is performed with the log-frequency as the abscissa. The values from the posterior for the calibration errors in each detector at these anchor points are sampled simultaneously with the rest of the source's parameters. This spline approach, albeit with significantly tighter priors on the calibration uncertainty derived from direct measurements of the detector response, has been used extensively in LIGO/Virgo analyses (e.g., [2,3,9,21]). As we detail below, there are a number of challenges associated with using GW signals to calibrate GW detectors. Measurement uncertainties between calibration errors at different frequencies are correlated within each detector and between detectors by the fact that all detectors observe the same signal. In particular, D L and θ jn correlate with the absolute amplitude and φ o correlates with the absolute phase. Inference based on GW data alone can only constrain the relative amplitude and phase, both between different frequencies in each detector and between detectors.
Furthermore, due to the three-fold degeneracy between δA, D L , and θ jn , using EM constraints for either D L or θ jn alone does not significantly reduce the posterior uncertainty for δA; only when EM constraints for both D L and θ jn are imposed do we see a significant increase in δA's precision. Furthermore, EM information constrains the absolute calibration amplitude uncertainty, and therefore the amplitude uncertainty in each detector can be separately bounded. Constraints on D L and θ jn do not significantly affect the uncertainty in δψ, much like they do not affect inference of other intrinsic parameters [28]. This is because EM data does not observe the overall phase, whereas it does observe the overall amplitude.

III. CALIBRATION FROM GW170817
We re-analyze GW170817 with loose priors on the detectors' calibration. Using LALInference [26] (publicly available within LALSuite [27]), standard powerspectral-density estimation techniques, and a TaylorF2 frequency-domain waveform that includes the effects of aligned spins and linear tides (see the review in [29]), we investigate the extent to which the astrophysical strain from a reasonably well-known source can be used to calibrate our detectors.
Previous analyses used priors based on direct measurements of detector calibrations [15] with careful quantification of the uncertainty in both δA and δψ as a function of frequency [2,3,9,21]. Like this work, these analyses model the calibration uncertainties as a function of frequency via spline interpolation between logarithmically spaced anchor points. Priors for δA and δψ at each anchor point are independent Gaussians with means and standard deviations taken from the measured uncertainties, typically corresponding to uncertainties of a few percent in amplitude and a few degrees in phase (see Figure 2). As we will show, these in-situ calibration measurements are significantly more precise than the constraints inferred from GW170817.
In addition to analyzing GW170817 with directlymeasured calibration uncertainties, we impose independent zero-mean Gaussian priors with standard deviations significantly wider than would ever be encountered in science-quality data. To wit, we assume loose a priori 1-σ calibration precision of 50% in amplitude and 45 • in phase. As we will see, GW170817 constrains the precision to be roughly a factor of two better a posteriori, and does so with a non-trivial frequency dependence.
While the detection of the EM counterpart to GW170817, AT 2017gfo, determines the source's location to high precision, we find that the three-interferometer LIGO+Virgo network already provides sufficient precision in GW170817's right ascension (α) and declination (δ) such that additional refinements in (α, δ) do not significantly affect our results. That is to say, the threedetector network limits the source's location to a small enough region that the antenna response functions do not vary significantly over the posterior's support. Therefore, the impact of refined localization on the projected strain within each detector is relatively small. This was observed empirically, as the posteriors for our calibration parameters were virtually unchanged when we fixed (α, δ) to the known values for NGC 4993 [6]. Indeed, previous studies have shown that precise localization does not necessarily improve inference of other parameters [28].
We find interesting constraints a posteriori only when the projected strain in a given detector is above the noise floor (∼40-300 Hz, see Figure 1 of [1]). The calibration at high or low frequencies remains uninformed by the data. This is expected, since the additional freedom allowed by frequency-dependent calibration uncertainty gives the signal model less constraining power. Our infer-ence therefore acts like an unmodeled transient search, in which coherence between detectors and projected strains larger than the noise determine where the data is most informative.
The combination of GW and EM data for a given source allows for qualitatively different calibration constraints. As mentioned in the previous section, the degeneracy between δA, D L , and θ jn , means that GW data alone cannot inform the absolute amplitude calibration. Instead, knowledge of the signal's frequency and amplitude evolution informs the relative amplitude calibration, as shown in Figure 1 and Appendix A. 2 Conversely, EM constraints on both D L and θ jn primarily inform the absolute amplitude, typically limited by uncertainty in H 0 from D L and relativistic ejecta dynamics and energetics for θ jn , which acts orthogonally to the GW data. Figure 1 shows how both GW and EM data combine to constrain δA. Figures 2 and 3 show the constraints on amplitude and phase calibration in the two LIGO detectors from GW170817. 3 We show several posteriors • (GW): GW data alone with loose calibration priors. We assume standard (uninformative) priors for all intrinsic and extrinsic parameters. In particular, we assume sources are uniformly distributed in volume ignoring local cosmological effects: • (GW+D L ): loose calibration priors and the source's location constrained by the approximate uncertainty in NGC 4993's location. We assume conservative bounds of D L ∈ [32, 50] Mpc with p(D L ) ∝ D 2 L .
These figures elucidate several important points. We find that while EM constraints impact the amplitude uncertainty, δA, all posterior processes for δA assuming loose calibration priors are significantly larger than the directly-measured calibration uncertainties. In addition, (left) Toy model showing contraints obtained with only GW data with our prior (blue), likelihood (green), and posterior (red). The GW likelihood only constraints the relative amplitude (the difference) but not the overall amplitude (the sum). (right) The same toy model showing how EM constraints (yellow) on DL and θjn can further constrain the amplitude. Knoweldge of both DL and θjn primarily constrains the overall sum. This behavior is born out in Figure 4. See Appendix A for more details.
the EM constraints do not meaningfully affect the posterior uncertainty estimation of δψ. However, EM constraints do affect the inferred source-frame chirp mass; this is because it depends on both the detector frame chirp mass and D L . In this sense, we find astronomical calibration is, in practice, only a corroboration of in-situ, directly measured calibration.
Because all detectors observe the same signal, uncertainties in the parameters of that signal manifest as correlations between neighboring frequencies in the calibration model, both within a single detector and between detectors. We do not observe measurable correlations between the amplitude and phase uncertainties, although each separately correlates with a few signal parameters.
We first focus on the amplitude calibration. δA primarily correlates with D L and θ jn because the amplitude observed by each detector is a combination of all three (Eqs. 1, 2, and 3). δA tends to move coherently at all frequencies, with larger δA corresponding to larger D L or smaller cos θ jn . This degeneracy means that GW data alone constrains only the relative (frequency-dependent) calibration; constraints on the absolute (frequency-independent) calibration remains dominated by our prior. We note that precise knowledge of either D L or θ jn is insufficient to improve the posterior constraints on δA. Instead, we only see significantly tighter δA constraints when we include knowledge of both D L and θ jn from EM observations. This is because GW170817 was primarily witnessed by the two LIGO de-tectors, and they are effectively sensitive to only a single polarization because they are nearly aligned. This means that the LIGOs alone cannot constrain the inclination significantly, and the uncertainty in θ jn is large enough that, even with relatively precise knowledge of D L , we still observe large posterior uncertainty for δA. Similarly, knowledge of θ jn without knowledge of D L means the uncertainty in D L limits the precision of our inference of δA. More favorable source locations, in which more than two detectors witness the signal with large signal-to-noise ratios, may allow the GW data alone to constrain θ jn . In this case, we would only rely on EM measurements of D L (alternatively the redshift, assuming H 0 is well known), which likely have smaller systematic errors than the associated constraints on θ jn from GRB and kilonova modeling (see, for example, [33][34][35][36][37]).
Nonetheless, even with GW data alone, GW170817 provides non-trivial posterior constraints on the relative δA between ∼40-300 Hz, with roughly ±20% precision (see Fig. 3). The precision of the absolute amplitude calibration is similar to the relative precision for GW170817, although the absolute calibration is likely strongly influenced by our prior. We re-emphasize that the directlymeasured calibration priors are significantly tighter, and we do not learn anything about the detector calibration if we use those measurements as priors regardless of which EM constraints are imposed. Figure 4 shows the marginal and joint posterior distributions for the amplitude calibration uncertainties at four representative fre- quencies. The amplitude calibration posteriors' shapes and the imposed constraints can be understood through the toy model in Figure 1.
We now turn to the phase calibration. While δψ does not correlate with extrinsic parameters or δA, it does correlate with M. In contrast to δA, which tends to move in the same direction at all frequencies, δψ tends to rotate around the "best measured" phase at the detectors' most sensitive frequency (∼100 Hz). This is because M determines the rate at which the signal's phase changes, and the signal's phase is reasonably well measured at that frequency. Therefore, if M changes, the signals phase rotates around the well measured point and δψ changes to match the observed data; note the inverted correlations between M and δψ(54 Hz) and M and δψ(430 Hz) in Fig-ure 5. While neighboring frequencies are still highly correlated, we additionally observe slight anti-correlations between δψ(54 Hz) and δψ(430 Hz) whereas we only see positive correlations between δA(54 Hz) and δA(430 Hz). The in-situ calibration uncertainties are tight, and the likelihood is flat across their entire support. We do not further constrain δψ with GW170817 when using the official priors. Again, while the GW data primarily constrains the relative phase calibration, we note that EM observations do not constrain φ o and therefore do not constrain the absolute phase further.
We also note that the calibration uncertainties in each detector could be composed of several distinct components. For example, each LIGO PCAL system relies on photodiodes calibrated against the same NIST-traceable  Figure 2. We note that, unlike in Figure 2, additional constraints on DL and θjn only marginally impact our ability to measure the relative calibration between detectors.
"gold standard" [18]. Therefore, miscalibration within the gold standard would manifest as calibration systematics shared across both detectors. We decompose the calibration error in each detector into separate components (e.g. δA H,tot = δA + δA H , where δA is shared between all detectors and the δA H , δA L , etc. are independent; see Appendix B for more details). Generally, the additional model freedom associated with marginalizing over independent components in each detector to constrain the shared uncertainty inflates the uncertainty relative to purely independent calibration errors, 4 and 4 δAtot is composed of the sum of multiple terms and the data only constrains the sum; uncertainty in any one of the terms will correspond to larger marginal uncertainty in the others.
GW170817 constrains the shared absolute amplitude uncertainty to ±25% and the absolute phase to ±27 • when assuming identical, independent zero-mean Gaussian priors on both the shared and independent components. Interestingly, the precision of the shared parameters appears to be much less affected by EM constraints imposed on D L and θ jn .

IV. CALIBRATION FROM FUTURE EVENTS
GW170817 serves as a useful proof-of-principle of an astronomical calibrator. We note that, while it has the largest signal-to-noise ratio (ρ) of any GW source detected to date, it still produces constraints more than an order of magnitude worse than the official calibration uncertainties. Competitive astrophysical calibration will only come from combining information from multiple events-as each event provides marginally more information, the cumulative effect will reduce the overall calibration uncertainty. In particular, this may identify unknown or unexpected calibration systematics.
To investigate this further, we adopt simple models for how astronomical calibration uncertainties will scale with additional events. The calibration likelihoods are nearly Gaussian, so we combine multiple events by multiplying Gaussian likelihoods along with a prior. When we have only GW data, we will only constrain the relative calibration, and the uncertainty is expected to scale as (see Appendix A): where the cumulative likelihood is determined by summing the likelihoods from each event in quadrature: and σ prior is the a priori uncertainty for each δA separately. The marginal uncertainty on the absolute calibration, on the other hand, should scale as In the limit of many events, the relative calibration precision will scale as σ ∝ 1/ √ N whereas the marginal absolute calibration will still be prior dominated.
For each event we draw ρ i from a distribution ∼ ρ −4 [38] and require ρ i ≥ 12. We draw many independent events, stack them, and repeat this process for many trials. Using parameters characteristic of our loose calibration priors and the amplitude constraints obtained from GW170817, Figure 6 shows that we will need only 10-20 events before we constrain the relative amplitude calibration uncertainty to less than 10%, but 1000-2000 events before we reach 1%.  Figure 2. Contours represent 50% and 90% credible regions. We note that all δA are correlated, although the correlations' strength decreases as the frequency separation increases. We also note that the δA constraints are essentially unchanged when we only include EM information for DL. Only EM constraints for both DL and θjn improve the precision of δA.
If we assume EM constraints comparable to those from GW170817, the nature of the scaling changes. We expect the joint constraints from both GW and EM data to meaningfully constrain both the relative and absolute amplitude calibration, so the cumulative uncertainty for the amplitude uncertainty would then separately scale as In this case, we need 400-600 events to reach 1% calibration uncertainty.
We note that these cumulative measurements will not be dominated by a single event. For a single, loud event to constrain δA to percent-level uncertainty, it would need to be ∼10 times closer than GW170817 assuming comparable EM constraints and similar GW detector sensitivity. This means the source would need to be within  Figure 2. Contours show the 50% and 90% credible regions. Note that EM information about DL and θjn does not affect δψ significantly. Furthermore, we observe slight negative correlations between δψ(54 Hz) and δψ(430 Hz) and a reversal in the correlations between M and δψ, suggesting the phase errors rotate around the "best measured frequency" (∼100 Hz).
∼4 Mpc, approximately the radius of the local group.
Following the approach in Section III, we decompose the calibration uncertainty into several pieces. Because the LIGO gold standard photodiodes are only periodically recalibrated, any systematic due to miscalibration of the gold standard would be shared between the LIGOs and could persist for several detections. As we saw previously, marginalizing over the independent components of the calibration uncertainty typically only increases the uncertainty in the shared parameters, but we expect the same scalings to hold with a larger σ o (Eq. 5). Based on the values found for GW170817, it will take roughly 50% more events to constrain a calibration parameter shared between detectors while marginalizing over independent components to the same precision as purely independent calibration errors. This rule of thumb should hold for FIG. 6. Expected scaling of the cumulative amplitude uncertainty with a population of events. Each curve shows one realization of our simulation assuming parameters consistent with our wide calibration priors: (light blue: GW) relative calibration uncertainty with no EM constraints and (purple: GW+DL+θjn) absolute uncertainty with both GW and EM constraints.
both relative calibration uncertainty (with or without EM data) and absolute calibration (with EM data).

V. DISCUSSION
Having established that astrophysical calibration of GW detectors is indeed possible and determining the approximate number of events needed before this method will be competitive with existing calibration schemes, the obvious question remaining is how long it will take to detect that many events. Based on current rate estimates, the advanced LIGO and Virgo detectors are expected to detect 4-80 BNS and several hundred BBH per year at design sensitivity [39]. While the uncertainties on these detection rates are quite broad, this suggests ∼ 6 years of BNS detections (assuming each has an informative EM counterpart) and nearly a decade of BBH detections (only constraining relative calibration uncertainty assuming GW data alone) before astrophysical calibration can constrain amplitude uncertainty below 1%. It is therefore possible, although not altogether likely, that astrophysical calibration could become competitive within the LIGO facilities' lifetimes. Even at that point, though, it is quite likely that astronomical calibration will only corroborate existing in-situ techniques, which may become more precise themselves.
It's worth noting, however, that the gold standard photodiodes used in the LIGO PCALs are recalibrated approximately once per year [40]. If astrophysical calibra-tion were to constrain unknown systematics within those photodiodes, we would only be able to use the detections made between photodiode recalibrations. This implies astrophysical calibration could only ever constrain amplitude uncertainties to somewhere between 2-15%.
We note that better models of possible calibration errors, rather than simple spline interpolation, could prove more constraining. This could also provide informative constraints when the signal is below the noise, such as is the case above ∼ 300 Hz for GW170817. However, this is not our primary goal. Instead, the model freedom associated with spline interpolation should allow astrophysical calibration to detect a more general class of calibration errors, not just those described by a particular model. Indeed, it is possible, although not certain, that astrophysical calibration could be the only way to detect such deviations if they manifest within the calibration of LIGO's gold standard photodiode.
Confirmation binaries expected for the Laser Interferometer Space Antenna (LISA, [41]) are in many ways analogous to the astronomical calibrators discussed here, and our approach could be used to construct a similar calibration for space-based gravitational-wave detectors. However, LISA's calibration is expected to be significantly better than that of ground-based detectors, since it is fundamentally limited by the ability to isolate the test masses, and this has already been established by LISA Pathfinder to work well past the 1% level [42,43].
Finally, we turn our attention to the implications for H 0 measurements with GW standard sirens. GW170817 and AT 2017gfo produced the first such measurement: H 0 = 70 +12 −8 km s −1 Mpc −1 [9]. While the precision is remarkable given it is based on a single event, it is not yet precise enough to resolve the percent-level discrepancy between the Planck [44] and SHoES [45] results. Similarly, recent work constrains H 0 using a statistical approach in the absence of EM counterparts [46]. Typically, the statistical approach will provide weaker constraints than events with EM counterparts [47]. However, the precision of both methods will be fundamentally limited by our ability to precisely calibrate GW detectors.
Unfortunately, we will not be able to simultaneously calibrate the absolute amplitude response of our detectors and infer H 0 because H 0 is degenerate with D L (assuming a fixed redshift) and therefore the absolute amplitude uncertainty within our detectors. Indeed, the EM constraints we applied in our analysis of GW170817 are primarily limited by uncertainty in H 0 . This holds for BBH without EM counterparts and BNS with electromagnetic counterparts, as we've shown that knowledge of only θ jn is not enough to constrain the absolute calibration.
Nonetheless, we will be able to constrain the frequencydependent relative amplitude and phase calibration of our detectors regardless of our prior knowledge of H 0 . While it is unlikely astrophysical astrophysical constraints will ever surpass direct in-situ measurements, they could provide a meaningful independent corrobora-tion with uncertainties only a factor of a few wider within the advanced detectors' lifetimes.

ACKNOWLEDGMENTS
The authors are thankful for many useful conversations with Jeff Kissel and the entire the LIGO Scientific Collaboration Calibration Group. We also acknowledge dis-cussions with Curt Cutler and Pete Bender about LISA calibration. R.E. and D.E.H. are supported at the University of Chicago by the Kavli Institute for Cosmological Physics through an endowment from the Kavli Foundation and its founder Fred Kavli. D.E.H. is also supported by NSF grant PHY-1708081, and gratefully acknowledges the Marion and Stuart Rice Award. The authors also gratefully acknowledge the computational resources provided by the LIGO Laboratory and supported by NSF grants PHY-0757058 and PHY-0823459.