iDQ: Statistical inference of non-gaussian noise with auxiliary degrees of freedom in gravitational-wave detectors

Gravitational-wave detectors are exquisitely sensitive instruments and routinely enable ground-breaking observations of novel astronomical phenomena. However, they also witness non-stationary, non-Gaussian noise that can be mistaken for astrophysical sources, lower detection confidence, or simply complicate the extraction of signal parameters from noisy data. To address this, we present iDQ, a supervised learning framework to autonomously detect noise artifacts in gravitational-wave detectors based only on auxiliary degrees of freedom insensitive to gravitational waves. iDQ has operated in low latency throughout the advanced detector era at each of the two LIGO interferometers, providing invaluable data quality information about each detection to date in real-time. We document the algorithm, describing the statistical framework and possible applications within gravitational-wave searches. In particular, we construct a likelihood-ratio test that simultaneously accounts for the presence of non-Gaussian noise artifacts and utilizes information from both the observed gravitational-wave strain signal and thousands of auxiliary degrees of freedom. We also present several examples of iDQ’s performance with modern interferometers, showing iDQ’s ability to autonomously reproduce known data quality monitors and identify noise artifacts not flagged by other analyses.

Gravitational-wave detectors are exquisitely sensitive instruments and routinely enable groundbreaking observations of novel astronomical phenomena. However, they also witness non-stationary, non-Gaussian noise that can be mistaken for astrophysical sources, lower detection confidence, or simply complicate the extraction of signal parameters from noisy data. To address this, we present iDQ, a supervised learning framework to autonomously detect noise artifacts in gravitational-wave detectors based only on auxiliary degrees of freedom insensitive to gravitational waves. iDQ has operated in low latency throughout the advanced detector era at each of the two LIGO interferometers, providing invaluable data quality information about each detection to date in real-time. We document the algorithm, describing the statistical framework and possible applications within gravitational-wave searches. In particular, we construct a likelihood-ratio test that simultaneously accounts for the presence of non-Gaussian noise artifacts and utilizes information from both the observed gravitational-wave strain signal and thousands of auxiliary degrees of freedom. We also present several examples of iDQ's performance with modern interferometers, showing iDQ's ability to autonomously reproduce known data quality monitors and identify noise artifacts not flagged by other analyses.

I. INTRODUCTION
Gravitational-wave (GW) detectors, like the advanced LIGO [1] and Virgo [2] interferometers (IFOs), are exquisitely sensitive machines. This sensitivity requires complex control schemes to isolate the instruments from their surroundings [3,4] and detailed calibration to infer the correct astrophysical strain incident on the detectors [5]. Their success, including the first direct detection of GWs [6], the now routine detection of binary black hole coalescences [7], and the detection of coalescing neutron stars [8,9], which enabled ground-breaking multi-messenger observations [10][11][12][13], is due to a combination of the detectors' sensitivity and advanced signal processing techniques.
However, several sources of noise still limit the detectors' sensitivity. The most fundamental is stationary Gaussian noise [14], which can be completely characterized by a power spectral density (PSD) and describes the detectors' behavior reasonably well most of the time. Another common noise source, referred to as non-Gaussian noise transients (colloquially termed glitches [15]), manifests as bursts of excess power in the detectors above and beyond what would be expected from stationary Gaussian noise alone. Because this is also the hallmark of a GW signal, non-Gaussian noise transients can be mistaken for real GW signals if they occur simultaneously in multiple detectors and currently limit searches' sensitivity to many expected astrophysical signals (e.g., [7,16,17]). Throughout the advanced detector era, an extensive zoology has been developed to categorize and mitigate the impact of non-Gaussian noise transients. This includes examining the morphology of the noise transients themselves in GW strain data as well as searching for correlations between the noise transients and other degrees of freedom that are not sensitive to GWs. Many other works have explored the former [18][19][20]. We focus on the latter.
Information from the detectors is recorded in a set of discretely sampled timeseries, referred to as channels. These channels observe many different degrees of freedom within the interferometers, including control signals used to stabilize the device [3,4,21,22] and monitors of the physical environment [23]. In total, there are more than 2 × 10 5 channels recorded at each LIGO detector, although only O (10 4 ) are sampled at frequencies high enough to be within the detectors' sensitive band. Typically the auxiliary features used within iDQ are derived from this subset of channels, which are sampled at 256 Hz. Each channel may witness a variety of possible noise sources, some of which may couple to the measurement of GWs and some of which may not. Because of the large number of channels, it is impractical to measure the couplings between all of them directly via targeted injection campaigns. Instead, we rely on statistical correlations between channels to infer the noise's source. If channels insensitive to GWs routinely glitch in close proximity to non-Gaussian transients within the GW channel, we infer that the transients in the GW channel are mostly likely due to terrestrial noise rather than being of astrophysical origin. Fig. 1 depicts a probabilistic graphical model representing this inference.
iDQ [24,25], a statistical framework for this inference, has operated throughout the advanced detector era and continues to provide robust, real-time measures of correlations between thousands of degrees of freedom within each detector and non-Gaussian noise in the GW channel. The speed and reliability of this information has proven invaluable for several GW detections (e.g., GW170817 [8] and examples in Section V). With the expected increases in detector sensitivity and corresponding elevated detection rates over the next few years [26], real-time data quality information will only become more important.
iDQ, first described in Ref. [27], was developed as an extension of Ref. [28] in preparation for the first observing run, which began in September 2015. Although there is a long history of algorithmic development in the field, including hierarchical veto application schemes based on approximations of the likelihood ratio [29], the Poisson significance of coincident noise [30], the percentage of time witnesses remove noise [31], and applications of more general machine learning algorithms [28,32,33], many of these algorithms fail to produce probabilistic statements about their predictions. Furthermore, such algorithms did not operate in real-time, a key component of searches for multi-messenger astrophysical events [34,35]. Additionally, GW interferometers are not stationary over long periods of time; the characteristics of the noise change. This means that correlations measured by any particular algorithm at any particular time may not generalize well to data recorded later, complicating the inference process. We note that there is a distinction between non-Gaussian noise and non-stationarity. For example, non-Gaussian noise may be described by stationary Poisson processes. In fact, many algorithms make this assumption [29,30], although it is not guaranteed to be the case.
iDQ provides a framework in which any supervised learning algorithm can be run in real-time. It calibrates their output into statistical statements about the confidence that non-Gaussian noise is present in the GW channel. This is accomplished via two-class classification, amenable to many machine learning algorithms. Additionally, iDQ automatically re-trains and re-calibrates the algorithms to capture non-stationarity within the detectors. This means that iDQ autonomously adapts to new sources of non-Gaussian noise within the detectors, identifying witnesses of previously unseen noise sources and flagging data as problematic without human intervention.
We describe iDQ's formalism in Section II, including our use of supervised learning in Section II A. Section III describes how we structure the inference, including how we construct vectorized representations of a detector's state (Section III B), how we train machine learning algorithms, including ways to extract feature importance from the trained models (Section III C), as well as how iDQ manages cross-validation (Section III D) and calibrates its predictions into probabilistic statements (Section III E). A few examples of iDQ's performance are shown in Section V and possible applications within searches are discussed in Section VI, including a likelihood ratio test based on first-principles noise models which account for our imperfect knowledge of the presence of non-Gaussian noise in our detectors. We conclude in Section VII.

II. FORMALISM
We couch our statistical inference as two-class classification, which we approach within a supervised learning framework. This produces predictions for the presence or absence of non-Gaussian noise. It is worth noting that this is not the only approach, and one could instead attempt to regress the full waveform of the non-Gaussian noise based on auxiliary degrees of freedom (e.g., [36,37]). However, classification is more tractable at this time, and we construct our inference in that framework using a vectorized representation of the detector's auxiliary state.
First, a bit of nomenclature. The target channel h(t) refers to the degree of freedom containing non-Gaussian noise we would like to identify, typically a proxy for the GW channel. This is shown in grey in Fig. 1. Auxiliary channels refer to all other channels. Because GW detectors are complicated devices that require active control, several auxiliary channels may be nearly identical to the target channel (i.e., contain signals derived from h used to control the interferometer). These are referred to as unsafe auxiliary channels, as they are likely to witness real GW signals, and it would be unsafe to use them to construct data quality flags as they could systematically veto real GW signals [31,38]. These are shown in red in Fig. 1. The remaining auxiliary channels, which are not sensitive to GWs, are referred to as safe auxiliary channels a(t), shown in blue in Fig. 1. Safety is typically determined through a series of hardware injections in which excitations are injected into the interferometer to mimic the effect of a real GW. Auxiliary channels which correlate strongly with the hardware injections may similarly correlate with real GW signals and are deemed unsafe. iDQ only uses information from safe auxiliary channels, although labels for supervised learning are derived from h.
To put this more formally, we assume the target channel is composed of three independent components where n is stationary Gaussian noise, s is the astrophysical strain induced in the detector, and g represents non-Gaussian noise artifacts. We can only observe h and therefore model n, s, and g as latent processes which may or may not be correlated with other degrees of freedom.
observed processes latent processes that influence h latent processes that do not influence h astrophysical signals non-Gaussian noise sources non-Gaussian noise sources target channel unsafe auxiliary channels safe auxiliary channels FIG. 1. Probabilistic graphical model representing different sources of noise within GW interferometers. Each circle represents a different process, some of which are observed and some of which are not, and conditional dependencies are represented by directed arrows. We assume that each separate process additionally includes independent stationary additive Gaussian noise, which we omit for clarity. Latent processes, which we cannot observe directly, are shown as unshaded circles. The top row corresponds to processes that can influence the target channel (h), including astrophysical signals (s) and a subset of possible sources of non-Gaussian noise (G1 -GN ). Conditional dependencies from these channels to h (thick arrows) are how non-Gaussian noise couples into the target channel. The bottom row (GN +1 -GN +M) corresponds to possible sources of non-Gaussian noise that do not affect h but may affect auxiliary witnesses. Observed processes are shown in the middle row, which include h (grey shaded circle) as well as N unsafe auxiliary channels (red shaded circles), which may be influenced by s or h (or both), and M safe auxiliary channels (blue shaded circles), which only witness sources of non-Gaussian noise. If we marginalize over the latent processes in this graph, we will introduce conditional dependencies between h and the auxiliary channels as well as between different sets of auxiliary channels. iDQ only uses information from safe auxiliary channels to infer these induced correlations, thereby predicting the presence or absence of non-Gaussian noise in the target channel without observing the target channel itself.
Furthermore, we assume n is stationary over timescales much longer than either s or g and can therefore be completely described by a single PSD. Because we adopt a two-class classification scheme, we must define our classes. We take G to be the union of all possible types of non-Gaussian noise (G i ) without explicitly enumerating each class (as opposed to, e.g., Gravi-tySpy's explicit multi-class classification [20]).
Typically, this is defined by a set of thresholds on h, such as signal-to-noise ratio (ρ) and a frequency range. For instance, one may target only loud, low-frequency noise relevant for high-mass binary black hole searches.
G then consists of all time samples that correspond to h within these thresholds. The complement of G, referred to as C, corresponds to clean times when the detector does not display non-Gaussian noise such that and Furthermore, clean states axiomatically imply g(t) = 0 ∀ t ∈ C, in that there are no non-Gaussian noise transients within clean times.
Because we derive labels based on h instead of g, true signals may also fall within the thresholds defining G. However, the true signal rate is expected to be orders of magnitude less than the rate of non-Gaussian noise artifacts ( 1/day as opposed to ∼ 1/minute), and we do not expect GWs to significantly pollute our training set [39]. Furthermore, because we only use safe auxiliary channels, defined by their insensitivity to GW signals, we expect s to be independent of a, and h containing GW signals is indistinguishable from h without GW signals based on a alone. Nonetheless, one could remove almost all true signals by removing any element of G that is coincident between multiple detectors [40]. This would also accidentally remove elements of G due to g that just happened to be coincident, but we expect the processes producing g to be independent in each detector [41]. Removing such a random subset of G would only decrease our sample size without biasing the training sets. However, because of the additional complexity associated with synchronizing processes running at geographically disparate locations, and the fact that the impact on our training sets is negligible, iDQ does not currently remove elements coincident between detectors from its training set.
Again, this is not the only way to construct the inference. Instead of classification, one could use the fact that ρ measures the probability that Gaussian noise alone could have produced h, and can therefore be used to estimate the probability that a non-Gaussian transient is present. Instead of classifying samples separated by hard thresholds, one could regress ρ directly or use ρ to define weighed training sets. We leave such extensions to future work, but note that similar model comparisons are implicit within our marginal-maximized likelihood ratio test (Section VI B).
iDQ infers the probability of the presence of non-Gaussian noise artifacts (g = 0) within h based on safe auxiliary channels. Specifically, iDQ estimates p G (t) = p(G| a(t)) = p( a|G)p(G) p( a|G)p(G) + p( a|C)p(C) .
using supervised learning to estimate the likelihoods p( a|G) and p( a|C).

A. Supervised Learning as Dimensional Reduction
As described in Section I, GW interferometers monitor a large number of auxiliary degrees of freedom as discretely sampled timeseries recorded at different rates. iDQ represents the information in these auxiliary channels as a set of features extracted from each channel separately and compiled into an array of fixed dimension.
Several example feature extractors are described in Ref. [42,43], and these generally rely on a wavelet decomposition to identify excess power beyond what is expected from stationary Gaussian noise alone. These feature extractors map discretely sampled timeseries into tabular data, such as the frequency, amplitude, and duration of non-Gaussian transients. iDQ constructs highdimensional representations of the detector's auxiliary state based on this tabular data, typically recording O (5) features for each of O(10 3 ) auxiliary channels. Although efforts to extract better feature sets are on-going [44], iDQ implicitly assumes that features extracted in this way from each channel are sufficient statistics. Indeed, the wavelet decompositions at the core of many feature extractors form overcomplete bases and contain all information available in the original channel. There is also evidence that the precise algorithmic details of the feature extractor may not significantly impact the overall inference (see Section 6.2 of Ref. [45]).
The Neyman-Pearson lemma [46] states that an optimal classification scheme orders samples by their likelihood ratio However, we do not know the functional form of the likelihoods a priori and must estimate them from observed samples. Compounding this, the dimensionality of a can be very large, typically O(10 4 ) or more. This drives us to supervised machine learning as a way to approximate Λ G C and reduce the dimensionality to something tractable. We therefore consider the main product of supervised learning algorithms, referred to as classifiers, to be a map from the high dimensional input space to the unit interval M( a) : R N 1 → R ∈ [0, 1]. The precise functional form of the map is determined by the details of the algorithm and is unimportant for the rest of the inference, but we expect elements of G to be mapped to values near 1 and elements of C to be mapped to values near 0. We then construct a likelihood ratio in this lower dimensional space for each classifier separately, de facto estimating Estimates of p(M( a)|G) and p(M( a)|C) are obtained by evaluating labeled samples with trained classifiers and directly modeling the resulting distributions (Section III E and Appendix A). Optimal classifiers will order samples according to Λ G C , so we expect Λ G C to be monotonic in the classifier's output M( a). Therefore iDQ also calculates the efficiency and false alarm probability as cumulative conditioned likelihoods integrated over classifier predictions efficiency = P (M( a) ≥ r|G) These cumulative statistics define the receiver operating characteristic (ROC) curve for a classifier, the standard metric for classification performance. Likelihood ratio tests optimize the efficiency at all false alarm probabilities.
We also note that iDQ can run multiple classifiers in parallel over the same data. Because each classifier produces a different map, and therefore may be able to better identify different subsets of glitches, we should be able to extract more information by combining classifiers. De facto, this would amount to using supervised learning to map a very high-dimensional space into a more manageable size, which may be amenable to direct modeling of the likelihood. This is the case for one-dimensional output from a single classifier. Combining a handful of classifiers should not pose a more complicated conceptual issue, although running enough classifiers in parallel may require us to use supervised learning again to model the joint likelihood over their output. This type of boosted classifier, previously explored within iDQ [27], would again reduce the problem to likelihoods defined over a one-dimensional space.

III. DECOMPOSITION OF THE STATISTICAL INFERENCE
iDQ divides the workflow into several asynchronous processes which communicate to share updated models and calibration. As Figure 2 shows, features are generated for each IFO and retrieved by various processes (Section III A). Vectorized representations of the detector's auxiliary state are constructed as needed in each process (Section III B). Training (Section III C) produces models for each classifier, which are then used in both evaluation (Section III D) and timeseries production (Section III F). The evaluated output is used to calibrate the model (Section III E), and the resulting map transforms the lowlatency predictions made during timeseries production into probabilistic measures with associated uncertainties. These measures, such as p G , can then be ingested by GW searches in real-time.
A. Data Discovery iDQ relies upon an external source of features, typically taken to be tabular data denoting the location and properties of non-Gaussian transients in a set of channels.
Although not strictly necessary (e.g., iDQ could ingest the raw timeseries directly from the detectors), we find that this preprocessing efficiently extracts the features relevant for our classification problem. Figure 2 represents these feature streams as directed arrows exiting the feature extractor and entering the iDQ workflow. Pragmatically, these are implemented as abstractions that manage data discovery and produce a consistent tabular output format regardless of the features' source, thereby simplifying any client interactions throughout the pipeline.
Most feature extractors operate in one of several wavelet domains, extracting excess power as collections of time-frequency pixels. Common choices are the Haar wavelet transform and the Q-transform [42]. However, the precise form of the feature extractor is unimportant beyond the fact that different wavelet transforms are able to better resolve different aspects of non-Gaussian noise transients. What's more, not all feature extractors produce the same set of features, although all provide some measure of the transient's central time, duration, frequency content, and amplitude or significance (typically measuring how rare the transient would be in stationary Gaussian noise). Ref. [28] explored the relative importance of features in the Haar domain, finding the relative time offset and significance to be most important.
Samples from G and C are identified based on the features present in h. Specifically, any transient which meets the criteria for G (see Section II) is recorded as a target time. Samples from C, called random times, are drawn according to a Poisson process in stretches of data sufficiently far away from target times. These clean segments are defined by another set of thresholds on h, which are typically chosen to be slightly looser than the bounds defining target times. This creates an effective buffer between samples in G and C, which is believed to avoid threshold effects and improve classifiers' ability to distinguish the sets. At the end of the day, any time segment not declared clean is considered dirty (i.e., t ∈ G), and this dirty time is accounted for within the calibration's prior odds (Section III E).
Additionally, iDQ gathers IFO-state information from a remote database [47]. Segments produced outside of iDQ record high-level state information about the detectors, such as when the IFOs record science-quality data, but without fine enough temporal resolution to flag subsecond non-Gaussian noise transients. iDQ polls the database for such segments, filtering the training and evaluation samples to retain only times within sciencequality data. This may not be strictly necessary in all cases and can introduce additional latency before segments are available. Segment information is, therefore, optional (directed arrows in Figure 2 are dashed instead of solid), but is almost always used in practice except during low-latency timeseries production. Timeseries production only applies existing models and calibration to IFO data. Therefore, it does not care whether the detectors are currently recording science-quality data.

B. Feature Vector Construction
Given a stream of features, we must still determine which features to use. Most, if not all, supervised classification schemes require consistent dimensionality in the input feature space ( a ∈ R N with fixed N ). Therefore iDQ must downselect features if too many auxiliary tran- sients are nearby or fill in default values if no auxiliary transients are available. Mapping the streams of features from many auxiliary channels into an array of fixed dimension is referred to as vectorization, and is carried out on-the-fly as needed within the pipeline.
Although other types of features have been investigated in the literature, such as averaging over small neighboring time windows [33], iDQ implements the following vectorization scheme following Ref. [28], which was also employed in [32]. For each auxiliary channel, iDQ queries all transients in that channel with central times within some window surrounding the time of interest. If no auxiliary transients are available, default values are returned for all requested features, thereby denoting auxiliary channels that were inactive. Otherwise, iDQ will extract features from the loudest auxiliary transient (largest ρ) within the window. We note that this is not the only choice, and quiet transients in closer coincidence with the time of interest may be more relevant than louder transients further away [44]. Nonetheless, this select-loudest algorithm works well in practice [48] with coincidence windows ∼ 100 ms.
The resulting feature vectors are collected into sets for training, evaluation, and timeseries generation. Each vector is labeled according to whether the time is associated with an element of either G or C based on h, thereby constituting a supervised learning training set. These labels are completely ignored during evaluation and timeseries generation. As implemented, each feature vector retains a reference to the data discovery abstraction used to retrieve the features (see Section III A). In this way, classifiers can access the full feature set if desired, although most rely on the vectorization scheme articulated above. A notable exception is OVL [29], which directly ingests the data discovery abstraction during training. This is done to avoid additional overhead associated with vectorization and is peculiar to the OVL algorithm, although this type of behavior is more broadly supported within iDQ.

C. Training
Given vectorized representations of the auxiliary features for each time of interest, iDQ then trains classifiers to separate the labeled samples. Again, classifiers are only given features extracted from safe auxiliary channels and cannot construct decision surfaces based on the vectors' G or C labels. The details of each classifier's training algorithm are specific to each classifier, and iDQ generally relies on external libraries for their implementation (e.g., scikit-learn [49], keras [50], and xgboost [51]). However, a few algorithms are implemented directly within iDQ, such as OVL [29].
In this way, iDQ uses supervised learning on labeled auxiliary feature vectors to generate maps from highdimensional input spaces to a single scalar rank. We note that vectorization itself also introduces dimensional reduction, as we extract features from at most a single auxiliary transient per channel, but beyond that it is the classifiers themselves that determine which features are relevant and which are not. This establishes a model for each classifier (α): Each classifier generates a separate model, and different models may be able to separate the training sets to different degrees. iDQ requires ranks to be within the unit interval, although this choice is arbitrary. The important aspect of the model is the relative ordering of samples, not the precise value of the rank. Therefore, any monotonic mapping from the unit interval to another range will preserve all relevant information.
Each classifier additionally manages internal crossvalidation or provides techniques to prevent over-fitting (see Appendix B for an example). Again, the details may be specific to each classifier, but iDQ also provides a cross-validation scheme independent of the classifiers themselves. Section III D describes the various techniques used to evaluate a classifier's performance fairly, making sure to account for any generalization error associated with the derived models.
Each trained model additionally records the range of data used during training as a unique hash. These hashes are used in evaluation (Section III D) and timeseries generation (Section III F) to track how data was manipulated as it progressed through the pipeline.
Some algorithms also support measures of feature importance within trained models. These are often related to the directional derivative of the model with respect to each auxiliary feature: ∂M/∂a i | a . Features with larger directional derivatives tend to be more important, although each classifier's measure of feature importance may adopt different specific details (Appendix B describes how OVL extracts the feature importance measures shown in Figures 7 and 8). Not all classifiers provide this information. Many that do only provide global estimates averaged over all samples instead of the local estimates at a particular auxiliary vector. However, we leave further investigations of standardized measures of feature importance, such as Ref. [52], to future work.

D. Evaluation
Supervised learning relies on cross-validation to evaluate generalization errors. This typically consists of subdividing the data into distinct sets, one of which is used for training and the other to evaluate performance. iDQ supports two main ways to subdivide the data into different bins for cross-validation.
Acausal or round-robin binning divides the data into a sequence of small segments, mixing samples independently of their time ordering. Fig. 3 demonstrates how segments are assigned to different data sets. Briefly, for N different bins with M segments per bin, iDQ generates N segs = N × M segments of equal duration. Data from the first segment is assigned to the first bin, the second segment to the second bin, and so on. Data from the (M + 1) th segment is assigned to the first bin, the (M + 2) th segment to the second bin, and this repeats until all data has been assigned. We then train over all data outside of the i th bin to generate a model used to evaluate data inside the i th bin, repeating the procedure for all bins. This approach provides features drawn from consistent distributions in both training and evaluation, even if the instantaneous feature distributions change over time. With features drawn from consistent distributions in both training and evaluation, acausal binning measures the best performance that should ever be expected from a classifier.
Causal binning again divides the data into many small segments. Fig. 4 shows this schematically. Each analysis then trains on a cumulative set of historical segments and uses the resulting model to evaluate the next segment, preserving the relative time ordering. Again, this is repeated for all segments, each using all historical data available during training. This allows users to investigate how detector non-stationarity affects their algorithm.
Each approach has its uses, and both produce ROC curves that measure classifiers' performance in a fair manner. Importantly, iDQ can simultaneously run multiple classifiers, guaranteeing they see identical data sets and that comparisons are as fair as possible. Evaluation also records each model's hash within each feature vector to maintain a record of how each feature vector was classified.

E. Calibration
While training and evaluation are the foundation of iDQ's supervised learning approach, calibrating the resulting ranks into probabilistic statements is of equal importance. iDQ does this by directly modeling the observed conditioned likelihoods for each classifier's rank: p(r α |G) and p(r α |C). Just like each classifier's model retains a hash to track provenance, each pair of conditioned likelihoods, or calibration map, records a unique hash to denote the evaluated samples from which it was generated.
iDQ can model the conditioned likelihoods in two ways, related to different assumptions about the nature of ranks produced by a classifier's model. Both quantify the uncertainty in their estimates, and both work well in practice. Figure 5 shows how they correctly model non-trivial conditioned likelihoods. Calibration maps further esti-eval train · · · train train eval train · · · train train eval train · · · train train train eval · · · train train train eval · · · train train train eval · · · train train . train train · · · eval train train train · · · eval train train train · · · eval train train train · · · train eval train train · · · train eval train train · · · train eval bin 1 bin 2 Schematic of data segmentation for causal batch analyses. Each row illustrates the data included in a single bin, and columns represent different segments of data. Each bin uses historical data cumulatively when training (grey shaded boxes), including an initial lookback period before the first evaluation segment (red boxes). Bins progressively train on more and more time-ordered data, thereby testing algorithmic sensitivity to detector non-stationarity in a similar way to what is experienced during streaming operation. mate prior odds for G and C samples. We describe these procedures in more detail below.

Continuous Calibration Maps
If classifiers produce ranks that can take any real value within the unit interval, with no discrete lumps of prob- then it is appropriate to model the underlying distribution with a continuous kernel density estimate (KDE). iDQ implements a Gaussian kernel and dynamically optimizes the kernel's bandwidth to maximize a crossvalidation likelihood quantifying how well the KDE reproduces the observed sample set. iDQ reflects observed samples around rank=0 and 1 to avoid edge effects within the KDE while numerically enforcing proper normalization. Uncertainty in the KDE model for the true likelihood at each rank is modeled by a β-distribution (see Appendix A). This procedure accurately predicts the observed variance in KDEs obtained from different realizations of sample sets drawn from the same underlying distributions, regardless of the true distribution.

Discrete Calibration Maps
If the trained model only produces a finite number of possible ranks, the resulting distribution may be better modeled as a weighed sum of δ-functions equivalent to our continuous Gaussian KDE in the limit of vanishingly small bandwidths. Weights are estimated as the fraction of observed samples assigned to that rank, and uncertainty in the weights is again modeled as a βdistribution such that where n i out of N total samples were assigned rank r i .

Prior Odds
While iDQ's conditioned likelihoods, and therefore the likelihood ratio Λ G C , do not depend on the prior odds between G and C, most applications instead rely on p G (Eqn. 5). While p G is monotonic in Λ G C , the exact value can be made arbitrarily large or small based on the prior odds assumed. This forces us to carefully consider our assumptions a priori about the relative frequencies of G and C. iDQ implements several choices, either allowing users to specify fixed prior odds, estimating them based on the relative fractions of samples within training sets, or estimating them based on the fraction of time declared clean within a training set. All these are based on the premise that the prior odds are approximately the ratio of the rates at which each type of sample occurs, adopting different techniques for approximating the rates of G (R G ) and C (R C ) samples.
As appropriate for inference over tabulated data, one approximation is This asks what the chance is that one would select either a G or C sample when randomly choosing an element of the fixed training set. Generally, N G is set by the true R G in the detector and the amount of time over which we collect samples. Similarly, N C depends on the rate at which we generate clean samples, which is an arbitrary choice typically chosen to balance training accuracy and computational expense. While formally correct for tabulated data, and therefore useful in some applications, this model does not necessarily represent the prior odds relevant for timeseries production.
Alternatively, we can model R G /R C ≈ T G /T C = (T /T C ) − 1 where T C /T is the fraction of analysis time declared clean when constructing the sample set. We measure T C directly from the conditions defining C based on h (Section III A). This approach models the relative frequency of times declared glitchy or clean, rather than selecting an element from tabular data, as is more appropriate for timeseries production.
F. Timeseries Production iDQ produces streaming estimates of statistical quantities as the culmination of training and calibration. These timeseries should be thought of as the main data product generated within iDQ and are the most applicable to GW searches. iDQ generates vectorized feature sets on a regular grid in time, typically sampled at ≥ 128 Hz. The regularly spaced vectors are then evaluated using a trained model, and the resulting array of ranks is calibrated into several statistical quantities using a calibration map. These quantities are then distributed in realtime to GW searches, and we discuss ways to incorporate them within searches in Section VI.
Section IV describes the differences between offline (batch) and online (stream) modes of operation in more detail, but both utilize asynchronous processes to manage training, evaluation, calibration, and timeseries production. Because some tasks require the output from other tasks, iDQ synchronizes them by polling for specific models and calibration maps from various repositories, referred to as modular data servers (MDSs). In this way, timeseries jobs can obtain the most relevant model and calibration map for any stretch of data. As with data discovery, iDQ does not depend on the particular implementation of a MDS as long as it allows tasks to get and put results through a consistent interface. De facto, most MDSs are implemented via local filesystems.
As noted in Section III E, the prior odds assumed during calibration can have a significant impact on the interpretability of the resulting timeseries. In particular, we expect p(G) p(C) because the typical duration of non-Gaussian noise transients is usually much less than their separation: O(10 −1 sec) O(10 2 sec). Because of the ambiguity associated with the choice of prior odds, iDQ produces timeseries for multiple statistical quantiles besides p G , including Λ G C , the conditioned survival functions (efficiency and false alarm probability), as well as the raw rank produced by the model. Access to Λ G C allows users to estimate p G with whatever prior odds they choose. Furthermore, the false alarm probability approximates the amount of time discarded by the classifier, and therefore could be used to set a convenient working point. What's more, the calibration map also provides uncertainty estimates based on the finite calibration sample size.

IV. BATCH VS. STREAM MODES
In addition to the decomposition described in Section III, iDQ supports two modes of operation related to how it synchronizes processes.
The offline, or batch, mode targets specific stretches of data and can support both causal and acausal binning schemes (Section III D). Additionally, batch jobs run tasks synchronously within each bin, which is to say that training must complete before evaluation begins, evaluation must complete before calibration begins, and calibration must complete before timeseries are produced. However, separate bins are independent and can be processed in parallel. Batch analyses have loose latency requirements.
The online, or stream, mode instead runs in low-latency, typically producing timeseries within O(10 −1 sec) of receiving features. The dominant source of latency for iDQ is feature generation and vectorization. The time required for feature extractors to process O(10 3 ) channels has recently been reduced to ∼ 5 sec [43], as opposed to ∼ 32 sec for the implementation of the Haar transform (KleineWelle [53]) used throughout the first two observing runs. Vectorization can also limit iDQ's latency, but the pipeline can generate vectors consisting of O(5) features for each of O(10 3 ) channels at rates above 128 Hz, which is typical of production configurations as most glitches have durations ∼ 100 ms.
Online jobs process all data causally and manage tasks asynchronously. However, because some tasks require input from others before they can begin, the streaming iDQ pipeline will run small batch pipelines if models and/or calibration maps are not initially available for all classifiers. Once initial models and calibration maps are available, separate processes for training, evaluation, calibration, and timeseries production run in parallel and interact through put and get requests in MDSs (Section III F). This means that re-training and re-calibration happen continuously, with a new data set defined and fed to classifiers as soon as they finish processing their previous sets. When each task begins processing a new data set, it polls the relevant MDS to obtain all the required data products without waiting for the asynchronous processes to complete. For example, several evaluation strides may use the same set of models because the training jobs generally take longer to complete than evaluation. Nonetheless, as soon as a training job completes, the evaluation jobs will automatically retrieve the new model.
Training jobs usually take the longest to complete, with runtimes of O(hours), and it is possible, then, that trained models will respond relatively slowly to detector non-stationarity. However, if the model becomes out-of-date, the much faster evaluation and calibration jobs, which complete in O(sec), will detect the decreased performance and update iDQ's probabilistic statements accordingly. For example, if a model suddenly cannot distinguish between G and C samples, the calibration jobs will update the conditioned likelihoods to p(r|G) ∼

Time [sec]
FIG. 6. Example iDQ timeseries for OVL [29] with 3 different ranking metrics (see Appendix B) from a batch analysis of O(10 2 ) sec during a period of elevated radio-frequency noise at the LIGO Hanford detector on Monday December 21, 2015. From the top to the bottom, we show the estimated probability of a glitch given the auxiliary data, the log likelihood-ratio between the glitch and clean models, the ratio of the detection efficiency to the false alarm probability (FAP), the FAP itself, and the raw rank produced by the classifier. For reference, central times for glitches derived from h are shown with light grey lines, occuring at a rate of ∼ 10 Hz at this time. We see that OVL correctly identifies the times at which glitches are present in h and that our prior (p(C) p(G)) correctly drive pG to small values when the likelihood is uninformative.
p(r|C), and therefore return the prior odds. Asynchronous, parallel processing lowers the latency required to produce timeseries at the cost additional complexity in data discovery. Because data may occasionally be dropped before it reaches iDQ, each process analyzes data in small strides and internally manages timeout logic, skipping data if it takes too long to arrive. In production, we generally find iDQ achieves duty cycles above 99%, thereby essentially guaranteeing results will be reliably provided to GW searches in low-latency. In fact, iDQ's information was distributed at the same time as the calibrated GW strain during the third observing run.
While many analyses will utilize the batch workflow, we expect streaming processes to be the most relevant in the coming years as iDQ's predictions are incorporated further into low-latency GW searches. Indeed, the examples in Figs. 7 and 8 were all derived from streaming analyses. Section VI enumerates a few other possible applications.

V. EXAMPLES
We present a few examples of iDQ's behavior with real detector data from the first two observing runs. First, and perhaps most importantly given the context, is the non-Gaussian noise artifact coincident with GW170817 in the LIGO Livingston interferometer [55]. Although similar noise transients are witnessed several times per day in each IFO, their exact cause is not known. They are, however, often associated with saturations within the interferometric control systems, and monitors exist to flag such saturations. Figure 7  (GraceDB [56]), thereby informing decisions in real-time about the candidate's probability of being astrophysical in origin and the resulting announcement to the broader astronomical community [57]. GW170817 serves as an example of how iDQ can independently identify noise sources already known to human analysts. Additionally, the witnesses iDQ identifies sometimes flag problematic time associated with loud noise transients that go unnoticed by more conventional monitors, including saturation-like glitches that happen to not saturate the control signals being monitored (e.g. [58]).
Pursuing this further, Figure 8 presents a radiofrequency whistle identified as a possible GW candidate by oLIB, a search for unmodeled GW bursts [54], on 4 February 2017. iDQ vetoed this event within 7 seconds. As with GW170817, iDQ autonomously developed witnesses for such noise and clearly identifies the time as glitchy. We note that canonical monitors for satu-rations did not flag this event, as expected, and iDQ's low-latency predictions were the only data quality products available at the time that could reject this candidate as noise without relying on the human inspection of the signal morphology in h(t). Rejecting candidates from unmodeled transient searches based on h(t) morphology itself is risky, and therefore iDQ's auxiliary witnesses provide much greater confidence the transient was of terrestrial origin. At the time of writing, no low-latency monitors exist for such radio-frequency whistles besides iDQ. Figures 7 and 8 show two examples of noise transients that are typically witnessed well by the auxiliary degrees of freedom used within iDQ. Both were identified in lowlatency the OVL algorithm ( [29]; Appendix B) running within iDQ's framework. We typically find that, in agreement with Ref. [28], OVL performs as well as, if not better than, more complex algorithms, and we primarily uses its predictions to identify noise in low-latency. In general, different classifiers may witness different noise sources, and combining the ranks from multiple classifiers (creating a boosted classifier ) is an active area of research. Figure 9 presents a few ROC curves showing iDQ's typical performance, also demonstrating iDQ's ability to run multiple classifiers simultaneously over the same data. We focus on OVL with 3 different choices for the ranking metric (Appendix B 1). OVL, running within iDQ's framework, typically identifies ∼ 10%-40% of non-Gaussian noise artifacts (KleineWelle [53] triggers between 32 and 2048 Hz with ρ 8) at the cost of 0.1%-1% false alarm probability. This depends on the mixture of noise sources present within the detectors, though, as a high fraction of well-witnessed noise will lead to correspondingly more impressive ROC curves. Figure 9 shows one such example from the LIGO Hanford detector a few days before GW151226 [59]. An intense radio-frequency glitch-storm produced a large number of clearly witnessed noise transients, and OVL identified ≥ 90% of them with ≤ 1% false alarm probability. It is worth noting that, at that time, offline canonical monitors for radio-frequency noise had become less effective due to detector non-stationarity [17,60]. iDQ automatically detected new witnesses without human intervention and retained a high glitch-detection efficiency in low-latency.

VI. APPLICATIONS WITHIN GRAVITATIONAL-WAVE SEARCHES
The optimal incorporation of probabilistic data quality information within GW searches remains largely unaddressed in the literature. In this section, we present a likelihood ratio test that incorporates imperfect knowledge of the presence of non-Gaussian noise within our detectors. We first review the current state of the field and how data quality information is often incorporated into searches, demonstrating how iDQ's products could be used within existing methodologies while discussing different approaches' relative advantages and drawbacks. We then formulate a search based on first-principles models of noise processes within detectors and the incorporation our imperfect knowledge of data quality based on both auxiliary and GW strain information.

A. Current Veto and Gating Strategies
Data quality products are currently applied within searches in two main ways. Data quality vetoes are applied after filtering, meaning after a search has produced a list of candidates. Usually, vetoes are specified as a list of segments and any candidate that falls within these segment is rejected. Gating, on the other hand, attempts to remove problematic data before filtering, thereby preventing false positives from appearing in candidate lists at any point. Importantly, both approaches assume binary data quality information. That is, the data is either declared clean or glitchy with complete certainty. iDQ extends this by providing probabilistic measures of data quality.
Let us begin with vetoes applied after candidates have already been identified. A naive approach is to perform a simple coincidence experiment between the search's candidates and data quality monitors. If, for example, iDQ's false alarm rate timeseries dips below a threshold anywhere within a coincidence window surrounding a GW candidate, this may indicate that a non-Gaussian noise artifact is present near the candidate, which may suggest it is of terrestrial origin. Other methods to identify problematic data based on auxiliary channels, often constructed by hand, are often used to define such veto segments. Of course, this depends on the precise way GW candidate reference times are recorded, as low-mass compact binary coalescences can sometimes coalesce several seconds after the non-Gaussian noise that caused the false alarm. In effect, this maps timeseries output into binary veto segments with a window and a threshold.
iDQ's output was used in this way with modest success during the first two observing runs [61], but vetoes suffer from several limitations. First, searching for extreme excursions in any of iDQ's timeseries within a coincidence window naturally introduces an additional trials factor. For example, if the coincidence window is longer than the typical separation between glitches, then the probability of obtaining p G ∼ 1 for at least one point in that window is almost surely 1. This complicates the statistical interpretation of iDQ's predictions, since, for example, what iDQ reports as the false alarm probability will not generally correspond to the false alarm probability of finding a large excursion within a large window. Furthermore, the mapping from threshold-window pairs to the effective false alarm probability will depend strongly on the quantity used, and there is no single obvious choice. One may threshold on the false alarm probability in an Shaded boxes approximate 1-σ uncertainty from the counting experiment used to measure the ROC curves at the points where they are measured, with linear interpolation between. Grey dashed lines correspond to uninformative classifier, and colored lines correspond to different ranking metrics within OVL (Appendix B 1). Each of these analyses used iDQ's internal acausal cross-validation scheme with two bins and two segments per bin, and these ROC curves therefore account for generalization error. These two examples represent approximate limits on the performance seen within iDQ, and primarily differ by the type of noise present within the detector. Nonstationary changes in non-Gaussian noise can be sudden and severe, necessitating automatic re-training and re-calibration.
attempt to bound the probability of false alarms, but one may alternatively threshold on the likelihood ratio Λ G C as this may be more appropriate for likelihood ratio tests. This could be tuned by hand, but that then negates any efforts to calibrate model predictions within iDQ, thereby defeating the purpose of a large part of the pipeline. Nonetheless, the conceptual simplicity and ease of implementation make this attractive for practical applications. For this reason, veto segments remain a core data quality product within GW searches.
A perhaps more sophisticated approach to vetoing based on extremized timeseries is to incorporate the extremum value within a window as part of a likelihood ratio test. While this may remove some ambiguity about the effective false alarm probability (the likelihood ratio test will naturally account for the probability of seeing such extrema within clean data) and remove the need to tune thresholds by hand, the ambiguities associated with the choice of window and statistic still remain. Although properly constructed likelihood ratios should be able to simultaneously incorporate arbitrarily many window-statistic pairs, one quickly encounters the pragmatic issues with modeling high-dimensional probability distributions that led us to employ machine learning as dimensional reduction in the first place (Section II A).
Ref. [62] explores a somewhat simpler construction, in which the iDQ log Λ G C timeseries is maximized over 1second windows, slightly transformed and then applied directly as a multiplicative factor to a likelihood ratio detection statistic. The transformation for iDQ's log Λ G C was empirically determined and fixed a priori, essentially assuming the functional form for the trials factor introduced by the maximization and how that modified the likelihood of a signal being present. While the assumed mapping may not be optimal, iDQ was found to moderately benefit current searches even with this simple approach.
Another approach is the idea of gating in some form, in that we should remove all problematic (glitchy) times before filtering, thereby removing the need to select a specific extremization procedure to veto post hoc. Heuristically, the logic is that candidates are generated with large ρ by glitches ringing up templates, and we can model the correlations between ρ and the presence of a glitch either explicitly within a likelihood ratio test over many variates or implicitly by removing the contribution of h(t) due to a glitch from the matched filter response altogether.
Such gating schemes are now ubiquitous within the field and trivial to implement within white noise. However, IFOs generate colored noise and the presence of loud glitches can ring whitening filters, polluting surrounding data that would otherwise be unaffected by the noisetransient. This prompted the development of inverse-Tukey window gating [63,64] as well as more complicated in-painting techniques designed to zero the filter response within a specified window after whitening [65]. We derive similar approaches to in-painting in Section VI B from first-principles.
While gating mechanisms become more complex, we note that current approaches all rely on the same premise, that there is a predefined set of times declared glitchy that must be removed from the analysis. These gates are typically defined by extremization processes over monitors and hard thresholds, and therefore suffer from the same ambiguities in determining appropriate settings as post hoc veto segments, although without the ability to simultaneously consider multiple choices as would be possible in a likelihood ratio test. Nonetheless, it is sometimes the case that filtering artifacts from gating or the ambiguity in defining gates are preferable to the original noise transient.
One could define gating conditions based on iDQ timeseries, but again we must face the selection of thresholds with no more obvious metric than guessing and checking how this affects searches' sensitivities. One could be tempted to soften the hard thresholds with an adaptive gate, such that the matched filter response of a timeseries h with a filter f would be modified to (14) in effect estimating ρ by probabilistically keeping the times expected to be clean based on auxiliary degrees of freedom. Again, while heuristically appealing, it is not clear that this approach is optimal. Indeed, it suffers from the same issues of whether to apply the adaptive gate before or after whitening the data as normal gates. While GW searches have benefited from data quality information made available in the past (e.g. [17]), the issues associated with choosing or optimizing ad hoc prescriptions for applying that information beg the question of whether there is a self-consistent framework that would provide a natural motivation for a particular approach. We present such a framework, and additionally prescribe how the greater information available from probabilistic knowledge of data quality can be used without the need to cast that information into binary flags.

B. Optimal Searches with Imperfect Knowledge of non-Gaussian Noise
We now formulate the problem from a first-principles model of the noise processes within our detectors. As a reminder, we assume linear additive noise so that the detector output is given by h = n + s + g, where we only observe h and the auxiliary state a, meaning we must marginalize over the unobserved latent processes n, s, and g. We begin by formulating probability distributions for detector noise in the target channel and auxiliary features conditioned on whether the IFO is in a glitchy or clean state.
In clean states, we assume the noise is generated by a stationary process, at least over timescale relevant to the filter, such that the autocorrelation function is given by (15) where S(f ) is the PSD. Adopting the Einstein summation convention, we then obtain by assuming Gaussianity. Furthermore, we expect n to be independent of a and declare g = 0 ∀ t ∈ C such that p(n, s, g, h, a|C) = p(n|C)p( a|C)δ(g)p(s)δ(h−(n+s+g)) (17) where the astrophysical strain induced in the detector is assumed to be independent of the instantaneous detector behavior. We use iDQ's conditioned likelihood to model p( a|C) ∼ p(M( a)|C) and assume an astrophysicallymotivated prior for signals p(s).
In glitchy states, we still assume n is distributed as in clean times and that n is independent of ( a, s, g). p(n, s, g, h, a|G) = p(n|C)p(g| a, G)p( a|G)p(s)δ(h−(n+s+g)) (18) We note that the only difference is that we demand p(g|C) = δ(g) but leave p(g| a, G) as a completely unknown function. Assuming that the set of data {g(t) | t ∈ G} is distributed somehow, we can construct a combined likelihood spanning predefined labeling of both G and C samples as p(n, s, g, h, a) = p(n|C)δ(h − (n + s + g))p(s) i∈C p( a i |C)δ(g i ) Marginalizing over latent processes yields p(h, a) = DnDsDg p(n, s, g, h, a) The result, in general, depends on how g ∈ G is distributed, which is unknown. We note that assuming g is Gaussiandistributed with divergent variances, so that any g is equally likely, incurs a large Occam factor from the normalization.
While not a problem when considering a single known permutation of which times are glitchy and which are clean, as was done in Ref. [65], this Occam factor can present issues when comparing different permutations, as we do below. Instead, we note that where p C is the distribution for n restricted to n(t) = 0 iff t ∈ C, which depends on the specific permutation of which times are clean and which are glitchy. This, then, implies which is not a proper distribution (i.e., normalizable) because of the maximization, but instead is an upper bound on the marginal likelihood for h and a given a permutation. We, in effect, construct an inference only using times declared clean. This is necessary because we do not know how g is distributed within glitchy times. Alternatively, machine learning models that infer g directly from a, or other assumptions about how g is distributed (e.g., [66]), would allow us to retain observations of h ∈ G and marginalize over g directly. We note that maximized likelihood does not break the assumption of Gaussianity or stationarity; we simply restrict our selves to times known to be clean and treat glitchy times as if they were not observed. In spirit, then, this is similar to gating.
However, we only have probabilistic knowledge of which times are glitchy and which are clean based on h and a. We therefore must marginalize over all permutations weighed by their relative prior probabilities. That is with so that the resulting likelihood is and we remember that the specific sets C and G depend on the permutation. Note, beyond the conditioned likelihoods from iDQ, even non-trivial prior odds alone can affect the resulting inference. For example, this tells us that if p(G) = p(C)/10 and p( a|G) = p( a|C) ∀ t, then one should marginalize over all possible covariance matrices that account for the prior knowledge that one glitch occurs for every 10 clean samples, on average.
Existing gating approaches assume exact data quality knowledge (p(C| a) is either exactly 0 or exactly 1), thereby selecting a single permutation and a single noise model. In Appendix D, we show how marginalization allowing for the presence of glitches, but without a priori knowledge of when the glitches occur, can effectively gate glitches automatically. This recovers and indeed expands upon current data quality approaches even without knowledge of auxiliary data. Proper prior odds also allow us to infer how many glitches are likely to be present and where they are likely to be, all without a priori knowledge beyond the relative frequencies of G and C samples.
With these marginal-maximized likelihoods in hand, we construct a likelihood ratio for the presence or absence of a signal in the data.
Appendix D 1 shows how the statistic behaves in the presence of glitches with either quiet or loud signals; in both cases it remains sensitive to GWs while being insensitive to the presence of glitches. However, there is always the chance that our noise model remains insufficient to properly characterize IFO behavior, in which case other well-developed ad hoc signal consistency tests, such as χ 2 goodness-of-fit tests, could be combined with Λ S !S within a larger likelihood ratio test, similar to how some searches currently include the matched filter ρ [63,67], itself a proxy for the likelihood ratio in stationary Gaussian noise. If Λ S !S is sufficient, additional variates like ad hoc χ 2 tests will not hurt the search's sensitivity. If it is not, they may continue to be vital.
However, Appendix D 2 shows how Λ S !S 's marginalization acts as an explicit signal consistency test without the need for additional ad hoc χ 2 tests. In effect, the marginalization over whether any particular data sample is declared glitchy or clean constitutes a signal consistency test conditioned on the rest of the clean data in that permutation: given the other clean data, is it more likely that the observed datum in question was generated by stationary Gaussian noise or a glitch?
The computational cost of direct marginalization may be prohibitively high, though, as the combinatorics of the number of different permutations grow exponentially with the length of the signal. Appendix D 3 discusses a few possible approximations that may render this more tractable.
We note that many pipelines may not assume the GW signal comes from a known template family and instead search for unmodeled transients [54,[68][69][70]. Such searches are often constructed by finding a maximum likelihood estimator for s and the corresponding maximum likelihood, subject to loose constraints on the GW waveform morphology or polarization to avoid degenera-cies from underconstrained inferences. While we leave the explicit development of analogous searches to future work, we note that similar maximum likelihood techniques should work just as well with our marginalmaximized Λ S !S . Similarly, one could formulate the probability of observing Gaussian noise in terms of a time-frequency decomposition of the data instead of the time-domain formulation provided here. In this case, one could construct an analogous marginal-maximized Λ S !S where the individual time-frequency pixels were assigned glitch or clean labels, marginalizing over all permutations. Several parallel streams of probabilistic data quality, each targeting a specific frequency band, could be produced with parallel instances of the existing iDQ framework. Such a formulation is perhaps closer in spirit to the timefrequency glitch model considered in Ref. [66] than it is to gating, although it would only exacerbate any computational issues associated with direct marginalization already present in our time-domain formulation.

VII. DISCUSSION
We present a statistical learning framework to infer the presence of non-Gaussian noise within gravitational-wave detectors: iDQ. Using a supervised learning approach to classification, we decompose the inference into training, evaluation, calibration, and finally timeseries production. iDQ accounts for non-stationarity in the detectors by continuously re-training classifiers to autonomously detect and exploit new witnesses for non-Gaussian noise without human intervention. iDQ's framework can accommodate any supervised learning algorithm that operates on tabular data, and iDQ supports multiple modes of operation. Offline, or batch operation reproduces and expands upon previous functionality in the litera-ture, fairly evaluating different algorithms' relative performance. Online, or stream operation manages multiple asynchronous processes to continuously re-train and re-calibrate statistical data quality inferences, producing robust probabilistic data quality information in real-time.
While some issues remain open, such as the best way to recover from detector non-stationarity (possibly through the use of weighted training sets) as well as the development and incorporation of novel feature sets and boosted classifiers, we show how iDQ has already proven invaluable within real GW searches. Specifically, we show several examples in which iDQ either autonomously reproduced the behavior of canonical data quality monitors without a priori knowledge of the type of noise in GW detectors or robustly identified noise sources otherwise unflagged by canonical monitors. Ref. [62] describes how iDQ has already been included within some searches. We reiterate that iDQ has operated in low-latency throughout the entire advanced detector era and provided robust data quality information in low-latency for all detections to-date [7].
We also explore current methods of incorporating data quality information within GW searches, discussing their relative merits and drawbacks, and note that all rely on absolute knowledge of data quality. That is, most existing techniques implicitly require analysts to assume they know whether the detector is in a glitchy or clean state with absolute certainty at all times. iDQ moves beyond this assumption. We instead introduce a way to incorporate probabilistic data quality information that accounts for our imperfect knowledge of the presence or absence of non-Gaussian noise based on safe auxiliary channels alone. This approach presents several attractive features, automatically incorporating optimally located gates without a priori knowledge of where gates should be placed, as well as providing signal-consistency tests from first-principles noise models rather than ad hoc χ 2 tests. While we remark that the computational expense of direct marginalization over imperfect data quality information may be large, we also suggest several possible solutions, leaving their full development to future work.
With increasing detection rates and improved detector sensitivity, robust data quality information will only become more important over the next few years. Indeed, the importance of low-latency information for multimessenger astronomy cannot be overstated. iDQ has provided robust low-latency probabilistic data quality information throughout the advanced detector era, and will continue to do so as GWs reveal new astrophysical phenomena in the most extreme environments found anywhere in the universe.
where the observed samples x i are explicitly reflected around the prior bounds. This is the basic estimator used within iDQ's continuous calibration maps (Section III E 1) to model the conditioned likelihoods given observed evaluated sample sets. Furthermore, iDQ estimates the corresponding survival functionŝ with cumulative normal distributions computed during a single iteration over the sample set rather than numerical integration ofp. Given a bandwidth, iDQ generates a dense grid of ranks and evaluates these estimators once for each grid point. Calibration during timeseries production (Section III F) can then be performed rapidly via linear interpolation without requiring repeated iteration over the sample set, which can be quite large: O(10 4 ). We now consider the choice of bandwidth and how to represent the uncertainty in our KDE representation given different realizations of the observed sample set. While all KDEs are biased estimators (E[p(y)] = p(y)) because they smooth the true distribution according to K(x, y; b), our bandwidth optimization procedure finds the best representation of p possible given our kernel. In practice, then, iDQ produces correct coverage (e.g., 50% of samples have nominal survival functions ≤ 50%) to within the expected statistical uncertainty for stationary distributions.

Bandwidth Optimization
Common practice is to define a leave-one-out cross-validation likelihood and use this to optimize the bandwidth. We adopt the following likelihood As shown in Ref. [54], maximizing this likelihood is equivalent to minimizing the Kullback-Leibler divergence between the true distribution and our estimator, approximating an integral over the the measure defined by the true distribution p(x) with a Monte-Carlo integral over the observed sample set. The astute reader will note that we do not impose reflecting boundary conditions within log L. We expect the impact to be minor and the computational complexity is significantly lessened. iDQ optimizes b separately for G and C samples through direct bisection searches within prespecified prior bounds. While this generates reliable estimators, we also note that log L is often quite flat near its maximum and nearby bandwidths may produce similarly well behaved estimators. In principle, one could marginalize over the choice of bandwidth with respect to a priorp We expect marginalization to produce more robust estimators [73], although we find that using the b that maximizes log L works well enough in practice and avoids the additional computational burden of direct numerical marginalization.

Representating an Estimator's Uncertainty as a β-distribution
Because iDQ has access to only a finite number of samples,p(x) will not perfectly reproduce p(x). iDQ models this sample uncertainty with β-distributions. Let us consider the function Because this statistic depends on the set of random variables {x i }, it will also follow some distribution. We note that f ∈ (0, 1], and therefore a β-distribution is a good candidate for compactly representing the expected uncertainty. Specifically, we fit a β-distribution to estimates of the mean and variance of f , based on the observed sample set. where in the last line we approximate the integral over p(x) as a weighed sum of our observed samples. We also calculate the second moment as and thereby obtain the variance directly via In the case of equal weights with N samples, this yields , as expected for a Fisher-efficient estimator. Now, the β-distribution defined by has mean and variance We choose α and β to reproduce E[f ] and V [f ]. The distribution of our estimatorp(y; b, {x i }) is therefore given bŷ where the fit parameters are a function of the bandwidth and observed samples. We note that some estimates of the mean and variance have no corresponding solution in terms of α and β, typically corresponding to expectation values very close to 0 or 1. We therefore impose a minimum expectation value of ∼ 10 −6 for numerical stability. iDQ estimates the best-fit α and β parameters at every point on the dense grid over ranks used to computep, interpolating between neighboring grid points as needed. This provides rapid estimates of sampling uncertainty at the same time as point estimates within timeseries production. A similar procedure is used to model uncertainty in the cumulative distributionP (x). Sample uncertainty for the likelihood ratio Λ G C (Eqn. 6) is obtained via Monte-Carlo sampling from the β-distributions representingp(r|G) andp(r|C). Again, these uncertainty measures are computed once for each grid point and interpolated as needed.

Training
OVL's training scheme has also been slightly updated, although it still remains largely as described in Ref. [29]. Specifically, OVL still trains via a nested iteration, evaluating veto configurations' performance hierarchically with a given ordering, pruning ineffective configurations, and re-ordering the list within each epoch. Pruning is done to avoid over-training and is accomplished by setting minima on various veto configuration performance metrics, like the efficiency-deadtime ratio or Poisson significance. The precise impact of these minima has not been quantified, but typically analysts select values to balance the loss in efficiency associated with restricting the vetolist to only the most exceptionally well-ranked veto configurations and the generalization error introduced by over-fitting, as efficient vetoes can sometimes correspond to statistically rare accidental coincidences within the training set that do not generalize well. For this reason, it is thought that pruning based on the Poisson significance has the largest impact on over-training.
OVL learns when it re-orders its list, as this places higher ranked veto conditions first. Because of the hierarchical nature of OVL, we take care to re-order the veto configurations to preserve as much information as possible. Specifically, within each re-ordering, we first sort the list to place high-threshold, small-window configurations first. All else being equal, these should produce better veto configurations. Only then do we order the configurations by their metrics, so that configurations with the same score are ordered to prefer high thresholds and short windows. Pragmatically, we find this makes a small but noticeable difference in the final ordering produced by the algorithm.

Ranks
Because iDQ requires classifiers to generate ranks within the unit interval and the metrics used within OVL typically span the positive real line, we map the metric scores (m) into ranks (r) according to where the scale (ξ) is fixed for each metric separately to account for their very different dynamic ranges seen with typical interferometric data. We note that this mapping is not unique, and other functional forms would accomplish the same task. However, Eqn. B1 distributes the ranks more uniformly over the unit interval than some other mappings, and this can help iDQ's calibration build accurate representations of the resulting conditional likelihoods.

Feature Importance
Finally, we would be remiss if we did not discuss OVL's notion of feature importance, one of the most attractive aspects of the algorithm besides its robust performance. Because OVL only considers a single auxiliary channel at a time and applies them in a specific order, it is straightforward to determine which auxiliary features (channel, threshold, and window) are responsible for OVL's predicted rank at any time. In this way, OVL reports which veto configurations were active as a function of time while simultaneously reporting their relative importances as their ranks. Indeed, feature importance as a function of time is shown in Figures 7 and 8.
OVL also measures correlations between veto conditions. Specifically, it can report the intersection of segments created by each veto configuration as a symmetric matrix. Diagonal elements correspond to the time contained within of each set of veto segments separately. Nearly redundant veto configurations, then, will produce intersection times close to the times of each configuration separately. We note that OVL's training algorithm, by design, will remove redundant configurations and therefore reduce the amount of overlap in the list by applying configurations hierarchically, removing vetoed transients and time before proceeding to the next configuration. This, combined with pruning, will tend to select a single witness configuration out of sets of highly correlated configurations. Nonetheless, such "covariance matrices" between veto configurations may prove useful when diagnosing the source of specific noise transients identified by the selected auxiliary witnesses. Figure 10 shows examples of correlations between the veto configurations used surrounding GW170817.

Appendix C: Synthetic Feature Generation
In addition to supporting multiple possible feature sources, each supplying distinct sets of features, iDQ can also generate synthetic data on-the-fly. This is done for testing purposes and to benchmark algorithms. Briefly, iDQ simulates an arbitrary number of stationary Poisson processes representing sources of noise. Each of these synthetic processes is described by a separate rate as well as distributions over frequency and ρ. Synthetic processes are then witnessed by user-specified sets of channels, and each witness records values scattered around the true value within the process (e.g., central times recorded in witness channels are Gaussian-distributed around the central times produced by the process, with separate standard deviations for each witness). In this way, the synthetic processes entangle the features witnessed by several channels, and noise in a single channel can be modeled by a separate process witnessed only by that channel. Each channel may witness multiple streams, generating arbitrarily complex correlations within the feature set. This implementation realizes the probabilistic graphical model depicted in Fig. 1. gates based on the relative frequency of G and C samples and the data's consistency with Gaussian noise without any other a priori knowledge. We note that, if the signal is quiet (h − s ∼ h ∼ σ), then we obtain log Λ S !S ≈ − |h 1 − s 1 | 2 + |h 3 − s 3 | 2 2σ 2 + |h 1 | 2 + |h 3 | 2 2σ 2 (D8) exactly as expected for stationary white Gaussian noise with the glitch gated with precise a priori knowledge of the glitch's location. If instead the signal is loud, the inference is more complicated as the noise-only model may confuse what is really a loud signal with a loud glitch, although we have which is still large due to the prior odds and therefore still strongly in favor of a signal. While the full solution with more samples and colored noise is more challenging technically, is follows the same basic principles. We again note that the benefits of marginalization seen within this toy model assume uninformative auxiliary information and simply accounts for the possibility that glitches exist within the detectors and our imperfect knowledge of data quality. Informative supervised learning models based on a with correct calibration, such as those provided by iDQ, can only further improve the inference. In the case of our toy model, this would simply add additional weight to the correct permutation that gated the second sample.

Signal Consistency Tests
We remark that signal consistency tests, like χ 2 goodness-of-fit, require the data to be consistent over several smaller, independent trials, looking at the distribution of t ρ(t) 2 rather than ( t ρ(t)) 2 . We note that such χ 2 statistics are ad hoc and not uniquely defined. Therefore, there is no particular reason we should expect to derive the form of any such statistic from first-principles considerations. Nonetheless, we show that marginalization naturally defines a signal consistency requirement similar in spirit to, but different in detail from, existing χ 2 tests.
Let us further consider the model comparisons implicit with the marginalization over permutations. As an example, let us assume that the G or C assignments are known perfectly for all samples except one: h k . The explicit marginalization over this single unknown sample is then p(h, a) = p( a k |C)p(C) We note that which is the likelihood of observing (h k − s k ) as part of the stationary Gaussian noise process conditioned on the observations of the rest of the Gaussian noise process known to be clean (h i − s i ∀ i ∈ C) multiplied by the prior volume allowed by the additional degrees of freedom. This is a signal consistency test that checks whether the k th sample agrees with the signal seen in the other N C samples. Marginalization compares this consistency test against the posterior odds that the k th sample was a glitch based on a k , effectively placing a lower bound on the probability of seeing any h k − s k . Within likelihood ratio tests, this prevents the noise-only model from becoming vanishingly small in the presence of loud glitches, thereby preventing the likelihood ratio from diverging and rendering the search much less sensitive to glitches. Indeed, this is exactly the behavior seen in our toy model.

Computational Cost of Marginalization
We note that the marginalization over all permutations proposed within Λ S !S is combinatorially expensive. Efficiently implementing such a calculation is an open problem, but we discuss a few possible solutions below.
First, one could Monte-Carlo integrate instead of performing the entire sum. However, Monte-Carlo integrals have variances that scale as The integral's variance, then, could be quite large in the presence of loud glitches as the variance between different permutations would be large. Such integrals may require many samples to converge.
Alternatively, one could sample from the sum in a scheme similar to the Metropolis-Hasting algorithm [75]. Jump proposals would consist of flipping the label of one sample from G to C or vice versa, much like an Ising spin model [76]. However, Markov-Chain Monte-Carlo estimates of marginal likelihoods are not without their own computational challenges and may not scale well in practice.
Regardless of the sampling procedure, we note that the number of possible permutations could be exponentially reduced by labeling small contiguous segments as G or C instead of labeling every time sample separately. This would greatly reduce the computational cost, perhaps to something tractable, but introduces issues of how to select the window size. Based on our considerations of model comparisons occurring within the marginalization, windows of comparable size to the stationary Gaussian noise's autocorrelation times may be appropriate, as this is the relevant timescale over which h k |h i∈C becomes less in formed by h i∈C and therefore less stringent of a test.
We would need to compute the probability that there was a glitch present at any time within each segment. This coarse-graining should be straightforward, though, as p(G ∈ window| a) = 1 − ti∈window p(C| a(t i )) (D13) A related approach would be to round the probabilities up or down based on some threshold. That is, if p(C| a) is above some threshold, we only consider permutations where that sample is labeled C. Similarly, if p(C| a) is below another threshold, we only consider permutations where that sample is labeled G, marginalizing over only the samples in the middle. In effect, this would define There is no single obvious choice of thresholds, however, so care would be needed. We note that this is similar to the auto-gating implemented within some existing searches [63,67], although the thresholds are placed on ρ and are not currently determined by p(G| a)/p(C| a). What's more, they ignore the conditioning on other data already declared clean. Indeed, this could be considered a conservative choice as the threshold to declare a sample clean based on the conditioned likelihood h k |h i∈C should only be more stringent than a threshold derived without the observations of h i∈C .
Such coarse-graining is likely to result in information loss and therefore less sensitive searches. However, the impact may be small enough and the computational speed-ups large enough to make this tractable, thereby improving GW search sensitivity compared to current approaches that do not marginalize over imperfect data quality information. Ref. [62] implements one such coarse-graining procedure, although they do not attempt to marginalize over probabilistic data quality and instead directly modify their likelihood ratio with a multiplicative factor that depends on iDQ's output. Even this simple approach already shows modest improvements in search sensitivity.