1 Introduction

Gravitational waves (GWs), ripples in the fabric of space and time that are a key prediction of Albert Einstein’s century-old theory of general relativity [1, 2], were directly observed for the first time by the Advanced Laser Interferometer Gravitational-wave Observatory (LIGO) instruments in September 2015 [3]. Since this groundbreaking discovery, the two LIGO detectors [4] and their partner, the Virgo detector [5] have measured nearly 100 GW candidates from coalescing black holes and neutron stars during their first three observing runs [6]. The fourth observing run (O4) of the LIGO, Virgo, and now KAGRA [7] GW detector network is under way and expected to result in hundreds more GW detections over the next 2 years [8].

To observe GWs from ground-based detectors such as LIGO and Virgo, the instruments need to be sensitive to changes in the length of the detector arms on the order of \(10^{-18}~\textrm{m}\) [9]. State-of-the-art instrumentation and data analysis techniques have been developed to achieve these sensitivity requirements. However, non-astrophysical noise sources, originating from both instrumental and environmental factors, affect the detectors and lessen their sensitivity to the GW universe. Of particular concern are transient, non-Gaussian noise sources known as glitches that appear in the detectors at a high rate [10,11,12]. Though some glitches have known causes, others show no obvious correlation with instrumental and environmental noise monitors, making their root causes difficult to diagnose [13, 14]. The classification and characterization of glitches are paramount to minimizing their effect on GW measurements. Classification is the crucial first step because many glitches are morphologically similar, as can be identified through their spectrograms. Figure 1 shows spectrograms of three glitches that exemplify common glitch classes seen in the LIGO detectors: Whistles, Blips and Koi Fish [15, 16]. Although these glitches can be differentiated by eye in their spectrograms (and by their sound, if played through a speaker), classification is a daunting task because of the sheer volume of glitch data that is accumulated throughout an observing run.

Such large-scale data analysis challenges are not unique to GW astronomy. Fundamental challenges and opportunities for twenty-first-century science lie in developing machinery and techniques to characterize and use expansive data. The paradigm in which individual researchers or even teams of researchers can analyze data has shifted to relying on novel methods for analysis of large-scale scientific datasets.

Machine learning (ML) techniques are heavily embedded in scientific data analysis, becoming a standard approach for analyzing large-scale data (e.g., [17, 18]); however, there are many instances where analysis by humans is still critical. One example is identifying new data classes not initially accounted for in training sets for ML algorithms. This novelty is indeed the case for GW detector characterization, as new glitches regularly appear due to environmental and instrumental changes as the detectors evolve [14, 19]. To this end, the Gravity Spy project was created to characterize glitches in GW detectors by combining computer- and human-based classification schemes [15].

At its core, Gravity Spy is a citizen-science project.Footnote 1 As a citizen-science project, it involves members of the general public in the scientific process, including formulating research questions, collecting or analyzing data, observing and recording natural phenomena, and disseminating results [20]. As Internet-enabled devices have become increasingly ubiquitous, citizen-science projects have become a feasible approach to providing access to human insights at a large scale. For example, Galaxy Zoo [21] invites volunteers to engage in the morphological classification of images of galaxies produced by the Sloan Digital Sky Survey: the project has successfully increased engagement with the public and led to novel discoveries in astronomy.Footnote 2

Gravity Spy has a web-based interface through which anyone can provide analysis of LIGO glitches and interact with LIGO scientists about the state of the detectors. It is hosted on the Zooniverse citizen-science platform.Footnote 3 The Zooniverse platform has fielded a workable crowdsourcing model (involving over 2.5 million people in more than four hundred projects since its inception) through which volunteers provide large-scale scientific data analysis. Zooniverse provides tools for systematically performing data analysis tasks on data collections, making it an ideal platform for Gravity Spy.

Fig. 1
figure 1

Spectrograms of example glitches seen in LIGO detectors. Time is along the horizontal axis, and frequency along the vertical axis. The color denotes the energy in each time–frequency bin. These glitches exemplify a subset of glitch classes. The left panel shows a Whistle glitch, middle panel a Blip glitch, and right panel a Koi Fish glitch

Building a citizen-science project using Zooniverse would not solve the specific challenges GW detectors face alone. The glitches in the detectors can change over the course of a run as GW scientists adjust the detectors or from other transient terrestrial events. Understanding and mitigating glitches requires innovative solutions, testing, and meticulous analysis of the data that only an interdisciplinary team of citizen-science volunteers, LIGO detector-characterization experts, computer scientists trained in machine learning, and social scientists could provide.

Since its inception in October 2016, Gravity Spy has accumulated over 7 million classifications of glitches by tens of thousands of Zooniverse users, significantly contributing to the characterization and identification of glitches in LIGO data. Gravity Spy also provides a testbed for fostering a symbiotic relationship between human and machine classification approaches. This connection goes far deeper than the classification task itself; the interplay between the human and machine components of the project allows for a customized training regimen for project volunteers, a more efficient and individually-tailored classification task, and the ability of volunteers to actively improve the underlying ML algorithms.

This article details activities undertaken as part of the 3-year multi-institution National Science Foundation grant (INSPIRE 1547880) to develop Gravity Spy as a prototype for the next generation of citizen-science projects. We describe the Gravity Spy project, its novel approaches to citizen science, its impact on GW detector characterization over the past half-decade, and the citizen-science techniques exclusively developed for this project. We will argue why such techniques are necessary for citizen science to remain highly relevant even as datasets grow exponentially and machine-based classification algorithms advance. Following an overview of the project, results, and lessons learned, we will look ahead to future advancements and adaptations to the Gravity Spy project, particularly focusing on the latest extension known as Gravity Spy 2.0.

Fig. 2
figure 2

Components of the interconnected Gravity Spy system. Gray arrows show the movement of glitches throughout the project, red arrows the progression of volunteers, and blue arrows the training of ML models. We note that there are multiple levels that volunteers progress through in the Beginner and Intermediate workflows with differing number of glitch classes, though the ML confidence of glitches are consistent within all Beginner workflows and within all Intermediate workflows

2 Building Gravity Spy

The Gravity Spy project relies on an intricate interconnection between GW data analysis, citizen science, and machine learning. An overview of the main Gravity Spy system is shown in Fig. 2 and summarized as follows:

  1. 1.

    Glitches are identified using the Omicron pipeline [22], which identifies moments with excess power in the data stream of each LIGO detector individually, referred to as Omicron triggers. Omicron triggers that exceed a signal-to-noise threshold and are from times of a suitable detector state are selected for study.

  2. 2.

    Spectrograms like those in Fig. 1 are created for four different time durations around each trigger. Each particular trigger’s collection of spectrograms in the project is referred to as a subject.

  3. 3.

    Each new subject is sent through a trained ML algorithm and assigned a probability of being an instance of one of the predetermined morphological classes in the Gravity Spy project.

  4. 4.

    New subjects are distributed to the Gravity Spy volunteer workflows based on the confidence score from the initial ML classification, with glitches likely to be of a known glitch class being provided to new volunteers and less certain glitches to more advanced volunteers.

  5. 5.

    Volunteers morphologically classify subjects until the combined machine and human classification score reaches a predetermined threshold, at which point that particular subject is retired from the project (i.e., removed from the classification workflows). If no consensus is reached after a set number of classifications, the glitch is migrated to a more advanced workflow.

  6. 6.

    Volunteers advance through the project and gain the ability to access workflows with additional classes of glitches as they correctly classify golden subjects (i.e., subjects that experts have already classified).Footnote 4

  7. 7.

    Volunteers in the most advanced workflows examine glitches that neither human nor machine classification has been able to confidently identify, in order to look for possible new classes of glitches that can be proposed for follow-up by LIGO scientists.

  8. 8.

    All machine learning, volunteer, and combined classification results for active and retired subjects are provided to LIGO–Virgo–KAGRA (LVK) scientists to aid glitch studies. Retired images are actively added to the ML training set, which can be retrained at a predetermined latency to improve its initial classification ability.Footnote 5

The following subsections provide more details about the building and performance of each component of the Gravity Spy system.

2.1 The dataset: transient noise in GW detectors

As summarized above, Gravity Spy classifications operate on spectrograms of GW data around the times of Omicron triggers [22]. The Omicron software relies on the Q-transform, an analog of the short Fourier transform [23]. Times of excess power are identified through the Q-transform, and the Q-transformed data are represented as spectrogram images (also known as Q-scans, see Fig. 1).

A glitch in one of the LIGO detectors will trigger Omicron to produce spectrograms for four different time windows: \(0.5~\textrm{s}\), \(1.0~\textrm{s}\), \(2.0~\textrm{s}\), and \(4.0~\textrm{s}\) [15]. The reason multiple time windows are used is so that both humans and ML models can examine glitch morphologies that occur on different characteristic timescales. These images are then exported to an ML model and for human vetting, as described in later sections of this article. The classifications from the ML and humans are then added to the Gravity Spy dataset for use by analysts.

Since the start of Advanced LIGO observations in the fall of 2015 through the end of the third observing run (O3) in spring 2020, the Gravity Spy dataset has accumulated \(\gtrsim 1.4 \times 10^6\) glitch triggers from the Omicron pipeline with a signal-to-noise ratio \(>7.5\), corresponding on average to a new glitch being added to the project per every minute of observing time. Compared to the \(\sim 10^2\) astrophysical events observed over this timespan [6], the number of glitches that have been uploaded to Gravity Spy is \(>10^4\) times larger. As of the start of the fourth observing run of the Advanced LIGO, Advanced Virgo and KAGRA detectors, 27 morphological classes [16] including the catch-all None-of-the-Above (NOTA) and No Glitch classes are available for volunteer classification of LIGO glitches, with 23 incorporated into the ML classifier [19].

2.2 The Gravity Spy classification task

The Gravity Spy classification task on the Zooniverse user interface involves citizen-science volunteers reviewing images of glitches and classifying them into morphological categories. The different levels at which volunteers can classify glitches are called workflows. As volunteers learn the classification task and classify glitches in accordance with glitches pre-classified by experts, they unlock and advance through workflows that introduce more glitch classes, more options in the classification interface, and glitches that are classified with less confidence by the ML algorithm.

The classification interface for the most advanced workflow is shown in Fig. 3. The left side of the interface presents the glitch to be classified, and the right side, a list of possible classes, along with NOTA for glitches that do not match any listed classes. To help volunteers pick a class, a Field Guide with exemplary images and text descriptions is available. Metadata (e.g., the date on which the glitch occurred) and image filtering options are present in the interface. Volunteers can also save favorite subjects or create collections of subjects that they can view after the glitch has been classified.

Upon selecting a class, volunteers are presented with two options: Done or Done & Talk. Both options submit the response to the response database. If the subject was used to evaluate promotion (gold-standard data), feedback is provided that the volunteer agreed or disagreed with the expert evaluation of the glitch. Selecting Done & Talk directs the volunteer to a thread on the Gravity Spy Talk discussion board where they can review comments posted about the image by other volunteers or post a new comment.Footnote 6 The Talk feature is extensively used to comment on NOTA glitches that might represent a new glitch class.

Fig. 3
figure 3

The Gravity Spy classification interface for advanced workflows, named (Binary Black Hole Merger (Level 5) and Inflationary Gravitational Waves (Level 6)). These workflows allow for volunteers to pick from all of the 24 morphological classes that are currently accounted for in the Gravity Spy project

2.3 Machine learning for Gravity Spy: classification and volunteer training

ML processing is an integral part of the Gravity Spy architecture and is used for classification and volunteer training. The ML model used for initial classification is a convolutional neural network (CNN), a class of deep learning algorithms that shows exceptional performance in image recognition endeavors [24]. The detailed architecture of the CNN used in the Gravity Spy classification task and how glitch spectrograms with differing temporal durations are combined can be found in [15, 25].

A training set of \(\simeq 7700\) labeled glitches across the 19 initial morphological classes was synthesized for the initial Gravity Spy launch, and this training set has been supplemented over time to include nearly \(10^4\) labeled glitches over 23 classes after O3 [16]. This dataset continues to be expanded as new data are being taken. The initial training set was constructed by experts: first, \(\sim 1000\) labeled glitches were used to train a preliminary ML model (with relatively poor accuracy), and then these ML labels were vetted by experts to expand the initial training set.

Gravity Spy uses a novel approach for volunteer training and advancement that relies on ML to provide task scaffolding. The design is informed by learning theories, namely the zone of proximal development [26,27,28]. Specifically, new volunteers to the project progress through a series of increasingly challenging and complex workflows to develop their knowledge of the classification task and the zoo of glitch morphologies, with more advanced workflows and more difficult-to-classify glitches opening to volunteers as they proceed through the project. The ML classifications and confidence scores determine the workflow to which a particular subject is assigned. Subjects that the ML confidently classified are assigned to beginner workflows, while subjects that the ML is uncertain about, which may thus represent new categories of glitches, are assigned to the more advanced workflows. There are multiple beginner workflows containing an increasing number of glitch classes. Glitches are assigned to the appropriate workflows so that new volunteers see glitches that are likely (but not certain) to be examples of one of the included classes. In the initial level, volunteers begin with only two options and, upon meeting performance thresholds as assessed by classification of gold-standard data, are promoted to subsequent levels in which additional glitch classes are presented. Also, instances of the glitch classes introduced to the volunteer in the previous level may be more challenging to classify, as indicated by lower ML confidence scores. At each level, volunteers have the option of NOTA to handle the case where the ML has confidently misclassified a glitch. In the most advanced workflow, most glitches have low ML confidence and may be representative of new morphological classes. The task of the volunteers thus shifts from classifying glitches into known classes to searching for similarities in the images that indicate a possible novel class of glitch.

As volunteers classify images, their classifications (which are coupled to a unique confusion matrix for each user that determines which classes a particular user commonly confuses with another) are combined with the initial ML confidence score to determine whether a particular subject has been classified with a high enough pre-set accuracy to be retired from the project and added to the ML training set; these glitches that are added to the ML training set improve its morphological coverage. Theoretically, this iterative retraining of the ML model could be automated. However, several variables govern this retirement procedure, and investigation into the optimal settings for these retirement criteria is ongoing.

2.4 Volunteers of Gravity Spy: supporting the machine and making discoveries

Hand-labeling glitches with the predefined glitch classes represents the lion’s share of work in Gravity Spy, without which the ML could not be trained nor benchmarked. Gravity Spy volunteers are essential in providing these by-eye classifications. As discussed earlier, the input provided by volunteers goes beyond the initial classification task; volunteers help identify and characterize glitches that do not fit into previously known glitch classes. When glitches do not fit one of the known glitch classes in the primary labeling, volunteers can begin by labeling them as NOTA, and proceed to create collections of such glitches that exhibit similar morphologies.

Large numbers in a new glitch class indicate that this morphology may be particularly detrimental to GW detector sensitivity. Collecting large samples of such glitches can allow for LIGO scientists to identify trends (e.g., in the times that they occur or in the auxiliary sensors that are triggered at the time of the glitch) [14, 29]. In addition to supplementing glitch collections by continuing through the classification task and identifying similar images, which can be cumbersome and time consuming, we enable volunteers to perform additional data analysis tasks using additional tools side by side with Zooniverse, collectively known as Gravity Spy Tools.Footnote 7 One such tool that has proved to be highly useful is the Similarity Search, in which volunteers (and project scientists) are able to query similar-looking glitches within the full dataset.Footnote 8 The search utilizes transfer learning, first modeling the properties of existing glitch classes in a high-dimensional feature space and then relying on a clustering algorithm to identify images that are morphologically similar in the database [30, 31]. To use the tool, project scientists or volunteers input a particular glitch subject (each glitch has a unique subject label), and query other glitches in the dataset that most closely resemble that subject morphologically. After running the Similarity Search (a screenshot of the output is shown in Fig. 4), users can evaluate the metadata of resulting glitches, decide which images to include or exclude, and export the results of the search to a new collection. The Similarity Search plays a crucial role by effectively filtering out the majority of the non-matching glitches, significantly enhancing the purity of the set that the volunteer will examine. The ability of volunteers to search the entire dataset for similar subjects is a methodology unique to citizen-science projects that proved highly useful; rather than proceeding through the classification task until similar-looking subjects appeared and could be added to a collection, this tool allowed project users to quickly build morphologically similar classes of glitches that could be vetted and added to the ML training. Details of the clustering algorithm and similarity search can be found in [30,31,32].

Fig. 4
figure 4

Results from a query of the Similarity Search tool. The glitch at the top of the image is was glitch that was queried on, and the four glitches below are some of the most similar glitches in the database. Similarity scores are in the rightmost column of the table

Following the identification and curation of a new glitch class, volunteers can submit a New Glitch Proposal (Fig. 5, bottom), including a proposed name for the glitch, a description of its typical morphological features, a single exemplar or reference glitch, and their collections of similar images. The proposal is evaluated by a LIGO member who assesses the robustness and usefulness of the proposed new class and then communicates whether it should be included in the list of glitch options in the classification interface.

Fig. 5
figure 5

New glitch proposal by Gravity Spy user EcceruElme

3 Results from Gravity Spy

Gravity Spy’s impact has been three-fold: it has improved glitch mitigation in GW detectors and, thereby, the quality of GW observations; it has explored new approaches to ML and human-computer interaction, and it has increased scientific engagement from community members. In the following subsections, we describe these impacts in detail.

3.1 Improving the sensitivity of GW detectors and analyses

One key investigation that Gravity Spy has enabled is monitoring the rate of glitches by glitch type. The disappearance of a specific glitch class can, for example, indicate that a specific mitigation strategy undertaken by the detector specialists is working [12, 33]. Monitoring different glitch types over time provides both a high-level assessment of the detectors’ state and some indication of whether a specific glitch type is responsible for increased detector noise. For example, Fig. 5 of [16] shows the hourly glitch rates for four types of glitches at LIGO Hanford and LIGO Livingston during O3.

Beyond a high-level assessment of detector noise, detector-characterization experts also use Gravity Spy classifications to search for correlations between individual glitch classes and features in auxiliary channels that monitor the state of the detectors and related instruments. Gravity Spy classifications are easily accessible to LIGO scientists through the internal database LigoDV-Web [34]. Investigations by detector experts, which could involve Gravity Spy classifications, are typically described in aLogs, LIGO’s records on detector functioning and data quality. One such example can be seen in [35], where potential correlations are identified between a sub-class of the Low-Frequency Blip class and noise in an auxiliary channel monitoring the suspension systems in LIGO Hanford. Similarly, one can investigate the effects of adjusting the instruments on the rate of different classes of glitches, as exemplified in another aLog [36].

Gravity Spy has also had a major impact on the understanding of LIGO noise through the identification of new glitch classes. For example, LIGO scientists identified the Low-Frequency Blip class through investigations of many glitches classified by Gravity Spy as regular Blips; the sub-class of Blips at lower frequencies prompted the creation of a new class in the Gravity Spy project [19]. Gravity Spy volunteers play a key role in this process; they can propose new glitch classes, some of which are eventually incorporated into the ML and classifications. One of the first instances of this was the identification of the Paired Doves glitch class by Gravity Spy volunteers during beta testing [15, 37]. Paired Doves are glitches that are morphologically similar to the GW signals expected from merging compact objects, so identifying this new class was critical for mitigating false positive GW detections.

Another notable example is when Gravity Spy volunteers identified the Crown glitch class, which was contemporaneously identified by LIGO scientists as Fast Scattering [19, 38]. These glitches were originally classified into the NOTA class or the Scattered Light class, but further investigation noted they should be their own class, separate from another sub-class of scattered light now known as Slow Scattering. The independent work of the volunteers and LIGO scientists enabled us to verify that the ML algorithm trained using either the volunteers’ Crown training set or the experts’ Fast Scattering training set produced comparable results [19], demonstrating the volunteers’ capability in identifying new classes. This Fast Scattering/Crown class was found to be associated with ground motion and accounted for a major fraction of glitches in LIGO Livingston during O3 [6, 16]. The combination of quick classification by ML and by-eye investigations by both LIGO scientists and community members was essential to identifying and understanding this new glitch class.

Beyond direct application to characterizing glitches, Gravity Spy classifications have also been used to understand the effects of detector noise on GW measurements and as inputs to other GW software pipelines. Gravity Spy data products, including both ML and volunteer classifications of the entire set of glitches up through O3, are available for public use [39, 40], with proprietary data from ongoing observing runs similarly available to LVK scientists. Gravity Spy classifications have been used to examine how different glitch classes affect data analysis [41,42,43,44], for example, how they may contaminate GW searches [45]; the development of algorithms to distinguish GW signals from glitches [46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]; the development of techniques to remove glitches from the data [64,65,66]; investigations of potential environmental or instrumental origins of noise [12, 67,68,69], such as what triggers the appearance of light-scattering noise [19, 33]; simulating synthetic (noisy) GW data [70,71,72]; and training or testing alternative ML glitch classification algorithms [57, 73,74,75,76,77,78,79]. Overall, Gravity Spy classifications have been instrumental in GW detector characterization, spanning from informal investigations to a key part of GW data analysis software [14].

3.2 Development of machine learning

As described earlier, Gravity Spy incorporates CNNs for machine learning-based glitch classification. CNNs can automatically learn hierarchical patterns and features from images to effectively distinguish between different glitch classes. The CNN architecture presented in Fig. 6 of [15] serves as the foundation, comprising two convolutional (Conv) and max-pooling layers, followed by two fully-connected (FC) layers. The final FC layer employs a softmax activation function, generating scores for each class based on a given set of input images. To create an input, spectrograms with four durations generated with different time windows (\(0.5~\textrm{s}\), \(1.0~\textrm{s}\), \(2.0~\textrm{s}\), and \(4.0~\textrm{s}\), the same that are shown to the volunteers of the project) are combined to form a square image. This merging process ensures the convolutional kernels effectively slide over all four durations, enabling the model to learn distinct features from both long- and short-duration glitches. This architecture has been trained multiple times in different contexts as shown in Table 1, as we describe below.

The initial model was trained using a dataset of 7718 glitches from 20 classes observed during the first observing run (O1) of Advanced LIGO [15, 25]. This dataset was curated and labeled via a collaboration between detector-characterization experts and citizen-science volunteers from Gravity Spy. The initial model achieved an average accuracy of \(97.1\%\) on the testing set. Later, two new glitch classes, the 1080 Line and 1400 Ripple classes, were discovered by citizen-science volunteers during the second observing run (O2) of Advanced LIGO. Therefore, the model was retrained using an expanded dataset of 7932 glitches from 22 classes, encompassing glitches from both of the first two observing runs [32].

The model was later improved with two additional Conv layers and max-pooling layers before the existing two FC layers. These deeper layers allowed the model to capture more intricate patterns in the glitch data. This retrained model achieved an accuracy of \(98.2\%\) on the testing set. During O3, LIGO detector-characterization experts and the Gravity Spy volunteers identified two new glitch classes: Fast Scattering/Crown and Low-frequency Blips [19]. In addition, the NOTA class was removed from the glitch classes defined in the ML model. This particular class is useful for volunteers to flag potential new glitch types but held limited significance for the model’s classification process due to the large variety of morphological features it encompassed. Therefore, the model was retrained again on a dataset of 9631 glitches from 23 classes in O3 [16, 19]. This most recent model reported training and validation accuracies of \(99.9\%\) and \(99.8\%\), respectively.

While each of the ML classifiers mentioned above demonstrated strong performance, they have some common challenges and limitations. First is the imbalanced distribution of glitches across classes. For instance, certain classes like Chirp and Wandering Line suffer from a scarcity of training samples, which can introduce bias during model training. To address this issue, it is recommended to incorporate class-specific evaluation metrics, such as precision and recall, when assessing the model. Second is the presence of noisy glitch labels. The training labels are derived from a combination of ML and volunteer classification, making it likely that the model relies on partially incorrect labels during the training phase. Furthermore, the model architecture remains relatively simple, relying on the direct combination of a few convolutional layers. This simplicity imposes constraints on the model’s ability to perform effective feature extraction, particularly when dealing with morphologically similar glitch classes. Future studies could potentially explore more advanced model architectures to further enhance the performance of the glitch classifiers. Finally, due to its fully supervised nature, the ML method faces challenges in recognizing new glitch classes that fall outside the predefined classes during training, which highlights the importance of involving volunteers in the process of identifying new glitch classes.

Table 1 The development of ML models for glitch classification in the Gravity Spy system

3.3 Engagement with project volunteers

While a significant amount of work has studied both machine- and human-based classification schemes, we know little about how to use machine-coded data to improve human learning and performance. To this end, we devoted substantial effort to researching how to recruit volunteers, train volunteers with the support of ML scaffolding, and assess volunteer engagement.

3.3.1 Volunteer recruitment

Zooniverse hosts hundreds of projects, allowing participants to be recruited from the existing user base. We engaged in supplemental recruitment through blog posts and announcements about GW discoveries and empirical research on attracting and retaining participants (i.e., motivation). A persistent challenge in citizen-science projects is that most volunteers participate infrequently, oftentimes only once. Research in citizen science has focused on strategies to motivate contributions, with studies on volunteer motivation in citizen-science finding that user motivation differs person-to-person and changes over time [80, 81].

To test theories of motivation, we conducted one study focused on increasing classification volume using novelty theory [82]. In this experiment, we showed participants a unique message when they were the first person to see a particular glitch. While the theory was tested in other projects, the findings revealed novelty messages had no effect on volunteer behaviors. We hypothesize that other aspects of the project (e.g., gamified elements through leveling not found in other projects) might be sufficiently motivating. We also conducted an experiment to evaluate the efficacy of recruitment messages appealing to four types of motivations known to be salient in citizen-science projects: science learning, contribution to science, joining a community of practice, and helping scientists [83]. Messages about contributing to science resulted in more classifications; however, statements about helping scientists generated more initial interest from participants.

3.3.2 Volunteer training

Participants of the Gravity Spy project may have little classification task experience, so training volunteers is crucial for ensuring data quality. The Zooniverse platform supports training through text- and image-based tutorials. We conducted several investigations to gauge the efficacy of our training and learning regimens. One such study was conducting an online A/B field experiment to evaluate the training procedure described in Sect. 2.2 [28]: Group A received all glitch classes (with a wide range ML confidence) without training when they started the project, whereas Group B received the default scaffolded training through the workflow levels that increase in complexity and difficulty as they proceed through the project. As anticipated, we found that volunteers who received the training were more accurate in their classifications (as indicated by agreement on gold-standard data and with the eventual consensus decision on the glitch) and contributed significantly more classifications to the project.

Another study used digital trace data (produced as a by-product of interactions with computer systems) to evaluate how volunteers use learning resources when given feedback about their classification (e.g., “You answered Blip, but our experts classified this image as Whistle”). The results demonstrated that authoritative knowledge provided by the project scientists improved learning during early workflows. However, as the challenge increases in advanced workflows, volunteers rely on resources constructed by the community (e.g., discussion boards) to learn to identify glitches more accurately [84]. The results of our investigation into the learning process have informed design decisions for integrating scaffolded tasks in subsequent citizen-science initiatives. Our findings outline optimal strategies to blend human and ML schemes to train volunteers and identify the temporal importance of learning resources. Additionally, this research underscores the significance of learning resources generated by volunteers in the absence of expert guidance.

3.3.3 Volunteer engagement

Much research has been conducted about people engaging in virtual citizen-science projects. It tends to demonstrate that many volunteers contribute once or infrequently while a handful of volunteers perform the majority of work and limit their engagement to image classification [85, 86]. This observation is corroborated by a Gini coefficient of 0.85, indicative of a high level of inequality in participation as found in other citizen-science projects [87]. In Gravity Spy, on average, a volunteer makes 235 classifications (median = 2) before dropping out, and less than 12% of volunteers have contributed to the project’s discussion boards. While engagement is skewed, a small cadre of highly motivated and engaged volunteers is active on the site. Our analysis of digital trace data shows how volunteers engage beyond submitting classifications, engaging with discussion boards and other project infrastructure. Our research describes how volunteers share new knowledge by linking external resources (e.g., arXiv preprints) and sharing results of independent investigations speculating about the causes of glitches in the data [88, 89]. As noted above, many glitch class proposals are the product of intense data curation and evaluation on the discussion boards. We find both independent and collaborative interactions [90]. Overall, our research on engagement demonstrates the ability of volunteers to engage in more complex work when provided the means to do so.

3.3.4 Volunteer contributions

Volunteer efforts have been crucial to Gravity Spy. At present, over 7.4 million classifications have been performed by project volunteers, with more than 32, 000 registered users (and many more users that did not register with a Zooniverse account) contributing to these classifications. Volunteers have drafted over 30 new glitch proposals: in O3, new glitch classes from these proposals were included in the glitch classification interface and used in retraining the ML model [16, 19].

4 Challenges and future considerations

Gravity Spy has faced several challenges that can be traced back to the intersection between the science team’s and volunteers’ work: (1) science team–volunteer engagement; (2) mismatching temporal rhythms and divided priorities across the science team and volunteers; (3) retraining of ML models, and (4) the constant evolution of the GW detectors.

First, the science team often lacked the capacity for sustained volunteer interactions, a challenge for many citizen-science projects [91]. In Gravity Spy, this was evidenced through bottle necks in approving new glitch class proposals and questions posted to discussion boards. Glitch proposals required extensive efforts by the science team to evaluate the veracity of the proposed glitch in the data stream, whether proposed glitch classes are still impacting GW detectors and relevant to detector-characterization science, and, when proposals were rejected, justifying the decision to the volunteers. For some proposals, many months can pass between an initial glitch proposal by a volunteer and a resolution to the proposal by the science team.Footnote 9 In evaluating glitch proposals, we trained physics undergraduate and graduate students to review new glitch class proposals. However, student schedules quickly fill up, or they move on to other projects, so solely relying on students for glitch proposal review is not a long-term solution. Meanwhile, some moderators know as much or more about the detectors as students and even members of the science team, but lack formal training, easy access to experienced science team members, and password-protected LIGO resources. In such instances, we have asked moderators to tag science team members who they believe have expertise in the question. Another challenge is that volunteers do not always realize that seemingly simple questions posted to discussion boards may take hours to address satisfactorily.

We have experimented with different methods to triage questions to the relevant experts and enrolled detector-characterization group members (not already a part of the Gravity Spy research team) to be on call to respond to questions on the discussion boards. We quickly discovered that many questions pertained to the project infrastructure (e.g., debugging the promotion algorithm) or similar issues that could only be handled and addressed by a small number of people or a single person on the project team.

In the latest version of the project (see Sect. 5), we strive to make volunteers more self-reliant by giving them access to more background knowledge about the detectors when needed (the same knowledge science teams use to evaluate glitch class proposals). One element of this solution involves building a wiki that gradually expands volunteers’ access to expert knowledge about the detectors [92, 93].

Second, the temporal rhythms and priorities guiding science team members’ and volunteers’ work are not always consistent. For instance, the investigations by the detector-characterization group and volunteers do not always align. While the detector-characterization team works on priorities for an entire observing run, which may involve time-sensitive investigations and fixes to the instruments, these may occur before volunteers can identify a new glitch, given the time required to build collections and develop solid proposals about a new glitch class. In contrast, volunteers may submit new glitch class proposals involving glitches unknown to the detector-characterization group (e.g., the Pizzicato glitch). To solve this issue, we are considering a real-time citizen-science format where volunteers would work on urgent problems and new data that call for quick turnarounds.

Similarly, the science team often needs nearly real-time classification of glitches. As a result, they have come to rely on the automated ML classifications, rather than waiting for the theoretically better human-classified results. Our realization of this dynamic has led to a refocusing of the project: more quickly retiring images rather than expending volunteer effort to refine them, using the volunteer results for model retraining rather than as immediate feedback to LIGO scientists, and emphasizing the role of volunteers in the discovery and characterization of novel glitch classes.

The third major challenge in the Gravity Spy project has been that retraining the ML model based on volunteer classifications has taken longer than initially planned. The issue has multiple elements. The dataset containing volunteer classifications has impurities, and we have had to experiment with different ways to assess its accuracy. Rare glitch classes have few examples, making the integration of them into a retrained model difficult. From a ML perspective, dealing with four versions of the same glitch images in different time durations can also be challenging. While it is helpful for the volunteers to move between different time durations, the ML can sometimes regard some time durations as more important than others, leading to potential classification problems.

Fourth, GW detectors continually evolve [94,95,96,97]; with it, new glitch classes emerge, and known classes disappear or become less prevalent. One cannot leave the ML model unattended for long. At the beginning of O4, the volunteers and, soon after, the science team found that the ML model was confidently mislabeling instances of a novel glitch as a Whistle glitch, which is one of the two glitch classes presented to volunteers in their first workflow. During this initial portion of O4, the detectors did not produce many Whistles, even though they used to be common enough that Gravity Spy used them for training purposes in the initial workflow. As such, we are considering a replacement for the Whistle at the introductory volunteer level as we retrain the ML model to distinguish Whistles from the new glitch.

The constant evolution of the detectors emphasizes the importance of clearly distinguishing the architecture of the ML model from the different retrained versions of a specific model. We have had to carefully track how the model changes as one experiments with new, more effective architectures. Equally important, we have had to verify the provenance of the different versions of a model as it gets retrained on new classification data, some of which might include new glitches. This is not always easy in a distributed science team with many stakeholders interested in the continuous improvement of the model.

Finally, making all the improvements deemed necessary is not always possible. For instance, we have yet to complete a volunteer performance assessment system that moves beyond relying on gold-standard data. The current ML model would guide the presentation of image classification tasks to newcomers to help them learn how to do the task more quickly while contributing to the project’s work. We designed but never implemented a Bayesian model that would both estimate volunteers’ ability and decide the classification of an image. The reason for not implementing this scheme was due to occasional high-confidence misclassifications from the ML model, which, when the volunteers made an alternative selection, would hinder their ability to progress through the project workflows. A simulation of the model applied to volunteer promotion and image retirement suggests that the model would require fewer classifications than the current system and, in the process, save volunteers’ time [98].

The challenges that Gravity Spy faces need to be understood in the context of its successes. The Gravity Spy project has led to substantial innovations in combining machine learning and volunteer-based classification schemes, and demonstrated how the facilitation of a symbiotic relationship between these methodologies leads to significant improvement in classification and characterization tasks. The Gravity Spy classifications and associated models have benefited GW detector characterization, and the project has engaged more than 30 thousand volunteers in the scientific process. This success was tied to the substantial user base and promotion that Zooniverse has developed over the past 15 years, as well as the public interest in the growing field of GW science since the start of O1 in 2015. The project also benefits from an active group of advanced volunteers. Led by a dedicated team of volunteer moderators, Gravity Spy continues to discover, collect, characterize, and name new glitch classes. The volunteer moderators have also played a critical role in responding to the many questions of other users in the project, a feat that would not have been possible by the science team alone. The scaffolded training where volunteers gradually level up has greatly improved volunteer engagement and retention [28], and is being implemented across other Zooniverse projects. To a certain extent, the success of the Gravity Spy classification model and the high volunteer engagement has removed the urgency of addressing some the outstanding issues in the project itself.

5 Gravity Spy 2.0

The success and challenges of Gravity Spy have spurred us to explore how volunteers can engage in more complicated analysis of LIGO data that would help detector-characterization scientists identify the causes of glitches and isolate these signals in the data stream. Along with the main strain channel that is sensitive to GWs, the LIGO detectors record more than 200, 000 auxiliary channels of data per detector from a diverse set of sensors that continuously measure every aspect of the detectors and their environment (e.g., equipment functioning, activation of components, seismic activity, or weather) [11, 99]. To explore the cause of glitches (i.e., what is happening in the detector or the environment that causes particular glitches), detector-characterization scientists conduct investigations using the temporal correlation of these auxiliary channels and the main GW channel. Using these data, researchers further isolate noise signals in the data stream, often analyzing output data using statistical software and conducting visual inspections of GW and auxiliary channel glitch images.

In 2020, the Gravity Spy collaboration began work funded by a new multi-year National Science Foundation grant (HCC 2106865) focused on developing the capacity for amateur volunteers to conduct similar activities and develop causal insights about glitches. This project, known as Gravity Spy 2.0, is currently live as part of the broader Gravity Spy infrastructure. It is an extension of our efforts in the initial Gravity Spy and is designed to help volunteers develop their knowledge of the data over time by progressing through three stages (also Phases of the grant).

Stage 1:

Volunteers will compare glitches in the main GW channel to glitches in auxiliary channels. At first, we provide users with 3 auxiliary channels per glitch subject, totaling in \(\sim ~20\) distinct auxiliary channels being used across the glitch classes considered in the first phase of the project. Human input is valuable because there could be morphological similarities between the glitches in the channels that point to some association. However, there is no simple rule to determine which channels are relevant or how the channels are coupled. Engagement with the data and the channel ontology in this phase will also support volunteers in learning about the auxiliary channels and their potential relationship to a glitch in the GW channel. A similar approach has been tried using data from the Virgo detector in the GWitchHunters project [100].Footnote 10

Stage 2:

The primary task will shift from individual glitches to collections, building on the filtered dataset of glitches and auxiliary channels created in Stage 1. Volunteers are expected to (1) create collections of glitches that might represent a novel glitch class with hypothesized common causality, not just similar appearance; (2) search for relations between those glitches and groups of auxiliary channels over time; and (3) note possible causes behind the correlated glitches and auxiliary channels.

Stage 3:

Advanced volunteers will analyze the network compiled at Stage 2 to identify the root causes of glitches. A supporting activity is adding paths among auxiliary channels, e.g., by examining patterns that appear across multiple classes of glitches. For instance, if several channels are involved simultaneously in many classes of glitches, they are likely interrelated.

In each stage, we will use novel interfaces and tools to support volunteer tasks. Our research adopts a human-centered design approach involving volunteers and detector-characterization scientists in developing Gravity Spy 2.0.

Development of Gravity Spy 2.0 and research into how best to use its output is ongoing. Discussions with detector-characterization scientists are needed to understand and translate their work making causal inferences. To that end, we need to understand what background knowledge about the glitches and the detector is necessary and how to structure the tasks. On the volunteer side, we need to know how best to align the system support with volunteers’ current knowledge and capabilities. To that end, we recently completed interviews and design critiques with long-time Gravity Spy volunteers. During the user studies, we gained insights into volunteers’ glitch classification practices, which will inform how we design the three stages.

Much like the original Gravity Spy, the output of this investigation will contribute new knowledge about cutting-edge citizen-science techniques that involve amateur volunteers in more advanced scientific tasks.

6 Citizen science in physics and beyond

Gravity Spy’s use and advancement of strategies that combine human and machine effort have produced an important blueprint for future citizen-science projects in physics and a broad range of other research domains. The capabilities of human crowdsourcing alone cannot keep up with the demands of ever-growing data volumes, even when assuming continued access to large communities of volunteers. However, the value of human input is unquestionably high, particularly in applications such as out-of-sample classification, serendipitous discovery, and other advanced analyses. Therefore, the continued development and application of hybrid machine–human workflows is essential for unlocking the full potential of many datasets.

Gravity Spy has demonstrated paths forward for capitalizing on the capabilities and strengths of human crowdsourcing. First, the ML-informed assignment of human effort in Gravity Spy, where subjects with high machine confidence are tasked to beginners and retired quickly and subjects with uncertain or unidentified machine classifications are assigned to experienced volunteers, is a model applicable to many projects. If ML models can expedite the classification of easy and routine data, that leaves greater effort for humans to contribute in more meaningful and advanced ways. Moreover, if projects successfully combine machine and human classifications for use in improving and refining the machine model, the resulting classification quality and system efficiency can be improved and optimized. The training set augmentation and model retraining strategy of Gravity Spy or the active learning strategy employed by a recent iteration of the Galaxy Zoo project [101], show the benefits of this human–machine interplay and its applicability in multiple research domains.

Second, Gravity Spy affirms a broader trend across Zooniverse citizen-science projects that a subset of volunteers is willing to contribute in advanced ways that go well beyond the project’s primary classification task. The investigation and proposal of new glitch classes and the use of the project’s Similarity Search tool show the willingness of certain volunteers to perform information synthesis activities, which augment the typical, more narrow scope of machine classifiers. This behavior mirrors that seen in other projects, such as volunteers performing follow-up analysis of previously unknown galaxy sub-types via the Galaxy Zoo project (e.g., Green Pea galaxies [102]) or the use of publicly available external data and software to characterize exoplanets as part of the Planet Hunters TESS project [103]. A common thread among these examples is the ability for volunteers to access data and tools that allow additional contributions toward the project science goals, and the resulting reward in the form of new findings to project teams that facilitate and support these volunteer activities.

Third, the patterns of interaction associated with Gravity Spy’s glitch proposals demonstrate the importance of setting up effective processes for interaction and feedback between research teams and volunteers, especially those carrying out advanced work. While ambitious volunteers may take it upon themselves to reach out and communicate with the project team using basic means (email, Zooniverse Talk boards, etc.), both the research team and volunteers often benefit from creating structures and processes to facilitate useful and beneficial interactions. This conclusion motivates a recommendation that citizen-science projects craft a plan for volunteer interaction that includes tools for both communication and analysis, especially to encourage more independent forms of volunteer contribution. We highlight that the Gravity Spy team’s interdisciplinary composition was an important asset in assembling a successful plan for volunteer engagement and advocate that research teams that include members spanning a broad range of project-enabling skills will better position their projects for potential success.

In summary, Gravity Spy has pioneered how ML approaches and citizen science can be synthesized across various domains to produce useful results. GW science has been supported in the process through large-scale classification and characterization of glitch classes and identification of new glitch morphologies, thereby enabling the identification of glitch origins and the removal of glitches from the GW datastream or eliminating the root cause of glitches from the detectors entirely. The next phase of the Gravity Spy project and the continued flow of new observations from LIGO, Virgo and KAGRA show that this project will continue to meaningfully contribute to scientific discovery for years to come.