Gaia Data Release 3: The extragalactic content

The Gaia Galactic survey mission is designed and optimized to obtain astrometry, photometry, and spectroscopy of nearly two billion stars in our Galaxy. Yet as an all-sky multi-epoch survey, Gaia also observes several million extragalactic objects down to a magnitude of G~21 mag. Due to the nature of the Gaia onboard selection algorithms, these are mostly point-source-like objects. Using data provided by the satellite, we have identified quasar and galaxy candidates via supervised machine learning methods, and estimate their redshifts using the low resolution BP/RP spectra. We further characterise the surface brightness profiles of host galaxies of quasars and of galaxies from pre-defined input lists. Here we give an overview of the processing of extragalactic objects, describe the data products in Gaia DR3, and analyse their properties. Two integrated tables contain the main results for a high completeness, but low purity (50-70%), set of 6.6 million candidate quasars and 4.8 million candidate galaxies. We provide queries that select purer sub-samples of these containing 1.9 million probable quasars and 2.9 million probable galaxies (both 95% purity). We also use high quality BP/RP spectra of 43 thousand high probability quasars over the redshift range 0.05-4.36 to construct a composite quasar spectrum spanning restframe wavelengths from 72-100 nm.


Introduction
The primary objective of the Gaia mission is to study the structure and origin of our Galaxy by measuring the distribution, kinematics, and physical properties of its constituent stars (Gaia Collaboration et al. 2016).The satellite and its observing strategy were therefore designed to optimize the measurement of astrometry, photometry, and spectroscopy of point sources.Nonetheless, by observing the entire sky multiple times down to a limiting magnitude of G 21 mag, Gaia has observed millions of extragalactic objects since it started observing in mid 2014.Various data on many of these objects are provided as part of the third Gaia data release (DR3), covering both previouslyidentified objects and new candidate objects identified using the Gaia data.The purpose of this paper is to summarize how extragalactic objects were identified, what their properties are, and what data on them are provided in Gaia DR3.
Extragalactic objects are classified or analysed by several modules in the Gaia data processing system.These modules were provided by different coordination units (CUs) within the Table 8 is only available in electronic form at the CDS at http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/ Data Processing and Analysis Consortium (DPAC) and operate largely independently.They are as follows: CU3 Astrometry, which assembled a list of extragalactic point sources from external catalogues to use in defining the astrometric reference frame (Gaia Collaboration & Klioner et al. 2022); CU4 Extended Objects (EO), which analyses the surface brightness profiles of an input list of objects to look for physical extension; CU7 Variability, which uses photometric light curves to characterise variability; CU8 Astrophysical Parameters, which uses astrometry, photometry, and the BP/RP spectra to classify objects and to estimate redshifts.Whereas the modules from CU3 and CU4 work on a predefined lists of extragalactic objects identified in other surveys, the Vari module in CU7 and the Discrete Source Classifier (DSC) module in CU8 use supervised machine learning to discover new objects.These classifiers use only Gaia data.The inclusion of additional data, such as infrared photometry, should improve the classification performance (sample completeness and purity).However, a key principle of the DPAC is to provide homogeneous classifications based only on the Gaia data, unaffected by issues with other catalogues, such as incompleteness.
It is important to realise that there is no common definition of quasar or galaxy across the various Gaia modules.A common definition is also not possible, because each module uses different data to classify or select objects, including different training sets.But broadly speaking, the term 'extragalactic' in the context of this paper refers to unresolved or barely resolved individual objects more than 50 Mpc from the Sun.
If Gaia were to obtain noise-free, unbiased parallaxes, then identifying extragalactic objects would be simple: They would be all the objects with parallaxes below some threshold.Yet we do not have this luxury: Despite the high precision of Gaia DR3 parallaxes -around 0.5 mas at G = 20 mag and 0.25 mas at G = 19 mag (Lindegren et al. 2021b) -this is not nearly enough to reliably identify extragalactic objects through a simple cut on parallaxes (or proper motions).Indeed, 657 million objects in Gaia DR3 have raw parallaxes below 0.25 mas, the vast majority of which are of course stars in our Galaxy.This is not to say that parallaxes, and moreover proper motions, are not useful, however, and we do indeed make use of them in our classifications and analyses.
Most of the extragalactic candidates we have identified are bundled into two integrated tables in Gaia DR3, called qso_candidates and galaxy_candidates.As their names make clear, the construction of these tables has been driven primarily by the desire to be complete, rather than pure.Together these tables contain around 11.3 million unique objects and have global purities of 50-70%, although they are significantly higher when we exclude the Galactic plane, high density regions around clusters and galaxies, and the faintest sources.These tables are nonetheless a significant improvement over the gaia_source table, which has 1.8 billion objects and an extragalactic purity of around 0.2%.Our rationale for producing completeness-driven integrated tables is that it is easier for users to then select a sub-sample of purer objects (according to their own criteria) from our integrated tables, than it would be to find objects (in gaia_source) that had been removed from puritydriven tables.In Sect.8 we recommend how to extract a purer sub-sample (∼96%) from the two integrated tables.
This paper is not the first to deal with classifying extragalactic objects using Gaia data.Initial studies cross-matched Gaia positions to other catalogues to analyse the properties of quasars and galaxies (e.g.Paine et al. 2018;Souchay et al. 2019).Several studies have made cuts on the astrometry (e.g.Heintz et al. 2018;Gaia Collaboration 2018), sometimes combined with classification using non-Gaia data (e.g.Fu et al. 2021), and others have applied machine learning methods to a number of Gaia metrics (e.g.Bailer-Jones et al. 2019) to identify extragalactic objects.Purer samples should be attainable when combining Gaia data with more discriminatory data, albeit at the loss of completeness if Gaia is the larger survey, and some studies report good results here (e.g.Wu et al. 2021).Other studies have used the Gaia data to characterise specific types of extragalactic object, such as gravitational lenses (e.g.Krone-Martins et al. 2018;Delchambre et al. 2019).
This paper is structured as follows.Section 2 summarizes the extragalactic processing modules that deliver results in Gaia DR3 and Sect. 3 describes the various tables that provide these results.Section 4 presents the properties of the extragalactic objects, such as sky distributions, spectra, surface brightnesses, and light curves.In Sect. 5 we provide some basic comparisons between the results of the different modules, and in Sect.6 we compare the results to external surveys.In Sect.7 we compute composite quasar spectra from individual quasar spectra at a range of redshifts.Section 8 describes a purer, and necessarily less complete, sub-sample of the integrated extragalactic tables.We conclude in Sect.9 with some suggested use cases.
Many more details on the topics discussed here can be found in the extensive online documentation that accompanies this data release. 1 We point in particular to the table and field descriptions there.Several other release papers provide details that are not in the documentation.These are Delchambre et al. (2022) for the CU8 classification and redshift estimation modules, Rimoldini et al. (2022) for the CU7 variability classifier and Carnerero et al. (2022) for the resulting selection of Active Galactic Nuclei (AGN), Ducourant et al. (2022) for the CU4 surface brightness profile analysis, and Gaia Collaboration & Klioner et al. (2022) for the .Readers may also want to consult De Angeli et al. (2022) for a description of the BP/RP spectrophotometry and Lindegren et al. (2021b) for the astrometric (parallax and proper motion) processing (the latter unchanged from Gaia EDR3).

Extragalactic processing modules
The modules in the Gaia data processing system that deal explicitly with extragalactic objects are as follows.DSC and Vari classify Gaia objects using supervised machine learning methods.
Vari additionally provides characterisations of the light curves.UGC (Unresolved Galaxy Classifier), QSOC (Quasar Classifier), and OA (Outlier Analyser) analyse the results from DSC, the first two computing redshifts.EO analyses the surface brightness profiles of an input source list.We summarize these modules here, leaving more detailed descriptions to the individual processing papers cited below.We also include in our analysis the list of quasars identified for Gaia-CRF3.
Some sources, in particular galaxies, are partially resolved by Gaia.Their two-dimensional structure -combined with the fact that Gaia observes sources over a range of position anglescan induce a spurious (non-intrinsic) photometric variability or an apparent astrometric variability, the latter potentially being interpreted by the astrometric processing (Lindegren et al. 2021b) as spuriously large parallaxes and proper motions.The DSC and Vari modules take advantage of these spurious measurements to help them classify extragalactic sources.

Discrete Source Classifier (CU8-DSC)
The Discrete Source Classifier uses the BP/RP spectrum together with the mean G-band magnitude, the variability in this band, the parallax, and the proper motion to classify each Gaia source probabilistically into five classes: quasar; galaxy; anonymous (essentially single star); white dwarf; binary star.DSC is trained empirically on Gaia data with labels for the quasar and galaxy classes coming from Sloan Digital Sky Survey (SDSS) spectroscopic classifications.The distributions of the training data in colour and magnitude are shown in Fig. 1.The training data define the classes (see Sect. 6.2), so these are not the same class definition adopted by other modules that contribute extragalactic source identifications to Gaia DR3.DSC comprises three classifiers.Specmod uses the BP/RP spectrum only and gives results for all five classes in DSC.Allosmod uses various photometric and astrometric features and only gives results for quasars, galaxies, and single stars.Specmod and Allosmod are nonetheless trained on a common set of data that has complete data for both classifiers.One consequence of this is that Specmod is also applied to some types of sources it was not trained on, for example galaxies that lack measured parallaxes and proper motions.probabilities in a Bayesian manner to give probabilities for all five classes (using the algorithm described in the appendix of Delchambre et al. 2022).Probabilities from all three classifiers are provided in the astrophysical_parameters table.DSC is described in more detail in Delchambre et al. (2022) and in the online documentation.

Combmod combines the Specmod and Allosmod classification
DSC incorporates a global class prior that reflects the rareness of quasars and galaxies.This makes it hard to achieve a high purity even for a good classifier.For example, if only one in every thousand sources were extragalactic, then even if a classifier had a 99.9% accuracy, the resulting sample would only be around 50% pure.For this reason one must report results not on a balanced validation set, but on one that reflects this prior. 2n addition to posterior probabilities, DSC also provides two class labels.The first, classlabel_dsc, is assigned the name of the class that achieves the highest posterior probability in Combmod that is greater than 0.5.If none of the output probabilities are above 0.5 then this class label is unclassified.This tends to produce a complete but impure sample of objects when we properly account for extragalactic rareness.The analyses in Delchambre et al. (2022) and Bailer-Jones (2021) using SDSS spectroscopically-confirmed objects shows a completeness for quasars and galaxies objects of over 90%, but a global purity of only about 20-25%.For Galactic latitudes above 11.5 • the purities increase to 41%.Additional filtering increases this further (see Sect. 8).The second class label, classlabel_dsc_joint defines a purer set of quasars and galaxies, and is assigned by requiring both Specmod and Allosmod probabilities to be above 0.5 for the corresponding class.This gives completenesses of 38% on quasars and 83% on galaxies, and purities on both classes of 63%.For Galactic latitudes above 11.5 • the purities increase to about 80%.

Quasar Classifier (CU8-QSOC)
The Quasar Classifier module (QSOC) estimates the redshift of sources classified as quasars by DSC-Combmod using their BP/RP spectra.For this selection, QSOC uses a very loose cut on the DSC quasar probability, classprob_dsc_combmod_quasar ≥ 0.01.This prioritizes completeness at the expense of purity to ensure that most of the objects that are suspected to be quasars are given a redshift estimate.The QSOC redshifts are inferred with a chi-square approach, whereby the BP and RP spectra are compared to a composite quasar spectrum taken at various trial redshifts in the range 0 z 6.The composite spectrum is built upon a semiempirical library of quasars from the SDSS DR12Q sample (Pâris et al. 2017).Each SDSS spectrum is first extrapolated to the wavelength range covered by BP/RP before being converted into a BP/RP spectrum using the available instrument model.More details of the algorithm can be found in Delchambre et al. (2022).In addition to the best point estimate of the redshift, QSOC also estimates lower and upper confidence intervals, redshift_qsoc_lower and redshift_qsoc_upper, which are the 15.9% and 84.1% quantiles of a log-normal distribution.The module also sets various processing flags in flags_qsoc, reflecting potential issues and/or degeneracies that may occur during the prediction phase.

Unresolved Galaxy Classifier (CU8-UGC)
The Unresolved Galaxy Classifier (UGC) estimates the redshift of sources classified as galaxies by DSC-Combmod with probability classprob_dsc_combmod_galaxy ≥ 0.25.UGC uses the BP/RP spectrum together with a supervised machine learning algorithm, the Support Vector Machine (SVM) (Cortes & Vapnik 1995;Chang & Lin 2011).A regression model (t-SVM) is trained on a set of 6000 sources selected from galaxies in the SDSS DR16 archive (Ahumada et al. 2020;Blanton et al. 2017) that are cross-matched to sources observed by Gaia.The BP/RP spectra and the SDSS redshifts of the sources in this set are used as training input and output, respectively.The SDSS galaxies were selected to have redshifts in the range 0 ≤ z ≤ 0.6 and magnitudes 17 ≤ G ≤ 21.Additional conditions were applied to specific parameters that influence the quality of the observed spectra.A test set of 250 000 galaxies, selected in a similar manner as the training set, was used to estimate the performance of the model, as reported in Delchambre et al. (2022).
This set was also used to estimate statistical uncertainties of the redshift predictions in redshift bins of width 0.02.The biasthe mean difference between predicted and observed redshifts -was found to be −0.006 with a root mean squared error of 0.039 for the entire redshift range 0.0 ≤ z ≤ 0.6.However, the uneven distribution of redshift and magnitude causes the performance to be better for lower redshifts than for higher ones.For each estimated redshift_ugc we further determine the lower and upper prediction level, redshift_ugc_lower and redshift_ugc_upper, corresponding to the bias and the 1σ error of the SVM model in the closest bin.See the online documentation for more details.

Outlier Analysis (CU8-OA)
The Outlier Analysis (OA) module was originally intended to analyse those sources that receive low classification probabilities for all DSC classes.As DSC-Combmod tends to give rather extreme probabilities -near to 0.0 or 1.0 -we used OA to analyse all sources that have all DSC-Combmod probabilities less than 0.999.This corresponds to 56 million sources.OA uses a Self-Organizing Map (Kohonen 1982), an unsupervised neural network that groups together similar data on a two-dimensional grid of neurons, in our case 30 × 30.The data here are the BP/RP spectra.From this we compute a prototype spectrum of each neuron as the mean of all spectra assigned to that neuron.We further compute various statistics for each neuron, such as the mean G, G BP , G RP , parallax, and Galactic latitude.We also compute a quality index that is based on the intra-neuron distance distribution; it takes seven discrete values from 0 to 6, where 0 represents the best quality neurons and 6 the poorest ones.The method of allocation to these is described in the online documentation.Finally, we compute a class label for each neuron by finding the best match between its prototype and a series of labelled templates, although neurons with quality index 6 are not assigned a label.This information is given in the oa_neuron_information and oa_neuron_xp_spectra tables, and an interactive visualization tool that can explore these tables is available (Álvarez et al. 2021).

Variability (CU7)
Extragalactic objects can also be identified via their photometric variability.Galaxies with active nuclei show variability in their accretion, such as in Seyfert galaxies and quasars, and in the case of blazars variability can be intrinsic or geometrical, related to a relativistic plasma jet directed towards us.
Using a supervised classification method Vari-Classification described in Rimoldini et al. (2022), we identified 1.0 million Active Galactic Nuclei (AGN) and 2.5 million galaxy candidates from the variability of the Gaia light curves.Epoch photometry in the G, G BP , and G RP bands are published for AGN candidates, and for those galaxies that are part of the Gaia Andromeda Photometric Survey (GAPS;Evans et al. 2022) or that might be misclassified as real variables in Gaia DR3 (and so published in one of the variability tables).Indeed, the apparent variability of galaxies in the Gaia data is mostly an artefact of their extension combined with the Gaia on-board detection algorithm and scanning law (see Sect. 4.4.2) and so does not justify the release of their time series (which are meant only for genuine variable objects).Nevertheless, the characteristics of these artificial brightness variations made it possible to identify galaxies as if they Further analysis and characterisation of the variable AGN classifications (by the module Vari-AGN) led to a higher purity selection of about 872 000 objects (Carnerero et al. 2022), whose AGN-specific metrics are published in the vari_agn table, and repeated in the qso_candidates table (see Sect. 3).The purity of this AGN sample was estimated to be about 95%.The galaxy sample in galaxy_candidates is perhaps even purer, estimated at 99%, although with a lower completeness at around 40% (Rimoldini et al. 2022).

Surface brightness profile (CU4)
A source is recorded by Gaia only if the on-board video processing unit determines its light profile to be sufficiently steep at its centre (Gaia Collaboration et al. 2016).While this is intended to accept only point sources, it does pick up some extended objects (see section 2).The resulting selection function has been assessed theoretically by de Bruijne et al. (2015) and de Souza et al. (2014).
The CU4 surface brightness profile module attempts to reconstruct the two-dimensional light profile of extragalactic sources in the following way (see Ducourant et al. 2022 for more details).Gaia scans each source at a range of transit angles during the course of its mission.These observations are mainly onedimensional (nine one-dimensional Astro Field (AF) windows plus the two-dimensional Sky Mapper (SM) window), but after a sufficient number of transits, most of the surface of the source has been covered by these transits.The CU4 module attempts to reproduce these observed windows from a large number of simulations of images of galaxies, each with different shape parameters from which Gaia-like windows are extracted.The parameters that produce the best fit to the observations are taken as the profile of the source.
The module is only applied to a pre-selected list of extragalactic sources (summarized below).Fits are made for the flux profiles for two types of objects: quasars and their decomposition into quasar and host galaxy; and galaxies.
For the quasars, the module first compares the mean integrated flux of the source in the small AF window (707 mas x 2121 mas) to the mean integrated flux in the large SM window (4715 mas x 2121 mas) (Gaia Collaboration et al. 2016).A larger flux in the SM window is interpreted as a detectable host galaxy, and the surface brightness profile is fit as a combination of an exponential circular profile for the central active nucleus and a Sérsic profile (including ellipticity and position angle) for the host galaxy (see Fig. 2).The surface brightness profile parameters of the host galaxy are produced only when there is no other source present within 2.5 , and only for those sources with a half light radius smaller than 2.5 , in order to avoid too large an extrapolation of the profile and so to increase the reliability of the parameters.
For the galaxies, all the objects processed exhibit flux excess in the SM window when compared to the mean flux in AF window (see Fig. 2), indicating that these sources are clearly extended.Two independent surface brightness profiles are fit for all objects: a Sérsic and a de Vaucouleurs profile.
The pre-defined list of extragalactic sources for these two types of processing was determined as follows.For quasars, several major catalogues of quasars and candidates were compiled: AllWISE (Assef et al. 2018;Secrest et al. 2015), HMQ (Flesch 2015), LQAC3 (Souchay et al. 2015), SDSS-DR12Q (Pâris et al. 2017), ICRF2 (Ma et al. 2009), and a selection of unpublished classifications of Gaia DR2 quasars based on photometric variability (Rimoldini et al. 2019).Together this gave a list of 1.4 million sources.Of these, we retained for analysis in Gaia DR3 a subset of 1 103 691 sources, each of which has at least 25 Gaia observations that together cover at least 86% of the surface area of the source.For the galaxies, a machine learning analysis of Gaia DR2 combined with the WISE survey (Cutri & et al. 2012) was used to identify 1.9 million galaxy candidates (Krone-Martins et al. 2022).The same filtering of sources as for the quasars reduced this to 914 837 galaxies to be analysed.

Gaia-CRF3 (CU3)
One of the outputs of the astrometric solution in Gaia DR3 is the selection of a set of sources whose positions and proper motions define the celestial reference frame of Gaia DR3, called Gaia-CRF3.These correspond to sources cross-matched between Gaia and several external quasar catalogues, and selected according to specific quality metrics.The procedure to define this source list is described in Gaia Collaboration & Klioner et al. (2022) and Gaia Collaboration et al. (2021).This source sample also represents an official realisation of the International Celestial Reference System (ICRS) at optical wavelengths, as acknowledged by Resolution B3 of the IAU (2021).

Gaia DR3 tables with extragalactic content
The extragalactic content of Gaia DR3 is provided through a number of tables and fields.These list, among other measures, the outputs of the modules described in the previous section.
The gaia_source table provides two dedicated flags (in_qso_candidates and in_galaxy_candidates) that indicate the presence of a given source in the respective tables of the same name (described below).It also lists the DSC-Combmod probabilities for the quasar and galaxy classes (Sect.2.1).The table astrophysical_parameters lists all the parameters produced by the modules in CU8, namely DSC, QSOC, UGC, and OA.Further results from OA (Sect.2.4) are provided in the oa_neuron_information table.These tables contain all sorts of objects, not just (candidate) extragalactic ones.The tables vari_classification_result and vari_agn provide information on AGN identified through the photometric light-curves (Sect.2.5).As a complement to the Gaia-CRF3 table carried over from Gaia-EDR3 (table agn_cross_id), there is a new table gaia_crf3_xm in Gaia DR3 that provides the complete cross-match information between the Gaia-CRF3 sources and the external catalogues in which they were identified (Gaia Collaboration & Klioner et al. 2022).

Integrated tables: qso_candidates and galaxy_candidates
In addition to the above tables, two integrated tables -qso_candidates and galaxy_candidates -are a compilation of the results from all processing modules that have classified or analysed extragalactic objects.While some of their columns are copies of information available in the abovementioned tables, the rest are provided exclusively through these integrated tables.This is the case for the DSC class labels and the redshifts stemming from QSOC and UGC, as well as the results from the surface brightness profile analysis.These two integrated tables are limited to sources that are more likely to be extragalactic, and have been selected using a number of different selection rules that are defined in the online documentation.Below we provide just a summary of these rules.
The qso_candidates table is constructed as follows.
-Sources for which the quasar class probability was larger than 0.5 for any of the three DSC classifiers (Specmod, Allosmod, Combmod -see Sect.2.1) are included.In addition to this, QSOC sources with reliable redshifts were also added (Sect.2.2).This reliability is determined from a combination of rules involving quality flags and Gaia photometry thresholds (for details see Delchambre et al. 2022).-Sources based on the analysis of photometric light curves (Vari-Classification, Sect.2.5) were selected when their class label was set to AGN.This class label is defined in Rimoldini et al. (2022).Almost all of these sources are also part of the Vari-AGN sample, but a handful are not and they have also been added to the integrated quasar table .-Quasars for which the surface brightness profile was analysed as described in Sect.We simply add class labels from OA to sources that are included by the above selections.These labels are not necessarily limited to be extragalactic source labels.
The galaxy_candidates table is constructed as follows.
-Sources for which the galaxy class probability was larger than 0.5 for any of the three DSC classifiers (Specmod, Allosmod, and Combmod -see Sect.2.1) are included.In addition to this, UGC sources with reliable redshifts were also added (Sect.2.3).The reliability is determined by a combination of two sets of rules, one concerning the quality of the BP/RP spectrum of the source, the other involving the comparison of outputs from three models estimating the redshift (for details see Delchambre et al. 2022).
-Sources identified by Vari-Classification (Sect.2.5) were selected if their class label was set to GALAXY.For a description of how this class label was defined, see Rimoldini et al. (2022).
-Galaxies for which the surface brightness profile was analysed as described in Sect.2.6 were included if the light profile parameters could be derived with sufficient quality.In complement to this, an ancillary table galaxy_catalogue_name provides the name of the external catalogues that were used to select the sources that entered this pipeline.In Gaia DR3 the only applicable catalogue is that described in Krone-Martins et al. (2022).
-As for the qso_candidates table, OA does not contribute additional sources.It only provides additional columns, which are filled just for those sources that were processed by OA.
The source lists according to the above selection criteria were concatenated into two lists, one for the qso_candidates table and one for the galaxy_candidates table.A complete list of the parameters (table columns) available in each table is given in the online documentation.Columns are filled for all sources regardless of how they are selected; thus a source may have a DSC probability that does not meet the above DSC selection criteria, for example (see Table 1).Not all parameters are available for all sources, as not all sources were treated by all modules.There are 6 649 162 sources in the qso_candidates table and 4 842 342 in the galaxy_candidates table.This large number of sources is mostly due to the selection rules of the DSC module, which favour completeness over purity (see section 2.1 and Delchambre et al. 2022).Users should therefore be aware that there is significant stellar contamination in these tables.For DSC this can be addressed using the classlabel_dsc_joint field.We address more generally how to build purer sub-samples in Sect.8.There are 174 146 sources in common between the two tables, and their union contains 11.3 million sources.
Table 1 gives an overview of how many sources from each module contribute to the integrated tables.Source overlaps between the modules within each table are shown in Tables 2 and 3, and graphically represented in the Venn diagrams in Figs. 3  and 4. Information about the distribution of the parameters featured in the tables is provided in the next section.
To estimate the overall purity of the integrated tables, we must be aware that modules with different purities can contribute the same source to a table.The estimation can be simplified, however, when we consider that all modules except DSC have similar high purities.Specifically, for the qso_candidates we assume that the modules other than DSC have an average purity of 96%, compared to a global DSC purity of 24%.From Fig. 3 we see that 4.1 million sources are contributed only by DSC, with the remaining 2.6 million contributed by the other modules.This gives an overall purity of the qso_candidates table of 52%.In a similar way, we estimate the overall purity of the galaxy_candidates to be 69%.We show how to obtain a purer sub-sample in Sect.8.
Table 1.Number of sources from each of the extragalactic processing modules contributing to the qso_candidates and galaxy_candidates tables (second column), or to the set of parameters featured for the respective modules (third column).The difference between the two columns indicates the number of sources where parameters are provided despite the sources not being eligible according to the selection rules of that module.A given source can be contributed by more than one module.

Module
Selected

Integrated tables
Figure 5 shows the sky distribution on a logarithmic density scale of all sources in the qso_candidates and galaxy_candidates tables.As already noted, there is considerable contamination in these due to misclassifications and the completeness-driven nature of the tables (i.e. the absence of filtering in some modules).This is apparent from the overdensities around the Large and Small Magellanic Clouds (LMC and   ).Some patterns are also an artefact of the use of input lists for some of the modules.Many of these sources are also faint, with poorer data in Gaia DR3, as can be seen in Fig. 6.There is also a small fraction of sources that are too bright to be genuine quasars or galaxies, which is an inevitable consequence of even a small misclassification probability and limited filtering.
The Gaia colour-colour diagram (CCD) and colourmagnitude diagram (CMD) are shown in Fig. 7. Quasars and galaxies separate quite well, but recall that Gaia observes primarily those galaxies with point-source like cores.What is not seen in these diagrams is the distribution of the stars, which out-number true quasars and galaxies by a factor of 500-1000 in Gaia DR3, and which make it hard to identify extragalactic objects based only on their Gaia colours.

DSC subset of the integrated tables
DSC is the dominant contributor to the qso_candidates and galaxy_candidates tables, so we look here at two subsets for each table defined by the DSC class labels (Sect.2.1).The first is selected by classlabel_dsc, which gives 5 243 012 quasars in the qso_candidates table (class quasar) and 3 566 085 galaxies in the galaxy_candidates table (class galaxy).Through comparison to SDSS spectroscopic classifications, and accommodating for the significant contamination by stars, we estimate these samples to have rather low purities of 24% and 22% respectively (see Bailer-Jones 2021, summarized in Delchambre et al. 2022, and Sect. 6.2 below).The second subset is the purer one identified using classlabel_dsc_joint, which gives 547 201 quasars in the qso_candidates table and 251 063 galaxies in the galaxy_candidates table.These two sets are estimated to have higher purities of 62% and 64% respectively, and of 79% and 82% respectively if we look only at higher latitudes (|b| > 11.54 • ).
Figure 8 shows the Gaia colour-colour diagrams for quasars in the qso_candidates table according to these two subsets.The upper panels show the DSC-Combmod probabilities.In the upper left panel we see that there are sources far away from the main clump of quasars, but the lower panel reveals that there are very few of them.These are all removed in the classlabel_dsc_joint = quasar set (right column), which shows only high Combmod probabilities. Figure 9 shows the corresponding colour-colour diagrams for the galaxy_candidates table.Again we see how the set defined by classlabel_dsc_joint = galaxy has a tighter distribution and higher Combmod probabilities than the less pure set defined by classlabel_dsc = galaxy.Similar figures showing the quasar and galaxy populations together are shown in Delchambre et al. (2022).These also show that use of the joint label preferentially removes fainter, lower signal-to-noise  sources, as these are less likely to get a high probability classification in both Specmod and Allosmod.
One thing to bear in mind is that Specmod and Allosmod do not deal with identical sets of sources, because these classifiers require different input data.In particular, Allosmod requires parallaxes and proper motions, that is 5p or 6p astrometric solutions (see Lindegren et al. 2021a for the definition of these solutions).Galaxies often only get 2p solutions (no parallax or proper motion) on account of their physical extent.Of the 3 566 085 million sources in the galaxy_candidates table with classlabel_dsc = galaxy, 3 367 211 have all three photometric bands, but of these, only 1 015 462 have parallaxes and proper motions and so can be classified by Allosmod (these numbers are for the whole sky, so including the LMC and SMC).As classlabel_dsc_joint can only be set to galaxy when Allosmod results are present, the change in the distribution we see in Fig. 9 for the two class labels is partially due to this.Plots in Delchambre et al. (2022) show the change when only considering the subset with 5p or 6p solutions.Most quasars, in contrast, do have 5p or 6p solutions: Of the 5 243 012 sources in the qso_candidates with classlabel_dsc = quasar, 5 086 531 have all three photometric bands, of which 4 815 212 have parallaxes and proper motions.
Because DSC is not the only contributor to the integrated tables, some of the sources in these tables have DSC class labels that are not the class of the table.In the qso_candidates table, 156 970 sources have classlabel_dsc set to galaxy, and 12 302 have classlabel_dsc_joint set to galaxy.In the galaxy_candidates table, the numbers with these two classlabels set to quasar are 12 933 and 234 respectively.

BP/RP spectra
Gaia observes all of its targets with the low resolution (30 ≤ λ/∆λ ≤ 100) BP/RP slitless spectrograph (Carrasco et al. 2021).1.6 billion of these were used by DSC-Specmod for classification (section 2.1; Delchambre et al. 2022), but only a fraction of these are published in Gaia DR3.Spectra for all sources brighter than G = 17.35 mag with at least 15 retained observations in each of BP and RP are published in Gaia DR3, amounting to 220 million sources.This includes few extragalactic sources, so a small set of these were added.In total, BP/RP spectra of 163 000 quasar candidates and 26 500 galaxy candidates in the integrated tables are published in Gaia DR3.Of these, 119 000 and 12 600 respectively are in the purer sub-samples defined in Sect.8.
As described in De Angeli et al. ( 2022), spectra are published as a set of coefficients of basis functions, from which spectra at arbitrary samplings can be produced using a published software tool.Internal to CU8, the spectra were sampled using the tool SMS-gen (Creevey et al. 2022), which is what we used to produce the spectra shown in this section.In all cases the spectra are the mean (epoch-averaged) spectra over a time span of up to 34 months.to lower the contrast of these emission lines compared to the continuum.These wiggles smooth out faint spectral features, and can be confused with emission lines, as both have comparable strength in low S/N spectra.Typically, though, the strongest spectral features -Lyα, C iv, Hβ, and Hα -are retained in G < 20 mag spectra.We also see in Fig. 10 that regions at wavelengths below 430 nm and above 650 nm in BP, and below 630 nm and above 950 nm in RP, contain little flux: spectral features in these regions generally have low S/N, complicating their detection by the DSC and QSOC algorithms.spectrum, due to the much lower resolution -and the already mentioned wiggles -shows much less prominent features.

Quasars
The majority of the 1 103 691 quasars analysed in terms of surface brightness lie in the diagonal of Fig. 2.These sources are considered point-like with no host galaxy detectable by Gaia.A group of 64 498 exhibit a clear extension, indicative of a host galaxy, as evidenced by larger fluxes in the SM window than in the AF window (Sect.2.6).For these sources the flag host_galaxy_detected = true is set.Among these, a robust solution from the fitting process was derived for 15 867 sources and their surface brightness profile is given in the catalogue.The flag host_galaxy_flag indicates the outcome of the fitting process for all sources considered.Values of 1 and 2 are good fits, indicating detection of a host galaxy.3 indicates that no host could be found, whereas 4 is a poor fit.Sources with host_galaxy_flag = 5 or 6 show no evidence of a host galaxy in our analysis, due to non-convergence of the algorithm or the presence of a close neighbour, respectively.
Figure 12 shows the spatial distribution on the sky of the 1 103 691 quasars analysed.The coverage is inhomogeneous due to the limited sky coverage of the catalogues that constitute the quasar input list (Sect.2.6) but it also reflects the scanning law of Gaia, as we only analyse sources that have at least 25 focal plane transits.The empty zones correspond either to the Galactic plane or to zones of lower frequency of scanning in Gaia DR3.
The distribution of the Sérsic index for all the quasars has a mode at 0.9 and a mean of 1.9.These values are consistent with quasars hosted by galaxies with disk-like light profiles, in agreement with a recent study of the surface brightness of host galax-ies from the Hyper Suprime-Cam Subaru Strategic Program (Li et al. 2021).
The distribution of position angles of host galaxies is roughly uniform, as expected, although there is a small excess at around 90 • .These are sources with negligible ellipticity for which the position angle is meaningless.In such cases, our fitting algorithm favours a 90 • position angle.The same is true for the galaxy sample discussed in the next section.
226 160 of the quasars processed have a spectroscopic redshift listed in Milliquas 7.2 (Flesch 2021) (selection TYPE = Q).2084 of these have a host galaxy detected by Gaia. Figure 13 shows the distribution of these redshifts.As expected, the quasars with a host galaxy have small redshifts (mean z=0.54) whereas those without a visible host galaxy have larger redshifts (mean z=1.71).In a few cases the host is detected for larger redshifts.These sources are usually very faint (G>20 mag) and suffer either from uncertainties in the light profile fit or in the redshift measurement.The host galaxies resolved by Gaia have an effective radius (encompassing half of the total light) distribution with a peak at around 800 mas.

Galaxies
The surface brightness profile module processed 914 837 galaxies.We see from Fig. 2 that all of these have a clear spatial extension.
The distribution of the effective radius of the de Vaucouleurs profile as measured by Gaia is shown in Fig. 14 as a function of the Gaia redshifts (given by redshift_ugc in table galaxy_candidates).The redshifts are all below about 0.5, with a mean value of 0.16.As expected, the closer a source is to us, the larger its effective radius.There is a slight accumulation of effective radii at 8000 mas, which corresponds to the bound  of the parameter search domain, with the results that for larger galaxies the radius would remain at 8000 mas.
Figure 15 shows the distribution of these galaxies on the sky.As with the quasars in the previous section, we see an uneven distribution due primarily to the required minimum number of observations.
The distribution of the Sérsic index peaks at around 4.5, which is consistent with the fact that the on-board detection algorithm favours elliptical types (de Souza et al. 2014).A few thousand galaxies have a Sérsic index below 2, indicative of disk galaxies.A visual inspection of a fraction of these reveals that most of them exhibit a compact bright bulge.The effective radius of the Sérsic profile has a peak value around 1800 mas and a de Vaucouleurs radius of around 1000 mas, which is typical of sources with a mean redshift of 0.13.
The ellipticities derived from Gaia exhibit a peak value around 0.25.This is more or less what is expected from the projection of oblate ellipsoids (representative of elliptical) onto the plane of the sky and is also observed in other surveys, such as Padilla & Strauss (2008).Of the one million G-band light curves of the variable AGN, 90% contain between 20 and 244 focal plane transits covering 795 to 1038 days (after applying time series filters described in Sect.10.2 of the online documentation).On average they have 39 focal plane transits over 925 days, which is sufficient to follow the long-term variability of most AGN. Figure 16 shows the light curves in the G, G BP , and G RP bands of three sources belonging to different AGN classes: a) the type 1 Seyfert galaxy

Galaxies
About 2.5 million galaxies in the galaxy_candidates table were selected based on the properties of their light curves.(Only a subset of these light curves are published in Gaia DR3; see Sect.2.5.)Gaia scans individual objects multiple times at different position angles.For extended objects this can produce an apparent -but spurious -photometric variability, because on each  scan only part of the total flux is collected by the limited size in the allocated window (Holl et al. 2022).Figure 17 shows the light curve of a known galaxy, in which we see variations in excess of 0.6 mag in G. Figure 18 shows the distribution (normalized by area) of the parameter ipd_gof_harmonic_amplitude for galaxies, AGN, Gaia DR3 variable stars, and other objects in the Gaia Andromeda Photometric Survey.This parameter measures the amplitude of the variation of the Image Parameters Determination goodness-of-fit statistic as function of the scan direction angle.Because galaxies are often extended objects at the Gaia resolution, they tend to have a larger value of this parameter than other types of objects.The galaxies that are detected by variability are based on this type of spurious signal.Figure 19 shows how the magnitude variability distribution of galaxies within 5.5 • of the Andromeda Galaxy (M31) compares to that of other sources in the same classes as in Fig. 18.We note that Gaia DR3 variable stars still amount to a relatively small fraction of all variables detected in Gaia.The brightness variations of galaxies overlap with high-amplitude tails of the distributions of other classes.

Classification
The various extragalactic modules (Sect.2) use different methods and data.This leads to a given source being classified differently in different modules, which is apparent in the qso_candidates and galaxy_candidates tables that collate results from all modules.On top of this comes the fact that different modules use different definitions of quasar and galaxy, in particular in the case of supervised learning algorithms, where the class is defined by the training data set.Tables 4 and 5 show the percentage of different classifications of the overlapping sources between modules based on the class labels where they exist (DSC, DSC-Joint, OA, Vari-Classification) or the existence of parameters (from UGC, QSOC, Vari-AGN, Surface brightness).For example, classlabel_dsc and Variability give different classes for 9.5% of their common sources.Such disagreements also come about because some modules focus more on high completeness, whereas others focus more on high purity (partially achieved by filtering).Recall also that the classification from a module appears in the table even if that source would not have been selected for inclusion in the table by that particular module (see Table 1).QSOC for quasars and UGC for galaxies are subsets of DSC selected with the properties described in Sections 2.2 and 2.3.Both use much lower thresholds on the DSC probabilities than do the DSC class labels.
Gaia-CRF3 does not distinguish between galaxies and quasars.Most are expected to be quasars so all are all in the qso_candidates table.OA works with a small fraction of sources that are generally faint and noisy so the comparison between OA and other modules should be carefully interpreted.probabilities.vari_best_class_score (from Vari) provides the median normalized rank, which also increases from 0 to 1 with increasing reliability, but it is not a probability.To compare these quantities, we map the DSC probabilities into normalized ranks.

Redshift
Redshifts are derived by two modules, the results of which are reported in the qso_candidates table (from QSOC) and the galaxy_candidates table (from UGC).Of the 174 146 sources in common between the two tables, 16 534 have a redshift derived by both modules.These are compared in Fig. 21.7 469 of these sources have predictions with |∆z| < 0.1, and the correlation improves when restricting the comparison to QSOC redshifts with higher reliability (black dots in Fig. 21): In this subset, 105 of 166 sources have |∆z| < 0.1.Specific discrepancies arise from emission line mismatches in the QSOC redshift determination.As QSOC aims to be complete, it processes galaxies, even though UGC -by design -generally gets better predictions on these objects (see Delchambre et al. 2022 for a more detailed explanation of these emission line mismatches).UGC, in contrast, aims to be pure and is accordingly not expected to process a significant number of quasars.Figure 21 shows loci of constant QSOC redshifts.These are probably erroneous matches at the BP/RP spectral borders, where wiggles from the Hermite polynomials are confused with quasar emission lines in the templates.
Figure 22 shows the colour-colour diagram for all sources for which UGC provides a redshift value, colour-coded by red- shift.We see that galaxies generally become redder in G BP −G, but bluer in G−G RP as redshift increases from 0 to 0.4.

Sources with stellar parameters
The extragalactic tables contain sources for which stellar astrophysical parameters are also reported in Gaia DR3.This is expected, because stellar parameters were inferred for sources in- There are 255 948 sources in the qso_candidates table and 7069 sources in the galaxy_candidates table that have effective temperatures derived by the CU8 GSP-Phot module (Fousneau et al. 2022).Checking a variety of metrics such as magnitude, sky distribution, and effective temperature itself, there is nothing apparently peculiar with these sources.Their presence is an inevitable consequence of the known stellar contamination.It is also important to remember that DSC, which is the single largest contributor to these integrated tables, did not filter out sources simply because they were bright (only DSC-Allosmod classifies sources with G < 14.5 mag to be stars).
There are also 4027 sources with valid radial velocities in the qso_candidates table, and 160 in the galaxy_candidates table.Considering that the extragalactic tables are mostly populated with faint sources, these small numbers are essentially due to the intrinsic magnitude limit of sources for which radial velocities could be derived in Gaia DR3 (Katz et al. 2022).Those featuring valid radial velocities have magnitudes that are usually incompatible with extragalactic sources, so it is fair to assume that they are stars.

Astrometric selection
Additional insight into the classified sources can be gained by analysing their astrometric parameters.As has been demonstrated in Gaia Collaboration & Klioner et al. (2022) and Gaia Collaboration et al. (2021), astrometry can be used to improve the purity of a sample of quasar candidates.It is clear, however, that this can only be achieved at the cost of reducing the completeness.
The procedure here is similar to that used in the construction of Gaia-CRF3, namely a two-step astrometric filtering of a sample of candidates (Gaia Collaboration & Klioner et al. 2022).In the case of Gaia-CRF3, the sample was obtained by crossmatching the Gaia EDR3 catalogue with several external quasar catalogues.In the present study, each of the Gaia classifiers contributing to the qso_candidates table is considered as an additional catalogue, and the same procedure is applied to all the external and Gaia-own selections of quasar candidates.
The first step of the astrometric filtering is to select individual sources that have high-quality astrometric solutions in Gaia EDR3 and statistically insignificant parallaxes and proper motions (see Gaia Collaboration & Klioner et al. 2022, Sect. 2.1 for the exact mathematical formulations).This step alone is insufficient to find genuine quasars (or extragalactic objects), as about 214 million sources in Gaia EDR3, dubbed 'confusion sources', satisfy these astrometric criteria.These are mostly stars of our Galaxy (Gaia Collaboration & Klioner et al. 2022, appendix C).At least at this stage of the Gaia project, astrometry cannot be used as an independent quasar classifier, although this may change in the future (see e.g.Heintz et al. 2015Heintz et al. , 2018)).
A second step of filtering is therefore needed.In this step, only those samples of sources are retained that show near-Gaussian distributions in the uncertainty-normalized parallaxes and proper motions.Since extragalactic sources are faint, the typical uncertainties of their astrometric parameters in Gaia DR3 are about two orders of magnitude larger than either the known level of systematic errors in Gaia DR3 (Lindegren et al. 2021a) or the known physical systematic effects (Gaia Collaboration et al. 2021).Bearing in mind that the true parallaxes and proper motions of genuine extragalactic sources should be zero, one expects Gaussian distributions of the normalized parameters.This requirement had proven to be very useful to distinguish genuine quasars from the confusion sources.
Both steps of the astrometric filtering obviously reject some genuine quasars that have considerable measured, but spurious, proper motions due to time-varying source structure (see Sect. 2).A prominent example here is 3C273, which is not part of Gaia-CRF3 for this reason.The samples considered at the second step of the astrometric filtering can come from a particular external catalogue or from one of the Gaia classifiers, but could also be selections according to various criteria (e.g.avoiding the crowded areas on the sky) or intersections of such selections (e.g.sources that were found to be quasars by two classifiers).
An additional characteristic of a sample of genuine extragalactic objects is that its sky distribution should not show overdensities in known stellar structures in our Galaxy and its environments, such as clusters, although it could still be influenced by such structures, for example variable Galactic extinction.This can also be used to help decide whether a particular sample of sources should be retained.
Using this two-step selection procedure we have identified a set of 1 897 754 quasar candidates, which we refer to as the 'astrometric selection'.They are indicated by the astrometric_selection_flag in the qso_candidates table.The purity of this sample is difficult to estimate, but we believe it to be 98% or perhaps better.The vast majority of these   We attempted a similar astrometric selection for the galaxy_candidates table.However, since most of the galaxies have only two-parameter astrometry in Gaia DR3, and as more problems in Gaia astrometry can be expected for extended sources, the astrometric selection for the galaxy table turned out to be less useful, so we decided not to publish it.Nonetheless, this analysis did reveal the properties of the population of sources in the astrometric selection that were classified as both quasars and galaxies: the astrometric selection from the qso_candidates table contains 54 892 sources that are also present in the galaxy_candidates table (cf.overall overlap of these tables of 174 146 sources).99% of those sources have 6p astrometric solutions.The normalized parallaxes and proper motions of this set of sources also have near-Gaussian distributions, but with standard deviations of 1.13-1.25,which is about 10% larger than for the astrometric selection as a whole.This set of sources is probably dominated by AGN for which source structure (i.e. the host galaxy) notably affects the astrometric solution.Indeed, a host galaxy was detected by Gaia for 23 805 of these sources (43%).Similar statistics of the normalized astrometric parameters can also be found for the set of sources in the astrometric selection for which a host galaxy was detected by Gaia (see Sect. 4.3), which contains 51 586 sources.
Thus we encounter the problem in the optical that is well known in radio astrometry (e.g.Charlot et al. 2020), namely the influence of the source structure on the quality of the astrometry.This topic will need a special attention in the future Gaia data releases.

Analysis of objects with lower probability classifications
The unsupervised algorithm OA was used to analyse the sources with lower DSC class probabilities (Sect.2.4).Here we focus on those neurons that were assigned to an extragalactic class label (QSO or GAL).These are shown in Fig. 25 for two different subsets: high quality neurons (HQN), that represent quality categories 0 to 3, and low quality neurons (LQN), that represent categories 4 to 6.We further limit our analyses to those sources that appear in the integrated tables.Figure 26 shows the number of neurons and objects assigned to each quality category.Ap- proximately 80% of the sources are assigned to a high quality neuron.
All OA sources were processed by DSC, as well as QSOC or UGC depending on their DSC probabilities.Table 6 is a contingency table showing the fraction of objects in common between these classifiers and the various OA neurons.For this we use classlabel_dsc.QSOC and UGC do not classify sources; for this purpose we just look at the ones they provide redshifts for.Among the galaxies identified by DSC, 83% of them were also found to be galaxies by the OA module, of which 77% landed in a high quality neuron and 6% in a low quality one.The coincidence increases for UGC, with 89% of its galaxies found in a galaxy neuron, of which 83% have high quality.The coincidence for the quasars is substantially lower, around 35% for both DSC and QSOC, with no substantial difference between high and low quality neurons.We also see that a large fraction of those objects that were not classified as a quasar or galaxy by DSC, or that were not analysed by QSOC, are classified as galaxies by OA: 54% and 90%, respectively, where most of them belong to a high quality neuron.
OA processes sources that tend to be faint with noisy BP/RP spectra, some of which OA had to modify (e.g.remove negative fluxes) so that it could process them.Table 6 suggests that the OA classification complements the results from the other modules.OA coincides with DSC and UGC when identifying galaxies in particular, and identifies objects rejected by those modules that may be real galaxies.OA could also potentially help to identify extragalactic candidates that are not in the qso_candidates or galaxy_candidates tables.

WISE and proper motions
To investigate the infrared colours of the sources in the integrated tables, we cross-matched them to the catWISE2020 catalogue (Marocco et al. 2021, including the 2021 catalogue updates) using a 1 matching radius.We found 4.31 million sources (65%) matches in the qso_candidates table , and 4 27 shows the distribution of these sources in a Gaia-catWISE colour-colour diagram.We see that most galaxy candidates have W1−W2 colours between 0.0 and 0.5 mag.This agrees with the range identified by Stern et al. (2012) for galaxies without an active nucleus and redshifts below 0.6.The quasar candidates, in contrast, show two overdensities in the catWISE colour.We explore this further by looking at the quasar candidates in the proper motion space, as shown in Fig. 28.The upper panel is for all sources with 5p or 6p solutions (2.87 million sources).We see that the bluer clump at around W1−W2 0 mag shows the full range of proper motions.Recall that non-zero proper motions of true quasars are spurious, either due to noise or to time-variable source structure.Nonetheless, the larger proper motions in the bluer clump compared to the redder clump is indicative of contamination by stars (and some galaxies), and the W1−W2 colour would seem to confirm that.Indeed, the lower panel of Fig. 28 is for the purer subset defined by classlabel_dsc_joint = quasar (0.50 million sources),  and this retains just the redder sources with proper motions that are more consistent with zero (plus noise).

Quasars
Here we look in more detail at the properties of known quasars in Gaia.For this purpose we cross-matched quasars from SDSS-DR14 (Pâris et al. 2018) that have a visually confirmed redshifts (source_z = VI or source_z = DR7Q) to all Gaia sources (those in gaia_source) using a 1 matching radius.Such a match is nominally identical to the quasars selected for training DSC (see Delchambre et al. 2022).However, we further limited this set to those with complete data in all photometric bands, at least five observations in BP and RP, complete astrometry (i.e.5p or 6p solutions), and with G < 20.75 mag.This gave 232 794 sources covering the redshift range 0.038-5.305.
A quasar in SDSS-DR14 is defined according to spectroscopic criteria.Specifically, they are sources with: (a) either at least one broad emission line with a full width at half maximum larger than 500 km s −1 , or interesting or complex absorption features; and (b) sufficiently large intrinsic luminosity (M i [z = 2] < −20.5).Since only one broad emission line is required, some objects may otherwise be classified as type 2 AGNs (those with predominantly narrow emission lines).The second part of the first condition aims to include Broad Absorption Line (BAL) quasars.This definition is free of morphological criteria.
The sample defined above is similar to the superset from which the DSC-Allosmod training set was drawn.However, DSC did not force classifications on them, so we can use it to assess DSC's completeness as a function of magnitude and redshift (further assessments can be found in Bailer-Jones 2021).This is shown in Fig. 29, using the two class labels from DSC.The dependence on redshift is expected because of its weak correlation with G BP −G RP , which increases the confusion with stars at high redshifts and with galaxies at low redshifts.Nonetheless, the completeness is above 80% for redshifts between 0.3 and 3.6 and G ≤ 20.25 mag.The lower completeness at fainter magnitudes is also expected, because lower quality data are more likely to be classified by DSC as the majority class of stars, according to the global prior (Sect.2.1), especially for the more conservative classlabel_dsc_joint label.This also explains why the overall completeness is much lower for this label, although it is still above 60% from z = 0 to z = 2.5 for G ≤ 19.25 mag.The overall completeness of classlabel_dsc is 215 721/232 794=93% and of classlabel_dsc_joint is 97 995/232 794=42%.However, given the non-uniform selection function of SDSS for obtaining spectra, we should be careful not to over-interpret this specific assessment of the DSC's completeness.
The G BP −G RP vs redshift relation for the sources classified as quasars by classlabel_dsc and classlabel_dsc_joint is shown in Fig. 30 (top and bottom panels).The pattern of undulations is expected, and is due to the quasar emission lines moving across the bands with redshift.The tail to the red for z > 3.5 corresponds to the Lyα forest entering and then filling the BP band.We see that some low redshift objects classified by SDSS as quasars are classified as galaxies by DSC.These quasars likely have a higher contribution of the host galaxy to the total emission, making them redder.The quasars with classlabel_dsc_joint = quasar follow neatly the bluest part of the colour-z relation, avoiding the regions with G BP −G RP above the median (compare top and bottom panels).This result and Fig. 29 indicate that this class selects mainly bright quasars, and from these only the bluest ones.classlabel_dsc = quasar complements the classlabel_dsc_joint = quasar class by covering the quasars that are redder due to their intrinsic emission, Galactic extinction, or local absorption (as in BALs, for example).The middle panel of Fig. 30 shows that the incompleteness of classlabel_dsc at high-z is due to the misclassification of many of these quasars as stars (see also Fig. 29).Similarly, this plot shows that the quasars in the envelope of the reddest colours over the range z = 0.5-3.5 are also classified as stars.
Figure 31 shows the colour-colour diagram for the sample colour-coded by SDSS redshift (top panel) and by Gaia phot_bp_rp_excess_factor (bottom panel).In the upper panel we see a clear trend of colour with redshift, with low redshifts located in the upper left and redshift increasing as we descend, but with the highest redshifts on the far right.The bottom panel shows that the region of low-z sources corresponds to sources with larger phot_bp_rp_excess_factor, which indicates that the combined BP and RP bands contain more flux than the G band.This region overlaps with the location of the galaxies (see plots of the DSC source densities in Delchambre et al. 2022).This suggests that the excess in G BP and G RP with respect to G is due to the wider photometric windows of G BP and G RP compared to G, which allows detection of the quasar's host galaxy emission over a wider region than in the G band.This, together with the red G BP −G RP colours, indicates a shift from quasars dominated by the nucleus to quasars with an important contribution of the host galaxy.This transition can be also appreciated in the colour-colour diagram of the purer sub- samples from the qso_candidates and galaxy_candidates tables (Fig. 37), followed by the transition to the general galaxy population.

Galaxies
We compared the content of the galaxy_candidates table with the spectral classes in SDSS DR16.We cross-matched the catalogues with a 1 arcsec radius and removed duplicated sources or ones where zWarning was not equal to zero.Of the 4 842 342 sources in the galaxy_candidates table, we found 534 154 matches in SDSS.98.0% of these are classified by SDSS as GALAXY, 1.6% as QSO, and 0.4% as STAR.Table 7 shows these percentages for each module that contributed to the galaxy_candidates table.For example, 48 460 sources have classlabel_dsc_joint = galaxy in the matched set, and 95.1% of these have SDSS class GALAXY.This percentage is a measure of the purity of classlabel_dsc_joint but Table 7. SDSS spectral classes for objects in the galaxy_candidates table.The first row gives the number of sources found in SDSS for the whole table (first column) and for each module (subsequent columns).The following rows give the percentage of sources of each SDSS class among these.The GALAXY row can be thought of as a measure of the purity against sources with SDSS spectral classifications, which by design is dominated by extragalactic sources, so is an overestimate of the purity of a sample selected at random from Gaia.only against those sources that have SDSS spectral classifications.It is considerably higher than the purity of this DSC class label reported in Sect.2.1 (and detailed in Delchambre et al. 2022), even though this purity estimate was also based on SDSS.

SDSS
The reason is that this earlier purity estimate was computed for a set of sources selected at random from Gaia: It includes the significant contamination from non-galaxies, which outnumber galaxies by a factor of about one thousand in Gaia.The higher figure reported in Table 7, in contrast, is just for those sources that have SDSS spectral classifications.By design of SDSS this includes proportionally very few stars and so very few potential contaminants.The numbers in Table 7 are a therefore a signif- icant overestimate of the true purity and should be treated with caution.
Of the 1 367 153 sources in the galaxy_candidates table that have redshifts provided by UGC, 248 356 match to sources in the SDSS-DR16 specObj table that are classified as GALAXY.The small discrepancy with the number in Table 7 is due to slightly different cross-match criteria.Figure 32 shows the difference between redshift_ugc and the SDSS-DR16 redshift as a function of the latter.The average of this difference is 0.06 with a standard deviation of 0.054 (which reduces to 0.029 when the 67 sources with redshifts above 0.6 are excluded).Generally the agreement is good, although UGC seems to systematically overestimate very low redshifts.

Composite quasar spectrum
Composite quasar spectra have many uses.First and foremost they are used as a reference in cross-correlations of individual spectra in order to classify these and determine their redshifts.Composite spectra are also used to identify faint spectral features that would otherwise be undetectable, to calibrate absolute magnitudes through the k-correction, as well as to construct colourcolour relations for identifying and characterising quasars based on photometry.Here we construct composite spectra to unveil the capability offered by the Gaia BP/RP spectrophotometers to characterise quasars.
In order to build composite BP/RP spectra we use the quasar sample described in Sect.4.2.1, which is based on the Milliquas 7.2 quasar catalogue of Flesch (2021).Our sample comprises 42 944 sources for which we use the Milliquas redshifts.We rely  8.We nonetheless also compute a composite Gaia-only spectrum based on 111 563 sources coming from the Gaia classifications and input lists together with the redshifts from QSOC.The exact sample used for this is defined in appendix B.2.The method by which we compute composite spectra is described in appendix C. The method relies on a single parameter, the logarithmic wavelength sampling of the composite spectra, chosen here to be log S = 0.003 (i.e. S ≈ 1.003) as a compromise between S/N and execution time.This sampling also applies to transformation matrices -M i in Eq.C.1 -that cover the observed wavelength region 309.5-1100.5 nm.
The resulting composite spectrum inferred using the Milliquas sample is shown in Fig. 33 as the solid black line and the values listed in 8.It covers a rest-frame wavelength range from 75.67 nm to 992.64 nm.The Gaia-only composite spectrum is shown as the dotted line and covers the wavelength range 47.08 nm to 962.83 nm.The composite spectra are trimmed in order to discard wavelength regions having flux density S/N less than one.After a multiplicative re-scaling of the flux densities so that their continua align, we found absolute differences between these two composites -relative to the Lyα flux density in the Milliquas composite spectrum -of less than 4% over the rest-frame wavelength region 100-900 nm, but up to 60% for regions bluewards of this range.The cause of these larger deviations is either contamination by sources with erroneous redshift estimates, as a consequence of the low purity of QSOC in this very high redshift region (see Delchambre et al. 2022), or border effects in the externally calibrated BP/RP spectra that we describe below.Figure 33 also shows, for comparison, a median composite spectrum from SDSS (Vanden Berk et al. 2001) that covers a rest-frame wavelength range similar to that of our full composite spectrum.
Figure 34 shows the redshift and magnitude distributions of the sources used to build the Milliquas-based composite spectrum.While the redshift distribution is as expected, with imprints of the selection and observational bias of each survey composing the Milliquas catalogue, the sharp drop at G > 19 mag is due to the filtering of the set of BP/RP spectra published in Gaia DR3.The 17 sources with G ≥ 19 mag are present because: 11 are associated with a best-matching node of the OA module (see Sect. 2.4), four are used as external calibrators by CU5, and two are white dwarf candidates used in Bellazzini at al. (2022).
In addition to the full composite spectrum computed over the whole redshift range from 0.052 to 4.358, we also computed composite spectra over several narrower redshift ranges, chosen such that their logarithmic rest-frame wavelength coverage are approximately of equal size: 0.052 ≤ z < 0.5, 0.5 ≤ z < 1, 1 ≤ z < 2, 2 ≤ z < 3 and 3 ≤ z ≤ 4.358.Composite spectra associated with each redshift range are shown in Fig. 33.The number of sources used in each redshift range, as well as the resulting reduced chi-square derived from Eq. C.1, are provided in Table 9 along with the frequency (ν) continuum slope, α ν and some information on the strongest observed emission lines: the restframe wavelengths, the relative flux density at their peak compared to either Lyα or Mg ii, and the full width at half maximum (FWHM).Unsurprisingly, all reduced chi-squares are larger than unity: Each composite spectrum evidently does not completely model the intrinsic variance seen in the observations.However, the moderate values that are observed are indicative of reasonable fits to the observations, which is corroborated by visual inspection and by the good agreement with the median composite spectrum of Vanden Berk et al. (2001).The full composite spectrum (with χ 2 ν = 6.6) also explains 99.2% of the observed variance in the entire set of spectra, although most of this variance comes from the spectral continuum.The maximum S/N ranges from 224 per log S interval in the composite spectrum with 3 ≤ z < 4, to 645 per log S interval in the composite spectrum associated with the 0.5 ≤ z < 1 redshift range.
The highest S/N of 610 per log S interval of the full composite spectrum allows us to identify many more emission lines than are otherwise visible in the BP/RP spectra of individual sources.Whereas only the Lyα, C iv, C iii], Mg ii, Hβ, and Hα emission lines are commonly seen in the observed BP/RP spectra of quasars, a visual inspection of the full composite spectrum reveals 22 emission lines, which are listed in Table 10.Despite the low resolution of BP/RP spectra (Sect.4.2), all common quasar emission lines were retrieved during this inspection procedure, in addition to some weak or rarely-seen emission lines.
Emission lines from the wavelength region covering the Lyα forest are similarly recovered in an unambiguous way.We achieve good agreement with laboratory wavelength positions, with a maximum absolute difference of |∆λ| = |λ lab − λ obs | = 0.951 nm for the C iii] emission line, where we consider only the nearest emission line in case these are blended.The apparent blueshift of the C iii] emission line resides in its asymmetry, which is due to the presence of the Si iii] λ 189.203 nm in its neighbourhood.The same rationale applies to the shift of the Lyα and Hβ emission lines (∆λ = −0.683nm and −0.433 nm, respectively) due to the presence of the N v λ 124.014 nm and [O iii] doublets respectively.
Our composite spectra are highly coherent with one another, with little variation depending on the redshift ranges that were used.After a multiplicative re-scaling of the flux densities to align the continua, we found differences relative to the Lyα flux density of less than 6% when compared to the full composite spectrum over several rest-frame wavelength regions (230-900 nm for the 0.052 ≤ z < 0.5 composite spectrum; 180-680 nm for the 0.5 ≤ z < 1.0 composite spectrum; 115-500 nm for the 1 ≤ z < 2 composite spectrum; 85-320 nm for the 2 ≤ z < 3 composite spectrum and 85-240 nm for the 3 ≤ z < 4 composite spectrum).Some border effects that were ignored in the previous comparison can still be seen in all composite spectra.These are due to the low S/N as well as systematic errors at the borders of the externally calibrated BP/RP spectra of quasars that we used to build the composite spectra.Such an effect is particularly noticeable in Fig. 33 redwards of 900 nm, where a sharp rise is visible that is also seen in about 90% of the individual spectra of quasars we used.This artefact is consequently taken as a genuine signal by our method.
The differences noticed when comparing the full composite spectrum to the SDSS one of Vanden Berk et al. (2001) redwards of the Hβ emission line are presumably due to different angular scales of the Gaia BP/RP footprint and the SDSS fibres, which are 2.1 (across scan direction; Carrasco et al. 2021) and 3 respectively.This results in the contamination of the composite spectrum by the light from the host galaxy being more suppressed in the Gaia observations.Figure 33 and Table 9 reveal differences in the continuum slopes, α ν , over the various redshift ranges.These are mostly a result of the different rest-frame wavelength coverage of each composite spectrum.Indeed, the presence of broad Fe multiplets and the Balmer continuum in the wavelength region 200-550 nm -the so-called 300 nm bump -complicates the continuum fitting in this range due to the lack of pure continuum.Consequently the values of α ν decrease once we consider quasars with redshifts z > 1.The value we found on the full composite spectrum, α ν = −0.464± 0.005, agrees reasonably well with literature values.The median composite spectrum from Vanden Berk et al. (2001) shown in Fig. 33, for example, has α ν = −0.46.
We inspected the measurements associated with the dominant quasar emission lines given in Table 10.We did not find any meaningful correlation between their values and the redshift range that was used for computing each composite spectrum.

How to select purer sub-samples
The qso_candidates and galaxy_candidates tables collate together results for most extragalactic candidates in Gaia DR3 from a number of modules, as described in Sect.3.1.Overall these tables aim for high completeness, rather than high purity.We have done this intentionally to allow the user to select their Table 9. Physical quantities derived from the composite spectra of quasars from the Milliquas catalogue shown in Fig. 33.The reduced chi-square, χ 2 ν , is obtained from Eq. C.1.The frequency continuum slope, α ν , is computed from the wavelength continuum slope, α λ = −α ν − 2, where the continuum is modelled through a power law of the form C λ ∝ λ α λ .In Fig. 33, the continuum is plotted as the lowest line joining the two most widely separated points in the range 121-600 nm without crossing the spectrum in this range.The emission line location, λ obs , and maximal line flux density, F, are retrieved from the fit of a quadratic polynomial in the vicinity of the laboratory wavelength, λ lab , after a local continuum has been subtracted that was computed in the same way as for α λ .The full-width at half maximum (FWHM) is retrieved from these continuumsubtracted emission lines using a linear interpolation of the spectral flux densities.All uncertainties are calculated, to first order, using the formal uncertainties on the composite spectra obtained from Eq. C.4. own sub-samples using the quality indicators and class probabilities.Here we describe how to select purer sub-samples from these tables.
For the quasars, EO and CRF3 used input lists of quasars identified by other surveys, so their samples are believed to be quite pure, above 90%.From the EO sample we exclude those with close neighbours (host_galaxy_flag = 6).For DSC, the joint subset has a purity of 62%, increasing to 79% when the Galactic plane (|b| < 11.54 deg) is avoided (Delchambre et al. 2022).The Vari classifier results for AGN, which already exclude the Galactic plane, has been assessed to have a purity of over 90% (Rimoldini et al. 2022).We therefore recommend the query in Table 11 to select a purer sub-sample of quasars.This returns 1 942 825 sources, which is 29% of the original table.Using the same approach at the end of section 3.1, and assuming a 96% purity for the non-DSC modules, we estimate the overall purity of this sub-sample to be 95%.Of these, 1.7 million have published redshifts from QSOC.
We use similar criteria to define a purer galaxy sub-sample, except that here there is no contribution from CRF3.Again we take all of the sources from EO (provided by an input list), the purer subset of DSC (64% pure; 82% outside the Galactic plane), and all of Vari.This query, in Table 12, returns 2 891 132 sources, 60% of the original table.We estimate the purity of this sub-sample to be 94%.Of these, 1.1 million have published redshifts from UGC.
There are 14 471 sources in common between these two purer sub-samples, and their union contains 4.8 million sources.
The sky distributions for these purer sub-samples are shown in Fig. 35 and can be compared with those for the full tables in Fig. 5.We immediately see how the purer sub-samples have lower densities in the Galactic plane.There are still artefacts from the Gaia scanning law, which is an indication of the less than perfect purity.We also still see overdensities at the LMC and SMC.This comes mostly from DSC, because unlike the other modules DSC did not do any sky-position-dependent fil- Table 10.Quasar emission lines found in the Milliquas-based composite spectrum and covering the redshift range 0.052 ≤ z ≤ 4.35 (Fig. 33).Each emission line is visually inspected before a quadratic polynomial is fit in the vicinity of its apparent peak using five samples of flux density.The maximum of the quadratic curve provides the observed restframe wavelength position, λ obs , of the line and its maximum flux density compared to Lyα, F/F Lyα (where the number of significant digits is provided in parenthesis).Because of the intricacies inherent in the fit of a local continuum to faint, broad, and/or blended emission lines, such a continuum was not subtracted from the flux densities reported here.This explains the differences between this table and the values found in    The magnitude distributions of the purer sub-samples are shown in Fig. 36.Compared to the distribution for the full set (dotted lines), we see that the purer sub-sample has excluded the brightest sources (the presence of which appears exaggerated, however, due to the logarithmic number scale).The faintest quasars have also been removed.
Colour-magnitude and colour-colour diagrams of the purer sub-samples are shown in Fig. 37 and can be compared with the same diagrams for the full tables in Fig. 7.This shows that the purer sub-samples have a tighter colour distribution, and remove many of the fainter galaxies.There are other ways to obtain purer sub-samples of the integrated tables.One could, for example, select on a higher probability threshold of the DSC probabilities (the probabilities for all three DSC classifiers are listed in the astrophysical_parameters table).An example is shown in appendix B.3.The variation of purity with threshold is explored in Bailer-Jones (2021).The joint flag used in the purer sub-samples (Tables 11 and Tables 12) corresponds to a threshold of 0.5.One could also use a higher threshold on the vari_best_class_score and vari_agn_membership_score rankings from the Vari-Classification and Vari-AGN modules listed in the qso_candidates table.
In Sect.5.4, we identified a purer sub-sample of the qso_candidates table via an analysis of astrometric distributions that are consistent with a uniform sky distribution of infinitely distant objects.The sources in this astrometric selection are indicated by the boolean flag astrometric_selection_flag in the qso_candidates table.Although this certainly excludes genuine quasars, it should be a reasonably pure list.Of the 1 897 754 sources in the astrometric sample, 1 801 255 are in common with the purer quasar sub-sample defined in Table 11, and the union of these two sets contains 2 039 324 sources.An equivalent flag is not available for the galaxy_candidates table, for reasons discussed in Sect.5.4.

Conclusions
We have described the data products released in Gaia DR3 for 11.3 million candidate quasars and galaxies.This set arises from both a classification using the Gaia data and from an analysis of sources identified by external surveys cross-matched to Gaia.The information on these sources is presented in the qso_candidates and galaxy_candidates integrated tables.Further information, also on additional lower probability candidates, is provided in several other tables (see Sect. 3).Our integrated tables are completeness driven, so many sources in them will not be true extragalactic objects.We therefore also provide a purer sub-sample of 4.8 million quasars and galaxies (see Sect. 8).
We foresee a number of use cases for our results, including: aiding confirmation of candidates found in other surveys; identifying unusual or rare objects; providing targets for spectroscopic follow-up; providing input data for more focused classifications or characterisations.It is our hope and expectation that the community can build on and improve our results also by combining them with data from other surveys.
As with previous data releases, Gaia DR3 is an intermediate data release, this time based on 34 months of mission data.The next data release will be based on 66 months of data, with correspondingly higher S/N and lower systematic errors.We plan to use those data, along with improvements in our algorithms, to update both our classifications and our characterisations of extragalactic objects.
3. Spectra often have correlated noise in their fluxes that should be taken into account.4. Quasars can have different continuum slopes (or any other background signal) that we may want to subtract in order to produce a pure emission line composite spectrum.5. Spectra used to build the composite spectrum may have different resolutions, sampling and line spread function, as in BP/RP spectra, that should be first homogenized so as to model the sole signal of interest.
With all of these arguments in mind, we developed a new method for computing a composite BP/RP spectrum based on maximum likelihood estimation through the minimization of 3 where -N is the number of observations (spectra).
x i is the i-th observation vector, here taken as a concatenation of the BP and RP spectral coefficients, which are coefficients associated with a linear spline basis functions that represent the BP/RP spectra Carrasco et al. (2021).-W i is the weight matrix associated with x i .If L i is the Cholesky decomposition of the covariance matrix associated with x i , m is the composite spectrum we are inferring.We additionally infer the scaling factors, s i , one associated with each observation i. -P is a matrix composed of a set of basis functions used to model a background signal to be subtracted from the spectra.The linear coefficients associated with P for the i-th observation are given by the column vector f i .An example use of P would be to model the quasar continua as a low order polynomial (whose coefficients are computed in f i ) and to subtract these continua from the spectra in Eq.C.1.This produces a pure emission line composite spectrum in m.The matrix P could be a set of vectors resulting from a previous minimization of Eq. C. .This method can be seen as a weighted principal component analysis, in the sense that P t are the minimal set of t components that minimize χ 2 t .Although mentioned here for completeness, we decided not to subtract the quasar continua in the present study, so we set P = 0 and f i = 0.
-M i is a transformation matrix, associated with x i , that projects P and m into the space of x i .In the present application, the goal of M i is twofold.First it isolates the restframe wavelength regions from P and m that correspond to the observed wavelength region in x i (the source redshift must therefore be taken into account).Second, the shifted and resampled spectrum is converted into BP/RP spectral coefficients through the use of the GaiaXPy simulator.See the documentation on simulate_continuous for more information on the calibration procedure.
In the minimization of Eq.C.1, m and a i = f i s i are free to vary.However, for a given value of m, we can differentiate 3 In our notation, x 2 W = Wx 2 = x T W T Wx.
Eq. C.1 with respect to a i and set the resulting gradient to zero to get4 Substituting this last equation into Eq.C.1 makes it depend only on m, such that any (global) optimization algorithm can be used with a number of unknowns given by the number of fluxes in m.In the present study, we use an expectation-maximization algorithm with momentum in batch mode.The steps of the expectation-maximization algorithm are: (i) compute a i using Eq.C.2; (ii) fit m to all x i given the previously computed a i , Steps (i) and (ii) are then iterated until the reduced chi-square improves by no more that 0.001 for 16 consecutive iterations.
To first order, the covariance matrix associated with the formal uncertainties of the computed composite spectrum can be approximated through the asymptotic normality property of the maximum likelihood estimator as where and Equation C.4 is only valid for large values of N. How large N should be is problem dependent.We performed simulations using 65 536 noisy realisations of a problem with N = 64 and eight variables in m, where all matrices, except W i , are uniformly distributed in [−1, 1] and W i is an orthogonal transformation of matrices whose eigenvalues are uniformly drawn in [0.001, 1].This led to a maximum absolute error in the median correlation coefficients of 0.007.The Octave/Matlab source code for minimizing Equation C.1 and for computing the approximate covariance matrix from Equation C.4 is available at https://github.com/ldelchambre/gls_mean/.

Fig. 1 .
Fig. 1.Colour-magnitude diagram (top) and colour-colour diagram (bottom) of the DSC training data for the quasars (blue) and galaxies (orange) as well as stars (black).The contours in each panel show the variation in source density on a linear scale.The points are equal-sized random subsets of sources from each class.There is significant overlap, in particular between stars and quasars: in reality the former dominate by a factor of about a thousand, and so overlap much more than is shown here.Plots for each class separately are provided in the online documentation.

Fig. 2 .
Fig. 2. Comparison of the flux collected in the AF and SM windows in the Gaia focal plane for quasars (with and without a detected host galaxy in Gaia) and for galaxies.Objects with an extension detectable by Gaia lie above the turquoise diagonal of quasars with no host galaxy.

Gaia Collaboration :Fig. 5 .
Fig. 5. Galactic sky distribution of all the sources in the qso_candidates table (left) and galaxy_candidates table (right).The plot is shown at HEALpixel level 7 (0.210 sq.deg.) in Hammer-Aitoff projection.The colour scale, which is logarithmic, covers the full range for each panel, so is different for each panel.

Fig. 6 .
Fig. 6.G-band magnitude distribution of all objects in the qso_candidates (blue) and galaxy_candidates (orange) table on a logarithmic scale.The brightest known quasar (3C273 -source_id 3700386905605055360) has a G magnitude of 12.8.

Fig. 7 .
Fig. 7. Colour-colour diagram (top) and colour-magnitude diagram (bottom) for all sources in the qso_candidates table (blue) and galaxy_candidates table (orange).The contours show density on a linear scale.The points are a random selection of 10 000 sources for each class.

Fig. 8 .
Fig. 8. Colour-colour diagram for sources in the qso_candidates table, excluding regions around the LMC and SMC.The left column shows sources with classlabel_dsc = quasar (2.77 million sources), the right column shows sources with classlabel_dsc_joint = quasar (the purer subset, 0.52 million sources).These numbers refer to the number of sources plotted, which are those that have all Gaia bands.The upper panel shows the mean DSC-Combmod probability for the quasar class (the field classprob_dsc_combmod_quasar).The lower panel shows the density of sources on a log scale relative to the peak density in that panel (densities 1000 times lower than the peak are not shown).

Fig. 9 .
Fig. 9.As Fig. 8, but for sources in the galaxy_candidates table.There are 3.24 million sources with classlabel_dsc = galaxy and 0.25 million sources with classlabel_dsc_joint = galaxy (in both cases excluding the regions around the LMC and SMC, and requiring all three Gaia bands).
Figure 10 shows the BP/RP spectra for 42 944 quasars with published BP/RP coefficients (field has_xp_continuous = true in gaia_source), classprob_dsc_combmod_quasar > 0.01 and spectroscopically confirmed redshift in the Milliquas 7.2 quasar catalogue of Flesch (2021) (type = Q).A search radius of 1 was used to match the Gaia sources to their Milliquas counterparts, leading to a redshift coverage of 0.052 ≤ z ≤ 4.358.The cut on the DSC Combmod quasar probability ensures that obvious stellar contaminants contained in our cross-match are discarded.The median magnitude of the sources in Figure 10 is G = 18.53 mag.Gaia observes much fainter quasars, but the BP/RP spectra of many of these will only be released in Gaia DR4.While we clearly see common quasar emission lines in this averaged plot, they are not necessarily visible in the low signal-to-noise ratio (S/N) spectra of individual faint quasars.Similarly, wiggles that are an artefact of the Hermite spline representation of the BP/RP spectra (De Angeli et al. 2022) tendto lower the contrast of these emission lines compared to the continuum.These wiggles smooth out faint spectral features, and can be confused with emission lines, as both have comparable strength in low S/N spectra.Typically, though, the strongest spectral features -Lyα, C iv, Hβ, and Hα -are retained in G < 20 mag spectra.We also see in Fig.10that regions at wavelengths below 430 nm and above 650 nm in BP, and below 630 nm and above 950 nm in RP, contain little flux: spectral features in these regions generally have low S/N, complicating their detection by the DSC and QSOC algorithms.

Figure 11
Figure11shows four representative spectra of galaxies as observed by Gaia (top row) and their corresponding SDSS spectra (bottom row).The first SDSS spectrum on the left shows only absorption lines, suggesting an early type galaxy with little or no star formation activity (the few spikes are caused by cosmic rays).These lines are barely detectable, if at all, in the low-resolution BP/RP spectrum.The two middle spectra show strong emission lines characteristic of active star formation.The strongest is the Hα emission with [N ii] lines on either side.This set of three lines is unresolved in the RP spectrum where it merges into a single and wide emission feature.Similarly, in the BP spectrum the Hβ and [O iii] emission lines are merged into another wide peak.The last spectrum on the right is classified as a 'GALAXY AGN' in SDSS.The corresponding BP/RP

Fig. 10 .
Fig. 10.Distribution of the BP flux (left) and RP flux (right) as sampled by SMS-gen (Creevey et al. 2022) of 42 944 quasars published in Gaia DR3 that have spectroscopically confirmed redshifts in the Milliquas 7.2 quasar catalogue of Flesch (2021) (type = Q).Dotted lines show the dominant quasar emission lines.Spectra are individually normalized in order to have a maximum flux of 1.0 and are then averaged in redshift bins of 0.01, with the inverse variance of the sampled fluxes used as the weight during the computation of the mean.

Fig. 11 .
Fig. 11.Galaxy spectra.Top row: Representative mean BP and RP Gaia spectra for four galaxies.Bottom row: The spectra for the same galaxies as observed with the SDSS-BOSS spectrograph (the SDSS class and subclass, if defined, are shown).

Fig. 12 .
Fig. 12. Distribution in Galactic coordinates (Hammer-Aitoff projection) of the quasars processed by the surface brightness profile module.Blue points are quasars with a host galaxy detected (host_galaxy_detected = true) and turquoise points are those without a host galaxy.

Fig. 13 .
Fig. 13.Normalized distribution of the redshifts from Milliquas v7.2aa (Flesch 2021) of quasars that were analysed for surface brightness profiles.Blue shows the 2000 quasars for which a host galaxy was detected by Gaia, and turquoise the remaining 224 000 quasars for which no host galaxy was detected.

Fig. 14 .
Fig. 14.Distribution of the effective radius (de Vaucouleurs profile) of galaxies processed by the surface brightness profile module as function of the redshifts measured by Gaia (redshift_ugc).

Fig. 15 .
Fig. 15.Distribution in Galactic coordinates (Hammer-Aitoff projection) of the galaxies processed by the surface brightness profile module.The colours show the density on a linear scale.
al. 2022), and finally on the variability probability (to deal with clearly variable objects).

Fig. 18 .
Fig. 18.Distributions (normalized by area) of the field ipd_gof_harmonic_amplitude for sources of various classes in the Gaia Andromeda Photometric Survey.'Other' includes constants and those variable objects that were not targeted in Gaia DR3.

Fig. 19 .
Fig. 19.Statistics of light curves of objects in the Gaia Andromeda Photometric Survey.Top: Standard deviation versus median G magnitude.Bottom: Normalized distribution of minimum-to-maximum variability range for G band light curves.Both panels are colour-coded as in Fig. 18.The distributions overlap in the upper panel, with galaxies covering AGN, for example.

Fig. 20 .
Fig.20.Comparison of DSC quasar classification probabilities (transformed to normalized ranks) with scores from the variability analysis.Darker colours depict higher densities, and the white line indicates the median rank.We see a broad agreement between the highest and lowest ranked quasars.
Figure 20 compares this for classprob_dsc_combmod_quasar to vari_best_class_score.The deviation from a perfect correlation reflects the difference in input data types, training sets and class definitions, and classification methods in general.

Fig. 21 .
Fig. 21.Comparison between the UGC and QSOC redshifts.Grey dots correspond to all redshifts in common between the two tables, while black dots are restricted to those with flags_qsoc = 0, which corresponds to a higher reliability subset.The red curve denotes identical predictions in the two modules.Yellow curves highlight mismatches between common quasar or AGN emission lines, as explained inDelchambre et al. (2022), while the blue vertical lines show constant predictions by QSOC.

Fig. 22 .
Fig. 22. Colour-colour diagram for the 1 367 153 galaxies for which redshifts are provided by UGC, colour-coded by redshift.A small number of sources have redshifts extending up to 0.6

Fig. 23 .
Fig. 23.Distribution Galactic coordinates (Hammer-Aitoff projection) of the 1 897 754 sources from the astrometric selection (i.e.sources in the qso_candidates table with astrometric_selection_flag set).The plot shows the density of sources per square degree computed from the source counts per pixel at HEALPix level 7 (pixel size 0.21 deg 2 )

Fig. 24 .
Fig. 24.Distributions of the normalized parallaxes and proper motion components for the sources in the astrometric selection with 5p (blue) and 6p (green) solutions.The red curves show the corresponding best-fit Gaussian distributions.The global parallax zero point of −0.017 mas of Gaia DR3 is taken into account(Lindegren et al. 2021b,a).The standard deviations of the best-fit Gaussian distributions for the sources with 5p (6p) solutions are 1.048 (1.068), 1.054 (1.092) and 1.063 (1.109) for the parallaxes, and proper motions in right ascension and declination, respectively.As usual in Gaia, the asterisk in α * in the middle panel indicates the implicit factor cos δ, that is µ α * = α cos δ.

Fig. 24
Fig.24shows the distributions of the normalized parallaxes and proper motions of the astrometric selection.They are close to Gaussian, which suggests a reasonably low level of stellar contamination.The standard deviations of the best-fit Gaussian distributions range from 1.05 to 1.11 and indicate by how much the formal uncertainties of the corresponding astrometric parameters may be underestimated in Gaia DR3.

Fig. 26 .
Fig. 26.Distribution of OA neurons labelled as extragalactic for each quality category.The number on each bar gives the number of sources in the qso_candidates or galaxy_candidates tables.No extragalactic objects appear in any of the best (0) or worst (6) quality neurons.

Fig. 27 .
Fig. 27.Gaia-catWISE colour-colour diagrams.Top: All sources in the qso_candidates table.Bottom: All sources in the galaxy_candidates table.In both cases regions around the LMC and SMC have been excluded.The colour scale shows the density of sources on a log scale relative to the peak density (densities 1000 times lower than the peak are not shown).

Fig. 28 .
Fig. 28.Gaia proper motion vs catWISE W1−W2 colour for all sources in the qso_candidates table (top) and the subset with classlabel_dsc_joint = quasar (bottom).Regions around the LMC and SMC are excluded.The colour scale shows the density of sources on a log scale relative to the peak density (densities 1000 times lower than the peak are not shown).

Fig. 29 .
Fig. 29.Completeness of classlabel_dsc = quasar (top) and classlabel_dsc_joint = quasar (bottom) with respect to the SDSS-DR14Q-Gaia DR3 cross-match as a function of G and SDSS redshift.Empty bins (white) have fewer than 25 sources.

Fig. 30 .
Fig. 30.G BP − G RP vs redshift for the subsets classlabel_dsc (top) and classlabel_dsc_joint (bottom) of the SDSS-DR14Q-Gaia DR3 cross-match.The orange points are classlabel_dsc = galaxy (top) and classlabel_dsc_joint = galaxy (bottom).The middle panel shows the quasars with classprob_dsc_combmod> 0.5 for any of the three stellar classes used in DSC-Combmod.The red curves show the median G BP −G RP and the 16% and 84% quantiles for all sources in the cross-match, regardless of the DSC class (so are the same in all panels).

Fig. 32 .
Fig. 32.Difference between redshift_ugc and SDSS DR16 redshift, as a function of the latter, for the 248 356 sources in common with SDSS spectral class GALAXY.

Fig. 33 .
Fig. 33.Composite quasar spectra.The thick solid lines show composites made from the 42 944 BP/RP spectra with spectroscopically-confirmed redshifts from the Milliquas 7.2 quasar catalogue of (Flesch 2021, i.e. type = Q in Milliquas).The different colours are for different redshift ranges.The thick dotted black line shows the composite made from 111 563 BP/RP spectra with reliable QSOC redshift estimates (identified using the query given in appendix B.2).The diagonal dotted line under each spectrum shows the quasar continuum, as described in Sect.7 and defined in Table 9. Vertical dotted lines indicate common quasar emission lines.For comparison purposes we also show the median SDSS composite spectrum of Vanden Berk et al. (2001) (orange line).The flux densities are tabulated in Table8.

Table 8 .
Composite quasar spectra shown in Figure33.The columns are the arbitrarily scaled flux densities, F, and associated uncertainties, F err , computed at rest-frame wavelengths λ.The values below are for the Milliquas composite covering the redshift range 0.052 to 4.358.A full electronic version of this table is available with the online version of this article, along with the quasar composite spectra for the narrower redshift ranges shown in Figure33as well as the Gaia-only composite.

Fig. 34 .
Fig. 34.Distribution of the literature redshift from the Milliquas 7.2 catalogue (Flesch 2021) and Gaia G-band magnitudes for the 42 944 sources used in the computation of the composite quasar spectra in Fig. 33.

Fig. 35 .
Fig. 35.Galactic sky distribution of all the purer sub-sample of sources in the qso_candidates table (left) and galaxy_candidates table (right).The plot is shown at HEALpixel level 7 (0.210 sq.deg.) in Hammer-Aitoff projection.The colour scale, which is logarithmic, covers the full range for each panel, so is different for each panel.Compare to Fig. 5 for the full tables.

Fig. 36 .
Fig. 36.G-band magnitude distribution of the purer sub-sample of objects in the qso_candidates (blue) and galaxy_candidates (orange) table on a logarithmic scale.The dotted lines show the distributions for the full tables.

Fig. 37 .
Fig. 37. Colour-colour diagram (top) and colour-magnitude diagram (bottom) for the purer sub-sample of sources in the qso_candidates table (blue) and galaxy_candidates table (orange).The contours show density on a linear scale.The points are a random selection of 10 000 sources for each class.Compare to Fig. 7 for the full tables.

Table 2 .
Source overlaps between the modules contributing to the qso_candidates table.See text for details about the module names.

Table 3 .
As 2 for modules contributing to the galaxy_candidates table.Quadruple Venn diagram for contributions to the galaxy_candidates table from DSC, the Surface brightness sample, Vari-Classification, and UGC.SMC).If we exclude generous regions around the LMC and SMC (defined in appendix B), then the number of sources in the qso_candidates table drops to 3.95 million (59% of the full table) and the number of sources in the galaxy_candidates table drops to 4.67 million (96% of the full table Carnerero et al. (2022) a million variable AGN candidates in the vari_classifier_result table, which were selected mainly on the basis of their variability properties.For these, the epoch photometry in the G, G BP , and G RP bands is published in the light_curve datalink table.A complete description of the selection methods can be found inRimoldini et al. (2022), and are summarized below.More restrictive criteria were applied to achieve the higher purity sample comprising 872 228 candidates in the vari_agn table (Sect.2.5), the characteristics of which are analysed inCarnerero et al. (2022).

Table 4 .
Comparison of the classes of sources in the qso_candidates table according by its contributing modules.Each element gives the number of sources with different classifications between any two modules, expressed as a fraction of the number of sources in common between those two tables.Sources labelled unclassified in classlabel_dsc and classlabel_dsc_joint are excluded.The columns list all the modules that provide classifications.The rows list all modules that add sources to the table: the last four of these are not classifiers, but provide sources based on other labels.

Table 5 .
As Table 4 but for the galaxy_candidates table.
.59 million (95%) in the galaxy_candidates table.Excluding the regions around the LMC and SMC (defined in appendix B) left 2.99 million matches in the qso_candidates table and 4.46 million in the galaxy_candidates table.Figure

Table 6 .
Contingency table for OA classifications.Each entry gives the percentage of objects classified by DSC (using classlabel_dsc), or processed by QSOC or UGC, that are assigned to OA high-quality neurons (HQN) or low-quality neurons (LQN), for sources in the qso_candidates and galaxy_candidates tables.
If Lyα is covered by the composite spectrum, it is used as a reference flux density, F ref , otherwise Mg ii is used.

Table 9 .
Broad feature composed of Fe multiplets.tering.Such sources can be easily removed by the user, as shown in appendix B.3. a

Table 11 .
ADQL query to select the purer quasar sub-sample.

Table 12 .
ADQL query to select the purer galaxy sub-sample.