VarIabiLity seLection of AstrophysIcal sources iN PTF (VILLAIN) I. Structure function fits to 71 million objects

Context. Lightcurve variability is well-suited for characterising objects in surveys with high cadence and long baseline. This is especially relevant in view of the large datasets to be produced by the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST). Aims. We aim to determine variability parameters for objects in the Palomar Transient Factory (PTF) and explore differences between quasars (QSOs), stars and galaxies. We will relate variability and colour information in preparation for future surveys. Methods. We fit joint likelihoods to structure functions (SFs) of 71 million PTF lightcurves with a Markov Chain Monte Carlo method. For each object, we assume a power law SF and extract two parameters: the amplitude on timescales of one year, $A$, and a power law index, $\gamma$. With these parameters and colours in the optical (Pan-STARRS1) and mid infrared (WISE), we identify regions of parameter space dominated by different types of spectroscopically confirmed objects from SDSS. Candidate QSOs, stars and galaxies are selected to show their parameter distributions. Results. QSOs have high amplitude variations in the $R$ band, and the strongest timescale dependence of variability. Galaxies have a broader range of amplitudes and low timescale dependency. With variability and colours, we achieve a photometric selection purity of 99.3 % for QSOs. Even though hard cuts in monochromatic variability alone are not as effective as seven-band magnitude cuts, variability is useful in characterising object sub-classes. Through variability, we also find QSOs that were erroneously classified as stars in the SDSS. We discuss perspectives and computational solutions in view of the upcoming LSST.


Introduction
Large, wide-field surveys allow us to identify rare objects, study parameter distributions of different object classes and select sources for further examination.Spectroscopy can produce high quality classifications, but often only photometry is available.With the prospect of deep, 10-year lightcurves from the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST; Ivezić et al. 2019), it is important to explore photometric selection based on lightcurve variability as a function of timescale for a large dataset.
This method is particularly suitable for analysing certain types of objects with distinctive variability parameters, e.g. for distinguishing quasars (QSO) from stars (van den Bergh et al. 1973;Schmidt et al. 2010;Ulrich et al. 1997;Myers et al. 2015).Variability can also advance understanding of the physical nature of the objects, as the mechanisms behind variability operate on different timescales (Schmidt et al. 2012) -such as QSO accretion disk instabilities (Rees 1984) or changes in accretion rate (Hopkins et al. 2006).
With identification of distant QSOs, we can study structure formation in the early universe and the cosmological parameters determining its expansion, e.g. through measurements of baryon acoustic oscillations (BAO) in the Lyα forest (Alam et al. 2021;du Mas des Bourboux et al. 2020;Turner 1991;Song et al. 2016;Secrest et al. 2021).Myers et al. (2015) selected a fraction of the SDSS eBOSS QSO targets for BAO based on variability in lightcurves from the Palomar Transient Factory (PTF; Law et al. 2009;Rau et al. 2009).QSOs are also useful for redshift-drift tests of cosmic acceleration (Sandage 1962;Kim et al. 2015;Alves et al. 2019;Loeb 1998).Especially lensed QSOs are interesting because they can be used for time-delay cosmography, which works better with highly variable sources (Refsdal 1964;Treu & Marshall 2016).Lensed QSOs can be discovered as extended objects with high variability (Kochanek et al. 2006).
Faint photometric standards are also of special interest.Large future observatories such as the Vera C. Rubin Observatory, the Extremely Large Telescope (Tamai et al. 2016), the Thirty Meter Telescope (Skidmore et al. 2015) and the Giant Magellan Telescope (Johns et al. 2012) will observe to higher magnitudes than the typical magnitudes of standard stars.These have magnitudes of 11.5 < V < 16 in the Landolt (1992) sample, which has, along with the Stetson database of secondary standards (Stetson 2000;Stetson et al. 2019), since been expanded and curated as described in Pancino et al. (2022).The combined sample mainly includes sources with 13 < V < 21.The distribution on the sky is not uniform, however, and knowing variable from non-variable sources is useful for calibration even at lower magnitudes, such as the 20.6 R-band limit of the PTF.
There are several ways of characterising variability, and they can be even more useful than colours for selection of AGN (De Cicco et al. 2021).Baldassare et al. (2020) demonstrated the potential in PTF variability for selection of AGN in low mass galaxies by fitting 50,000 lightcurves with a damped random walk (DRW) model.Ward et al. (2021) found variable AGN in the Zwicky Transient Facility (ZTF; Bellm et al. 2019) using a combination of variability measures.
In this paper, we apply structure functions (SF) to explore the variability of objects.This is a well-known technique, first applied in astronomy by Simonetti et al. (1985), which is computationally efficient for large sets of data with gaps (Moreno et al. 2019).SF models describe the difference in magnitude ∆m across a time interval ∆t.We specifically consider a power law model, as done by e.g.Hook et al. (1994), which is often applied for AGN and QSOs.Both SF and DRW models pose challenges -even 20 year baselines cannot constrain DRW models completely (Suberlak et al. 2021;Stone et al. 2022).Sánchez et al. (2017) found the Bayesian SF defined by Schmidt et al. (2010) to be the best and most stable one for noisy lightcurves with irregular sampling.Schmidt et al. (2010) applied power-law SFs to lightcurves from The Sloan Digital Sky Survey (SDSS) and defined a variability parameter region for selection of QSO candidates with criteria chosen by eye.This gave relatively complete and pure candidate sets for SDSS S82 lightcurves, but with worse performance for lightcurves from Pan-STARRS1 (PS1; Chambers et al. 2016).The performance was achieved on sets of known QSOs, F/G stars and RR Lyrae, with stars outnumbering QSOs by a factor of 30.
In this paper, we aim to analyse the results and performance of a similar method but applied to the entire PTF survey.We will show where variability is most effective in breaking degeneracies and which forms of variability are common among different types of objects.Specifically, matching with spectroscopic SDSS classifications of QSOs, stars and galaxies, which we assume as ground truth, we can evaluate new variability selection criteria and compare with those of Schmidt et al. (2010) 1 .By including as many objects from the PTF survey as possible, the properties of the total lightcurve sample are more representative of PTF sources than if we only included e.g.specific stellar subtypes.
Another approach to photometric selections is based on colours.We will examine colour-colour and colour-magnitude diagrams of a spectroscopically confirmed sample by matching magnitudes in mid-IR from the Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010) and in the optical from PS1.This will allow us to examine simple selection criteria based on these.
We describe the datasets (and the data cleaning process) used in this work in Sect. 2. Sect. 3 introduces the methodology.In Sect.3.1, we define models of the PTF lightcurve SFs, and we fit them in Sect.3.2.The metrics used for evaluation of selection of objects are defined in in Sect.3.3.The results of the SF fitting are plotted and described in Sect. 4. Based on the variability parameters and matched colour information, we choose photometric selection criteria and examine the properties of selected objects in Sect. 5. We discuss them in Sect.6 and summarise the findings in Sect.7.

PTF
The Palomar Transient Factory was a project with the Palomar 48 Schmidt telescope at Palomar Observatory in California.This includes imaging in DR1, DR2 and DR32 from 1 March 2009 to 28 January 2015 covering 20 000 deg 2 (Law et al. 2009;Rau et al. 2009) 3 .The PTF lightcurve database is based on a subset of these images with data from 598 975 024 objects in R and g.
Since most data points are in R, these are chosen for the analysis of this paper (as mag_autocorr and magerr_auto).The R band is adapted from the 658 nm Mould-R filter described in Cuillandre et al. (2000), has a limiting magnitude of 20.6 and photometry is in the AB magnitude system (Law et al. 2009).From this database, we also consider the timestamps, PTF object ID, RA, Dec and median R -queried in 648 patches of 10 • ×10 • via the IRSA Catalog Search Tool 4 .310 large patch files are split into 6340 files, keeping data points with the same object ID together.File by file, we group data points by ID to assemble lightcurves.These are cleaned and fitted in Sects.2.5 and 3.1 in parallel using 14 computing nodes for six months.

SDSS
The Sloan Digital Sky Survey DR17 (Blanton et al. 2017;Abdurro'uf et al. 2022) includes 5.789 million spectra (Smee et al. 2013;Wilson et al. 2019;Drory et al. 2015) from the Apache Point Observatory in New Mexico, USA (Gunn et al. 2006).866 338 QSOs, 962 162 stars and 2 790 253 galaxies are spectroscopically confirmed based on DR17 from February 2021 (with data both in PhotoObj and SpecObj).Their coordinates, redshifts and spectroscopic classes are queried via CasJobs on SciServer (Taghizadeh-Popp et al. 2020).We cross-match by creating a k-d tree (Maneewongvatana & Mount 1999) of the PTF coordinates.Then we query the tree to find the nearest SDSS object within two arcseconds, if it exists.By inspecting SDSS images 5 , we find that this radius best avoids spurious matches.

WISE
The Wide-field Infrared Survey Explorer is a space telescope observing in the mid infrared (Wright et al. 2010).It had a cryogenic phase from Dec. 14, 2009 to Feb. 2011 and a postcryogenic one since 2013 (NEOWISE).With the latest update from February 2021, the combined AllWISE program (Cutri et al. 2021) data release II/328 contains 748 million objects.
We query WISE objects with the CDS X-Match Service (Boch et al. 2012) within five arcseconds of the mean PTF coordinates to each object.From WISE ('vizier:II/328/allwise'; Cutri et al. 2021;Ochsenbein et al. 2000), we get W1 (3.4 µm), W2 (4.6 µm), their errors and coordinates.W3 and W4 are not included, as they are not deep enough to include most of the PTF sources.WISE magnitudes are in the Vega magnitude system (Wright et al. 2010).For each PTF object, only the closest WISE source is selected via a k-d tree like in Sect.2.2.If this source contains multiple data points, we choose the lowest magnitude, as that is the brightest detection, and save the mean RA and Dec.

PS1
The Panoramic Survey Telescope and Rapid Response System includes two telescopes in Hawaii; Pan-STARRS1 and Pan-STARRS2.We use the g (481 nm), r (617 nm) and z (866 nm) bands (Tonry et al. 2012) from the Pan-STARRS1 survey DR1 with 1.92 billion objects ('vizier:II/349/ps1';Chambers et al. 2016;Ochsenbein et al. 2000).This covers the sky at J2000.0 declination > −30 • , including the PTF footprint.The PS1 magnitudes are in the AB system.Like for WISE, in PS1 we query within five arcseconds of PTF objects and select one set of magnitudes, magnitude errors and coordinates per source.

Data cleaning
Data cleaning is needed for reliable fitting results.For PTF, extreme magnitudes outside the range of 12 < R < 22 are removed.If a lightcurve contains too few data points, fit parameters are difficult to constrain, and if the time span is too short, we cannot detect variability over long timescales.Therefore, we require that each lightcurve must contain at least 20 data points and span at least one year.At 80 • < RA < 90 • and 0 • < Dec < +10 • , lightcurves are so well sampled that to speed up the analysis in Sect.3.2, we select a maximum of 1500 random data points for some sources.
Magnitude outliers affect variability fits, so we want to remove them, while preserving the real variability.A moving weighted median (MWM) is computed for each data point i in each lightcurve.This is based on the magnitudes R close,i of the seven closest neighbouring data points in time within a window of five days (2.5 days to each side).Data points more than three standard deviations away from the MWM are removed, taking into account both the error of the data point and of the weighted median.A data point with magnitude R and error σ R is removed if where MAD is the Median Absolute Deviation.We compute the MAD of every part of the MWM as the weighted median of the absolute distances of the data points used in the computation of the MWM.That is, the MAD of a specific value of MWM is based on the up to seven data points within the time window of data point i: If fewer than seven data points are close enough to i in time, they are still used, and the data point i itself has a relatively higher weight compared to those few data points, leading to a higher probability of acceptance.This is desirable, as we have less information to base rejection on.After data cleaning, the remaining lightcurves are distributed on the sky according to

Methodology
We describe the variability of each object by defining power law models of their structure functions.The models are fit to structure functions of individual objects to extract variability descriptors as fit parameters.These will be used for exploring relations in the data and photometric selection of object classes in Sects.4-5; but first, we define metrics for evaluation of selection quality in Sect.3.3.Notes.Sample sizes before and after selecting and cleaning PTF lightcurves and matching with sources in SDSS, WISE and PS1.

Structure functions
For each object, we analyse variability by comparing every pair of data points in its lightcurve.For each pair of data points i and j, we have a difference in magnitude ∆m i j and in time ∆t i j .By comparing all pairs, we find the timescale dependence of differences in magnitude with a SF (Simonetti et al. 1985), and model this variability.The total effective variability V eff of each object consists of intrinsic variability, which we describe with a model V mod , and noise, σ m .Assuming the model describes the data well, V eff,i j is close to the observed ∆m i j .For V mod , we choose a power law, so Eq. 4 has two free parameters -A and γ.A quantifies the amplitude, in units of magnitudes, of the variations on the timescale t 0 , and the power law index γ describes how the amplitudes depend on timescale.We choose t 0 = 1 year (observed frame).One could correct for the factor of 1 + z to find rest frame ∆t i j , but redshift information is limited, and SDSS redshifts are not independent of the spectroscopic labels we use for comparison.
The SF is simply the square of the model variability, following the notation of Schmidt et al. (2010), We expect most objects to have 0 < γ < 1. γ > 0 shows that variability increases with timescale, and for γ > 1 this is accelerating, which we would mostly expect to see for short lightcurves with highly uncertain γ.This could be objects with an overall positive or negative trend.γ < 0 means most variability is found at short timescales, e.g. in transient tails, but we do not expect to include these objects due to the minimum observed time span of one year.Only A consistently shows correlation with physical black hole parameters in the literature, namely an anti-correlation with the rest-frame emission wavelength, λ rest , and the Eddington ratio, L/L Edd (De Cicco et al. 2022).

Fitting
SF models are fitted to data using the emcee python package for each pair of data points, assuming a gaussian distribution of ∆m i j .For the total log posterior, LP, of the lightcurve, we use the same prior ln(p) as in Schmidt et al. (2010): The MCMC runs with eight walkers for 500 steps where the first 200 are discarded as burn-in.We find these values to balance accuracy and speed.From each fit, we retain the median values of A and γ and their 16th and 84th percentiles as 1σ uncertainties.

Evaluation metrics
To study parameter distributions of different object types, we photometrically select QSOs, stars and galaxies.The quality of selection criteria is evaluated on purity and completeness, estimated using the spectroscopically confirmed subset.We compute these with the number of true positives, TP, false negatives, FN, and false positives, FP:

Variability and colour distributions
We present plots and statistics for two datasets: all fitted objects and those with spectroscopic classifications.In Sect.5, we also assemble photometric selections.The datasets allow us to compare distributions of three classes: QSOs, stars and galaxies.The plots include variability parameters (A and γ) and colour diagrams of W2 vs. W1 − W2 and g − r vs. z − W1.We know stars, galaxies and QSOs to have good separation in optical vs. infrared colours (Maddox & Hewett 2006;Lang et al. 2016).W1 − W2 was also used by Wright et al. (2010), and Assef et al. (2013) combined W2 and W1 − W2 for selecting AGN.
We query and fit the data in batches as described in Sect.2.1.For 6340 files with approximately a million lightcurve data points per file, the batch processing time is about an hour on a computing node with 32 cores.The maximum time is 11 days for one file.

Full PTF sample
In Fig. 1, we show parameter distributions for the full PTF sample.The top panel is a 2D-histogram of A and γ values, illustrating the variability of the 70 920 904 fitted lightcurves.We see two large clusters, at log A ∼ −0.8 and log A ∼ −5 (base 10).Variations of just 10 −5 magnitudes over timescales of one year indicate that the values of the latter cluster are spurious.
The middle and bottom panels of Fig. 1 display a colourcolour and a colour-magnitude diagram.As these are based on parameters in WISE and PS1, they show distributions for the 80 % (56 991 591) of the PTF objects with matches in both of these surveys (see Table 1).93 % of QSOs, 75 % of stars and 98 % of galaxies have these matches.

Spectroscopically confirmed sample
Fig. 2 contains similar plots to Fig. 1, but only for data with spectroscopic SDSS classes.In the variability parameter plot, this is 2.5 % of the fitted sources (1 748 047), and the colour diagrams show the 2.276 % (1 613 916) with matches in SDSS, WISE and PS1.The colours denote different classes: red for QSOs, green for stars and blue for galaxies.These are scaled based on relative density, with full saturation for areas of parameter space with one class and maximum relative density, white for areas without data and black or grey for areas with high relative densities of multiple classes.Note that this means a space that is more green than blue can still have more galaxies than stars.The full dataset includes more galaxies than stars and QSOs combined (see Sect.

2.2).
In the top panel of A-γ space, we notice the two clusters of Fig. 1.The one at log(A) ∼ −0.8 has high relative densities of all classes, but they are spread out along different axes in the A-γ plane -more galaxies have log(A) < −1 and more QSOs have γ > 0.1.Stars are mostly in the same areas as galaxies, but with more spread in γ and higher relative density at log(A) < −3.In the middle panel, the stellar locus is immediately recognisable, as well as the main distributions of QSOs and galaxies (Ansari et al. 2021;Lang et al. 2016).The latter two include an overlap most apparent at g − r ∼ 0.4 and z − W1 > 0.3.The bottom panel shows that the three classes do have different distributions in W1 − W2 vs. W2 too, but with a large overlap between stars and galaxies at W2 > 15.

Photometric selection
We next explore how variability relates to broad band colours through photometric selections using colours, variability or both.Simple photometric selections of QSOs, stars and galaxies allow us to inspect differences between parameter distributions of photometrically and spectroscopically selected objects.Based on the distributions of Fig. 2, we define regions of parameter space dominated by QSOs, stars or galaxies.Schmidt et al. (2010) used a similar approach in A and γ.For reliable selections, we focus on purity and accept a low completeness.We avoid areas of high degeneracy between classes and achieve separate selection criteria for variability parameters and for colours.All criteria are listed in Table 2.While these are simple, linear criteria in parameter space to study relations between colour and variability, more advanced techniques exist and have been deployed e.g. by Ansari et al. (2021) in colour space.Such a study is presented in a companion paper (Bruun et al. 2023).
We also examine properties of their QSO selection criteria, which were originally chosen by eye, for sources with large fit parameters compared to their uncertainties:

Selection properties
The selection criteria of Table 2 are based on purity and completeness of the labeled objects in Fig. 2 and then applied to the larger, unlabeled datasets of Notes.Criteria for selection of candidate QSOs, stars and galaxies.These are based on the SDSS distributions of Fig. 2  Assuming a maximum of one Gaia match per source, only 3.7% of SDSS galaxies have a match in Gaia with a measured proper motion, and the fraction decreases to 0.5% for galaxy candidates selected by colour and variability.Further details are given in Appendix E.
If instead we use the Schmidt criteria and require the sources to have A and γ significantly different from zero (Eq.10), we find the most common SDSS class to be QSOs.However, these criteria lead to more contamination from stars and galaxies than the QSO criteria of Table 2.

Colour selection
Fig. 3 shows distributions of colour-based selections from the large dataset of fitted sources in Fig. 1.The overall pattern is the same as in Fig. 2, but the stars are more spread out.More of them have very low A or high γ, and relatively fewer are found at log(A) ∼ −0.8 and γ ∼ 0. According to Table 3, the colour selected candidates have similar, high purities of 98.6 − 99.7 % for all classes, but with varying completeness -6.82 % for galaxies, 7.92 % for stars and 38.3 % for QSOs.

Variability selection
We plot the variability-selected sources in Fig. 4. Relying solely on A and γ is challenging according to the SDSS labels.For example, while the variability criteria for stars catch 53.4 % of stars and 19.8 % of galaxies, they select more galaxies due to different population sizes.Naturally, criteria based on overlapping classes in A-γ-space lead to overlapping selections in colour space.

Ambiguous sources
Given the purity of the photometric selections in Table 3, the contaminating sources are expected to have a high rate of misclassifications by SDSS.Checking the most confident photometric predictions with differing spectroscopic classifications, we find examples of spectra appearing more typical of the photometric candidate class.We inspect the spectra of the 32 photometric QSO candidates with stellar spectroscopic classifications, and judge ∼ 20 % to be QSOs and ∼ 50 % as possible QSOs.16 of the objects are in SIMBAD, registered as eight QSOs, four BL Lacertae objects, one blazar and just three stars.For example, SDSS J120429.34+495814.4,with the spectrum of Fig. B.3, has the characteristic broad lines of a QSO.This object is part of the Data Release 12 Quasar catalog from SDSS (Paris et al. 2017).However, it is spectroscopically confirmed as a star in SDSS DR17.This misclassification analysis shows the power of the photometric selection criteria of this work, although we do not expect high rates of misclassifications in the full SDSS spectroscopically confirmed sample.
Variable galaxies in the Table 2 QSO region of A and γ have higher redshifts, W2 and z − W1 than other galaxies (typically redshift 0.5 vs. 0.1, see Figs.C.1 and C.2) .In SIMBAD, variable galaxies are more often registered as the brightest galaxy in a cluster than other galaxies are.The fraction increases from 9 to 15 %.There is also a low, but relatively much higher, fraction of radio sources which goes from 1.6 to 3.7 %.SDSS QSOs have similar redshifts independently of variability, except more non-QSO-like variability at z < 0.3 (see Fig. C.1).
Spectroscopic QSOs variability-selected as stars or galaxies are more often found at W1 − W2 < 0.75 and W2 < 14 than other QSOs.This difference in WISE colours is smaller than for spectroscopic galaxies and stars in different photometric classes, though, as illustrated in Figs.C.2 -C.4.

Discussion
QSOs, stars and galaxies are distributed differently in A and γ and in colours, but with overlaps limiting the quality of simple selections.We see this in differences between the SDSS class distributions of Fig. 2 and the corresponding candidate class plots in Fig. 3 and 4 based on colour and variability, respectively.The overall patterns can still be recognised.Class information in the SF variability parameters is especially interesting for objects without spectroscopy and limited colour measurements.Variability alone does not give as reliable candidates as colour selection, but colour and variability combined can break degeneracies and select classes better than separately.Notes.Sample statistics of selections in variability and colour, including counts, completeness and purity (see Eq. 9).In row 1 (All), we compare spectroscopically classified objects to the full fitted source count and to the full set with spectroscopic classes to get relative frequencies in each set.In row 5-7 and 11, for computing purity and completeness, selection counts are compared to the total spectroscopic counts.In row 2-4 and 8-10, comparisons also require matches in WISE and PS1, since colours are used.

A-γ clusters
In A-γ space, we see two large clusters.One of them is likely an artefact from objects with undetectable variability, leading the MCMC to suggest arbitrarily small A and a large range of γ values.There is only one clear cluster at higher A (log(A) ∼ −0.8), but the QSOs and galaxies are spread along different axes.This is most apparent in Fig. 2. The galaxies cover a broad range of A, but their variability is rarely timescale dependent, at least not on most relevant scales.This is shown by the low γ values.In contrast, QSOs are even more variable (high A) with a clearer timescale dependence.Stars are spread out more evenly in A and γ, limiting selection purity and reflecting the diverse nature of stellar variability.

Comparison with SDSS lightcurve analysis
The selection criteria of Table 2 are simple and focused on purity.For QSOs, they differ from those by Schmidt et al. (2010).
Table 3 shows a purer QSO set with slightly lower completeness, than if we apply the Schmidt criteria.This may be due to slightly different fitting and outlier removal or differences in noise and measurements between PTF and SDSS.The dataset presented in this paper is more representative of all observed object types, as it includes all sources from the PTF survey that pass through the data cleaning of Sect.2.5, whereas Schmidt et al. (2010) introduced specific types of contaminants and in specific ratios.

Spectroscopic selection bias
When only 2.5 % of sources have spectroscopic classifications, it shows a need for other methods for identification.It also allows for significant selection bias in sources with SDSS spectra compared to sources with long PTF lightcurves.This affects purity estimates, and the choice and evaluation of selection regions.
There are differences in the distributions of sources with and without spectroscopic classes.This is especially clear if we compare Fig. 1 to Fig. B.2.For example, SDSS is missing spectra for objects at low W2, including a band of sources at low W1 − W2.At 10 < W2 < 12 and −0.5 < W2 − W1 < −0.25 we have a very mixed group, with sources typically marked as stars (21 %) or binary star systems in SIMBAD.SDSS also has a bias in favour of sources with z − W1 > 3.This is part of the reason why in Fig. 4, QSOs and galaxies are relatively infrequent at those values compared to Fig. 2. If more objects are included in star-dominated areas, they also include a larger fraction of the galaxies and QSOs.Many of these are actually stars, though, but wrongly selected by variability.Especially variable stars late in the main sequence can be difficult to distinguish from QSOs and galaxies; the parameter region at 2.1 < z − W1 < 2.6 and 1.1 < g − r < 1.4 is dominated by stars for all three variability selections.Variability-selected star candidates in the area are at least 86 % stars judging by the most common stellar classifications registered in SIMBAD ("Star", "lowmass*" and "PM*").Galaxy and QSO candidates are at least 79 % and 72 % stars, respectively, showing a small difference in the nature of these objects.For spectroscopic stars, those with QSO-or galaxy-like variability are more spread out in z − W1 and found at higher values of W1 − W2.This is illustrated in In W2 vs. W1 − W2, the spectroscopic classes overlap at W2 > 15, and the variability selections overlap even more.Objects at −0.25 < W2 − W1 < 0 and 10 < W2 < 12 are mostly registered as stars in SIMBAD, and SDSS does not classify any of them as QSOs, although many have QSO-like variability.
SDSS matched sources have relatively longer time spans and more epochs per lightcurve, as shown in Fig. B.1.Hence, variability estimation of the full sample might be less accuratespreading out sources in A-γ space.Therefore, removing the most sparsely sampled sources is important.To both include large data sets and be confident in the results, the balance of data cleaning will also be important in future surveys such as the LSST.Even for sources with >100 epochs over >5 years , the spectroscopic classes still have an overlap at log A ∼ −0.8 and γ ∼ 0, but it is about 50 % smaller in both log A and γ.Longer timescales may change the selected populations, for example by increasing the fraction of type II to type I AGN (De Cicco et al. 2015, 2019).

Photometric selection bias
We select pure sets of each class, but with low completeness and a bias for sources with specific parameters.With variability, we only select stars with low A -but we know variable stars exist.They are just difficult to isolate, and so reduce completeness for stars and purity for galaxies and QSOs.Type I and Type II AGN differ in colours and SFs, and so are not best selected by one simple set of criteria either (De Cicco et al. 2022).The most densely populated variability region, at log A ∼ −0.8 and γ ∼ 0, is not covered by the criteria at all.The same goes for the dense colour diagram area at high W2.In Fig. 2, galaxies dominate a triangular area of g − r vs. z − W1 with two large clusters.The criteria only cover one of them.In SIMBAD, the cluster at high g − r has more galaxies labeled as being part of a cluster, and especially as the brightest galaxy in a cluster.A more advanced selection method could solve these issues (Bruun et al. 2023).
The surveys are not completely representative of the sky, which is not observed uniformly (see Fig. as stars.E.g. including more variable stars could lower the QSO selection purity -both due to the overlap with QSO variability and the increased prior probability of a classified object being a star.However, including more data would also mean there is more data to learn from.The balance of object type frequencies is relevant if we apply Table 2 criteria to other datasets, and in evaluation of criteria of Schmidt et al. (2010) on PTF data.Completeness is computed independently for each class, but could also be affected by a change in the fraction of subtypes.

Perspectives
If we were to base selection criteria on distributions from SDSS confirmed objects and evaluate on the same set, we would be overfitting.However, here the goal is only to estimate class distributions and relations between variability and colour.Machine learning could automatically classify all objects based on the SDSS labeled subset.This would allow us to select more sources and examine probabilities of belonging to each class.It would quantify how variability breaks degeneracies and improves selections by colours and magnitudes.This will be performed in paper two of VILLAIN (Bruun et al. 2023), including a table of all variability parameters and classifications.Accurate photometric selections can identify new QSO candidates and prepare us for analysis of large surveys like the LSST.Optical variability of galaxies, including in the PTF R-band, is also known be useful for identifying AGN missed by other techniques (Baldassare et al. 2020).It would be interesting to study subtypes, such as by variability and redshift differences of type I and II AGN, like De Cicco et al. (2022).Intermediate-redshift QSOs can have colours comparable to those of stars, so variability could be more valuable for those (Yang et al. 2017).Another prospect is selecting non-variable stars for photometric calibration or for a homogeneous study of variability across stellar subtypes.
We have assumed that the objects have simple power law variability, but this is not necessarily a good model for all sources or on all timescales.One could fit e.g.exponential or DRW models instead and analyse the differences, though we expect the overall selections and challenges to be similar.Advanced models can capture more variability information but require more resources (Moreno et al. 2019).More parameters could be used for selection, such as proper motions or photometric redshifts.

Conclusion
We have devised a procedure for the homogeneous analysis of 71 million PTF lightcurves.We have fitted them with jointlikelihood SF models and studied regions in both variability and colour space.SF power law variability is most useful outside the log(A) ∼ −0.8 and γ ∼ 0 region.We select photometric sets of 99.3 % spectroscopic purity for QSOs, 99 % for stars and 99.8 % for galaxies.The spectroscopic classifications are, however, wrong for 20 -50 % of objects photometrically identified as QSOs but spectroscopically as stars.The large PTF sample allows us to discover these rare cases and assemble a set of 31 404 QSO candidates by colour and variability.With only variability, spectroscopic purity drops to 59.3 % and with only colour, it is 98.7 %.
Using SF joint likelihoods on the entire PTF survey, we have shown how variability might be used on future large datasets including the LSST.When new measurements are added to a lightcurve, complete reprocessing can be avoided, as likelihood information on the previous segment of the lightcurve is already stored in A and γ.In a survey with the LSST foreseen depth, a colour-plus-variability method can provide a large sample of faint astrometric standards for the internal calibration of extremely large telescopes, which require objects beyond the depth of Gaia.
In each survey, and depending on computational resources, one must balance sample size and fitting accuracy via data cleaning and selection methods.The value of variability and colours depend on the survey and sources, but in general for PTF, crossmatching colours should be prioritised.
Considering simple cuts in both variability and colour, the completeness is at 12.5 %, so a machine learning method that balances purity and completeness has potential for creating larger QSO samples for studying e.g.cosmology.This is examined in the companion VILLAIN paper (Bruun et al. 2023), where we also release a table of classifications and parameters for all fitted PTF sources.Such a large data-set would allow e.g. a complementary selection of rare objects, such as lensed QSOs.
tropy, 6 a community-developed core Python package for Astronomy (Astropy Collaboration et al., 2013;Astropy Collaboration et al., 2018).This research has made use of the NASA/IPAC Infrared Science Archive, which is funded by the National Aeronautics and Space Administration and operated by the California Institute of Technology.This research makes use of the SciServer science platform (www.sciserver.org).SciServer is a collaborative research environment for large-scale data-driven science.It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University.SciServer is funded by the National Science Foundation through the Data Infrastructure Building Blocks (DIBBs) program and others, as well as by the Alfred P. Sloan Foundation and the Gordon and Betty Moore Foundation.This research has made use of the VizieR catalogue access tool, CDS, Strasbourg, France (DOI : 10.26093/cds/vizier).The original description of the VizieR service was published in 2000, A&AS 143, 23.This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia),processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium).Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.Gaia proper motions for sources selected by variability (top), colour and variability (middle) and spectroscopy (bottom).The histograms are normalised individually.All variability-selected sources approximately follow the distribution of SDSS stars, but when we also select by colour, the results are generally close to those of the spectroscopically confirmed objects.A high-purity and high-completeness cut at 2 mas/y, justified by the bottom panel, would yield ≈ 50% purity in a sample of quasar candidates selected only through variability.
Fig. A.1, mostly covering the northern hemisphere and avoiding low Galactic latitudes.Table 1 contains sample statistics including objects matched in SDSS, WISE and PS1.70 920 904 objects are analysed, spanning 365 − 2147 days and containing 20 − 5579 data points.

Fig. 1 .
The criteria are illustrated in Fig. B.2 with those of Schmidt et al. (2010) for comparison.

Fig. 1 .
Fig. 1.Heatmaps of variability parameters (top), g−r vs. z−W1 (middle) and W2 vs. W1 − W2 (bottom).These are the full parameter distributions of objects with lightcurves in PTF after cleaning and matching with other surveys as necessary.Two large clusters are observed in A-γspace.Selection criteria are applied to this data for analysis of candidate QSOs, stars and galaxies.log(A) is in base 10 and based on A units of in magnitudes.

Fig. 2 .
Fig. 2. All spectroscopically confirmed stars (green), QSOs (red) and galaxies (blue) from SDSS plotted in variability fit parameters (top) and colours: g − r vs. z − W1 (middle) and W2 vs. W1 − W2 (bottom).Marker colours show the object class and blend to grey or black when multiple classes occupy the same parameter region.A heatmap of all spectroscopically confirmed objects is found in Fig. B.2 for comparison with Fig. 1.

Fig. 3 .
Fig. 3. Variability of objects of Fig. 1 selected by the colour criteria from Table 2.These are based on the colour distributions of Fig. 2.

Fig. 4 .
Fig. 4. Colour distributions of variability-selected objects from Fig. 1.These fulfil the variability criteria of Table 2 based on the upper diagram of in Fig. 2.
Fig. C.4.The variability does indicate a physical difference in these cases.
A.1).We include fewer stars at low Galactic latitudes, where Galactic extinction affects colours more and there is a higher risk of mismatches with nearby sources.Stars have the fewest colour matches, indicating an under-representation in WISE or PS1.A change in ratios of object types and stellar subtypes affects purity.The reason is that the number of true or false positives depends on the selection of objects that are not equally difficult to distinguish from or Article number, page 8 of 16 S. H. Bruun et al.: Variability selection in PTF: I. Processing of lightcurves Fig. E.1.Gaia proper motions for sources selected by variability (top), colour and variability (middle) and spectroscopy (bottom).The histograms are normalised individually.All variability-selected sources approximately follow the distribution of SDSS stars, but when we also select by colour, the results are generally close to those of the spectroscopically confirmed objects.A high-purity and high-completeness cut at 2 mas/y, justified by the bottom panel, would yield ≈ 50% purity in a sample of quasar candidates selected only through variability.
Table 3 lists sample statistics of candidates selected within these parameter space regions of variability, colour or both.The Schmidt criteria are included for comparison.To estimate completeness, colour-based selections are given as percentages of
Gaia DR3(Gaia Collaboration et al. 2016, 2022)sources within one arcsecond of PTF sources have lower proper motions for QSO candidates.This is especially clear when colour information is included in QSO selection, as shown in Fig.E.1.Objects with stellar colours and variability have the highest proper motions, as expected.