TESTING TESTS ON ACTIVE GALACTIC NUCLEI MICROVARIABILITY

José A. de Diego

doi:10.1088/0004-6256/139/3/1269

1. INTRODUCTION

Observational techniques for optical monitoring of active galactic nuclei (AGNs) and photometric studies of variable stars have many similarities, but they differ in the sense that variable stars often show periodic light curve fluctuations, while AGNs do not. Although most variable AGNs (blazars) show microvariability or transient large amplitude variations in a few hours, the unpredictability of these changes in brightness and the difficulty of obtaining confirmation by other observers have been a perennial cause of incredulity and skepticism since the first report by Matthews & Sandage (1963) of a 15 minute microvariability event of amplitude ΔV = 0.044 in 3C 48. In less variable objects, such as quasars, the amplitude of the microvariations is usually lower and close to the limit of detection.

In order to increase the confidence on the validity of variability reports, a number of statistical tests have been proposed to prove the reliability of the measurements. A methodology that has been widely used is the χ² test for variances which compares a sample variance obtained from a suspected variable target with a theoretically calculated variance for a non-variable object, taking into account all the possible sources of error. For example, Pica & Smith (1983) use a χ² and the so-called Q-statistic, based on the difference between the brightest and the dimmer observations, to investigate long-term variability in approximately 6000 photographic observations of 130 AGNs of different types monitored during 13 yr. After the introduction at the end of the 1970s of the CCDs in astronomical observations, differential photometry became a very reliable technique for short time-resolved light curve studies. In differential photometry, the flux of the target object is divided by a reference star in the same CCD frame. As the target and the reference star images have been obtained simultaneously at the same air mass and identical instrumental and weather conditions, the flux ratio is considered to be very reliable. It is common practice to compare the differential light curves of the target and at least one non-variable field star, denoted as a comparison star.

A remarkable effort to update statistical techniques used in photoelectric photometer and photographic plate measurements, taking advantage of the better and faster response of CCD detectors, was carried out by Howell et al. (1988), who compared the variances between the object and a comparison star using the F distribution. The F-test compares two sample variances, one for the suspected variable source and the other for the non-variable comparison star. These authors also introduced the Γ factor to account for scale differences between the variances of the target and the comparison star due to photon noise. However, χ²-based tests have nevertheless endured as they are well known and simple estimators for flux variations.

Jang & Miller (1997) and Romero et al. (1999) have proposed a new test to analyze microvariability based on the ratio C between the target and the comparison star standard deviations, rather than variances. This C-statistic resembles Pica & Smith's (1983) Q-statistic, but is less sensible to the spurious effect of a single discrepant point. In the last decade, C-statistics have become very popular and several researchers have adopted this statistical methodology to study quasars (e.g., Stalin et al. 2004a; Gupta et al. 2008) and blazars (e.g., Xie et al. 2004; Andruchow et al. 2005).

A different methodology to analyze differential light curves has been proposed by de Diego et al. (1998). These authors use the one-way analysis of variance (ANOVA) test for studies of quasar microvariability. Using this technique, de Diego et al. were the first to claim that microvariability events were as frequent in radio quiet as in radio loud quasars (excluding blazars). An ANOVA has also been applied in other studies of AGN variability (Ramírez et al 2004; Villforth et al. 2009; Ramírez et al. 2009). However, because the novel results reported by de Diego et al. (1998) were unexpected at the time and because they were difficult to compare with previous studies, the ANOVA test has still not gained full acceptance (Romero et al. 1999; Carini et al. 2007). In this paper, independent runs of data are simulated using the Monte Carlo technique, and analyzed using one-way ANOVAs and other statistical techniques.

Despite their generic name, which can be misleading, ANOVA tests are designed to detect differences between several sample means, rather than between sample variances. Thus, such tests can be considered a generalization of the Student t-tests for differences between two sample means. It is worth noting that tests for means can distinguish smaller differences than tests for variances (see Section 4), and thus it is expected that an ANOVA improves the detection of microvariability events. ANOVA tests are used, for instance, in experimental design statistical methodologies (e.g., Box et al. 2005), which deliberately impose one or more conditions on different groups of data in the interest of observing the response. These methodologies radically differ from those common in astronomy, which involve collecting and analyzing data on the run and where external conditions cannot be manipulated. Therefore, after applying the same objective method to this and other statistical tests, we will be able to effectively establish the reliability and advantages of the ANOVA test.

This paper presents a comparison of the outcome of different analysis strategies for the detection of low amplitude microvariations when reliable astronomical differential photometric data are available. Reliable data are understood data characterized by random errors that are not affected by systematical errors. Dealing properly with systematical observational errors would require, first, to detect that the data are indeed affected by these errors; second, to derive an understanding of their cause; and third, to properly correct for these systematics by changing and fine tuning the observational setup, to the extent that this is possible. In this paper, systematical effects will be entirely disregarded.

To analyze the data, several procedures based on χ² tests, F tests for variances, one-way ANOVA and C-statistics will be considered in turns. Even though these consist of common statistical methodologies, there exist many different implementation strategies of these tests and, in principle, new alternative tests could be carried out. Strictly speaking, some of the derived inferences or comparisons established between the various tests may be valid only for the particular cases considered here. Furthermore, we cannot rule out that different observational circumstances or setup implementations with respect to those envisaged in this paper might result in altogether different results.

This paper is organized in the following way. The simulation procedure is described in Section 2, results are shown in Section 3, a discussion and comparisons between the tests are presented in Section 4, and the conclusions are summarized in Section 5. In Appendix A, the interested reader can also find the mathematical description of each test, along with some comments on their use and validity. An interesting implementation of the one-way ANOVA to improve the detection of microvariability in the AGN light curves is discussed in Appendix B.

2. LIGHT CURVE SIMULATIONS

In order to accurately analyze the power and robustness¹ of the different statistical methodologies, simulations were performed for the light curves of the quasar, a reference star, and two comparison stars. For easiness, the Instrument Simulator² software of the Mexican Observatorio Astronómico Nacional was used to obtain basic data for the simulations. The input arguments were the 1.5 m telescope, the SITe1 detector, 2 × 2 binning, V filter, 1 farcs 5 seeing, 3'' aperture, 60 s exposure time, and magnitudes V = 17 and 15 for the quasar and the reference star, respectively. For simplicity, the comparison stars have been chosen to have the same magnitude as the quasar. The output includes, among others, the following parameters: the signal-to-noise ratio (S/N), and the object, sky, and detector (total read-out) noises. From these parameters, the object and the sky electron counts were obtained considering Poisson distribution (i.e., photon shot noise). The parameters used in the simulations are shown in Table 1. Column 1 describes the source associated with the parameters; Columns 2 and 3 indicate the electron counts for the signal and noise, respectively; Column 4 shows the S/N.

Table 1. Parameters Used in the Simulations

Description	Electron Counts		Total
	Signal	Noise	S/N
Detector	...	103	...
Sky	23,287	152	...
V = 17	32,292	179	126
V = 15	203,766	451	418

Download table as: ASCII Typeset image

For a telescope of 1.5 and an S/N larger than 100, 1 minute exposures are reasonable. Longer exposures may saturate bright objects and stars which might be used as references in differential photometry. Thus, every simulation comprises 150 × 1 minute exposures during 5 hr of monitoring, with 1 minute lag between exposures to account for CCD read-out. For more realistic simulations, atmospheric attenuation in the V band (A_V = 0.2 mag/air mass) has been taken into account. The object is supposed to pass through the zenith in the middle of the run (i.e., 2.5 hr after the monitoring began). Photon noises were generated for the object, the reference and comparison stars, and for a constant brightness sky. Finally, white Gaussian noise was also generated to account for the CCD read-out. The photometric accuracy for the differential light curves of the quasar and the comparison star is about 0.01 mag.

Two runs of simulations were performed. One of the runs considered a Gaussian-shaped variation in the quasar flux. The duration of the variations is described by their FWHMs which are allowed 60 discrete values in the range between 1 and 60 minutes (i.e., in steps of 1 minute). This range for the duration of the variations is also appropriate to investigate the possibility of detecting spikes (see Sagar et al. 1996; de Diego et al. 1998; Gopal-Krishna et al. 2000; Stalin et al. 2004a). The peak of the variation was allowed to be centered between 120 and 180 minutes from the beginning of the monitoring. This range for the peak center ensures that the whole variation is contained in the data set. Finally, for each of the 60 values set for the duration of the variation, 100 simulations were performed allowing random amplitudes up to 3% of the quasar flux (i.e., ∼0.03 mag). Therefore, the total number of simulations were 6000 (60 × 100), of 150 photometric points each. Figure 1(a) shows the final simulated raw curve for the quasar and comparison star for a Gaussian variation, while Figure 1(b) shows the same curves after calibration by the observations of the reference star. The standard deviation of the differential curve of the comparison stars shows that the photometry is accurate up to 0.009 mag.

**Figure 1.** Simulated raw curves (a) and differential light curves (b) for a V = 17 quasar (✩) and comparison star (+, with an offset of 0.1 mag). At the middle of the simulation, the objects are crossing the zenith, where the atmospheric absorption has been set at 20%. A reference star with V = 15 has been used to calibrate the observations. The amplitude of the variability in this case was 0.0267, and the peak has a FWHM of 60 minutes.
Download figure:
Standard image High-resolution image

For the other run, the light curves of the objects present a constant (linear) flux variation, as shown in Figure 2. The amplitudes of these variations, measured as the difference between the first and the last data point, were also random valued up to 3% of the quasar flux. As in the Gaussian peak case, a total number of 6000 simulations of 150 data points each were performed. In both cases, Gaussian peak and linear variation, all the fluxes and estimated errors were converted into magnitudes for the analysis.

3. RESULTS

The results of the statistical analyses of the simulations are summarized in Table 2. Column 1 identifies the test (see below); Column 2 indicates the significance level α; Columns 3–5 show the number of Type I errors found in the analysis of one of the comparison stars, the number of detections of microvariations in the quasar light curves, and the number of Type II errors for the quasars, respectively, for the 6000 simulations for the Gaussian peak variation; Columns 6–8 repeat the same numbers but for linear variations.

Table 2. Results of the Simulations

Test		Gaussian Peak Variation			Linear Variation
	α	Type I	Detections	Type II	Type I	Detections	Type II
χ² test	0.001	10	1614	4386	2	2278	3722
Idem (std)	0.001	125	1748	4252	129	2385	3615
One-way ANOVA	0.001	3	2187	3813	4	3043	2957
Idem (30 minute lags)	0.001	2	1086	4914	11	1668	4332
F test (α = 0.1%)	0.001	9	837	5163	5	1302	4698
Idem (α = 1%)	0.01	63	1443	4557	61	2050	3950
C-statistic	0.01	0	0	6000	0	0	6000

Download table as: ASCII Typeset image

The first item in Table 2 is the χ² test described in Section A.2. To perform this test, the true error distributions introduced in the simulations (photon and white Gaussian noises) have been considered to estimate the error of each individual data point. In the second test, a similar χ² analysis has been performed, but considering the standard deviation of the comparison star instead of the individual errors of each measurement (note that in this case Type I errors have been calculated from the other comparison star). The third test is a one-way ANOVA, performed by grouping the data in sets of five individual observations (see description in Section A.3). The fourth test is also a one-way ANOVA but with a 30 minute lag between group sampling (and thus it is the only test that considers only a fraction of 1/3 of the simulations). The results of two F-tests for variances (see Section A.1 for a description) are reported in the fifth and sixth items, the former at the significance level α = 0.001 (or 0.1%) to compare with the previous tests, and the latter at α = 0.01 (or 1%) to compare with the C-statistics. Finally, the fifth test is a C-statistic as described in Section A.4.

The significance level α is a probability set a priori by the researcher that a test yields, only by chance, a result at least as extreme as the one observed. Note that ANOVA and χ² tests have been performed for a significance level of 0.1%, that corresponds to the usual detection limit of 3σ. On the other hand, the significance level of the C-statistic is set at 1%, or a detection limit of 2.576σ, which is the level commonly defined for this test (e.g., Jang & Miller 1997; Romero et al. 1999). For the F test, both significance level values are considered: α = 0.1% to compare with the χ² and ANOVA tests, and α = 1% to compare with the C-statistic, as the F test is considered as an alternative to this methodology (see Section A.4).

Type I errors are due to the rejection of a true null hypothesis (i.e., rejecting the non-variability hypothesis for a non-variable object). For an unbiased test, the actual number of Type I errors depends only on the number of data sets examined and the significance level of the test. In this case, the number of Type I errors has been obtained testing the simulated differential curves of a (non-variable) comparison star. For a significance level of say 0.1% and 6000 simulations, we should expect around six spurious detections. Considering a binomial distribution for the number of Type I errors, we expect that its actual number for a given test will be between 0 and 13, and in most cases in the 6 ± 3 interval. If the number of Type I errors for the comparison star is significantly different from the expected frequencies, it is evidence that the actual significance of the test differs from its nominal set value and the test is not reliable.

The number of detections in Table 2 indicates how many tests have succeeded in detecting variability in the quasar simulated differential light curves, and it is also a measure of the power of the test, which generally varies as a function of the data set characteristics. On the other hand, Type II errors are due to the acceptance of a false null hypothesis (i.e., accepting the non-variability hypothesis for a variable object). Note that in all the simulations the quasar varied. Therefore, as the number of Type I errors is low with respect to the number of detections, the number of Type II errors would be (approximately) 6000 less the number of detections.³

From Table 2 we see that the χ² test performed taking into account the actual error distribution, one-way ANOVA, one-way ANOVA with 30 minute lags, and the F test at α = 0.1%, all show a number of Type I errors in accordance with expectations. The F test at α = 1% and the C-statistic have a lower significance level and accordingly the number of expected Type I errors would be 60. The results for the F test at α = 1% agree with this expectation, nevertheless C-statistics consistently produced neither Type I errors nor detections in accordance with the arguments presented in Section A.4. In fact, in these simulations, the C-statistic is always a factor ≳2 below its critical value (2.576) that determines the boundary between rejecting or accepting the null hypothesis.

The χ² test performed considering the standard variation of the comparison star instead of the actual errors for each observation in the quasar light curve shows a number of Type I errors much larger than expected (Table 2). This is a consequence of employing a wrong methodology. As explained in Section A.4, failing to consider the number of degrees of freedom in the estimation of the standard variation of the light curve of the comparison star produces a biased statistics. The increase in the number of detections with respect to the previous χ² test is not significant. As the number of Type I errors is still much less than the number of Type II errors, it is possible to estimate the number of spurious detections of variability in the quasar light curves (i.e., detections not related to the actual variation of the light curves). In this case, the number of spurious detections will be approximately 125 × 4252/6000 = 89, while a similar calculation produces seven spurious detection for the previous χ² test. Thus, even if the biased statistics would affect only the number of Type I errors and not Type II errors (which also probably will be affected), we would expect a difference of about 80 detections between both tests, counting for most of its factual value of 134.

On the contrary, for the F tests the number of Type I errors agrees with the expected number, as mentioned above. However, they show a rather small number of detections in comparison with the other tests (see Table 2). Thus, the detections for the F test at α = 0.1% are well below the results for the one-way ANOVA with 30 minute lags, even if the number of data points for the latter test is a factor of 3 smaller than for the F test. In the case of the F test at α = 1%, the significance level is low enough that a larger number of detections would be expected with respect, for example, to the χ² test at a significance level of 0.1%; yet the opposite is true. There are two possible explanations for these results. One is the non-robustness of the F test for non-Gaussian distributed data (Lehmann 1986, Section 5.4) and the other is that the test has intrinsically less power than χ² and ANOVA. Indeed, some of the comparison star differential light curves are not well fitted using a Gaussian profile, as shown in Figure 3, perhaps as a consequence of the underlying Poisson distribution associated with flux measurements. To investigate this possibility, a set of Kolmogorov–Smirnov for goodness-of-fit tests of Gaussianity (e.g., Wall & Jenkins 2003, Section 5.3.2) was performed on the distribution of the simulated Gaussian peak data for each of the 6000 light curves of the comparison star. The significance level of the test was fixed at 5%; therefore, around 300 Type I errors were expected if the data were fairly Gaussian distributed. The actual number of differences found by the Kolmogorov–Smirnov test were 296, which rules out the non-Gaussian distribution explanation. Then, the differences in detecting light curve variations with the results of the χ² and ANOVA test should be imputed to a relative diminished power of the F test.

**Figure 3.** Histogram of the differential light curve of a comparison star. In some cases, as in this simulation which shows a strong skewness, the distribution of the differential light curves separate from a Gaussian curve.
Download figure:
Standard image High-resolution image

The number of detections of the one-way ANOVA tests is significantly larger than the χ² tests. The 150 observations computed for each run of simulations are divided in 30 groups of five observations each, and the time interval in each group expands 10 minutes (remember that exposures and read-outs last 1 minute each). Even if the number of groups is reduced to 10 (i.e., sampling an object every 30 minutes as in the case of a one-way ANOVA with 30 minute lags), the ANOVA maintains its robustness and to a large extent the power to detect microvariations.

4. DISCUSSION

To perform the ANOVA test it is necessary to bin the data. Although the results for the χ² test using the standard deviation of the comparison star shows the risks of introducing what might be considered a priori reasonable changes in the strict test procedure, it is still questionable if the χ² test will perform as well as an ANOVA if data were binned in the same way. However, just as there is a loss of information in going from a list of observations to a histogram, the binning procedure applied to the χ² test will produce an immediate loss in the test power. Statistical theoretical backgrounds to reject the χ² binning methodology may be traced back at least to lossy compression methods considered in Shannon's information theory. Binning data have the effect of reducing the signal besides the noise, and it is mathematically impossible to get any additional information from the binned data. Thus, analyzing both the raw signal and the binned data with the same χ² statistical procedure will result in an unavoidable loss of sensitivity of the test.

Things get even worse when the signal is low, close to the 3σ sigma limit as it is frequent in microvariability studies. Besides blurring the signal, binning data in groups of n observations in a χ² procedure also has the effect of reducing by a dividing factor of n the degrees of freedom of the statistical analysis. Both combined effects are disastrous in order to detect tiny variations. For the simulations presented in this paper, the raw data χ² procedure were able to detect 1614 variations of a total of 6000 cases, while an aside calculation based on the binned procedure for n = 5 yielded only 284 detections. In comparison, the ANOVA produced 2187 detections. Although the ANOVA also groups the data producing a loss of signal, it redistributes the degrees of freedom between groups and error estimates within groups, rather than canceling them. Thus, if N is the number of observations and k the number of groups (k = N/n), the degrees of freedom are ν₁ = k − 1 for the groups and ν₂ = N − k for the errors, and therefore ν₁ + ν₂ = N − 1, that corresponds to the degrees of freedom of the original data set. Besides, as stated above, an ANOVA tests (group) means, while χ² tests variances, and it is well known that tests for means are more powerful than tests for variances, among other things, because the actual values for means are tighter constrained. This is a result of the squaring of each term, which effectively weights outliers and large errors more heavily than small ones. For example, a few simulations generating samples of size 100 drawn from a normally distributed population with mean μ = 0 and standard deviation σ = 1 show that the ratio between their respective 95% confidence intervals is C.I.(σ²)/C.I.(μ) ≈ 1.5.

In the case of the ANOVA and the ANOVA with 30 minute lags, the time interval within each group has been chosen such that it does not exceed 10 minutes. This limit is imposed by previous experience that optical microvariations in timescales of less than 20 minutes in quasars are rare, hard to detect, or both. On the other hand, when the monitoring is performed in several optical bands that will be compared later, it should accomplish simultaneity criteria of variability between the involved bands. Thus, Villata et al. (2004) have considered time intervals lasting around 10 minutes for photometric sequences between bands V and I in blazars, while Papadakis et al. (2004) and Hu et al. (2006) have calculated around 20 and 11 minute lags between bands B and I for AGNs and blazars, respectively. The difficulty of detection for even large microvariability events lasting less than 10 minutes has also been shown in the results of these simulations, no matter what statistical methodology has been used (see, for example, Figure 4). Thus, 10 minutes is a safe time interval to bin data sharing similar flux characteristics, statistically indistinguishable from the noise, and will be appropriate for many studies of quasar microvariability. However, for the ANOVA continuous monitoring strategy, it is still possible to improve the test power by trying out different bin sizes after the observations have already been made (see Appendix B).

**Figure 4.** Detection frequencies against temporal FWHM of the variation for the Gaussian peaks. χ² test (•), one-way ANOVA ( $\opentri$ ), and one-way ANOVA with 30 minute lags (□).
Download figure:
Standard image High-resolution image

**Figure 4.** Detection frequencies against temporal FWHM of the variation for the Gaussian peaks. χ² test (•), one-way ANOVA ( $\opentri$ ), and one-way ANOVA with 30 minute lags (□).
Download figure:
Standard image High-resolution image

The choice for one-minute exposures has been justified in Section 2. Thus, a group of around five such exposures, lasting less than 10 minute accounting for CCD read-out, is a reasonable methodological choice for the ANOVA with 30 minute lags observational strategy, for which the bin size of the groups is set before the observations and cannot be changed after. For a larger telescope, the exposures might be shorter, but the total number of observations in each group is still limited by the read-out dead time. Besides, the gain in the power of the test would be relatively small for a number of exposures larger than five.

In the rest of this section only the χ² test, the one-way ANOVA with and without time lags between group observations, and the C-statistic will be discussed. The results of the simulations presented in Section 3 for the χ² test using the standard deviation of the comparison star, and the F test for sampled variances, are enough descriptive to show the possible risks of relying on these procedures.

4.1. Test Comparisons

It is expected that the probability of detecting a change in the brightness of a source depends on both the amplitude of the variation and its duration. The range of amplitudes is the same for both the Gaussian peak and the linear simulations, but the duration of the Gaussian peak variations is restricted to an FWHM of 60 minutes or less, while the duration of the linear variations spreads over the 5 hr of monitoring. Therefore, it is not surprising then that linear variations are detected more easily than the shorter Gaussian peaks considered in the simulations (see Table 2). The effect of the duration of the Gaussian peak variations on the number of detections is shown in Figure 4 for the χ² test, one-way ANOVA, and one-way ANOVA with 30 minute lags.

Figure 5(a) compares the distribution of the χ² and the ANOVA F statistics for the non-variable comparison star. This plot yields a circular-shaped cloud of points (allowing a certain amount of distortion because of the use of different scales), as expected for non-biased statistics. The points are centered at coordinates (1, 149) as the expected values for ANOVA F and χ² with 149 degrees of freedom when the null hypothesis is true are 1 and 149, respectively. Figure 5(b) shows a similar plot for the Gaussian peak variation quasar statistics. The clear correlation between both statistics indicates that they are measuring the same variable phenomenon. Note that the maximum range of the χ² statistic is about two times its critical value, while the ANOVA F statistics spreads out approximately five or six times from its respective critical value. This is a consequence of the different powers of the tests under the conditions of these simulations. The same arguments apply in the case of the linear variations. In this case, the difference in power between the ANOVA and the χ² tests can be illustrated from the results for the quasar light curve shown in Figure 2 where the ANOVA F statistic is larger than the critical value set to detect variations (F = 2.8>2.3 = F^(0.001)_29,120), while χ² is below (χ² = 164 < 208 = χ²_0.001,149).

Figure 6 shows the distributions of (a) the χ² statistic, (b) the ANOVA F, (c) the ANOVA F with 30 minute lags between group observations, and (d) the C-statistic, against absolute amplitude of the variations for temporal Gaussian peak variations, along with the critical values for each test (indicated by thick horizontal lines). It is clear that for (a), (b), and (c) the number of detections (points above the lines indicating critical values) increases with the variability amplitude. Remember that the ANOVA, the χ², and the C-statistic were calculated using the same data set, while for the one-way ANOVA with 30 minute lags the data set is resampled to one-third of the simulated observations. From Figure 6(d), it is obvious that the C-statistic is about a factor of 2 below any detection even though the nominal significance level for this test (α = 1%) is less tight than for the ANOVA and χ² tests (α = 0.1%).

**Figure 6.** Distribution of statistics against absolute amplitudes of the Gaussian peak variations and critical values; (a) points indicate the values of the χ² statistic with α₁ = 0.001, ν = 149, and the solid line the critical value for the one-sided test $\chi ^2_{\nu,\alpha _1} = 208$ ; (b) idem for the ANOVA statistic with α₁ = 0.001, ν₁ = 29, ν₂ = 120, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 2.28$ ; (c) idem for the ANOVA statistic with 30 minute lags, α₁ = 0.001, ν₁ = 9, ν₂ = 40, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 4.02$ ; (d) idem for the C-statistic with α₂ = 0.01 and its critical value for the two-sided normal test $z_{\alpha _2} = 2.576$ .
Download figure:
Standard image High-resolution image

The results for the linear variations shown in Figure 7 are similar to Gaussian peak variations, but the statistics are less scattered because there is no effect of the length of the variation. For a significance level of α = 0.1% and the conditions of the simulations, all the linear variations with amplitudes ≳0.027 are detected by the χ² test. On the other hand, the one-way ANOVA detects all the variations with amplitudes ≳0.022. In comparison, the one-way ANOVA with 30 minute lags detects around 90% variations for amplitudes near 0.03. As in the Gaussian peak case, the C-statistic is again a factor of 2 below any detection at the significance level of α = 1%.

Percentages of detections per amplitude range are shown for the Gaussian peak variations in Figure 8, and for the linear variations in Figure 9. Percentages for the Type II errors per amplitude range can be easily derived from these figures as the subtraction of the percentages of detections from 100. In the case of the linear variations considered in these simulations, the dependence of the number of detections with the amplitude of the variations is straightforward. But in the case of the Gaussian peak, the double dependence on the amplitude and the temporal length of the variation makes the relationship less evident. After smoothing, this double dependence can be shown as a contour plot of the probability of detection as a function of the amplitude and duration of the microvariability event, as shown in Figure 10. Note that the probability of detection increases with both the amplitude and the duration of the variation.

**Figure 10.** Example of the double dependence of the Gaussian peak detections on the amplitude and the temporal length of the variation. The smoothed χ² statistic of the simulated data is shown as a contour plot of the probability of detection with FWHM of the variations in minutes in the X-axis, and the amplitude in magnitudes in the Y-axis. Plots for one-way ANOVA statistics would show a similar aspect.
Download figure:
Standard image High-resolution image

The bulk of all these results attests that both the χ² test and one-way ANOVA are robust methodologies to study variability in the light curves of quasars. However, the χ² test relies on an accurate theoretical estimation of the data error for each single data point. As noted in Section A.2 this is not usually the case, and a number of ad hoc factor corrections have been used to compensate the notorious lack of agreement between the IRAF estimated photometric errors and the actual dispersion of the data (e.g., Gopal-Krishna et al. 2003; Stalin et al. 2004a; Bachev et al. 2005).

The large increment in the number of Type I errors obtained using a sample standard deviation drawn from the comparison star instead of the actual error (Table 2), as well as arguments given above about the loss of power of the test when the data are binned, warns against any simple approach to dodge this problem. However, the error estimation issue is offset by the internal error estimation of the ANOVA tests. It is reasonable to assume for the χ² test that an accurate estimate of the actual errors for each data point might be achieved by measuring the data dispersion for a large number of foreground stars. However, we have already mentioned above that the dispersion measurements such as variances and standard deviations are poorly constrained. Let us investigate the possible gain in accuracy by using a set of comparison stars.

Actually, the significance level of the χ² test when the true photometric error of the data is unknown, but estimated from one or more comparison stars is very easy to calculate: it corresponds to the F statistic. In our case, each light curve comprises 150 observations (ν = 149 degrees of freedom) and the nominal significance level of the test is α = 0.1%, which corresponds to a critical χ²_0.001,149 = 208. We divide this value by ν to obtain the reduced χ² critical value χ²_r = 1.40, and calculate the significance level α for the F statistics with ν₁ = 149 and ν₂ = 149 × N_*, for $F^{(\alpha)}_{149,149 N_*} = 1.40$ . Some of these calculations are presented in Table 3. Column 1 indicates the number of stars N_* used to estimate the error, Column 2 is the actual significance level α for the test, and Column 3 is the relative error in the variance estimate.

Table 3. Effect of the Number of Comparison Stars on the Test Accuracy

Number of Stars	α	Err(s²)/σ²
1	0.0212	1.4142
5	0.0029	0.6325
10	0.0018	0.4472
20	0.0014	0.3162
30	0.0012	0.2582
40	0.0012	0.2236

Download table as: ASCII Typeset image

Note the value for a single comparison star (N_* = 1); the actual significance level is 2.12% rather than 0.1%, therefore, for 6000 light curve simulations this test should produce 0.0212 × 6000 ≈ 127 Type I errors, which agrees with the result of the simulations for the χ² test performed using the standard deviation of the comparison star (see the second item in Table 2). Other interesting results reported in Table 3 are that combining the data for 10 stars, the actual significance level is still almost twice the nominal value, and that it is necessary to combine up to 30 or 40 stars to obtain a significance level accurate up to 20%.

Sample variances s² are χ² distributed random variables (Equation (A9)) and therefore it can be demonstrated easily that, in our case, the sample variance for each photometric data point obtained from measuring N_* stars also has an associated variance (i.e., variance of the variance) given by

$\begin{equation*} \mathrm{Var}[s^2] = \sigma ^4 \frac{2}{N_*-1}, \end{equation*} \tag{ 1 }$

where σ is the true photometric error and N_* − 1 the degrees of freedom. Thus, the error of the measured variance is expressed by the square root of the previous equation: $\mathrm{Err}[s^2] = \sigma ^2 \sqrt{2/(N_*-1)}$ . For observations analogous to the simulations reported in this paper, σ ≈ 0.01 mag. The relative error for the variance estimate (Err[s²]/σ²) obtained from N_* star is shown in the third column of Table 3. Even for N_* = 40 the variance estimate has an accuracy worse than 20%, which implies that the photometric error estimate is inaccurate by approximately 50% (s = 0.010 ± 0.005). But even in the case that it was possible to observe tens of stars simultaneously, attempting to meet the controlled conditions of the simulations where the errors are completely known, ANOVA's statistical power performs better than the χ² test to find tiny variations in the quasar simulated differential light curves. Although the actual differences in power may vary depending on the test implementation and light curve characteristics, or even the actual observational methodology, the effort of combining the light curves of tens of foreground stars may be irrelevant in most cases.

Another concern with measuring errors using a large sample of suitable foreground stars is that it is not always possible. In fact, the number density of bright stars around quasars that can be used for differential photometry studies is small. From a photometric point of view, quasars are blue color objects. For the most reliable differential photometry, at least in the most blue optical bands, quasars should be compared preferably with either nearby white dwarf stars or main sequence bluish stars to avoid color effects that may arise at different air-masses. Besides, quasars are observed at high galactic latitudes and thus, unless the telescope field is large enough, the number of foreground stars around quasars is usually scarce. Moreover, many of them may be old, low luminosity stars hanging around the thick galactic disk or roaming through the galactic halo. Then, many of the brightest foreground stars will be luminous Pop II red giants and evolved, fast period variable stars. Although the effect of observing objects with moderate color differences at low air-masses is negligible in broadband studies, red giant stars should be avoided in differential optical photometry of quasars. Fast period variables are of course unsuitable as reference and comparison stars. Therefore, around most quasars there are only a few useful nearby, not too red and non-variable stars bright enough to be used for comparison purposes.

Another interesting subject is the possibility of detecting very fast variations. It has been noted above that the statistical power of the tests is limited by both the duration of the microvariability event and its amplitude. For the observational parameters considered in these simulations, the effort to detect microvariations lasting less than ≈30 minutes and with amplitudes of less than 0.02 mag would be very inefficient. The relatively large number of spikes reported in microvariability literature (Sagar et al. 1996; de Diego et al. 1998; Gopal-Krishna et al. 2000; Stalin et al. 2004a) suggests that they may be a common phenomenon and worthy of investigation using very fast and accurate photometry with large telescopes.

The one-way ANOVA with 30 minute lags procedure is analogous to the methodology used by de Diego et al. (1998). Simulations show that this procedure is also robust, but the smaller number of points in each data set (50 observations rather than 150) due to the discontinuity in the sampling affects the number of detections, which is about one half of the one-way ANOVA continuous sampling. But in the real world, the final result will depend on the actual timescales of the microvariations. Stalin et al. (2004a) report three kinds of variations, namely small gradual variations lasting over several hours, time-resolved microvariability on hour-like timescales, and single-point fluctuations (spikes). In this case, the one-way ANOVA with 30 minute lags will perform almost as well (within a 20%) as the continuous sampling detecting the first two types of microvariations. Therefore, as this technique permits monitoring of other target fields during the gaps between the end of a group of observations and the beginning of the next group, it turns out to be a very powerful exploratory methodology to detect microvariations. For example, in the one-way ANOVA with 30 minute lags case each group of five observations lasts 10 minutes (accounting for exposures and read-outs), and there is a gap of 20 minutes until the first exposure of the next group begins. Depending on the telescope setup, it would be possible to switch to at least one, maybe two, near different target fields, increasing the number of monitored objects by a factor of 2 or 3.

The results for the ANOVA and χ² tests discussed above do not demonstrate that, under all circumstances, the one-way ANOVA tests perform better than a well-implemented χ² test. It is possible to argue that similar results to those presented in this paper would have been obtained in the case, for example, of light curve shapes not considered here, such as sawtooth variations. However, there are many possibilities of test implementation and observational techniques. One of the requirements of ANOVA tests is the homogeneity of variances; that is, errors should be distributed equally in all the groups. An ANOVA is robust against moderate violations of this requirement, and thus this is not a big concern when comparing data obtained with the same telescope and equipment, and during the same night under ordinary atmospheric conditions. However, combining data from a couple of telescopes with different characteristics may ruin the one-way ANOVA's performance.

4.2. Remarks on the C-statistic Results

Proofs on the reliability of the C-statistic methodology have confirmed the severity of the problems pointed out in Section A.4. This methodology has been widely accepted as a standard test, and thus the confusion generated in the research of microvariability in quasars is not surprising. Several researchers have used C-statistics to study messy samples of radio quiet and radio loud quasars together with blazars. For example, Romero et al. (1999) include several BL Lac objects, accounting for nearly 70% of their microvariability detections. In most cases, only blazar-like objects will show variations extreme enough to lie above the proposed C-statistic critical value. A detailed discussion about the relevance of a careful sample selection can be found in Ramírez et al. (2009).

In the case of high accuracy photometry, such as the observations presented in Romero et al. (1999, $\sigma _e \sim 0.001\,\rm {mag}$ ), maybe rare microvariability events also could be detected by the C-statistics in non-blazar objects. For example, Gupta & Joshi (2005) detect microvariation events if the errors are about 0.005 mag, but usually fail to detect them if errors are 0.01 mag or above. Note that this result is in accordance with those presented in Section 3 in the sense that for the accuracy of 0.01 mag considered in the simulations, the C-statistic is always a factor ≳2 below its critical value.

There is contradictory evidence that some radio quiet quasars (RQQs) may have a weak blazar component (Czerny et al. 2008; Chand et al. 2009) which might explain at least some reported microvariability detections. If this blazar component is present, it might account for flux fluctuations in RQQs above 0.05 mag and the extreme variations above 0.1 mag reported for PG 0026+129 (Jang 2005) and US 995 (de Diego et al. 1998, however, note that large amplitude variations are not considered in the simulations presented in this paper).

When a more appropriate methodology is used, the results converge in the sense that microvariability in RQQs is also a common phenomenon. Thus, Gupta & Joshi (2005) used the F test and found that two, probably three RQQs out of a small sample of six presented microvariations (i.e., between 30% and 50%), but when they mixed their results with those obtained by other groups that used C-statistics, the total numbers are six, possibly 14 objects that vary out of 49 (i.e., between 12% and 29%). On the other hand, Stalin et al. (2005) separate BL Lacs in their sample and, using a combination of the C-statistic and the structure function find that the Duty Cycle for radio loud and radio quiet quasars are not statistically different (18% and 22%, respectively). This result is also analogous to the value calculated by Stalin et al. (2004a) for the Duty Cycle inferred from de Diego et al. (1998, ∼25%).

5. CONCLUSIONS

In this paper, several statistical methodologies and their variants for microvariability studies in quasars have been tested. From these techniques, one-way ANOVA and χ² tests have shown to be the most robust statistics. However, some "reasonable" χ² test variants, particularly those that estimate errors from the light curves of comparison stars rather than relying on accurate physical models to characterize the data quality, are much less robust. These χ² variants should be avoided, as they can be easily substituted by other robust tests, or at least they should be studied with detail, using simulations to understand their limitations before they are used in statistical analysis. The error estimation problem does not affect the one-way ANOVA statistic that reaches the highest performance to detect variations in the simulated differential light curves of quasars presented in this paper. In fact, the results present in this paper for the one-way ANOVA test can be further improved by trying different group sizes (see Section B). On the other hand, a discontinuous sampling based on the one-way ANOVA methodology proves to be a powerful exploratory technique for detecting microvariations. The relative loss of power of the discontinuous monitoring depends on the sampling frequency, but it is possible to trade off between this sampling frequency and an increase of the number of monitored objects switching target fields during the gaps between the groups of observations, resulting in a larger total number of detections of variability.

The F test for variances has less power than one-way ANOVA and χ² tests, but it is still a valid option for detecting flux variations in AGNs. However, the C-statistic contains several misconceptions and cannot be considered a true statistical test.

The author thanks Dr. Ignacio Méndez and the anonymous referee for helpful suggestions. This work has been supported by CONACyT grant 50296 and has been partially funded by the Spanish MICINN under the Consolider-Ingenio 2010 Program grant CSD2006-00070: first Science with the GTC (http://www.iac.es/consolider-ingenio-gtc).

APPENDIX A: TEST DESCRIPTIONS

A1. F Test for Sampled Variances

Given two sample variances such as s²_Q for quasar differential light curve measurements and s²_* for the comparison star, the number of degrees of freedom for each sample, ν_Q and ν_* respectively, usually will be the same and equal to the number of measurements N minus 1 (ν = N − 1). Then, if there are no differences in population variances, the sample variance ratio is distributed as the F distribution with ν_Q and ν_* degrees of freedom:

$\begin{equation} F = \frac{s_Q^2}{s_*^2}. \end{equation} \tag{ A1 }$

This F statistics is compared with the $F^{(\alpha)}_{\nu _Q,\nu _*}$ critical value, where α is the significance level set for the test, and ν_Q and ν_* the degrees of freedom of the quasar and the star of the comparison sample. The smaller the α value, the more improbable it is that the result is produced by chance. Thus, a value α = 0.1% or 1%, as assumed in this paper, roughly corresponds to a 3σ or a 2.6σ detection, respectively. If F is larger than the critical value, the null hypothesis (no variability) is discarded. In this paper, the F test has been performed at two significance levels (0.1% and 1%) to allow comparison with other statistical procedures. The respective critical values are F^(0.001)_149,149 = 1.6651 and F^(0.01)_149,149 = 1.4666.

A well-known problem that may affect the outcomes of the F test for variances is that it is non-robust to non-normality (Lehmann 1986, Section 5.4).

A2. χ² Test for Variance in a Normal Population

Given a number N of observations of a source over a given period of time, these observations are supposed to be taken from a population of possible observations having a normal distribution. For this sample, the mean magnitude is $\overline{V}$ , and the ith observation yields a magnitude V_i and the corresponding standard error σ_i. Then, the χ² statistic is expressed by

$\begin{equation} \chi ^2 = \sum _{i=1}^N \frac{(V_i - \overline{V})^2}{\sigma _i^2}. \end{equation} \tag{ A2 }$

This statistic is compared against a critical value χ²_α,ν obtained from the χ² probability function, where α is the significance level and ν = N−1 are the degrees of freedom. If χ²>χ²_α,ν the test indicates a larger than expected scattering of the data points (i.e., evidence of variability). Each simulation comprises N = 150 observations, and thus ν = 149. Therefore, the critical value adopted for this test is χ²_0.001,149 = 208.

Pica & Smith (1983), Heidt & Wagner (1996), Gopal-Krishna et al. (2000), Andruchow et al. (2005, see references therein), and others follow the χ² test proposed by Penston & Cannon (1970) and Kesteven et al. (1976). This procedure uses a "weighted" average defined by

$\begin{equation} \overline{V} = \frac{\sum _{i=1}^N V_i/\sigma _i^2}{\sum _{i=1}^N 1/\sigma _i^2}. \end{equation} \tag{ A3 }$

Note that σ_i is the expected error, i.e., the error from considering photon noise from the source and sky, the CCD read-out and all possible non-systematic sources of error, some of them probably unknown in practice. As these individual errors are unknown, different estimates s_i are used instead of σ_i. Thus, errors are often calculated from the usually underestimated value yielded by the IRAF reduction package, multiplied by a correction factor. This error rescalation is necessary because χ² and other tests (but not ANOVAs) assume that the distribution of the real errors is known.

For example, Bachev et al. (2005) use a factor of 1.3; Stalin et al. (2004a, 2004b, 2005) Gopal-Krishna et al. (2003) and Gupta et al. (2008, in the near infrared) use 1.5; Garcia et al. (1999) 1.73; and Gopal-Krishna et al. (1995) 1.75. In fact, standard IRAF photometric packages do not take into account appropriate propagation of errors during the image processing. Particularly, de Diego et al. (1998) have argued that these IRAF packages do not consider the possible spurious enhancement of the S/N after the flat-field correction, due to changes in the sensibility across the detector (for example, sensibility near the borders may be lower than at the center), and that the error estimated directly from the corrected signal may be different from the original error. Note that this affects only the error estimation and not the measured level of the flux. Presently, there is no astronomical reduction software that adequately deals with this problem, and implementing a solution may not be a trivial issue. An ANOVA (see Section A.3) overcomes this problem by empirically measuring the dispersion within each group of observations (directly associated with the true S/N), while data means for each group are conserved by flat-fielding the images.

Usually, images of the object and the comparison stars (one or more) are recorded in the same CCD frames. However, combining several comparison stars to estimate the individual errors σ_i in Equation (A2) does not solve the problem. Here it is worth noting that χ²_ν with ν degrees of freedom distributes as F_ν,∞. Thus, even if σ_i is obtained using several comparison stars to get a better estimate of the errors, the precision is still limited by the number of these stars, which probably are less than a dozen. In contrast, the χ² test requires that each σ_i be calculated from all the possible measurements of an infinite population of stars. This issue is exemplified in Section 4.1.

Some authors substitute the individual errors by a common error σ estimated from the dispersion in the comparison star data. This is the standard procedure used when performing the C-statistic (see Section A.4). However, in differential photometry the number of images of the object and of the comparison stars is the same and therefore the estimates for the standard deviation of their light curves have the same number of degrees of freedom. Then, the χ² test is biased because it only takes into account the degrees of freedom of the estimation for the quasar. Therefore, the correct procedure is to consider the F test for two sample variances as proposed by Howell et al. (1988).

In conclusion, any procedure that cannot rely on theoretically known (not estimated) error values, compromises the reliability of the χ² test. Whenever the true errors are unknown, other statistical methodologies should be used.

In the case of the simulations presented in this paper, all the parameters are controlled and the true errors can be effectively computed. Equation (A2) has been used to compute the χ² statistics. Note that data quality is ensured by the simulation design; thus there is no need to use weighted averages.

A3. One-way ANOVA Test

ANOVA tests are used to compare the means of a number of samples. Due to the Central Limit Theorem, no matter what the shape of the original distribution is, the sampling distribution of the mean approaches a normal distribution. Therefore, tests to compare the means (t-test and ANOVA) are more robust than their counterparts to compare variances (χ² and F tests). This offsets the problem of the non-robustness of the F test.

A one-way ANOVA has been applied by de Diego et al. (1998) to investigate the variability in the light curves of quasars. The methodology consisted of measuring k groups of n_j = 5, one after the other, in short (1 minute) observations. The k groups are ideally separated by 20–30 minutes. Larger time lags are common, affecting the time resolution but not the statistical significance of the test.

For the mathematical description of the one-way ANOVA test, if y_ij represents the ith observation (with i = 1, 2, ..., n_j) on the jth group (with j = 1, 2, ..., k), the linear model describing every observation is

$\begin{equation} y_{ij} = \overline{y} + g_j + \varepsilon _{ij}, \end{equation} \tag{ A4 }$

where $\overline{y}$ represents the mean of the whole data set, $g_j = \overline{y}_j - \overline{y}$ the between-groups deviation, and ε_ij = y_ij − y_i the within-groups deviation, also called residual or measurement error. The size of the data set will be N = ∑^k_i=1n_j. If the number of observations in the groups n_j is constant, N = k × n_j.

As noted previously, an ANOVA tests whether the means of the groups are equal. The condition tested or, null hypothesis, is that the means of the different groups are equal. If the test yields a probability smaller than the adopted significance level α, the null hypothesis will be rejected and the alternate hypothesis (at least one group mean is different from the others) will be accepted. The alternate hypothesis in this case implies detection of variability in the quasar light curve.

From Equation (A4), the total sample variation can be separated into variations between and within groups

$\begin{equation} \sum _{j=1}^k \sum _{i=1}^{n_j} (y_{ij} - \overline{y})^2 = \sum _{j=1}^k (y_{j} - \overline{y})^2 + \sum _{j=1}^k \sum _{i=1}^{n_j} (y_{ij} - \overline{y}_j)^2, \end{equation} \tag{ A5 }$

where the term on the left-hand side describes the total deviations of the data with respect to the mean. The first term on the right-hand side of the equation represents the total variation between groups, and the last term the total errors. Equation (A5) can be shortened to

$\begin{equation} {\rm SS}_T = {\rm SS}_G + {\rm SS}_R, \end{equation} \tag{ A6 }$

where SS_T stands for the total sum of squares. Similarly SS_G and SS_R stand for group sum of squares and residual sum of squares.

Whenever the null hypothesis is true, the k groups of sampled data will be normally and independently distributed, with mean μ and variance σ². Then, the statistics

$\begin{equation} F = \frac{{\rm SS}_G / (k-1)}{{\rm SS}_R / (N-k)} = \frac{{\rm MS}_G}{{\rm MS}_R}, \end{equation} \tag{ A7 }$

corresponds to the F distribution with ν₁ = k − 1 and ν₂ = N − k degrees of freedom. The pseudo variances MS_G and MS_R are mean estimates for the variations between groups and residuals, respectively. For a certain significance level α, if F exceeds the critical value $F^{(\alpha)}_{\nu _1,\nu _2}$ the null hypothesis will be rejected.

The F critical values employed in this paper for the ANOVA tests are F^(0.001)_29,120 = 2.2819 and F^(0.001)_9,40 = 4.0243 for the full and the 30 minute lag sample simulations, respectively.

A4. C-statistic

The C-statistic was first employed by Jang & Miller (1997) and generalized by Romero et al. (1999). The statistical parameter used is

$\begin{equation} C = \frac{\sigma _T}{\sigma }, \end{equation} \tag{ A8 }$

where σ_T and σ are the standard deviation of the quasar and the comparison star differential light curves. The adopted variability criterion requires that C ⩾ 2.576, which corresponds to a 99% confidence level, or 1% significance level following the notation in this paper.

There are two pitfalls with this criterion. First, the critical value 2.576 corresponds to the 1% significance level of the normal distribution for a two-sided test, rather than a one-sided comparison. The two-sided test would be relevant to test that the dispersion in the quasar light curve may be larger or smaller than the dispersion of the comparison star light curve. Note also that the C-statistic would always be positive, and that its expected value when σ_T = σ is centered around 1, rather than 0 as would be expected in the case of a fair normal distribution.

Another pitfall, probably the most important, is that you cannot compare two standard deviations using the normal distribution. In fact, it is not feasible to use standard deviations for most calculations cohy they are not lineal statistical operators (for example, given two independent random variables A and B with standard deviations σ_A and σ_B, respectively, the standard deviation of the sum A + B is not σ_A + σ_B). That is the reason why you have to use variances instead; variances are the second moments of the statistical distributions and therefore lineal operators (in the previous example, the variance of the sum A + B is σ²_A + σ²_B).

If we draw all possible samples of a given size N from a normally distributed population and compute the variances of all those samples, we will obtain a distribution of sample variances that starts with s² = 0 and have a mean of σ². Thus, even if the distribution of all possible sample means drawn from a normally distributed population will be approximately symmetrical (as a consequence of the central limit theorem), an equivalent distribution of sample variances will not approximate symmetry, but will be distributed as χ² with N − 1 degrees of freedom (this result is a consequence of the Cochran's theorem for the sum of squares of linear combinations of a set of independent standard normally distributed random variables):

$\begin{equation} \chi ^2 = (N-1) \frac{s^2}{\sigma ^2}. \end{equation} \tag{ A9 }$

One important outcome of this equation is that the shape of the χ² distribution will be different for different sample sizes. Thus, the dispersion of the distribution of sample variances will depend on how many degrees of freedom has our estimate. Note that this dependence of the variance dispersion on the number of degrees of freedom transmits to the standard deviation, although the χ² statistic does not apply in this case.

Using the F statistics it is easy to calculate the "real" significance of the C-statistic. If the critical value from Equation (A8) is 2.576, the critical F value is 2.576² = 6.636. For F with (20, 20) and (50, 50) degrees of freedom, the significance level of the test will be 4.4 × 10⁻⁵ and 2.1 × 10⁻¹⁰, respectively; i.e., much less than the 10⁻² value considered by Jang & Miller (1997) and Romero et al. (1999). Note that Romero et al. report that variations in radio loud quasars occur more often than in RQQs, but most sources in their radio loud sample are strongly variable BL Lac objects that can easily reach high significance levels of microvariability detection.

To summarize this discussion, the critical values for the C-statistic criterion are wrongly established, and Equation (A8) does not describe a normal distributed variable because it is neither properly centered with the mean expected value being zero, nor does the independent sample size parameter σ represent an unbiased measurement of the dispersion of the sample standard deviation. The square value of the C-statistic can be used to perform an unbiased F test if the degrees of freedom used for the σ_T and σ estimates are provided.

APPENDIX B: IMPROVING THE ANOVA POWER

The power of any statistical test to study variability during a given time interval depends on the number of observations recorded to improve the time resolution and the measurement precision to improve the amplitude resolution. In practice, a compromise is attained to hold reasonable resolutions for both factors before actually observing the target. Would it not be useful to have the possibility of swapping between the time and amplitude resolutions after the observations? Binning data might help, but it is shown in Section 4 that it does not work for a χ² methodology because of the loss of degrees of freedom.

In the ANOVA methodology, if the target has been monitored discontinuously, as in the case of an ANOVA with 30 minute lags, the binning is fixed and consequently the time and amplitude resolutions cannot be exchanged either. But if the monitoring was continuous, it is possible to try out different bin sizes to improve the amplitude resolution at the expense of the time resolution. In this paper, a binning of five observations was used because it was a priori a reasonable value for time resolution and allowed direct comparison with the ANOVA with 30 minute lags strategy. Now, different bins of n observations will be considered to achieve the maximum power of the ANOVA test for the light curve simulations considered in this paper. The bin size n will always be a divisor of the total number N = 150 of observations of a light curve simulation.

Table 4 shows the ANOVA results for the Gaussian peak variations considering different bin size strategies. Column 1 indicates the bin size n, Column 2 shows the number of degrees of freedom ν₁ for the k groups (ν₁ = k − 1), Column 3 displays the number of degrees of freedom ν₂ for the residuals (ν₂ = N − k), Column 4 shows the F critical value for a significance level α = 0.1% and ν₁ and ν₂ degrees of freedom ( $F^{(0.001)}_{\nu _1,\nu _2}$ ), Column 5 indicates the number of Type I errors, Column 6 shows the number of detections of variability, and Column 7 displays the relative power for the ANOVA tests normalized to the results for n = 15.

Table 4. ANOVA Performance for Different Bin Sizes

n	ν₁	ν₂	$F^{(0.001)}_{\nu _1,\nu _2}$	Gaussian Peak		Relative
				Variation		Power
				Type I	Detections
2	74	75	2.0657	2	1178	0.48
3	49	100	2.0835	5	1713	0.69
5	29	120	2.2819	3	2187	0.88
6	24	125	2.3907	6	2268	0.92
10	14	135	2.8199	4	2444	0.99
15	9	140	3.3374	1	2478	1.00
25	5	144	4.3617	3	2285	0.92
30	4	145	4.8882	7	2143	0.86
50	2	147	7.2428	5	1996	0.81
75	1	148	11.2727	4	430	0.17

Download table as: ASCII Typeset image

A careful inspection of Table 4 shows that, for the simulations presented in this paper, the test power to detect variations increases with the bin size until reaching a maximum of 2478 detections for n = 15 (noted in boldface in Table 4), and then decreases for larger bin sizes. The smooth overall detection tendency shows that the differences in power for these tests with different bin sizes are not an artifact. For n = 5 the relative power is almost 0.9, and all the bins between n = 5 and 50 have relative power larger than 0.8. These results show that the ANOVA bin size is not a fine tuning parameter, and that binning the data in groups of five observations is, also from the statistical point of view, a reasonable choice.

The smallest bin sizes improve the time resolutions, but the detections are biased toward large amplitude variations. On the contrary, the largest bin sizes can detect small variations, but the time resolution is degraded and fast variations are not detected. Thus, the tests for n = 5 detect 65 variations with time FWHM up to 13 minutes, while the tests for n = 15 detect only 37. But the percentages of detections for an amplitude of variation of approximately 0.01 mag are around 10% and 20% for the n = 5 and n = 15 tests, respectively.

In the case of linear variations, or a more generally monotonic variation, the results are simpler. As the duration of the variability event is not an issue in this case, the only factor that matters is the amplitude resolution. Therefore, the tests that have more power are those with only two groups, one for each half of the light curve, that corresponds to n = 75, ν₁ = 1 and ν₂ = 148. The actual number of detections for this test is 4080, instead of 3043 detections that were obtained for n = 5.

These results are easily translated to the case of analyzing a single quasar light curve. The researcher does not know in advance the specific characteristics of a possible microvariability event. But planning the observations to be analyzed using an ANOVA experimental design, the astronomer can enhance the temporal resolution of the observations to detect very fast large amplitude variations, and still conserve the amplitude resolution to detect longer timescale low amplitude variations. Of course, this has also a cost. The final significance level α_f when performing a set of n_t statistical tests is different from the significance level α_t for the individual tests. There are a number of adjustments applied in the statistical literature for multiple tests, the most common is the Bonferroni correction:

$\begin{equation*} \alpha _t = \alpha _f / n_t. \end{equation*} \tag{ B1 }$

If the set of tests performed are the same as in the case presented above for the simulations, n_t = 10 and α_t = 0.1%, the final significance level will be α_f = 1%. Similarly, the significance level of the individual tests could be set to α_t = 0.01% to reach a final value of α_f = 0.1%). However, the results shown in Table 4 suggest that performing test for bins of size 5, 15, and 30 may be enough to ensure detection on a wide range of timescales and amplitudes. If only three tests are performed, the significance levels are α_f = 0.3% (if α_t is set at 0.1%) or α_t = 0.03% (if α_f is maintained at 0.1%).

TESTING TESTS ON ACTIVE GALACTIC NUCLEI MICROVARIABILITY

Article metrics

Permissions

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. LIGHT CURVE SIMULATIONS

3. RESULTS