TESTING TESTS ON ACTIVE GALACTIC NUCLEI MICROVARIABILITY

Published 2010 February 11 © 2010. The American Astronomical Society. All rights reserved.
, , Citation José A. de Diego 2010 AJ 139 1269 DOI 10.1088/0004-6256/139/3/1269

1538-3881/139/3/1269

ABSTRACT

Literature on optical and infrared microvariability in active galactic nuclei (AGNs) reflects a diversity of statistical tests and strategies to detect tiny variations in the light curves of these sources. Comparison between the results obtained using different methodologies is difficult, and the pros and cons of each statistical method are often badly understood or even ignored. Even worse, improperly tested methodologies are becoming more and more common, and biased results may be misleading with regard to the origin of the AGN microvariability. This paper intends to point future research on AGN microvariability toward the use of powerful and well-tested statistical methodologies, providing a reference for choosing the best strategy to obtain unbiased results. Light curves monitoring has been simulated for quasars and for reference and comparison stars. Changes for the quasar light curves include both Gaussian fluctuations and linear variations. Simulated light curves have been analyzed using χ2 tests, F tests for variances, one-way analyses of variance and C-statistics. Statistical Type I and Type II errors, which indicate the robustness and the power of the tests, have been obtained in each case. One-way analyses of variance and χ2 prove to be powerful and robust estimators for microvariations, while the C-statistic is not a reliable methodology and its use should be avoided.

Export citation and abstract BibTeX RIS

1. INTRODUCTION

Observational techniques for optical monitoring of active galactic nuclei (AGNs) and photometric studies of variable stars have many similarities, but they differ in the sense that variable stars often show periodic light curve fluctuations, while AGNs do not. Although most variable AGNs (blazars) show microvariability or transient large amplitude variations in a few hours, the unpredictability of these changes in brightness and the difficulty of obtaining confirmation by other observers have been a perennial cause of incredulity and skepticism since the first report by Matthews & Sandage (1963) of a 15 minute microvariability event of amplitude ΔV = 0.044 in 3C 48. In less variable objects, such as quasars, the amplitude of the microvariations is usually lower and close to the limit of detection.

In order to increase the confidence on the validity of variability reports, a number of statistical tests have been proposed to prove the reliability of the measurements. A methodology that has been widely used is the χ2 test for variances which compares a sample variance obtained from a suspected variable target with a theoretically calculated variance for a non-variable object, taking into account all the possible sources of error. For example, Pica & Smith (1983) use a χ2 and the so-called Q-statistic, based on the difference between the brightest and the dimmer observations, to investigate long-term variability in approximately 6000 photographic observations of 130 AGNs of different types monitored during 13 yr. After the introduction at the end of the 1970s of the CCDs in astronomical observations, differential photometry became a very reliable technique for short time-resolved light curve studies. In differential photometry, the flux of the target object is divided by a reference star in the same CCD frame. As the target and the reference star images have been obtained simultaneously at the same air mass and identical instrumental and weather conditions, the flux ratio is considered to be very reliable. It is common practice to compare the differential light curves of the target and at least one non-variable field star, denoted as a comparison star.

A remarkable effort to update statistical techniques used in photoelectric photometer and photographic plate measurements, taking advantage of the better and faster response of CCD detectors, was carried out by Howell et al. (1988), who compared the variances between the object and a comparison star using the F distribution. The F-test compares two sample variances, one for the suspected variable source and the other for the non-variable comparison star. These authors also introduced the Γ factor to account for scale differences between the variances of the target and the comparison star due to photon noise. However, χ2-based tests have nevertheless endured as they are well known and simple estimators for flux variations.

Jang & Miller (1997) and Romero et al. (1999) have proposed a new test to analyze microvariability based on the ratio C between the target and the comparison star standard deviations, rather than variances. This C-statistic resembles Pica & Smith's (1983) Q-statistic, but is less sensible to the spurious effect of a single discrepant point. In the last decade, C-statistics have become very popular and several researchers have adopted this statistical methodology to study quasars (e.g., Stalin et al. 2004a; Gupta et al. 2008) and blazars (e.g., Xie et al. 2004; Andruchow et al. 2005).

A different methodology to analyze differential light curves has been proposed by de Diego et al. (1998). These authors use the one-way analysis of variance (ANOVA) test for studies of quasar microvariability. Using this technique, de Diego et al. were the first to claim that microvariability events were as frequent in radio quiet as in radio loud quasars (excluding blazars). An ANOVA has also been applied in other studies of AGN variability (Ramírez et al 2004; Villforth et al. 2009; Ramírez et al. 2009). However, because the novel results reported by de Diego et al. (1998) were unexpected at the time and because they were difficult to compare with previous studies, the ANOVA test has still not gained full acceptance (Romero et al. 1999; Carini et al. 2007). In this paper, independent runs of data are simulated using the Monte Carlo technique, and analyzed using one-way ANOVAs and other statistical techniques.

Despite their generic name, which can be misleading, ANOVA tests are designed to detect differences between several sample means, rather than between sample variances. Thus, such tests can be considered a generalization of the Student t-tests for differences between two sample means. It is worth noting that tests for means can distinguish smaller differences than tests for variances (see Section 4), and thus it is expected that an ANOVA improves the detection of microvariability events. ANOVA tests are used, for instance, in experimental design statistical methodologies (e.g., Box et al. 2005), which deliberately impose one or more conditions on different groups of data in the interest of observing the response. These methodologies radically differ from those common in astronomy, which involve collecting and analyzing data on the run and where external conditions cannot be manipulated. Therefore, after applying the same objective method to this and other statistical tests, we will be able to effectively establish the reliability and advantages of the ANOVA test.

This paper presents a comparison of the outcome of different analysis strategies for the detection of low amplitude microvariations when reliable astronomical differential photometric data are available. Reliable data are understood data characterized by random errors that are not affected by systematical errors. Dealing properly with systematical observational errors would require, first, to detect that the data are indeed affected by these errors; second, to derive an understanding of their cause; and third, to properly correct for these systematics by changing and fine tuning the observational setup, to the extent that this is possible. In this paper, systematical effects will be entirely disregarded.

To analyze the data, several procedures based on χ2 tests, F tests for variances, one-way ANOVA and C-statistics will be considered in turns. Even though these consist of common statistical methodologies, there exist many different implementation strategies of these tests and, in principle, new alternative tests could be carried out. Strictly speaking, some of the derived inferences or comparisons established between the various tests may be valid only for the particular cases considered here. Furthermore, we cannot rule out that different observational circumstances or setup implementations with respect to those envisaged in this paper might result in altogether different results.

This paper is organized in the following way. The simulation procedure is described in Section 2, results are shown in Section 3, a discussion and comparisons between the tests are presented in Section 4, and the conclusions are summarized in Section 5. In Appendix A, the interested reader can also find the mathematical description of each test, along with some comments on their use and validity. An interesting implementation of the one-way ANOVA to improve the detection of microvariability in the AGN light curves is discussed in Appendix B.

2. LIGHT CURVE SIMULATIONS

In order to accurately analyze the power and robustness1 of the different statistical methodologies, simulations were performed for the light curves of the quasar, a reference star, and two comparison stars. For easiness, the Instrument Simulator2 software of the Mexican Observatorio Astronómico Nacional was used to obtain basic data for the simulations. The input arguments were the 1.5 m telescope, the SITe1 detector, 2 × 2 binning, V filter, 1farcs5 seeing, 3'' aperture, 60 s exposure time, and magnitudes V = 17 and 15 for the quasar and the reference star, respectively. For simplicity, the comparison stars have been chosen to have the same magnitude as the quasar. The output includes, among others, the following parameters: the signal-to-noise ratio (S/N), and the object, sky, and detector (total read-out) noises. From these parameters, the object and the sky electron counts were obtained considering Poisson distribution (i.e., photon shot noise). The parameters used in the simulations are shown in Table 1. Column 1 describes the source associated with the parameters; Columns 2 and 3 indicate the electron counts for the signal and noise, respectively; Column 4 shows the S/N.

Table 1. Parameters Used in the Simulations

Description Electron Counts Total
  Signal Noise S/N
Detector ... 103 ...
Sky 23,287 152 ...
V = 17 32,292 179 126
V = 15 203,766 451 418

Download table as:  ASCIITypeset image

For a telescope of 1.5 and an S/N larger than 100, 1 minute exposures are reasonable. Longer exposures may saturate bright objects and stars which might be used as references in differential photometry. Thus, every simulation comprises 150 × 1 minute exposures during 5 hr of monitoring, with 1 minute lag between exposures to account for CCD read-out. For more realistic simulations, atmospheric attenuation in the V band (AV = 0.2 mag/air mass) has been taken into account. The object is supposed to pass through the zenith in the middle of the run (i.e., 2.5 hr after the monitoring began). Photon noises were generated for the object, the reference and comparison stars, and for a constant brightness sky. Finally, white Gaussian noise was also generated to account for the CCD read-out. The photometric accuracy for the differential light curves of the quasar and the comparison star is about 0.01 mag.

Two runs of simulations were performed. One of the runs considered a Gaussian-shaped variation in the quasar flux. The duration of the variations is described by their FWHMs which are allowed 60 discrete values in the range between 1 and 60 minutes (i.e., in steps of 1 minute). This range for the duration of the variations is also appropriate to investigate the possibility of detecting spikes (see Sagar et al. 1996; de Diego et al. 1998; Gopal-Krishna et al. 2000; Stalin et al. 2004a). The peak of the variation was allowed to be centered between 120 and 180 minutes from the beginning of the monitoring. This range for the peak center ensures that the whole variation is contained in the data set. Finally, for each of the 60 values set for the duration of the variation, 100 simulations were performed allowing random amplitudes up to 3% of the quasar flux (i.e., ∼0.03 mag). Therefore, the total number of simulations were 6000 (60 × 100), of 150 photometric points each. Figure 1(a) shows the final simulated raw curve for the quasar and comparison star for a Gaussian variation, while Figure 1(b) shows the same curves after calibration by the observations of the reference star. The standard deviation of the differential curve of the comparison stars shows that the photometry is accurate up to 0.009 mag.

Figure 1.

Figure 1. Simulated raw curves (a) and differential light curves (b) for a V = 17 quasar (✩) and comparison star (+, with an offset of 0.1 mag). At the middle of the simulation, the objects are crossing the zenith, where the atmospheric absorption has been set at 20%. A reference star with V = 15 has been used to calibrate the observations. The amplitude of the variability in this case was 0.0267, and the peak has a FWHM of 60 minutes.

Standard image High-resolution image

For the other run, the light curves of the objects present a constant (linear) flux variation, as shown in Figure 2. The amplitudes of these variations, measured as the difference between the first and the last data point, were also random valued up to 3% of the quasar flux. As in the Gaussian peak case, a total number of 6000 simulations of 150 data points each were performed. In both cases, Gaussian peak and linear variation, all the fluxes and estimated errors were converted into magnitudes for the analysis.

Figure 2.

Figure 2. Simulated raw curves (a) and differential light curves (b) for a V = 17 quasar (✩) and comparison star (+, with an offset of 0.1 mag). At the middle of the simulation, the objects are crossing the zenith, where the atmospheric absorption has been set at 20%. A reference star with V = 15 has been used to calibrate the observations. The amplitude of the variability in this case was 0.0113, and it has a steady increase in flux during the monitoring.

Standard image High-resolution image

3. RESULTS

The results of the statistical analyses of the simulations are summarized in Table 2. Column 1 identifies the test (see below); Column 2 indicates the significance level α; Columns 3–5 show the number of Type I errors found in the analysis of one of the comparison stars, the number of detections of microvariations in the quasar light curves, and the number of Type II errors for the quasars, respectively, for the 6000 simulations for the Gaussian peak variation; Columns 6–8 repeat the same numbers but for linear variations.

Table 2. Results of the Simulations

Test   Gaussian Peak Variation Linear Variation
  α Type I Detections Type II Type I Detections Type II
χ2 test 0.001 10 1614 4386 2 2278 3722
Idem (std) 0.001 125 1748 4252 129 2385 3615
One-way ANOVA 0.001 3 2187 3813 4 3043 2957
Idem (30 minute lags) 0.001 2 1086 4914 11 1668 4332
F test (α = 0.1%) 0.001 9 837 5163 5 1302 4698
Idem (α = 1%) 0.01 63 1443 4557 61 2050 3950
C-statistic 0.01 0 0 6000 0 0 6000

Download table as:  ASCIITypeset image

The first item in Table 2 is the χ2 test described in Section A.2. To perform this test, the true error distributions introduced in the simulations (photon and white Gaussian noises) have been considered to estimate the error of each individual data point. In the second test, a similar χ2 analysis has been performed, but considering the standard deviation of the comparison star instead of the individual errors of each measurement (note that in this case Type I errors have been calculated from the other comparison star). The third test is a one-way ANOVA, performed by grouping the data in sets of five individual observations (see description in Section A.3). The fourth test is also a one-way ANOVA but with a 30 minute lag between group sampling (and thus it is the only test that considers only a fraction of 1/3 of the simulations). The results of two F-tests for variances (see Section A.1 for a description) are reported in the fifth and sixth items, the former at the significance level α = 0.001 (or 0.1%) to compare with the previous tests, and the latter at α = 0.01 (or 1%) to compare with the C-statistics. Finally, the fifth test is a C-statistic as described in Section A.4.

The significance level α is a probability set a priori by the researcher that a test yields, only by chance, a result at least as extreme as the one observed. Note that ANOVA and χ2 tests have been performed for a significance level of 0.1%, that corresponds to the usual detection limit of 3σ. On the other hand, the significance level of the C-statistic is set at 1%, or a detection limit of 2.576σ, which is the level commonly defined for this test (e.g., Jang & Miller 1997; Romero et al. 1999). For the F test, both significance level values are considered: α = 0.1% to compare with the χ2 and ANOVA tests, and α = 1% to compare with the C-statistic, as the F test is considered as an alternative to this methodology (see Section A.4).

Type I errors are due to the rejection of a true null hypothesis (i.e., rejecting the non-variability hypothesis for a non-variable object). For an unbiased test, the actual number of Type I errors depends only on the number of data sets examined and the significance level of the test. In this case, the number of Type I errors has been obtained testing the simulated differential curves of a (non-variable) comparison star. For a significance level of say 0.1% and 6000 simulations, we should expect around six spurious detections. Considering a binomial distribution for the number of Type I errors, we expect that its actual number for a given test will be between 0 and 13, and in most cases in the 6 ± 3 interval. If the number of Type I errors for the comparison star is significantly different from the expected frequencies, it is evidence that the actual significance of the test differs from its nominal set value and the test is not reliable.

The number of detections in Table 2 indicates how many tests have succeeded in detecting variability in the quasar simulated differential light curves, and it is also a measure of the power of the test, which generally varies as a function of the data set characteristics. On the other hand, Type II errors are due to the acceptance of a false null hypothesis (i.e., accepting the non-variability hypothesis for a variable object). Note that in all the simulations the quasar varied. Therefore, as the number of Type I errors is low with respect to the number of detections, the number of Type II errors would be (approximately) 6000 less the number of detections.3

From Table 2 we see that the χ2 test performed taking into account the actual error distribution, one-way ANOVA, one-way ANOVA with 30 minute lags, and the F test at α = 0.1%, all show a number of Type I errors in accordance with expectations. The F test at α = 1% and the C-statistic have a lower significance level and accordingly the number of expected Type I errors would be 60. The results for the F test at α = 1% agree with this expectation, nevertheless C-statistics consistently produced neither Type I errors nor detections in accordance with the arguments presented in Section A.4. In fact, in these simulations, the C-statistic is always a factor ≳2 below its critical value (2.576) that determines the boundary between rejecting or accepting the null hypothesis.

The χ2 test performed considering the standard variation of the comparison star instead of the actual errors for each observation in the quasar light curve shows a number of Type I errors much larger than expected (Table 2). This is a consequence of employing a wrong methodology. As explained in Section A.4, failing to consider the number of degrees of freedom in the estimation of the standard variation of the light curve of the comparison star produces a biased statistics. The increase in the number of detections with respect to the previous χ2 test is not significant. As the number of Type I errors is still much less than the number of Type II errors, it is possible to estimate the number of spurious detections of variability in the quasar light curves (i.e., detections not related to the actual variation of the light curves). In this case, the number of spurious detections will be approximately 125 × 4252/6000 = 89, while a similar calculation produces seven spurious detection for the previous χ2 test. Thus, even if the biased statistics would affect only the number of Type I errors and not Type II errors (which also probably will be affected), we would expect a difference of about 80 detections between both tests, counting for most of its factual value of 134.

On the contrary, for the F tests the number of Type I errors agrees with the expected number, as mentioned above. However, they show a rather small number of detections in comparison with the other tests (see Table 2). Thus, the detections for the F test at α = 0.1% are well below the results for the one-way ANOVA with 30 minute lags, even if the number of data points for the latter test is a factor of 3 smaller than for the F test. In the case of the F test at α = 1%, the significance level is low enough that a larger number of detections would be expected with respect, for example, to the χ2 test at a significance level of 0.1%; yet the opposite is true. There are two possible explanations for these results. One is the non-robustness of the F test for non-Gaussian distributed data (Lehmann 1986, Section 5.4) and the other is that the test has intrinsically less power than χ2 and ANOVA. Indeed, some of the comparison star differential light curves are not well fitted using a Gaussian profile, as shown in Figure 3, perhaps as a consequence of the underlying Poisson distribution associated with flux measurements. To investigate this possibility, a set of Kolmogorov–Smirnov for goodness-of-fit tests of Gaussianity (e.g., Wall & Jenkins 2003, Section 5.3.2) was performed on the distribution of the simulated Gaussian peak data for each of the 6000 light curves of the comparison star. The significance level of the test was fixed at 5%; therefore, around 300 Type I errors were expected if the data were fairly Gaussian distributed. The actual number of differences found by the Kolmogorov–Smirnov test were 296, which rules out the non-Gaussian distribution explanation. Then, the differences in detecting light curve variations with the results of the χ2 and ANOVA test should be imputed to a relative diminished power of the F test.

Figure 3.

Figure 3. Histogram of the differential light curve of a comparison star. In some cases, as in this simulation which shows a strong skewness, the distribution of the differential light curves separate from a Gaussian curve.

Standard image High-resolution image

The number of detections of the one-way ANOVA tests is significantly larger than the χ2 tests. The 150 observations computed for each run of simulations are divided in 30 groups of five observations each, and the time interval in each group expands 10 minutes (remember that exposures and read-outs last 1 minute each). Even if the number of groups is reduced to 10 (i.e., sampling an object every 30 minutes as in the case of a one-way ANOVA with 30 minute lags), the ANOVA maintains its robustness and to a large extent the power to detect microvariations.

4. DISCUSSION

To perform the ANOVA test it is necessary to bin the data. Although the results for the χ2 test using the standard deviation of the comparison star shows the risks of introducing what might be considered a priori reasonable changes in the strict test procedure, it is still questionable if the χ2 test will perform as well as an ANOVA if data were binned in the same way. However, just as there is a loss of information in going from a list of observations to a histogram, the binning procedure applied to the χ2 test will produce an immediate loss in the test power. Statistical theoretical backgrounds to reject the χ2 binning methodology may be traced back at least to lossy compression methods considered in Shannon's information theory. Binning data have the effect of reducing the signal besides the noise, and it is mathematically impossible to get any additional information from the binned data. Thus, analyzing both the raw signal and the binned data with the same χ2 statistical procedure will result in an unavoidable loss of sensitivity of the test.

Things get even worse when the signal is low, close to the 3σ sigma limit as it is frequent in microvariability studies. Besides blurring the signal, binning data in groups of n observations in a χ2 procedure also has the effect of reducing by a dividing factor of n the degrees of freedom of the statistical analysis. Both combined effects are disastrous in order to detect tiny variations. For the simulations presented in this paper, the raw data χ2 procedure were able to detect 1614 variations of a total of 6000 cases, while an aside calculation based on the binned procedure for n = 5 yielded only 284 detections. In comparison, the ANOVA produced 2187 detections. Although the ANOVA also groups the data producing a loss of signal, it redistributes the degrees of freedom between groups and error estimates within groups, rather than canceling them. Thus, if N is the number of observations and k the number of groups (k = N/n), the degrees of freedom are ν1 = k − 1 for the groups and ν2 = Nk for the errors, and therefore ν1 + ν2 = N − 1, that corresponds to the degrees of freedom of the original data set. Besides, as stated above, an ANOVA tests (group) means, while χ2 tests variances, and it is well known that tests for means are more powerful than tests for variances, among other things, because the actual values for means are tighter constrained. This is a result of the squaring of each term, which effectively weights outliers and large errors more heavily than small ones. For example, a few simulations generating samples of size 100 drawn from a normally distributed population with mean μ = 0 and standard deviation σ = 1 show that the ratio between their respective 95% confidence intervals is C.I.(σ2)/C.I.(μ) ≈ 1.5.

In the case of the ANOVA and the ANOVA with 30 minute lags, the time interval within each group has been chosen such that it does not exceed 10 minutes. This limit is imposed by previous experience that optical microvariations in timescales of less than 20 minutes in quasars are rare, hard to detect, or both. On the other hand, when the monitoring is performed in several optical bands that will be compared later, it should accomplish simultaneity criteria of variability between the involved bands. Thus, Villata et al. (2004) have considered time intervals lasting around 10 minutes for photometric sequences between bands V and I in blazars, while Papadakis et al. (2004) and Hu et al. (2006) have calculated around 20 and 11 minute lags between bands B and I for AGNs and blazars, respectively. The difficulty of detection for even large microvariability events lasting less than 10 minutes has also been shown in the results of these simulations, no matter what statistical methodology has been used (see, for example, Figure 4). Thus, 10 minutes is a safe time interval to bin data sharing similar flux characteristics, statistically indistinguishable from the noise, and will be appropriate for many studies of quasar microvariability. However, for the ANOVA continuous monitoring strategy, it is still possible to improve the test power by trying out different bin sizes after the observations have already been made (see Appendix B).

Figure 4.

Figure 4. Detection frequencies against temporal FWHM of the variation for the Gaussian peaks. χ2 test (•), one-way ANOVA ($\opentri$), and one-way ANOVA with 30 minute lags (□).

Standard image High-resolution image

The choice for one-minute exposures has been justified in Section 2. Thus, a group of around five such exposures, lasting less than 10 minute accounting for CCD read-out, is a reasonable methodological choice for the ANOVA with 30 minute lags observational strategy, for which the bin size of the groups is set before the observations and cannot be changed after. For a larger telescope, the exposures might be shorter, but the total number of observations in each group is still limited by the read-out dead time. Besides, the gain in the power of the test would be relatively small for a number of exposures larger than five.

In the rest of this section only the χ2 test, the one-way ANOVA with and without time lags between group observations, and the C-statistic will be discussed. The results of the simulations presented in Section 3 for the χ2 test using the standard deviation of the comparison star, and the F test for sampled variances, are enough descriptive to show the possible risks of relying on these procedures.

4.1. Test Comparisons

It is expected that the probability of detecting a change in the brightness of a source depends on both the amplitude of the variation and its duration. The range of amplitudes is the same for both the Gaussian peak and the linear simulations, but the duration of the Gaussian peak variations is restricted to an FWHM of 60 minutes or less, while the duration of the linear variations spreads over the 5 hr of monitoring. Therefore, it is not surprising then that linear variations are detected more easily than the shorter Gaussian peaks considered in the simulations (see Table 2). The effect of the duration of the Gaussian peak variations on the number of detections is shown in Figure 4 for the χ2 test, one-way ANOVA, and one-way ANOVA with 30 minute lags.

Figure 5(a) compares the distribution of the χ2 and the ANOVA F statistics for the non-variable comparison star. This plot yields a circular-shaped cloud of points (allowing a certain amount of distortion because of the use of different scales), as expected for non-biased statistics. The points are centered at coordinates (1, 149) as the expected values for ANOVA F and χ2 with 149 degrees of freedom when the null hypothesis is true are 1 and 149, respectively. Figure 5(b) shows a similar plot for the Gaussian peak variation quasar statistics. The clear correlation between both statistics indicates that they are measuring the same variable phenomenon. Note that the maximum range of the χ2 statistic is about two times its critical value, while the ANOVA F statistics spreads out approximately five or six times from its respective critical value. This is a consequence of the different powers of the tests under the conditions of these simulations. The same arguments apply in the case of the linear variations. In this case, the difference in power between the ANOVA and the χ2 tests can be illustrated from the results for the quasar light curve shown in Figure 2 where the ANOVA F statistic is larger than the critical value set to detect variations (F = 2.8>2.3 = F(0.001)29,120), while χ2 is below (χ2 = 164 < 208 = χ20.001,149).

Figure 5.

Figure 5. Comparison between the χ2 and the one-way ANOVA F statistics for (a) a reference star and (b) a quasar with Gaussian variations. Thick lines show the critical values for each statistics. The white cross in (a) marks the expected value of both statistics for a non-variable reference star. The distribution of data points in (b) corresponds to 3682 cases where neither χ2 nor ANOVA detect variations (lower left corner), 131 cases where χ2 detects variations and ANOVA does not (upper left corner), 704 cases where ANOVA detects variations and χ2 does not (lower right corner), and 1483 cases where both tests detect variations (upper right corner).

Standard image High-resolution image

Figure 6 shows the distributions of (a) the χ2 statistic, (b) the ANOVA F, (c) the ANOVA F with 30 minute lags between group observations, and (d) the C-statistic, against absolute amplitude of the variations for temporal Gaussian peak variations, along with the critical values for each test (indicated by thick horizontal lines). It is clear that for (a), (b), and (c) the number of detections (points above the lines indicating critical values) increases with the variability amplitude. Remember that the ANOVA, the χ2, and the C-statistic were calculated using the same data set, while for the one-way ANOVA with 30 minute lags the data set is resampled to one-third of the simulated observations. From Figure 6(d), it is obvious that the C-statistic is about a factor of 2 below any detection even though the nominal significance level for this test (α = 1%) is less tight than for the ANOVA and χ2 tests (α = 0.1%).

Figure 6.

Figure 6. Distribution of statistics against absolute amplitudes of the Gaussian peak variations and critical values; (a) points indicate the values of the χ2 statistic with α1 = 0.001, ν = 149, and the solid line the critical value for the one-sided test $\chi ^2_{\nu,\alpha _1} = 208$; (b) idem for the ANOVA statistic with α1 = 0.001, ν1 = 29, ν2 = 120, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 2.28$; (c) idem for the ANOVA statistic with 30 minute lags, α1 = 0.001, ν1 = 9, ν2 = 40, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 4.02$; (d) idem for the C-statistic with α2 = 0.01 and its critical value for the two-sided normal test $z_{\alpha _2} = 2.576$.

Standard image High-resolution image

The results for the linear variations shown in Figure 7 are similar to Gaussian peak variations, but the statistics are less scattered because there is no effect of the length of the variation. For a significance level of α = 0.1% and the conditions of the simulations, all the linear variations with amplitudes ≳0.027 are detected by the χ2 test. On the other hand, the one-way ANOVA detects all the variations with amplitudes ≳0.022. In comparison, the one-way ANOVA with 30 minute lags detects around 90% variations for amplitudes near 0.03. As in the Gaussian peak case, the C-statistic is again a factor of 2 below any detection at the significance level of α = 1%.

Figure 7.

Figure 7. Distribution of statistics against absolute amplitudes of the linear variations and critical values; (a) points indicate the values of the χ2 statistic with α1 = 0.001, ν = 149, and the solid line the critical value for the one-sided test $\chi ^2_{\nu,\alpha _1} = 208$; (b) idem for the ANOVA statistic with α1 = 0.001, ν1 = 29, ν2 = 120, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 2.28$; (c) idem for the ANOVA statistic with 30 minute lags, α1 = 0.001, ν1 = 9, ν2 = 40, and $F_{(\nu _1,\nu _2)}^{(\alpha _1)} = 4.02$; (d) idem for the C-statistic with α2 = 0.01 and its critical value for the two-sided normal test $z_{\alpha _2} = 2.576$.

Standard image High-resolution image

Percentages of detections per amplitude range are shown for the Gaussian peak variations in Figure 8, and for the linear variations in Figure 9. Percentages for the Type II errors per amplitude range can be easily derived from these figures as the subtraction of the percentages of detections from 100. In the case of the linear variations considered in these simulations, the dependence of the number of detections with the amplitude of the variations is straightforward. But in the case of the Gaussian peak, the double dependence on the amplitude and the temporal length of the variation makes the relationship less evident. After smoothing, this double dependence can be shown as a contour plot of the probability of detection as a function of the amplitude and duration of the microvariability event, as shown in Figure 10. Note that the probability of detection increases with both the amplitude and the duration of the variation.

Figure 8.

Figure 8. Percentage of detections per amplitude of Gaussian peak variation. (a) χ2 statistic; (b) ANOVA; (c) ANOVA with 30 min lags.

Standard image High-resolution image
Figure 9.

Figure 9. Percentage of detections per amplitude of linear variation: (a) χ2 statistic; (b) ANOVA; (c) ANOVA with 30 minute lags.

Standard image High-resolution image
Figure 10.

Figure 10. Example of the double dependence of the Gaussian peak detections on the amplitude and the temporal length of the variation. The smoothed χ2 statistic of the simulated data is shown as a contour plot of the probability of detection with FWHM of the variations in minutes in the X-axis, and the amplitude in magnitudes in the Y-axis. Plots for one-way ANOVA statistics would show a similar aspect.

Standard image High-resolution image

The bulk of all these results attests that both the χ2 test and one-way ANOVA are robust methodologies to study variability in the light curves of quasars. However, the χ2 test relies on an accurate theoretical estimation of the data error for each single data point. As noted in Section A.2 this is not usually the case, and a number of ad hoc factor corrections have been used to compensate the notorious lack of agreement between the IRAF estimated photometric errors and the actual dispersion of the data (e.g., Gopal-Krishna et al. 2003; Stalin et al. 2004a; Bachev et al. 2005).

The large increment in the number of Type I errors obtained using a sample standard deviation drawn from the comparison star instead of the actual error (Table 2), as well as arguments given above about the loss of power of the test when the data are binned, warns against any simple approach to dodge this problem. However, the error estimation issue is offset by the internal error estimation of the ANOVA tests. It is reasonable to assume for the χ2 test that an accurate estimate of the actual errors for each data point might be achieved by measuring the data dispersion for a large number of foreground stars. However, we have already mentioned above that the dispersion measurements such as variances and standard deviations are poorly constrained. Let us investigate the possible gain in accuracy by using a set of comparison stars.

Actually, the significance level of the χ2 test when the true photometric error of the data is unknown, but estimated from one or more comparison stars is very easy to calculate: it corresponds to the F statistic. In our case, each light curve comprises 150 observations (ν = 149 degrees of freedom) and the nominal significance level of the test is α = 0.1%, which corresponds to a critical χ20.001,149 = 208. We divide this value by ν to obtain the reduced χ2 critical value χ2r = 1.40, and calculate the significance level α for the F statistics with ν1 = 149 and ν2 = 149 × N*, for $F^{(\alpha)}_{149,149 N_*} = 1.40$. Some of these calculations are presented in Table 3. Column 1 indicates the number of stars N* used to estimate the error, Column 2 is the actual significance level α for the test, and Column 3 is the relative error in the variance estimate.

Table 3. Effect of the Number of Comparison Stars on the Test Accuracy

Number of Stars α Err(s2)/σ2
 1 0.0212 1.4142
 5 0.0029 0.6325
10 0.0018 0.4472
20 0.0014 0.3162
30 0.0012 0.2582
40 0.0012 0.2236

Download table as:  ASCIITypeset image

Note the value for a single comparison star (N* = 1); the actual significance level is 2.12% rather than 0.1%, therefore, for 6000 light curve simulations this test should produce 0.0212 × 6000 ≈ 127 Type I errors, which agrees with the result of the simulations for the χ2 test performed using the standard deviation of the comparison star (see the second item in Table 2). Other interesting results reported in Table 3 are that combining the data for 10 stars, the actual significance level is still almost twice the nominal value, and that it is necessary to combine up to 30 or 40 stars to obtain a significance level accurate up to 20%.

Sample variances s2 are χ2 distributed random variables (Equation (A9)) and therefore it can be demonstrated easily that, in our case, the sample variance for each photometric data point obtained from measuring N* stars also has an associated variance (i.e., variance of the variance) given by

Equation (1)

where σ is the true photometric error and N* − 1 the degrees of freedom. Thus, the error of the measured variance is expressed by the square root of the previous equation: $\mathrm{Err}[s^2] = \sigma ^2 \sqrt{2/(N_*-1)}$. For observations analogous to the simulations reported in this paper, σ ≈ 0.01 mag. The relative error for the variance estimate (Err[s2]/σ2) obtained from N* star is shown in the third column of Table 3. Even for N* = 40 the variance estimate has an accuracy worse than 20%, which implies that the photometric error estimate is inaccurate by approximately 50% (s = 0.010 ± 0.005). But even in the case that it was possible to observe tens of stars simultaneously, attempting to meet the controlled conditions of the simulations where the errors are completely known, ANOVA's statistical power performs better than the χ2 test to find tiny variations in the quasar simulated differential light curves. Although the actual differences in power may vary depending on the test implementation and light curve characteristics, or even the actual observational methodology, the effort of combining the light curves of tens of foreground stars may be irrelevant in most cases.

Another concern with measuring errors using a large sample of suitable foreground stars is that it is not always possible. In fact, the number density of bright stars around quasars that can be used for differential photometry studies is small. From a photometric point of view, quasars are blue color objects. For the most reliable differential photometry, at least in the most blue optical bands, quasars should be compared preferably with either nearby white dwarf stars or main sequence bluish stars to avoid color effects that may arise at different air-masses. Besides, quasars are observed at high galactic latitudes and thus, unless the telescope field is large enough, the number of foreground stars around quasars is usually scarce. Moreover, many of them may be old, low luminosity stars hanging around the thick galactic disk or roaming through the galactic halo. Then, many of the brightest foreground stars will be luminous Pop II red giants and evolved, fast period variable stars. Although the effect of observing objects with moderate color differences at low air-masses is negligible in broadband studies, red giant stars should be avoided in differential optical photometry of quasars. Fast period variables are of course unsuitable as reference and comparison stars. Therefore, around most quasars there are only a few useful nearby, not too red and non-variable stars bright enough to be used for comparison purposes.

Another interesting subject is the possibility of detecting very fast variations. It has been noted above that the statistical power of the tests is limited by both the duration of the microvariability event and its amplitude. For the observational parameters considered in these simulations, the effort to detect microvariations lasting less than ≈30 minutes and with amplitudes of less than 0.02 mag would be very inefficient. The relatively large number of spikes reported in microvariability literature (Sagar et al. 1996; de Diego et al. 1998; Gopal-Krishna et al. 2000; Stalin et al. 2004a) suggests that they may be a common phenomenon and worthy of investigation using very fast and accurate photometry with large telescopes.

The one-way ANOVA with 30 minute lags procedure is analogous to the methodology used by de Diego et al. (1998). Simulations show that this procedure is also robust, but the smaller number of points in each data set (50 observations rather than 150) due to the discontinuity in the sampling affects the number of detections, which is about one half of the one-way ANOVA continuous sampling. But in the real world, the final result will depend on the actual timescales of the microvariations. Stalin et al. (2004a) report three kinds of variations, namely small gradual variations lasting over several hours, time-resolved microvariability on hour-like timescales, and single-point fluctuations (spikes). In this case, the one-way ANOVA with 30 minute lags will perform almost as well (within a 20%) as the continuous sampling detecting the first two types of microvariations. Therefore, as this technique permits monitoring of other target fields during the gaps between the end of a group of observations and the beginning of the next group, it turns out to be a very powerful exploratory methodology to detect microvariations. For example, in the one-way ANOVA with 30 minute lags case each group of five observations lasts 10 minutes (accounting for exposures and read-outs), and there is a gap of 20 minutes until the first exposure of the next group begins. Depending on the telescope setup, it would be possible to switch to at least one, maybe two, near different target fields, increasing the number of monitored objects by a factor of 2 or 3.

The results for the ANOVA and χ2 tests discussed above do not demonstrate that, under all circumstances, the one-way ANOVA tests perform better than a well-implemented χ2 test. It is possible to argue that similar results to those presented in this paper would have been obtained in the case, for example, of light curve shapes not considered here, such as sawtooth variations. However, there are many possibilities of test implementation and observational techniques. One of the requirements of ANOVA tests is the homogeneity of variances; that is, errors should be distributed equally in all the groups. An ANOVA is robust against moderate violations of this requirement, and thus this is not a big concern when comparing data obtained with the same telescope and equipment, and during the same night under ordinary atmospheric conditions. However, combining data from a couple of telescopes with different characteristics may ruin the one-way ANOVA's performance.

4.2. Remarks on the C-statistic Results

Proofs on the reliability of the C-statistic methodology have confirmed the severity of the problems pointed out in Section A.4. This methodology has been widely accepted as a standard test, and thus the confusion generated in the research of microvariability in quasars is not surprising. Several researchers have used C-statistics to study messy samples of radio quiet and radio loud quasars together with blazars. For example, Romero et al. (1999) include several BL Lac objects, accounting for nearly 70% of their microvariability detections. In most cases, only blazar-like objects will show variations extreme enough to lie above the proposed C-statistic critical value. A detailed discussion about the relevance of a careful sample selection can be found in Ramírez et al. (2009).

In the case of high accuracy photometry, such as the observations presented in Romero et al. (1999, $\sigma _e \sim 0.001\,\rm {mag}$), maybe rare microvariability events also could be detected by the C-statistics in non-blazar objects. For example, Gupta & Joshi (2005) detect microvariation events if the errors are about 0.005 mag, but usually fail to detect them if errors are 0.01 mag or above. Note that this result is in accordance with those presented in Section 3 in the sense that for the accuracy of 0.01 mag considered in the simulations, the C-statistic is always a factor ≳2 below its critical value.

There is contradictory evidence that some radio quiet quasars (RQQs) may have a weak blazar component (Czerny et al. 2008; Chand et al. 2009) which might explain at least some reported microvariability detections. If this blazar component is present, it might account for flux fluctuations in RQQs above 0.05 mag and the extreme variations above 0.1 mag reported for PG 0026+129 (Jang 2005) and US 995 (de Diego et al. 1998, however, note that large amplitude variations are not considered in the simulations presented in this paper).

When a more appropriate methodology is used, the results converge in the sense that microvariability in RQQs is also a common phenomenon. Thus, Gupta & Joshi (2005) used the F test and found that two, probably three RQQs out of a small sample of six presented microvariations (i.e., between 30% and 50%), but when they mixed their results with those obtained by other groups that used C-statistics, the total numbers are six, possibly 14 objects that vary out of 49 (i.e., between 12% and 29%). On the other hand, Stalin et al. (2005) separate BL Lacs in their sample and, using a combination of the C-statistic and the structure function find that the Duty Cycle for radio loud and radio quiet quasars are not statistically different (18% and 22%, respectively). This result is also analogous to the value calculated by Stalin et al. (2004a) for the Duty Cycle inferred from de Diego et al. (1998, ∼25%).

5. CONCLUSIONS

In this paper, several statistical methodologies and their variants for microvariability studies in quasars have been tested. From these techniques, one-way ANOVA and χ2 tests have shown to be the most robust statistics. However, some "reasonable" χ2 test variants, particularly those that estimate errors from the light curves of comparison stars rather than relying on accurate physical models to characterize the data quality, are much less robust. These χ2 variants should be avoided, as they can be easily substituted by other robust tests, or at least they should be studied with detail, using simulations to understand their limitations before they are used in statistical analysis. The error estimation problem does not affect the one-way ANOVA statistic that reaches the highest performance to detect variations in the simulated differential light curves of quasars presented in this paper. In fact, the results present in this paper for the one-way ANOVA test can be further improved by trying different group sizes (see Section B). On the other hand, a discontinuous sampling based on the one-way ANOVA methodology proves to be a powerful exploratory technique for detecting microvariations. The relative loss of power of the discontinuous monitoring depends on the sampling frequency, but it is possible to trade off between this sampling frequency and an increase of the number of monitored objects switching target fields during the gaps between the groups of observations, resulting in a larger total number of detections of variability.

The F test for variances has less power than one-way ANOVA and χ2 tests, but it is still a valid option for detecting flux variations in AGNs. However, the C-statistic contains several misconceptions and cannot be considered a true statistical test.

The author thanks Dr. Ignacio Méndez and the anonymous referee for helpful suggestions. This work has been supported by CONACyT grant 50296 and has been partially funded by the Spanish MICINN under the Consolider-Ingenio 2010 Program grant CSD2006-00070: first Science with the GTC (http://www.iac.es/consolider-ingenio-gtc).

APPENDIX A: TEST DESCRIPTIONS

A1. F Test for Sampled Variances

Given two sample variances such as s2Q for quasar differential light curve measurements and s2* for the comparison star, the number of degrees of freedom for each sample, νQ and ν* respectively, usually will be the same and equal to the number of measurements N minus 1 (ν = N − 1). Then, if there are no differences in population variances, the sample variance ratio is distributed as the F distribution with νQ and ν* degrees of freedom:

Equation (A1)

This F statistics is compared with the $F^{(\alpha)}_{\nu _Q,\nu _*}$ critical value, where α is the significance level set for the test, and νQ and ν* the degrees of freedom of the quasar and the star of the comparison sample. The smaller the α value, the more improbable it is that the result is produced by chance. Thus, a value α = 0.1% or 1%, as assumed in this paper, roughly corresponds to a 3σ or a 2.6σ detection, respectively. If F is larger than the critical value, the null hypothesis (no variability) is discarded. In this paper, the F test has been performed at two significance levels (0.1% and 1%) to allow comparison with other statistical procedures. The respective critical values are F(0.001)149,149 = 1.6651 and F(0.01)149,149 = 1.4666.

A well-known problem that may affect the outcomes of the F test for variances is that it is non-robust to non-normality (Lehmann 1986, Section 5.4).

A2. χ2 Test for Variance in a Normal Population

Given a number N of observations of a source over a given period of time, these observations are supposed to be taken from a population of possible observations having a normal distribution. For this sample, the mean magnitude is $\overline{V}$, and the ith observation yields a magnitude Vi and the corresponding standard error σi. Then, the χ2 statistic is expressed by

Equation (A2)

This statistic is compared against a critical value χ2α,ν obtained from the χ2 probability function, where α is the significance level and ν = N−1 are the degrees of freedom. If χ22α,ν the test indicates a larger than expected scattering of the data points (i.e., evidence of variability). Each simulation comprises N = 150 observations, and thus ν = 149. Therefore, the critical value adopted for this test is χ20.001,149 = 208.

Pica & Smith (1983), Heidt & Wagner (1996), Gopal-Krishna et al. (2000), Andruchow et al. (2005, see references therein), and others follow the χ2 test proposed by Penston & Cannon (1970) and Kesteven et al. (1976). This procedure uses a "weighted" average defined by

Equation (A3)

Note that σi is the expected error, i.e., the error from considering photon noise from the source and sky, the CCD read-out and all possible non-systematic sources of error, some of them probably unknown in practice. As these individual errors are unknown, different estimates si are used instead of σi. Thus, errors are often calculated from the usually underestimated value yielded by the IRAF reduction package, multiplied by a correction factor. This error rescalation is necessary because χ2 and other tests (but not ANOVAs) assume that the distribution of the real errors is known.

For example, Bachev et al. (2005) use a factor of 1.3; Stalin et al. (2004a, 2004b, 2005) Gopal-Krishna et al. (2003) and Gupta et al. (2008, in the near infrared) use 1.5; Garcia et al. (1999) 1.73; and Gopal-Krishna et al. (1995) 1.75. In fact, standard IRAF photometric packages do not take into account appropriate propagation of errors during the image processing. Particularly, de Diego et al. (1998) have argued that these IRAF packages do not consider the possible spurious enhancement of the S/N after the flat-field correction, due to changes in the sensibility across the detector (for example, sensibility near the borders may be lower than at the center), and that the error estimated directly from the corrected signal may be different from the original error. Note that this affects only the error estimation and not the measured level of the flux. Presently, there is no astronomical reduction software that adequately deals with this problem, and implementing a solution may not be a trivial issue. An ANOVA (see Section A.3) overcomes this problem by empirically measuring the dispersion within each group of observations (directly associated with the true S/N), while data means for each group are conserved by flat-fielding the images.

Usually, images of the object and the comparison stars (one or more) are recorded in the same CCD frames. However, combining several comparison stars to estimate the individual errors σi in Equation (A2) does not solve the problem. Here it is worth noting that χ2ν with ν degrees of freedom distributes as Fν,. Thus, even if σi is obtained using several comparison stars to get a better estimate of the errors, the precision is still limited by the number of these stars, which probably are less than a dozen. In contrast, the χ2 test requires that each σi be calculated from all the possible measurements of an infinite population of stars. This issue is exemplified in Section 4.1.

Some authors substitute the individual errors by a common error σ estimated from the dispersion in the comparison star data. This is the standard procedure used when performing the C-statistic (see Section A.4). However, in differential photometry the number of images of the object and of the comparison stars is the same and therefore the estimates for the standard deviation of their light curves have the same number of degrees of freedom. Then, the χ2 test is biased because it only takes into account the degrees of freedom of the estimation for the quasar. Therefore, the correct procedure is to consider the F test for two sample variances as proposed by Howell et al. (1988).

In conclusion, any procedure that cannot rely on theoretically known (not estimated) error values, compromises the reliability of the χ2 test. Whenever the true errors are unknown, other statistical methodologies should be used.

In the case of the simulations presented in this paper, all the parameters are controlled and the true errors can be effectively computed. Equation (A2) has been used to compute the χ2 statistics. Note that data quality is ensured by the simulation design; thus there is no need to use weighted averages.

A3. One-way ANOVA Test

ANOVA tests are used to compare the means of a number of samples. Due to the Central Limit Theorem, no matter what the shape of the original distribution is, the sampling distribution of the mean approaches a normal distribution. Therefore, tests to compare the means (t-test and ANOVA) are more robust than their counterparts to compare variances (χ2 and F tests). This offsets the problem of the non-robustness of the F test.

A one-way ANOVA has been applied by de Diego et al. (1998) to investigate the variability in the light curves of quasars. The methodology consisted of measuring k groups of nj = 5, one after the other, in short (1 minute) observations. The k groups are ideally separated by 20–30 minutes. Larger time lags are common, affecting the time resolution but not the statistical significance of the test.

For the mathematical description of the one-way ANOVA test, if yij represents the ith observation (with i = 1,  2,  ..., nj) on the jth group (with j = 1,  2,  ..., k), the linear model describing every observation is

Equation (A4)

where $\overline{y}$ represents the mean of the whole data set, $g_j = \overline{y}_j - \overline{y}$ the between-groups deviation, and εij = yijyi the within-groups deviation, also called residual or measurement error. The size of the data set will be N = ∑ki=1nj. If the number of observations in the groups nj is constant, N = k × nj.

As noted previously, an ANOVA tests whether the means of the groups are equal. The condition tested or, null hypothesis, is that the means of the different groups are equal. If the test yields a probability smaller than the adopted significance level α, the null hypothesis will be rejected and the alternate hypothesis (at least one group mean is different from the others) will be accepted. The alternate hypothesis in this case implies detection of variability in the quasar light curve.

From Equation (A4), the total sample variation can be separated into variations between and within groups

Equation (A5)

where the term on the left-hand side describes the total deviations of the data with respect to the mean. The first term on the right-hand side of the equation represents the total variation between groups, and the last term the total errors. Equation (A5) can be shortened to

Equation (A6)

where SST stands for the total sum of squares. Similarly SSG and SSR stand for group sum of squares and residual sum of squares.

Whenever the null hypothesis is true, the k groups of sampled data will be normally and independently distributed, with mean μ and variance σ2. Then, the statistics

Equation (A7)

corresponds to the F distribution with ν1 = k − 1 and ν2 = Nk degrees of freedom. The pseudo variances MSG and MSR are mean estimates for the variations between groups and residuals, respectively. For a certain significance level α, if F exceeds the critical value $F^{(\alpha)}_{\nu _1,\nu _2}$ the null hypothesis will be rejected.

The F critical values employed in this paper for the ANOVA tests are F(0.001)29,120 = 2.2819 and F(0.001)9,40 = 4.0243 for the full and the 30 minute lag sample simulations, respectively.

A4. C-statistic

The C-statistic was first employed by Jang & Miller (1997) and generalized by Romero et al. (1999). The statistical parameter used is

Equation (A8)

where σT and σ are the standard deviation of the quasar and the comparison star differential light curves. The adopted variability criterion requires that C ⩾ 2.576, which corresponds to a 99% confidence level, or 1% significance level following the notation in this paper.

There are two pitfalls with this criterion. First, the critical value 2.576 corresponds to the 1% significance level of the normal distribution for a two-sided test, rather than a one-sided comparison. The two-sided test would be relevant to test that the dispersion in the quasar light curve may be larger or smaller than the dispersion of the comparison star light curve. Note also that the C-statistic would always be positive, and that its expected value when σT = σ is centered around 1, rather than 0 as would be expected in the case of a fair normal distribution.

Another pitfall, probably the most important, is that you cannot compare two standard deviations using the normal distribution. In fact, it is not feasible to use standard deviations for most calculations cohy they are not lineal statistical operators (for example, given two independent random variables A and B with standard deviations σA and σB, respectively, the standard deviation of the sum A + B is not σA + σB). That is the reason why you have to use variances instead; variances are the second moments of the statistical distributions and therefore lineal operators (in the previous example, the variance of the sum A + B is σ2A + σ2B).

If we draw all possible samples of a given size N from a normally distributed population and compute the variances of all those samples, we will obtain a distribution of sample variances that starts with s2 = 0 and have a mean of σ2. Thus, even if the distribution of all possible sample means drawn from a normally distributed population will be approximately symmetrical (as a consequence of the central limit theorem), an equivalent distribution of sample variances will not approximate symmetry, but will be distributed as χ2 with N − 1 degrees of freedom (this result is a consequence of the Cochran's theorem for the sum of squares of linear combinations of a set of independent standard normally distributed random variables):

Equation (A9)

One important outcome of this equation is that the shape of the χ2 distribution will be different for different sample sizes. Thus, the dispersion of the distribution of sample variances will depend on how many degrees of freedom has our estimate. Note that this dependence of the variance dispersion on the number of degrees of freedom transmits to the standard deviation, although the χ2 statistic does not apply in this case.

Using the F statistics it is easy to calculate the "real" significance of the C-statistic. If the critical value from Equation (A8) is 2.576, the critical F value is 2.5762 = 6.636. For F with (20, 20) and (50, 50) degrees of freedom, the significance level of the test will be 4.4 × 10−5 and 2.1 × 10−10, respectively; i.e., much less than the 10−2 value considered by Jang & Miller (1997) and Romero et al. (1999). Note that Romero et al. report that variations in radio loud quasars occur more often than in RQQs, but most sources in their radio loud sample are strongly variable BL Lac objects that can easily reach high significance levels of microvariability detection.

To summarize this discussion, the critical values for the C-statistic criterion are wrongly established, and Equation (A8) does not describe a normal distributed variable because it is neither properly centered with the mean expected value being zero, nor does the independent sample size parameter σ represent an unbiased measurement of the dispersion of the sample standard deviation. The square value of the C-statistic can be used to perform an unbiased F test if the degrees of freedom used for the σT and σ estimates are provided.

APPENDIX B: IMPROVING THE ANOVA POWER

The power of any statistical test to study variability during a given time interval depends on the number of observations recorded to improve the time resolution and the measurement precision to improve the amplitude resolution. In practice, a compromise is attained to hold reasonable resolutions for both factors before actually observing the target. Would it not be useful to have the possibility of swapping between the time and amplitude resolutions after the observations? Binning data might help, but it is shown in Section 4 that it does not work for a χ2 methodology because of the loss of degrees of freedom.

In the ANOVA methodology, if the target has been monitored discontinuously, as in the case of an ANOVA with 30 minute lags, the binning is fixed and consequently the time and amplitude resolutions cannot be exchanged either. But if the monitoring was continuous, it is possible to try out different bin sizes to improve the amplitude resolution at the expense of the time resolution. In this paper, a binning of five observations was used because it was a priori a reasonable value for time resolution and allowed direct comparison with the ANOVA with 30 minute lags strategy. Now, different bins of n observations will be considered to achieve the maximum power of the ANOVA test for the light curve simulations considered in this paper. The bin size n will always be a divisor of the total number N = 150 of observations of a light curve simulation.

Table 4 shows the ANOVA results for the Gaussian peak variations considering different bin size strategies. Column 1 indicates the bin size n, Column 2 shows the number of degrees of freedom ν1 for the k groups (ν1 = k − 1), Column 3 displays the number of degrees of freedom ν2 for the residuals (ν2 = Nk), Column 4 shows the F critical value for a significance level α = 0.1% and ν1 and ν2 degrees of freedom ($F^{(0.001)}_{\nu _1,\nu _2}$), Column 5 indicates the number of Type I errors, Column 6 shows the number of detections of variability, and Column 7 displays the relative power for the ANOVA tests normalized to the results for n = 15.

Table 4. ANOVA Performance for Different Bin Sizes

n ν1 ν2 $F^{(0.001)}_{\nu _1,\nu _2}$ Gaussian Peak Relative
        Variation Power
      Type I Detections  
 2 74  75 2.0657 2 1178 0.48
 3 49 100 2.0835 5 1713 0.69
 5 29 120 2.2819 3 2187 0.88
 6 24 125 2.3907 6 2268 0.92
10 14 135 2.8199 4 2444 0.99
15  9 140 3.3374 1 2478 1.00
25  5 144 4.3617 3 2285 0.92
30  4 145 4.8882 7 2143 0.86
50  2 147 7.2428 5 1996 0.81
75  1 148 11.2727  4  430 0.17

Download table as:  ASCIITypeset image

A careful inspection of Table 4 shows that, for the simulations presented in this paper, the test power to detect variations increases with the bin size until reaching a maximum of 2478 detections for n = 15 (noted in boldface in Table 4), and then decreases for larger bin sizes. The smooth overall detection tendency shows that the differences in power for these tests with different bin sizes are not an artifact. For n = 5 the relative power is almost 0.9, and all the bins between n = 5 and 50 have relative power larger than 0.8. These results show that the ANOVA bin size is not a fine tuning parameter, and that binning the data in groups of five observations is, also from the statistical point of view, a reasonable choice.

The smallest bin sizes improve the time resolutions, but the detections are biased toward large amplitude variations. On the contrary, the largest bin sizes can detect small variations, but the time resolution is degraded and fast variations are not detected. Thus, the tests for n = 5 detect 65 variations with time FWHM up to 13 minutes, while the tests for n = 15 detect only 37. But the percentages of detections for an amplitude of variation of approximately 0.01 mag are around 10% and 20% for the n = 5 and n = 15 tests, respectively.

In the case of linear variations, or a more generally monotonic variation, the results are simpler. As the duration of the variability event is not an issue in this case, the only factor that matters is the amplitude resolution. Therefore, the tests that have more power are those with only two groups, one for each half of the light curve, that corresponds to n = 75, ν1 = 1 and ν2 = 148. The actual number of detections for this test is 4080, instead of 3043 detections that were obtained for n = 5.

These results are easily translated to the case of analyzing a single quasar light curve. The researcher does not know in advance the specific characteristics of a possible microvariability event. But planning the observations to be analyzed using an ANOVA experimental design, the astronomer can enhance the temporal resolution of the observations to detect very fast large amplitude variations, and still conserve the amplitude resolution to detect longer timescale low amplitude variations. Of course, this has also a cost. The final significance level αf when performing a set of nt statistical tests is different from the significance level αt for the individual tests. There are a number of adjustments applied in the statistical literature for multiple tests, the most common is the Bonferroni correction:

Equation (B1)

If the set of tests performed are the same as in the case presented above for the simulations, nt = 10 and αt = 0.1%, the final significance level will be αf = 1%. Similarly, the significance level of the individual tests could be set to αt = 0.01% to reach a final value of αf = 0.1%). However, the results shown in Table 4 suggest that performing test for bins of size 5, 15, and 30 may be enough to ensure detection on a wide range of timescales and amplitudes. If only three tests are performed, the significance levels are αf = 0.3% (if αt is set at 0.1%) or αt = 0.03% (if αf is maintained at 0.1%).

Footnotes

  • The power of a statistical test is the probability that the test will reject a false null hypothesis. A robust statistical technique is one that performs well even if its assumptions are somewhat violated by the inherent properties of the sampled population.

  • The Instrument Simulator of the OAN has been developed by Alan Watson and can be accessed through the OAN Web site http://132.248.4.250/resast/simulador.

  • A more accurate calculation would take into account the fraction of Type I errors included in the number of detections. If Type I errors were frequent, mixtures of honest detections and Type I errors should also be present.

Please wait… references are loading.
10.1088/0004-6256/139/3/1269