Stimulus-Induced Narrowband Gamma Oscillations are Test–Retest Reliable in Human EEG

Abstract Visual stimulus-induced gamma oscillations in electroencephalogram (EEG) recordings have been recently shown to be compromised in subjects with preclinical Alzheimer’s Disease (AD), suggesting that gamma could be an inexpensive biomarker for AD diagnosis provided its characteristics remain consistent across multiple recordings. Previous magnetoencephalography studies in young subjects have reported consistent gamma power over recordings separated by a few weeks to months. Here, we assessed the consistency of stimulus-induced slow (20–35 Hz) and fast gamma (36–66 Hz) oscillations in subjects (n = 40) (age: 50–88 years) in EEG recordings separated by a year, and tested the consistency in the magnitude of gamma power, its temporal evolution and spectral profile. Gamma had distinct spectral/temporal characteristics across subjects, which remained consistent across recordings (average intraclass correlation of ~0.7). Alpha (8–12 Hz) and steady-state-visually evoked-potentials were also reliable. We further tested how EEG features can be used to identify 2 recordings as belonging to the same versus different subjects and found high classifier performance (AUC of ~0.89), with temporal evolution of slow gamma and spectral profile being most informative. These results suggest that EEG gamma oscillations are reliable across sessions separated over long durations and can also be a potential tool for subject identification.


Introduction
Gamma rhythm corresponds to narrowband oscillatory activity between 30 Hz and 70 Hz in the brain, which has been shown to be involved in many higher cognitive functions like attention (Fries et al. 2001;Jensen et al. 2007) and memory (Herrmann et al. 2004). Gamma oscillations have also been shown to be affected in cognitive disorders such as schizophrenia, autism Singer 2010, 2012;Kitchigina 2018), bipolar disorder (Garcia-Rill et al. 2019) and Alzheimer's Disease (AD) (Stam et al., 2002;Kitchigina 2018). In addition, gamma oscillations can be induced by presenting certain visual stimuli such as bars and gratings, and their amplitude and center frequency dependent on the stimulus properties such as size, contrast, and spatial frequency (Henrie and Shapley 2005;Gieselmann and Thiele 2008;Ray and Maunsell 2010;Jia et al. 2013;Murty et al. 2018). Recently, we have shown that these stimulus-induced narrowband gamma oscillations recorded using electroencephalogram (EEG) in middle-aged/elderly subjects (>49 years) weaken with age (Murty et al. 2020) and are weaker in subjects with mild cognitive impairment (MCI) and early AD compared with their age and gender matched controls (Murty et al. 2021). These results suggest that stimulusinduced gamma oscillations recorded using EEG could potentially be an inexpensive and easily accessible biomarker for diagnosis of mental disorders such as AD.
For the induced gamma to serve as a clinical biomarker, it is necessary to confirm its consistency across multiple recording sessions. Previous studies have shown that stimulus-induced visual gamma oscillations are test-retest reliable, but there are several limitations. First, such test-retest reliability of induced visual gamma oscillations is limited to magnetoencephalogram (MEG) recordings (Hoogenboom et al. 2006;Muthukumaraswamy et al. 2010;Tan et al. 2016). The only EEG study to our knowledge that reported induced gamma oscillations used a checkerboard stimulus that mainly induced broadband gamma (Keil et al. 2003); another study that used gratings to induce gamma oscillations reported reliability of evoked, not induced, gamma oscillations (Fründ et al. 2007). Second, although having a test-retest interval of a year allows for checking its feasibility as an annual check-up, most of these studies have tested reliability with test-retest interval spanning from a few weeks to months (but see McCusker et al. 2021 who had a test-retest interval of 3 years, although they used checkerboard patterns and their paradigm also included a behavioral task). Third, most studies have been conducted on young subjects (20-40 years), but it is also important to study the reliability in older subjects who are prone to cognitive disorders. Further, steady-state-visually evoked-potentials (SSVEPs) at gamma frequencies have recently been shown to have therapeutic effect on rodent models of AD (Iaccarano et al., 2016), and to be useful in brain computer interfacing (BCI) applications (Vialatte et al. 2010). Therefore, it is important to compare the test-retest reliability of SSVEPs with induced gamma oscillations. Finally, most previous studies have focused on power and center frequency of induced gamma oscillations. However, these oscillations also have a characteristic temporal evolution, and in addition, the peak frequencies of these and other oscillations vary across individuals, thus generating a characteristic spectral profile as well. Since these spectral and temporal features could provide additional information beyond simply the power and center frequency, it is important to study test-retest reliability of these features. Such power and center frequency based features have been studied in the context of subject identifiability (Näpf lin et al. 2007) and biometric authentication (Thomas and Vinod 2018) in resting state EEG data as well as visual evoked potentials (Sarnthein et al. 2009), but not for stimulusinduced gamma oscillations.
To address these limitations, we tested the reliability of visually induced narrowband gamma oscillations recorded using EEG in healthy middle-aged/elderly (aged >49 years) with an interval of 1 year or more. Since several previous studies have reported reliability of alpha oscillations (Salinsky et al. 1991;Corsi-Cabrera et al., 2007;McCusker et al. 2021), we performed these analyses on the alpha band as well. This dataset is a part of double-blind case-control EEG study (n = 257; 106 females), part of which has been used earlier to show that gamma oscillations weaken with healthy aging (Murty et al. 2020) and with MCI/AD (Murty et al. 2021). Fifty-two subjects were retested with the same visual paradigm after a period of approximately 1 year. We employed large visual stationary Cartesian grating stimuli that induced 2 distinct gamma oscillations in both monkeys and humans (Murty et al. 2018), called slow [20-35 Hz] and fast gamma . We estimated reliability of power in alpha and both gamma bands using a widely used measure called Intraclass Correlation Coefficient (ICC; (Shrout and Fleiss 1979)). We examined the testretest reliability of SSVEPs at 32 Hz as well. We also studied correlations in the temporal and spectral profiles across same versus different individuals. In addition, we studied intersubject variability (subject distinctiveness) through a linear classifier that was trained to discern same subject pairs from the all possible cross-subject pairs and compared the contribution of various features toward classification.

Human Subjects
The full dataset was collected from 257 middle-aged/ elderly human subjects (females: 106) aged 50-88 years as part of the Tata Longitudinal study of aging (TLSA). For brevity, we refer to the whole population as "elderly," even though there is considerable age variation in the population. The subjects were recruited from urban Bengaluru communities, with careful evaluation by trained psychiatrists, psychologists, and neurologists affiliated with National Institute of Mental Health and Neuroscience (NIMHANS) and M.S. Ramaiah Hospital, Bengaluru. Their cognitive performance was evaluated by a set of tests-Addenbrooke's Cognitive Examination (ACE), Clinical Dementia Rating (CDR), and Hindi Mental State Examination (HMSE). More details have been provided in previous reports (Murty et al. 2020(Murty et al. , 2021. Out of the 257 subjects recruited in the first year, we discarded data of 10 subjects due to noise. Out of the remaining 247 subjects, 52 subjects were retested after approximately a year (this test-retest interval was based on constraints not related to EEG recordings, since this was part of a multi-investigator project involving many types of recordings). Among these, 4 subjects were either unhealthy (CDR > 0) or had invalid CDR score, leading to usable data of 48 subjects who were healthy in both sessions. The test-retest sessions of a subject are referred to as baseline and follow-up.
All subjects received monetary compensation for participating in both the sessions. Informed consent was obtained from the subjects prior to the experiment. All procedures were approved by The Institute Human Ethics Committees of Indian Institute of Science, NIMHANS, Bengaluru and M.S. Ramaiah Hospital, Bengaluru.

Experimental Setting and Behavioral Task
Experimental setup and details of recordings have been explained in detail previously (Murty et al. 2020(Murty et al. , 2021 and are summarized here. EEG was recorded from 64channel active electrodes (actiCap) using BrainAmp DC EEG acquisition system (Brain Products GMbH) and were placed according to the international 10-10 system. Raw signals were filtered online between 0.016 Hz (first-order filter) and 1 kHz (fifth-order Butterworth filter), sampled at 2.5 kHz, and digitized at 16-bit resolution (0.1 μV/bit). Samples were decimated to 250 Hz using the Matlab command "resample," which applies an finite impulse response antialiasing lowpass Kaiser filter before downsampling by 10. Average impedance of the final set of electrodes was (mean ± SEM) 7.89 ± 0.67 kΩ and 7.73 ± 0.54 kΩ for baseline and follow-up, respectively. All the EEG signals recorded were referenced to electrode FCz during acquisition (unipolar reference scheme).
All subjects sat in a dark room facing a gammacorrected LCD monitor (dimensions: 20.92 × 11.77 inches; resolution: 1280 × 720 pixels; refresh rate: 100 Hz; BenQ XL2411) with their head supported by a chin rest.
It was placed at (mean ± SD) 58 ± 0.7 cm from the subjects (range: 54.9-61.0 cm) based on their comfort and subtended 52 • × 30 • of visual field for full screen gratings.
In the "gamma" experiment, subjects performed a visual fixation task, wherein 2-3 full screen grating stimuli were presented for 800 ms with an interstimulus interval of 700 ms after a brief fixation period of 1000 ms in each trial using a customized software running on MAC OS. The stimuli were full contrast sinusoidal luminance achromatic gratings with either of the 3 spatial frequencies (1, 2, and 4 cycles per degree (cpd)) and 4 orientations (0 • , 45 • , 90 • , and 135 • ). Out of the 48 subjects that were retested, 1 subject had noisy followup data, leaving with 47 pairs of subjects. With further constraints on the number of good electrodes for analysis (see Artifact Rejection below) over both baseline and follow-up, we finally had 40 pairs of subjects (20 males and 20 females). Subjects with data available on both the sessions (n = 40) aged in the range of (mean ± SD) 64.3 ± 7.17 (min: 50, max: 85 years). The test-retest interval was (mean + SD) was 380.4 ± 10.12 days (min: 256, max: 519), as reported in the plots in Figures 1 and 2.
The "SSVEP" experiment was performed on a subset of these subjects. This was always conducted after the gamma experiment, on the same day because the stimulus settings were dependent on the gamma experiment results. Here, subjects viewed a random presentation of grating counter-phasing at 16 cycles per second (cps), along with a static grating, both of spatial frequency, and orientation that produced the maximum gamma in the gamma experiment, but only the data from counter-phased grating was used. These stimuli were presented in a similar stimulus paradigm as the gamma experiment. Out of the 48 subjects that were retested, 3 subjects had unusable baseline or follow-up data and 9 subjects did not have enough number of good electrodes (see Artifact Rejection below), leaving us with 36 pairs of subjects (20 males and 16 females). Of these, subjects with the number of trials less than 15 were further discarded because an appreciable SSVEP peak was often not visible with fewer trials (4 subjects), ending up with 32 pairs (17 males and 15 females).
Eye position was monitored using EyeLink 1000 (SR Research Ltd), sampled at 500 Hz. Any eye-blinks or change in eye position outside a 5 • fixation window during −0.5 s to 0.75 s from stimulus onset were noted as fixation breaks, which were removed off line. This led to a rejection of 14.6 ± 2.8% (mean ± SEM) and 17.8 ± 2.9% (mean ± SEM) repeats for the baseline and the followup groups for the gamma experiment, and 14.9 ± 2.4% and 16.5 ± 2.7% for the SSVEP experiment. We have previously shown that eye-movements and microsaccades (Hassler et al. 2011) have a negligible effect on gamma power in this dataset (Murty et al. 2020).

Artifact Rejection
We implemented a fully automated artifact rejection framework for further details, see (Murty et al. 2020(Murty et al. , 2021, in which outliers were detected as repeats with deviation from the mean signal in either time or frequency domains by more than 6 standard deviations, and subsequently electrodes with too many (30%) outliers were discarded. Subsequently, stimulus repeats that were deemed bad in the visual electrodes (see the section on EEG data analysis below) or in more than 10% of the other electrodes were considered bad, eventually yielding a set of good electrodes and common bad repeats for each subject. Overall, this led us to reject (mean ± SEM) 14.7 ± 0.8% and 14.6 ± 0.8% of repeats for the baseline and follow-up groups for the gamma experiment, and 6.9 ± 0.6% and 7.3 ± 0.7% for the SSVEP experiment. Finally, we computed slopes for the baseline power spectrum between 56 Hz and 84 Hz range for each unipolar electrode and rejected electrodes whose slopes were less than 0. We further discarded protocols which did not have any of the bipolar electrode pair (see EEG data analysis) in the visual anterolateral electrode groups (both left and right anterolateral group) after electrode rejection. Subjects with no good protocols (in either baseline or follow-up group or both) were rejected from analysis, yielding 40 subjects (20 females) for the gamma experiment and 32 subjects for the SSVEP experiment.
In addition to this artifact detection pipeline that was used in previous studies, 2 additional conditions were included. First, for each subject, we computed the union of bad electrodes in baseline and follow-up sessions and rejected this union from both sessions, such that analyses for baseline and follow-up were over the same set of electrodes. Second, we noticed the presence of a large discontinuity in the signal in a small fraction (∼3%) of stimulus repeats and added a pipeline to remove those. Specifically, stimulus repeats with discontinuities (in any of the electrodes) were detected by thresholding the derivative of signal (>120 μV/sec) in time within [−0.5 1.5] s with respect to stimulus onset. This led to the rejection of (mean ± SEM) 5.07 ± 0.89 (maximum: 21) repeats in baseline and 4.55 ± 1.04 (maximum: 23) repeats in follow-up subjects. Electrodes with more than 20 such discontinuities were considered noisy (maximum: 5 electrodes). These additional conditions were added mainly to improve the time-frequency plots (Figs 1 and 2) that were sensitive to these large discontinuities; the main results (Figs 5 and 8) remained similar even without these additional conditions. Overall, after discarding all bad repeats, 291.6 ± 12.8 and 298.9 ± 14.12 stimulus repeats were available for the gamma experiment, and 30.1 ± 1.1 and 29.5 ± 1.1 for the SSVEP experiment, for the baseline and the follow-up groups, respectively.
The exclusion criteria are summarized in Table 1 below. Note that the reduction in the size of the SSVEP dataset is simply because it was always done after the Gamma experiment and for a shorter duration, since the main aim of this project was to investigate stimulus-induced gamma oscillations. However, even for the Gamma study, data from 8/48 (∼17%) of subjects were discarded. This relatively high number of rejections is because of the very stringent criteria for subject selection. For example, the visual electrodes that were used for analysis (see "EEG data analysis" section below) were divided into 2 anterolateral groups and one posteromedial group. We rejected any subject for which even one of the 2 anterolateral groups did not have any good electrodes. This was done so that we could test for potential asymmetry in gamma distribution on the scalp (as shown in Supplementary Figs S1 and S2). Relaxing this requirement such as usable electrodes in either one (instead of both) of the 2 anterolateral groups lead to subject selection would have yielded 4 additional subjects (and a rejection ratio of ∼8%).
All the data analyses were done using custom codes written in MATLAB (MathWorks. Inc; RRID:SCR_001622). Power spectrum and time-frequency spectrograms were obtained using the Chronux Toolbox ((Bokil et al. 2010), RRID:SCR_005547). Both were calculated for individual Change in power between stimulus period and baseline period for a frequency band was computed using the following formula: where ST is the stimulus power spectra and BL is the baseline power spectra, both averaged within relevant frequency bins (f ), across all the analyzable repeats. Band powers are specifically computed in 3 frequency bands, namely slow gamma (20-35 Hz), fast gamma (36-66 Hz), and alpha (8-12 Hz). The gamma frequency band limits were set based on the localization of 2 gamma peaks in the change in PSD profile, and are the same as used in our previous studies (Murty et al. 2018(Murty et al. , 2020(Murty et al. , 2021. These were averaged across all the 3 electrode groups. Scalp maps were generated using the topoplot function of EEGLAB toolbox ( (Delorme and Makeig 2004), RRID:SCR_007292) with custom bipolar montage of 112 channels. Test-retest reliability was assessed by Intraclass Correlation Coefficient (ICC) measure. Suppose the dataset is arranged in a matrix with each column representing a session and each row a different subject. ICC incorporates both intersubject and intrasubject variability in terms of mean square (variance) for rows (MSR) [variance of sessions' means across subjects] and mean square (variance) within raters (MSW) [mean of subjects' variance across sessions], respectively. A specific type of ICC, called 1-1 ICC, which is based on one-way analysis of variance model, is defined as follows (see (Shrout and Fleiss 1979) for details): In our case, since N sessions = 2, this reduces to ICC = (MSR − MSW)/(MSR + MSW). This measure increases with decreasing within-subject variability (MSW) and with increasing between-subject variability (MSR), with values approaching 1 with perfect reliability (MSW = 0). This measure was used for the reliability of spectral power in sensor and source domains in MEG (Martín-Buro et al. 2016). Also, the null-hypothesis H 0 : ICC = 0 was tested for significance by F-statistical test, F-value = MSR/MSW, with (N subjects − 1) and N subjects (N sessions − 1) degrees of freedom (Shrout and Fleiss 1979), reported with 95% confidence interval, F-value, and P-value. Larger the variation of session means across subjects, relative to the variation within sessions, larger is the F-value. Given the F-value, degrees of freedom of MSR and MSW, P-value is computed based on the F cumulative distribution function. Two related types of ICC, called Consistency and Agreement ICC (ICC(C,1) and ICC(A,1)) were also computed for comparing our results with other studies. These ICC definitions involve mean square for error which encompasses variance across both subjects and sessions (for details, see (McGraw and Wong 1996)). We computed the ICC using MatlabCentral file-exchange codes provided by Salarian (2016), which implemented the statistical testing described by (McGraw and Wong 1996).

Estimating Subject Distinctiveness Using Linear Discriminant Analysis
A measure of subject distinctiveness reflects the separability between the self-pairs and cross-pairs-which is approximated here, in terms of classifier performance (Linear Discriminant classifier) in discriminating the 2 categories of pairs. Input to the classifier is a pair of baseline and follow-up session recordings from 2 subjects. For each pair, we defined 7 comparative measures (stimulusinduced features), namely, absolute difference in relative band powers within slow gamma (p sγ ) [20-35 Hz], fast gamma (p fγ ) [36-66 Hz], and alpha (p α ) [8-12 Hz] (3 features), baseline corrected band power temporal correlations (Pearson's correlation) in the slow gamma (tc sγ ), fast gamma (tc fγ ), and alpha (tc α ) between [0 0.8] s (3 features; correlation between the traces shown in Fig. 3), and spectral correlation (sc) in [0-100 Hz] (excluding 50 Hz and 100 Hz peaks that represented line noise and monitor refresh rate; 1 feature; correlation between the traces shown in Fig. 4). To study the importance of the features in the baseline period, we also considered baseline alpha difference power (p blα ) and raw PSD correlation (bl sc) features (correlation between the baseline PSDs shown in Supplementary Fig. S3). Three binary classifiers (using only stimulus-induced features, only baseline features, and both feature sets combined) were trained to classify a pair into either self-pair or cross-pair and was tested according to 5-fold stratified cross-validation.
Because the cross-pairs (40 × 39 = 1560) far outnumbered the self-pairs (40), classifier performance could  not be assessed using accuracy, since even a null classifier that categorizes every observation as a crosspair would have an accuracy of 97.5%. Instead, we assessed the classifier performance through the area under the receiver operating characteristic (ROC) curve (AUC), where the ROC curve represents the true positive rate against false positive rate. AUC calculation does not need the class label output of classifier, as it considers various thresholds for classification. Only the weighted projection along with the true class labels are needed for true positive rate and false positive rate. AUC values vary between 0.5 (chance level classification) and 1 (perfect classification). Feature importance was determined by the weight of the feature computed using linear discriminant analysis (LDA) classifier on the z-scored feature values, as well as using single feature AUC method wherein each feature was individually considered to classify self versus cross-pair. We used MATLAB statistics and machine learning toolbox function perfcurve.m for computing AUC.

Statistical Analysis
All the statistical analysis and results obtained, are using the power spectrum or correlation measures. Appropriate tests (t-statistic; F-statistic; Wilcoxon rank sum) were used to interpret our findings.

Data and Code Availability
All spectral analyses were performed using Chronux toolbox (version 2.10), available at http://chronux.org. ICC analysis functions were taken from MatlabCentral file-exchange, provided by Salarian (2016). Figure data and codes used here are available at https://github.com/ supratimray/TLSAEEGProjectPrograms under the con-sistencyProject folder. Raw data will be made available to readers upon reasonable request and made publicly available at a later time.

Time Frequency Spectra Remain Reliable over 1-Year Interval
We first examined the change in time-frequency power spectra relative to baseline power (−0.5 to 0 s) of EEG signals for the "gamma" experiment, averaged across the 3 bipolar electrode groups, for the baseline and follow-up sessions (see Methods). Figures 1 and 2 show the results for female (n = 20) and male subjects (n = 20), sorted based on decreasing total gamma power. Both females and males show visually reliable timefrequency spectra within a subject across baseline (left panel) and follow-up (right panel). Specifically, the powers in the 3 frequency bands, slow gamma, fast gamma, and alpha were consistent in timefrequency spectra, as were the temporal evolution of these rhythms. Similarly, the topographic scalp maps of average gamma band power (mean of slow and fast gamma) appeared consistent between baseline and follow-up sessions in females (Supplementary Fig. S1) and males (Supplementary Fig. S2). Figure 3 shows the temporal evolution of band powers in alpha, slow gamma, fast gamma range, computed by averaging time frequency spectra across relevant frequency bins, separately for males and females. Remarkably, these profiles were distinct across subjects but appeared highly correlated across the 2 sessions within subjects. Figure 4 shows the change in power during the stimulus period (0.25-0.75 s, where 0 is the stimulus onset) relative to baseline for the 2 sessions. These traces show a prominent suppression at alpha range, and elevation of power at slow and fast gamma bands, and together represent a "spectral profile" for each subject. These spectral profiles were also highly similar for each subject and distinct across subjects. Consistency was found in the spectral profile of baseline period absolute power as well, although for several subjects there was substantial difference in the noise floor ( Supplementary Fig. S3). For these subjects, the stimulus-induced PSDs were also different in a similar way (data not shown), such that the change in PSD plots shown in Figure 3 remained highly overlapping. Further, in spite of the difference in noise floor, the spectral shape during baseline periods was informative about subject identity (see Fig. 8 for details).  Figure S4 shows similar scatter plot of average resting state alpha power (baseline period). SSVEP power (increase in power at 32 Hz from baseline) computed from the SSVEP experiment (see Methods) is also shown (Fig. 5d). In all cases, the powers were highly correlated (significance results are indicated in the plots). Overall, we did not find age dependence or the dominance of any age group in the observed correlation pattern (not shown).

Feature-Based Subject Identification
From the data collected under "gamma" experiment, we computed subject distinctness measures, ref lecting the separation between the self-pairs (40/1600) and cross-pairs (1560/1600), using spectral and temporal profiles of the signal. The separation was assessed within a classifier framework, based on 7 stimulus-induced features including band power differences (slow gamma, fast gamma, and alpha bands; 3 features), temporal profiles (3 features), and spectral profile (1 feature; Fig. 7; see Methods for details) using LDA. Feature weights were computed by 5-fold stratified training and was used for classifying test data. Figure 7a shows the mean values of the 7 features for the cross and self-pairs. As expected, the band power differences were lower for self-pairs than cross-pairs (plots on the left). To use the correlation-based features in the same way to ref lect dissimilarity, we subtracted 1 from the correlation values so that they were also lower for self-pairs than cross-pairs (right plots). To compare the distinctness of these features, and for data standardization, we z-scored each feature value across all the 1600 (40 × 40) pairs (Fig. 7b). All features had lower z-scores for self-pairs compared with crosspairs, with the temporal profile of slow-gamma and the spectral profile having the largest separation in zscores.
Next, we performed LDA on these z-scored data. Because data are z-scored, the weights of the classifier (shown in the legends of Fig. 7b) can be compared directly and reflect the importance of the features toward classification. These weights were also the highest for the temporal profile of slow-gamma and the spectral profile. Note that these weights are not always proportional to the difference in z-scores between the 2 classes, since some features may be highly correlated with others and may therefore carry redundant information (for example, temporal variation in fast gamma had a weight of almost zero). Further, we assessed the performance of each feature separately by calculating the AUC, which is a measure of the separation between the data in the 2 classes that does not depend on any explicit threshold (Fig. 8). Again, AUC values were highest for the temporal correlation of slowgamma (0.84) and spectral correlation (0.83). Temporal correlation features remained dominant even when calculated over [0.25-0.75] s to exclude the transient activity. However, including all 7 features improved the AUC only marginally to 0.89, suggesting that these features had high redundancy. We also considered the baseline features (features obtained from resting state) including absolute average alpha power (1 feature; from the PSD in baseline period, Supplementary Fig. S3) and spectral profile (1 feature; correlation between the PSDs shown in Supplementary Fig. S3; see Methods for details) to determine the improvement in classification with inclusion of stimulus-induced features. Features obtained from resting state data also performed well (AUC = 0.82), with resting state alpha power (AUC = 0.73) performing better than spectral correlation of baseline PSDs (AUC = 0.67). The features in the baseline period were redundant with the stimulus features; adding these baseline features only improved the AUC to 0.90 (Fig. 8). These observations establish slow gamma temporal correlation and spectral correlation as the dominant features.

Discussion
We studied reliability of various spectral and temporal neural markers on visual stimulus-induced responses in EEG recorded from middle-aged/elderly (50-88 years) healthy human subjects, with a test-retest interval of about 1 year. Time frequency spectra were consistent across baseline and follow-up sessions for both males and females. Similar consistency was observed for temporal traces of alpha, gamma band powers as well as change in PSD profiles. We used ICC to measure reliability and found power in alpha, slow, and fast gamma to be significantly reliable. SSVEP power at 32 Hz were also found to be reliable. Temporal and spectral profiles were tested for reliability through Pearson's correlation between the profiles of same individual pairs ("self-pair correlation") and were found to be significantly higher than the correlations between cross individual pairs ("cross-pair correlation"). We trained a linear classifier to distinguish pairs of datasets as belonging to the same subject (self) versus different (cross) using 7 features including absolute difference of powers (3 features), temporal band power correlations (3), and spectral correlation (1). The feature weights as well as the separation in individual feature z-scores were highest for slow gamma temporal correlation and spectral correlation. We measured classifier performance using area under ROC curve and found that each of these 2 features also performed the best (AUC of ∼0.84). Overall classification performance increased only modestly when all features were used (AUC of 0.89), suggesting high redundancy between features. Classification using only resting state (baseline) features was also high (AUC of 0.82 when resting alpha and baseline PSDs were considered) and combining resting state and stimulus-induced features yielded AUC of 0.90. Overall, our results suggest that stimulus-induced spectral and temporal changes in various frequency bands are reliable and distinct, allowing their use in subject identification as well.

Reliability of Fast and Slow Gamma
Visually induced gamma oscillations in fast gamma range were shown to be reliable in human MEG with substantially higher ICC(C,1) of 0.89 (Hoogenboom et al. 2006;Tan et al. 2016) compared with our results (ICC(C,1) of 0.72 for fast gamma). However, these 2 studies had much lower test-retest period of 1-11 day (Tan et al. 2016) and 4 days-4 months (Hoogenboom et al. 2006), and the stimulus was a contracting circular sine wave grating. Hence, the low ICC in our case could be due to multiple reasons, such as the stimulus, recording modality and the test-retest period. Test-retest interval seems to be an important factor, since McCusker et al. (2021) who studied induced gamma in MEG after a period of up to 3 years reported much lower agreement ICC(A,1) of 0.55 (McCusker et al. 2021). Apart from the duration, the lower values could be due to the ICC type used, as the agreement type ICCs are usually lower than the consistency and classical definitions in presence of systematic bias in the data (although we did not find this in our data). Moreover, it could be due to the use of a checkerboard stimulus. Another human MEG study showed test-retest reliability in induced gamma oscillations with 1 week gap (Muthukumaraswamy et al. 2010), but they did not report ICC values. This is the first time that the reliability of induced slow gamma oscillations was addressed in human EEG with a gap of approximately a year. Slow gamma was observable in a single subject (S02) in the study by Tan et al. (2016) in human MEG, but they do not follow-up on it further. They used many stimuli with differences in spatial structure (circular or Cartesian grating), temporal structure (moving or stationary), and size (large or small), however, only some of which produced slow gamma. We have previously shown that slow gamma oscillations are induced in human EEG with stationary Cartesian gratings of large size (Murty et al. 2018) and hence we employed similar stimulus parameters for this study.

Reliability of Alpha Band
Corsi-Cabrera et al. (2007) found less intersession reliability in absolute alpha band power with eyes open compared with beta band. They attributed less intersession reliability to the higher reactivity of alpha to external stimulation modifying the subjects' internal state. Several studies have compared absolute versus relative power (percentage value of power within the band w.r.t total power across all the bands), with mixed results. Some studies found relative alpha power to be more reliable than absolute power (John et al. 1983;Kondacs and Szabó 1999), others found comparable reliability (Gasser et al. 1985;Salinsky et al. 1991), whereas a recent study (McCusker et al. 2021) reported more robust reliability of absolute alpha activity in resting state than in the stimulus period (alpha desynchronization). In our study, relative power (ICC(1) = 0.81) measures were more reliable than absolute power (ICC(1) = 0.76) in the alpha band, within the stimulus period. Moreover, in line with the study by McCusker et al. (2021), the resting state alpha reliability of both the relative power (ICC(1) = 0.83) and absolute power (ICC(1) = 0.78) were higher than the stimulus period. Apart from alpha power, alpha peak frequency also has been investigated and found to be test-retest reliable (Maltez et al. 2004;Näpf lin et al. 2007Näpf lin et al. , 2008Grandy et al. 2013).

Reliability of Other Signals
Test-retest reliability has been investigated in signals such as event-related synchronization and desynchronization (ERS/ERD) (Neuper et al. 2005), theta and alpha oscillations during a working memory task (McEvoy et al. 2000), functional EEG network characteristics (Velde et al., 2019) and evoked oscillations with visual stimuli (Keil et al. 2003;Fründ et al. 2007). Although the degree of reliability is difficult to compare across studies, all measures have shown some degree of test-retest reliability. Similarly, spectral features like the spectral exponent (slope of the PSD in log-log coordinates) have been shown to convey test-retest reliability (Demuru and Fraschini 2020).
Although, to our knowledge, reliability of SSVEPs in gamma range has not been studied, there are several studies on its auditory equivalent, the auditory steadystate response (ASSR) (McFadden et al. 2014;Legget et al. 2017;Hirano et al. 2020). Interestingly, ASSRs have been shown to be abnormal in patients with autism (Wilson et al. 2007) and schizophrenia (Brenner et al. 2003). Hence, the aforementioned reliability studies have validated the use of ASSRs as a neuropsychiatric biomarker, including those employing MEG recordings (Tan et al. 2015). A recent study further showed ASSRs to be reliable in schizophrenia patients (Roach et al. 2019).

Subject Distinctness
Näpf lin et al. (2007) used alpha power, center frequency, and spectral correlation (in [2-32 Hz]) for data collected under eyes closed and eyes open paradigm to separate out same individual pair from interindividual pair using a Generalized Linear Model (GLM) and reported 88% sensitivity or probability of detection with 40 subjects. In their study, GLM was fitted for each person separately. They also reported remarkable performance by the spectral shape feature. Another study used short time variability of the EEG spectral pattern for subject recognition using automatic speaker identification method (Stassen 1980). This study used supervised clustering with a cluster for each subject. We instead formulated a classifier over all the subjects and measured performance using AUC. Kondacs and Szabo (1999) found lower intraindividual than interindividual variability mainly in alpha power (7.5-12.5 Hz) and alpha mean frequency in eyes closed condition. We chose spectral shape feature (spectral correlation) (Fig. 7a) and alpha power during resting state ( Supplementary Fig. S4). Computation of alpha mean frequency was not informative in our case since the spectral resolution was 2 Hz (due to analysis duration of 0.5 s), giving rise to only 3 bins in alpha range (8-12 Hz). Hence, we did not consider alpha mean frequency feature in this study.

Dominance of Spectral Profile and Slow Gamma Temporal Correlation and in Classification
High performance of spectral profile feature is not unexpected, since this feature captures the changes in power at all frequencies including alpha and gamma bands, as well as features such as the shape of the PSD, and has been previously shown to perform well in classification tasks (Näpflin et al. 2007). More surprising was the high performance of the temporal profile of slow gamma. As shown in Figure 3, first column, slow gamma temporal profile was more variable across subjects than other measures such as fast gamma. We have previously shown that unlike fast gamma which is induced soon after stimulus onset and then is subsequently maintained throughout the stimulus duration, slow gamma builds up slowly over time (Murty et al. 2018). On the other hand, the transient broadband component is high early after stimulus onset (Figs 1 and 2). Depending on the ratio of transient and steady-state components in the temporal profile, slow gamma temporal profile showed higher variability across subjects than fast gamma, which generally was highest after stimulus onset and then decayed (Fig. 3).
Another reason could be the nature of slow gamma itself. Slow gamma has larger functional spread, that is, larger coherent neural population in comparison with fast gamma, as shown in monkey LFP (Murty et al. 2018). Such large-scale synchrony might contribute to its relatively higher reliability. The dominance of slow gamma over fast gamma also has important practical considerations. In EEG, higher frequencies (>30 Hz) often get corrupted by a high noise floor (as observed in a few cases in our data as well; see Supplementary Fig. S3). In addition, the line noise at 50 or 60 Hz is within the fast gamma range. These factors impede accurate estimation of fast gamma in EEG (which could have contributed to a lower performance in our data as well). Slow gamma does not suffer from these practical limitations. Therefore, using static, full field gratings that induce strong slow gamma could be a convenient way to obtain a richer spectral signature that is also more reliable with age than fast gamma (Murty et al. 2020) and is equally compromised in AD as fast gamma (Murty et al. 2021).

Factors that Influence Reliability
The reliability of gamma oscillations could depend on the stability of inhibitory interneuron networks that are thought to generate these rhythms (Buzsáki and Wang 2012). In addition, oscillations recorded on the scalp could be influenced by various subject dependent parameters like head geometry, scalp conductivity, and other low-level factors. Further, some studies have shown cortical thickness to be correlated with gamma peak frequency, but not with power values (Muthukumaraswamy et al. 2010;Gaetz et al. 2012). Auditory steady-state activity (ASSR) at 40 Hz was also shown to be positively associated with left superior temporal gyrus cortical thickness in healthy controls (Edgar et al. 2014). Since these external factors may also change with time, the reliability of the recorded signal ref lects both the stability of the sources and these confounding factors. Some of these factors such as cortical folding (Schultz et al. 2013), cortical thickness (Gaetz et al. 2012), and surface area (Herting et al. 2015) can be estimated using subjectspecific magnetic resonance imaging (MRI). Moreover, reliability estimates could be improved by increasing the signal to noise ratio by using accurate head models allowing us to go from the sensor to source space (Hoogenboom et al. 2006;Muthukumaraswamy et al. 2010;Martín-Buro et al. 2016;Tan et al. 2016).

Supplementary Material
Supplementary material can be found at Cerebral Cortex Communications online.