Analysis of Rare, Exonic Variation amongst Subjects with Autism Spectrum Disorders and Population Controls

We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.

Picard and reads mapped to hg18 using Bfast [?]. The quality score recalibration and indel realignment was performed using GATK, followed by single nucleotide variant (SNV) identification using AtlasSNP 2 software [?].
Prior to filtering we compare the minor allele frequency (MAF) of SNVs for Baylor and Broad data sets (Fig. S6). These two platforms differ in read depth (Fig. S7).
A model for filtering of SNVs was developed to improve the quality of the data and facilitate combining of data from 2 platforms. The misclassification of a common homozygote as a heterozygote can largely reduce the power of rare variants association tests [?]. In our data some heterozygous calls were based on very low and unbalanced depth. Filtering out these low-quality heterozygous calls should improve power.
To create a suitable filter based on the data, we choose the 104 trios from the Broad data set [?].
Prior to filtering we found 2272 potential de novos mutations among 104 trios, but this greatly exceeds expectation and most of these are false calls. Because low quality parental genotype calls can lead to false de novo discoveries, we first remove those candidates that are likely due to errors at this level. If one parent is heterozygous, but his/her genotype call is common homozygous, then a false positive de novo is created in this position. Therefore, we first use a simple filter to delete the de novo discoveries likely tracing to low quality homozygous calls in parents, namely that the GQ score must be larger than 30 and the balance must be larger than 95%. By this filter we reduce the number of potential de novos from 2272 to 522.
Based on a parsimonious filter for complete trios, Neale et al. [?] greatly reduced this list and identified 87 true de novos, which were confirmed by Sanger sequencing. We label the remaining 435 heterozygotes to be false calls. Now we use the data from the ASD probands of these trios to develop a filter. Here we used the quality variables available to us (depth, balance and genotype quality (GQ)) to build a prediction model. Based on this information, and by using classification tree methods we can find a good threshold to filter the false minor allele calls. The filter selected by 10-fold cross validation keeps minor allele calls that have depth larger than 17 and balance smaller than 0.66. The misclassification rate is 0.09, but 4 of the 87 validated de novo variants were missed with this stringent filter. A slightly less stringent filter with depth larger than 10 and balance smaller than 0.75 also gave good classification results, with no false negatives. Table S2 shows the classification results of the stringent and lenient filter.
We compared the seven common samples processed at both sites using 4 varieties of filters plus 2 very minimal filters (Table S1). First we identified all non-synonymous rare variants. Applying the stringent filter described previously (η > 17 and ξ < .66) to both data sets we found that Baylor had far too few heterozygous calls and the concordance rate across sites was greatly reduced. This is presumably due to the lower depth of the Baylor reads overall. Clearly we need to use a less stringent cutoff for depth and balance if we wish to obtain comparable call rates; however we do not want to do so if it leads to a high rate of false heterozygous calls in Baylor relative to the filtered Broad calls. We considered 4 filters to select a cutoff that produced a high concordance rate with a low rate of potential false positives in Baylor versus Broad. With a Baylor filter of depth > 10 and balance < .85 we achieved this goal, reducing the mismatch rate to .017% with only 9 heterozygotes called in Baylor, but not in Broad.

B. Mathematical exposition of mega-and meta-analysis
To obtain equation (Eqn. 4), note that where S j is the score of the jth variant. The j'th score can be expressed as Under the null hypothesis, β j = 0 for all j, therefore Q ∼ χ 2 d , and under the alternative hypothesis, To obtain equation (Eqn. 5), note that the score test statistics can be written as Under the null hypothesis, By the Lindeberg central limit theorem (CLT), if T j is not dominated by a few elements, then i T ijỸi is approximately independent standard normal distribution. Therefore Under the alternative hypothesis, By Lindeberg CLT the distribution of S j is: Next we seek an expression of power of meta-and mega-analysis. For the meta-analysis, we first calculate the p-value of Q 1 and Q 2 under the null distribution.
under the null hypothesis; it can be used as the test statistics of meta-analysis. Since the pdf of χ 2 d is complicated, we use a normal approximation, due to R.A. Fisher: 2χ 2 d is approximately equivalent to 5 N ( √ 2d − 1, 1). Therefore P r 1 and P r 2 can be rewritten as So now we have two approximate statistics for meta-and mega-analysis Meta: Under α level type I error, the power of meta-and mega-analysis are as followed: where q 1−α is the 1 − α quantile of normal distribution, and f Q t , f Q g are the density of Q t and Q g .
Because the pdf of a non-central chi-distribution is very complicated, we apply another approximation due to [?]: where b = δ/(d + δ).
Let c = δ/d and plug 2c + 1 (c + 1) 2 d 3 (c + 1)d and x follows standard normal distribution. We obtain the power of meta-and mega-analysis under the level of α: Some insight can be gained by noting that Q t is well approximated by 2(Q 1 + Q 2 ) − √ 4d − 1.
When merging data files from separate sources a substantial fraction of the observations can be missing due to the filtering process. However, due to the nature of the test statistic, missing values have a negligible effect on inference. This can be best explained in the context of the C-alpha test, which is equivalent to the SKAT test under certain conditions. Recall that if a variant is observed z times in cases and m − z times in controls, the C-alpha statistic compares the distribution of (z, m − z) to a binomial distribution with probability of success equal to the fraction of cases in the sample. The test then accumulates this information across variants in the gene. Missing data would have no effect on the test statistic provided missingness was equally distributed over cases and controls. In our analysis missingness varied by site (Baylor and Broad), and by case/control status (Table 3). In practice, SKAT imputes missing values using single imputation. Because our analysis is restricted to variants with MAF < .01, this is essentially equivalent to imputing missing values with the reference allele. We compared results obtained with and without imputation and found this procedure had negligible effect on the outcome.
Due to inaccuracies in the P-values for genes with very few variants and numerous covariates, we include only the P-values for the genes that have minor alleles counts (MACs) greater than 4. With this filter, we have 12,676 genes in Baylor and 13,119 genes in Broad. For those genes clearly not associated with the phenotype (p-values > .5) we found that SKAT tended to report p-values biased downward toward .5, causing an apparent, but uninteresting, inflation in the genomic control (GC) inflation factor.
Hence we calculate the GC factor using the third quantile of {F −1 (1−p-value)}, where F −1 is the inverse CDF of χ 2 1 distribution, rather than the median. Dividing this quantity by its expectation, which is equal to 1.32, leads to a more robust, but equivalent measure of the GC factor.
For AASC data analysis, weights are set at √ ω j = Beta(p j , 1, 25

Supplemental Acknowledgement
The collection of data and biomaterials in one project that participated in the National Institute of