Histone acetylome-wide associations in immune cells from individuals with active Mycobacterium tuberculosis infection

Host cell chromatin changes are thought to play an important role in the pathogenesis of infectious diseases. Here we describe a histone acetylome-wide association study (HAWAS) of an infectious disease, on the basis of genome-wide H3K27 acetylation profiling of peripheral blood granulocytes and monocytes from persons with active Mycobacterium tuberculosis (Mtb) infection and healthy controls. We detected >2,000 differentially acetylated loci in either cell type in a Singapore Chinese discovery cohort (n = 46), which were validated in a subsequent multi-ethnic Singapore cohort (n = 29), as well as a longitudinal cohort from South Africa (n = 26), thus demonstrating that HAWAS can be independently corroborated. Acetylation changes were correlated with differential gene expression. Differential acetylation was enriched near potassium channel genes, including KCNJ15, which modulates apoptosis and promotes Mtb clearance in vitro. We performed histone acetylation quantitative trait locus (haQTL) analysis on the dataset and identified 69 candidate causal variants for immune phenotypes among granulocyte haQTLs and 83 among monocyte haQTLs. Our study provides proof-of-principle for HAWAS to infer mechanisms of host response to pathogens.


SNP Calling. All discovery samples indicated in Supplementary
were used for SNP calling as described previously 1,2 . Following the genome analysis toolkit (GATK) best practices, the BAM file for each ChIP-seq sample was realigned at given indel sites and base qualities were then recalibrated 3 . For each sample, GATK's Haplotype Caller was run in reference confidence model (GVCF) mode, and subsequently multi-sample SNP calling was performed on all samples (granulocytes plus monocytes) using GATK's Genotype GVCFs tool. Low-confidence SNP calls were filtered by applying thresholds to GATK parameter values. We filtered out SNPs if they satisfied any of the following: QD<4.185, Inbreeding Coefficient<-0.398, within a SNP cluster of more than 19 SNPs in a 100bp window, Homopolymer run≥18, and Hardy-Weinberg equilibrium binomial test P-value≤1e-3. We additionally imposed the criterion that a SNP should be supported by at least 5 non-reference reads across all datasets, and at least one of the datasets should have 3 or more non-reference reads. We also discarded SNPs contained within UCSC Genome Browser self-chains blocks with a normalized score>90. Based on this pipeline, we discovered 362,021 SNPs in ChIP-seq peaks, of which 247,411 were polymorphic in the granulocyte dataset and 273,753 in the monocyte dataset.
Genotyping. All samples were genotyped using the Illumina OmniExpress v1.1 Beadarray, which assesses >700,000 representative genetic markers in a genome-wide manner. A subset of SNPs was further validated using the Sequenom MassArray primer extension platform. Both methods have been used extensively by others and us and have been found to result in >99.99% concordance 4 . Genotypes inferred were compared against genotype calls from ChIP-seq reads to identify and correct sample swaps in the ChIP-seq pipeline.
Histone Acetylation QTLs. We used the genotype-independent signal correlation and imbalance (G-SCI) test to call haQTLs in granulocytes and monocytes 1 . Only high-quality samples from the ethnically homogeneous discovery cohort were used in this analysis (46 granulocyte samples, 32 monocyte samples). The G-SCI test does not require prior knowledge of genotypes. Rather, it infers genotype likelihoods from ChIP-seq read counts and then integrates over all possible genotypes to calculate the statistical significance of an haQTL. This test combines information from peak height vs. genotype correlation across samples and allelic imbalance in read count within heterozygous individuals. These two signatures of an haQTL are combined into a single Pvalue. To infer haQTLs independently of disease state, disease status was regressed out from the peak-height matrix used for DA peak calling, and the residual was used for G-SCI analysis 2 . In this study, the G-SCI test procedure was refined in two ways. Firstly, for each SNP k, we replaced the uniform allele frequency prior used in the original study with a SNP-specific allele frequency fk. Thus, for each SNP, we estimated four parameters by maximum-likelihood (fk, a, b, and s) instead of three. Secondly, we modified the method for permutation testing to convert raw P-values to final P-values. The original permutation approach 1 randomized the data by permuting peak heights across samples for the same SNP. In addition, reference and non-reference base calls were randomly flipped with probability 0.5 at heterozygous sites. In the current study, this procedure was generalized and accelerated by randomly swapping data between ChIP-seq units. A unit was defined as the peak height and allelic counts for a SNP in a single individual. Thus, the total number of units would equal the number of SNPs times the number of individuals. Units were then binned in three dimensions according to their peak height, number of reads covering the SNP and fraction of non-reference reads. Units within the same three-dimensional bin were randomly permuted, and the haQTL P-value was then calculated for each SNP. As before, the permutation-test P-value was then calculated based on the rank of the raw P-value relative to the P-values of permuted data. Again, as before, SNPs with an adjusted P-value of 0 after 1 million permutations were assigned an adjusted P-value of 5e-7. FDR correction and filtering for QTL effect size was performed as before 1 .
Overlap between haQTLs and eQTLs. We downloaded the granulocyte eQTLs 5 and obtained the subset that was in tight linkage with SNPs we called within granulocyte ChIP-seq peaks. The genotypes from the European population of 1000 Genomes were used in the LD calculation, with a threshold of R 2 ≥0.8 and maximum distance of 500 Kb. As before, we corrected for LD within the eQTL set by constructing the LD network of the eQTL set and replaced each connected component in the network by the SNP with the best eQTL P-value 1 . This resulted in 1,257 Naranbhai eQTLs used for analysis. We downloaded the naïve monocyte eQTLs 6 and used the same procedure to obtain a final set of 2,216 eQTLs for further analysis. We also downloaded neutrophil and monocyte eQTLs 7 and retained those with an FDR Q-value≤0.05. If a gene had more than one eQTL, the one with the best P-value was kept and the others were discarded. Chen et al. eQTLs were further filtered based on LD as described above, resulting in 3,069 neutrophil and 3,460 monocyte eQTLs for further analysis. Note that the vast majority of granulocytes in peripheral blood are neutrophils. For each SNP from the four eQTL sets, the LD with haQTLs from the corresponding cell type was calculated using the same 1000 Genome population and LD thresholds as above. To calculate the statistical significance of the LD, we randomly chose 1.5 million SNPs as a control set, and used a method as before wherein the likelihood of linkage of the control set and eQTL set with haQTLs was calculated while adjusting for allele frequencies, genomic location biases (distance from nearest TSS of a gene), biases in LD block sizes between the eQTL SNP and control SNP, and correcting for LD within each of the four eQTL sets 1 .
Overlap with published haQTL datasets. We downloaded neutrophil and monocyte H3K27 haQTLs 7 and processed them in a similar manner to eQTLs. First, we filtered for FDR Q-value≤0.05 and retained only the most significant haQTL for each ChIP-seq peak. We then discarded haQTLs that were not in LD with any SNP within our ChIP-seq peaks (LD was defined as R 2 ≥0.8, distance≤500kb). As above, these were further filtered to retain only one haQTL per LD cluster, resulting in 3,625 neutrophil H3K27ac QTLs and 4,357 monocyte H3K27ac QTLs. We then calculated the statistical significance of the linkage of these haQTLs with our own haQTLs using the method described above for eQTLs. Partitioned Heritability. Using the GWAS results from Luo and colleagues 8 , we performed linkage-disequilibrium (LD) score regression 9 to quantify proportion of heritability of both differentially and non-differentially acetylated peaks within peripheral blood monocytes and granulocytes. For all acetylated sites described earlier, we created annotations by extending sites by 500 bp at both ends. We then partitioned heritability by functional category using the full BaselineLD v2.2 model containing 83 functional annotations (using the --h2 flag) and the 1000 Genomes Project Phase 3 East Asian reference haplotypes 10 . We excluded variants in the HLA region (chr6:26 Mb-34 Mb). All peripheral LDSC files used in this analysis are available at (https://data.broadinstitute.org/alkesgroup/LDSCORE/).

Supplementary Tables
Supplementary Table 1. Number of high-quality granulocyte and monocyte ChIP-seq datasets from discovery and validation cohorts remaining after QC. Sizes of corresponding final core sets are indicated in parentheses. Number of consensus ChIP-seq peaks in high-quality datasets. Number of differentially acetylated (DA) peaks in core set.

Legends of Supplementary Tables in Excel File
Supplementary