Multiancestry analysis of the HLA locus in Alzheimer’s and Parkinson’s diseases uncovers a shared adaptive immune response mediated by HLA-DRB1*04 subtypes

Significance We report that specific HLA-DRB1*04 alleles are protective against Alzheimer’s dementia (AD), Parkinson’s disease (PD), and other neurodegenerative disorders. Further, we found that these HLA (Human Leukocyte Antigen) subtypes selectively bind a piece of Tau crucial to aggregation but only when it is acetylated (a-PHF6). This a-PHF6 piece is significant as it is a common posttranslational modification of Tau found in Alzheimer’s brains. Only when someone is HLA-DRB1*04:04 or HLA-DRB1*04:01 can PHF6 be presented as a T cell epitope to T cell receptors and mount a memory immune response against this pro-aggregation fragment. This immune response would protect against AD, PD, and neurodegeneration, explaining the HLA association. Vaccination with a-PHF6 in HLA-DRB1*04 individuals could have preventive effects.


1) Participants and sources of data
Phenotypic information and genotypes were obtained from publicly released genomewide association study datasets assembled by the Alzheimer's Disease Genetics Consortium (ADGC) and derived from whole-genome sequencing (WGS) data generated by the Alzheimer Disease Sequencing Project (ADSP), with phenotype and genotype ascertainment described elsewhere.The cohorts' queried accession numbers, as well as the sequencing technology or single nucleotide polymorphism (SNP) genotyping platforms are described in Supplementary Tables 17 and 18.The microarray datasets are largely part of the ADGC and as such they will be referred thereafter as the ADGC.

2) Quality control procedures
Prior to HLA imputation, ancestry, principal components and relatedness determinations, variants were excluded in each cohort-platform based on genotyping rate (< 95%), MAF < 1%, and Hardy-Weinberg equilibrium in controls (p < 10 -6 ) using PLINK v1.9 (1).GnomAD(2) database-derived information was used to filter out SNPs that met one of the following exclusion criteria (3,4): (i) located in a low complexity region, (ii) located within common structural variants (MAF > 1%), (iii) multiallelic SNPs with MAF > 1% for at least two alternate alleles, (iv) located within a common insertion/deletion, (v) having any flag different than PASS in gnomADv.3,(vi) having potential probe polymorphisms.The latter are defined as SNPs for which the probe may have variable affinity due to the presence of other SNP(s) within 20 bp and with MAF > 1%.Individuals with more than 5% genotype missingness were excluded.Duplicate individuals were identified with KING(5) and their clinical, diagnostic and pathological data (including age-at-onset of cognitive symptoms, age-at-examination for clinical diagnosis, age-at-last exam, age-at-death), as well as sex, race, and APOE genotype were cross-referenced across cohorts.Duplicate entries with irreconcilable phenotype or discordant sex were flagged for exclusion.

3) Ancestry determination
For each cohort, we first determined the ancestry of each individual with SNPWeights v2(6) using reference populations from the 1000 Genomes Consortium (7).By applying an ancestry percentage cut-off > 75%, the samples were stratified into five super populations: South-Asians, East-Asians, Amerindians, Africans, and Europeans, and an Admixed group composed of individuals not passing the 75% cut-off in any single ancestry (Supplementary Table 18)(3).The analyses were split into three ancestry groups: Europeans, Africans, and Amerindians-Latinos.The first two groups are composed of individuals passing the 75% threshold in their respective ancestry.The Amerindian-Latinos includes individuals in the Amerindians ancestry group (75% cut-off), and individuals in the Admixed group with at least 15% Amerindians and who identified as Hispanic/Latinos ethnicity.The rationale to include these additional individuals is to compensate the paucity of the Amerindians only group and to have a similar ancestry composition as in the Latin American Research Consortium on the Genetics of Parkinson's Disease (LARGE-PD, see below).Last, enriching for Amerindians ancestry enables us to assess the effect of HLA-DRB1*04:07 since the HLA haplotype DRB1*04:07~DQA1*03:01~DQB1*03:02 is a common haplotype in this ancestry group.

4) Imputation
Each cohort-genotyping platform was imputed on the TOPMed imputation server per ancestry group to obtain an imputation quality (R 2 ) per ancestry group.For the local-GWAS at the HLA locus we retained variants with R 2 > 0.30, MAF > 1%, and present in 50% of the imputed cohorts.HLA -alleles and -amino-acids were imputed on platform and ancestry specific reference panels available through HIBAG (8) or trained in-house as previously described (9).In all allele-level analyses, alleles with an imputation posterior probability lower than 0.5 were considered as undetermined as recommended by HIBAG developers, and only allele with carrier frequency above 1% were retained for analysis.For haplotype-level analyses, only individuals with non-missing allele genotypes were included.Three-locus HLA class I or class II haplotypes were determined using the haplo.emfunction from the R haplo.statspackage.Only haplotypes with posterior probability >0.5 and a carrier frequency of >1% were included in the analysis.In the amino-acid-level analyses, HIBAG(8) was used to convert P-coded alleles to amino acid sequences for exons 1-3 of class II genes.

5) Samples retained for analysis
Supplementary Table 19 describes the demographics of individuals retained for analysis.Analyses were implemented into 6 different groups separating WGS data and TOPMed imputed and by ancestry group: ADSP-European, ADSP-African, ADSP-Amerindian-Latino, ADGC-European, ADGC-African, ADGC-Amerindian-Latino.

6) Statistical analyses
In the following paragraph a variable refers indifferently to a variant in the local-GWAS at HLA locus, an HLA-allele, an HLA-haplotype, or any HLA-amino-acids.The AD risk associated with each variable was estimated using a linear mixed model regression on case-control diagnosis.The HLA -allele, -haplotype, and -amino-acids level analyses were run as dominant model (phenotype frequency, collapsing homozygotes for the minor frequency variable with heterozygotes).All statistical analyses were performed in R (v4.0.2) and adjusted for sex, six genetic principal components estimated with the PC-Air method (10) implemented in GENESIS (11), and covaried by a sparse genetic relationship matrix estimated with the PC-Relate method(12) implemented in GENESIS.Case-control analyses were not adjusted for age given that controls were older than cases in some subgroups.Correcting for age when cases are younger than controls leads to the model incorrectly inferring the age effect on AD risk, resulting in statistical power loss(3).

1) Participants, quality control and variant imputation
The UK Biobank data includes 488,377 participants which were genotyped on SNP microarrays and imputed at high resolution using two reference panels: (i) the Haplotype Reference Consortium (HRC) for most variants with minor allele frequency > 0.001 and (ii) the UK10K+1000Genomes for variants not in the HRC panel (13).The quality control prior to imputation has been extensively described in Bycroft et al. (13).The proxy-AD phenotype defined in Bellenguez et al. (14) (i.e., cases are individuals who have an ICD10 code linked to AD in their medical record (15) or reported a first degree with Alzheimer's disease, March, 2021 release).We restricted our analysis to 388,051 unrelated individuals after pruning for 3 rd degree relatedness using the following criteria to rank order individuals for removal: (i) highest number of relatives, (ii) not a proxy-AD case (iii) and youngest individual.

2) Ancestry determination
Unrelated individuals of the UK Biobank were split into two groups: British and non-British/other ancestries.The British ancestry group corresponds to individuals who selfidentified as white British and who clustered on together in the principal ancestry component analysis performed in Bycroft et al. (13)(field ID: 22006).The British ancestry group was composed of 52,426 proxy-AD cases, and 272,624 controls.The non-British/other ancestries group was composed of 7,840 proxy-AD cases and 55,161 controls.This last group was heterogeneous in term of ancestral origin, but most individuals identified as non-British European.

3) HLA Imputation
HLA -alleles and -amino-acids were imputed on platform and ancestry specific reference panels available through HIBAG (8) or trained in-house as previously described (9).In allele-level analyses, alleles with an imputation posterior probability lower than 0.5 were considered as undetermined as recommended by HIBAG developers, and only allele with carrier frequency above 1% were retained for analysis.In the haplotype-level analyses, only individuals with non-missing allele genotypes were included in the haplotype level analysis.Three-locus HLA class I or class II haplotypes were determined using the haplo.emfunction from the R haplo.statspackage.Only haplotypes with posterior probability >0.5 and a carrier frequency of >1% were included in the analysis.In the amino-acid-level analyses, HIBAG(8) was used to convert P-coded alleles to amino acid sequences for exons 1 -3 of class II genes.

4) Statistical analyses
In the following paragraph, a variable refers indifferently to a variant in the local-GWAS at HLA locus, an HLA-allele, an HLA-haplotype, or a specific HLA-amino-acid.HLAallele, -haplotype, and -amino-acids level analyses were run as dominant model (phenotype frequency, collapsing homozygotes for the minor frequency variable with heterozygotes).Proxy-AD association were tested with plink2 (v2.00a2LM) using theglm flag covarying for age at last visit, sex, genotyping array, assessment center and the first 20 PCs provided by the UK Biobank.

Alzheimer's Disease -NCGG dataset
The National Center for Geriatrics Gerontology (NCGG) Biobank was established as a geriatric hospital-based Biobank in 2012.The NCGG Biobank is one of the facilities belonging to the National Center Biobank Network.The NCGG Biobank cohort of the study consisted of 2974 patients (female, 64%; mean age, 78.0) with LOAD and 3096 controls (female, 53%; mean age, 71.1) who were recruited from the NCGG Biobank.All subjects were of Japanese origin.Genotyping data were downloaded from the NCGG Biobank database.All subjects were genotyped by using the Affymetrix Japonica Array.Demographics, quality control and GWAS analysis are fully described in Shigemizu et al. (16).The HLA analyses were conducted plink2 (v2.00a2LM) using the -glm flag covariates per cohort were described in Shigemizu et al. (16).

Alzheimer's Disease -GARD dataset
Phenotypic information and genotypes were obtained from the Gwangju Alzheimer's & Related Dementias (GARD) cohort database portal (http://gard.nrcd.re.kr:8080/), with phenotype and genotype ascertainment, as well as ethnical review described elsewhere.Briefly, all cases were LOAD and fulfilled the NINCDS-ADRDA criteria and met the pathological criteria (scanned amyloid beta PET).Genotyping was conducted with the blood species using the Korea Biobank Array, a microarray platform customized for Koreans.Demographics, quality control and GWAS analysis are fully described in Kang et al. (17) and summarized in Supplementary Table 21.The HLA analyses were conducted plink2 (v2.00a2LM) using the -glm flag covariates per cohort were described in Kang et al. (17).

Alzheimer's Disease -JGSCAD dataset
Demographics, quality control and GWAS analysis are fully described in Miyashita et al. (18) and summarized in Supplementary Table 21.The HLA analyses were conducted plink2 (v2.00a2LM) using the -glm flag covariates per cohort were described in Miyashita et al. (18).

1) Participants and sources of data
Participants were enrolled and followed up at one of Alzheimer's Disease Center (ADC) across the US.Genetic data were obtained from the Rush Religious Orders Study and Memory and Aging Project (ROSMAP) (19) and from the Alzheimer's Disease Center (ADC) cohorts 1 to 7 parts of the ADGC (20) (see Supplementary Table 17 for data accession number).ROSMAP samples were assessed by the Rush ADC and their neuropathological assessment followed procedures described respectively in Schneider et al. (21).Neuropathological assessment for samples with genotyping from ADGC was obtained from National Alzheimer's Coordinating Center (NACC) and followed postmortem evaluation protocol (22).

2) Quality control procedures, ancestry determination, and imputation
The content of this section is identical to the corresponding sections in "Alzheimer's Disease -ADSP & ADGC datasets" given that these samples were included in the association with AD status.

3) Samples retained for analysis
Supplementary Table 22 describes the demographics of individuals retained for neuropathology analyses: Tau Braak staging, neuritic plaques density.We also defined three categories: AD pathology only, Lewy body (LB) pathology only, and dual pathology (AD and LB) and compared these against controls without AD and LB pathologies.The schematic below describes these categories and follows the classification defined in Tsuang et al. (23).
Supplementary Table 23 provides the demographics and number of individuals per category.

4) Statistical analyses
The statistical analyses follow the method described in the "Alzheimer's Disease -ADSP & ADGC datasets" corresponding section.

Alzheimer's Disease Cerebrospinal Fluid -EADB and Swedish datasets
1) Participants and sources of data.EADB participants (as described above) for which cerebrospinal fluid (CSF) amyloid beta and/or (phosphorylated) tau measurements were available were included.The Swedish cohorts originate from Gothenburg H70 Birth cohort studies and are clinical AD samples from Sweden all gathered and analyzed in Gothenburg.Genetic data for EADB cohorts has been processed using a consistent approach (14), in which the Illumina Infinium Global Screening Array (GSA, GSAsharedCUSTOM_24+v1.0) was predominantly used in addition to the Axiom 815K Spanish biobank array (Thermo Fisher).The genetic data for the Swedish cohorts were generated with the Illumina Neurochip array.

2) Quality control procedures and imputation
Quality control procedures of the EADB datasets are described here in Bellenguez et al (14).For the Swedish datasets, QC and imputation procedures are described elsewhere (24).In short, low-quality variants were excluded based on call rate, minor alle frequency (MAF< 0.01) and Hardy-Weinberg disequilibrium (P < 1 × 10 -6 ).Individuals were removed based on per-sample call rate, sex mismatch, excessive heterozygosity or non-European ancestry.The Sanger imputation service was used to impute post-QC, using the reference panel of Haplotype Reference Consortium data (HRC1.1).The UCSC LiftOver program (https://genome-store.ucsc.edu/)and Plink v2.0 (www.cog-genomics.org/plink/2.0/)were used to lift the GRCh37 genomic positions to GRCh38, the genomic build for all other datasets.

3) Samples retained for analysis
Supplementary Table 24 describes the demographics of individuals retained for analysis.The association analyses with HLA haplotypes, alleles and amino acids were only performed for those individuals for which genotype-level data was available (rather than GWAS summary statistics).For rs601945 association analyses, all cohorts were included.

4) Statistical analyses
For HLA-locus, -allele, haplotype, and amino acid association analyses, similar association analysis procedures were performed.For continuous phenotypes Aβ42, tau and pTau, linear regression was performed within each cohort using PLINK v2.0.Association tests were adjusted for gender, age, assay type (if applicable), and ten ancestry principal components.METAL was used for meta-analysis of the per cohort association results, applying the default approach that utilizes p-value and direction of effect, weighted according to sample size.Association analyses were repeated for subgroups, stratified according to diagnosis status, resulting in a group including only AD subjects, and one including individuals with no or mild cognitive impairment.Covariates were those described for the main analyses above.

Parkinson's Disease -IPDGC, McGill, NINDS, NGRC, Oslo, PPMI, APDGC, UK Biobank datasets
Demographics, phenotyping, quality control, imputation and analysis of the European ancestry cohorts part of local-GWAS at HLA summary statistics have been extensively described in Nalls et al (25).Similarly, the phenotyping, quality control and HLA imputation PD cohorts used in the HLA -alleles, -haplotypes, and -amino-acids level analysis were previously described in Yu et al. (9).Demographics are presented in Supplementary Table 25.

Parkinson's Disease -EastAsians-PD and 23andMe datasets
For the EastAsians-PD and 23andMe cohorts, HLA alleles, haplotypes, amino acids statistics were derived from GWAS summary statistics data using the DISH software (26) as described in Naito et al. (27).Demographics, quality control and GWAS analysis were previously described (27)(28)(29) and available demographics are reported in Supplementary Table 26.

Parkinson's Disease -LARGE-PD dataset
Demographics, quality control and GWAS analysis are fully described in Loesch et al. (30) and demographics are also shown in Supplementary Table 26.