Amino acid signatures of HLA Class-I and II molecules are strongly associated with SLE susceptibility and autoantibody production in Eastern Asians

Human leukocyte antigen (HLA) is a key genetic factor conferring risk of systemic lupus erythematosus (SLE), but precise independent localization of HLA effects is extremely challenging. As a result, the contribution of specific HLA alleles and amino-acid residues to the overall risk of SLE and to risk of specific autoantibodies are far from completely understood. Here, we dissected (a) overall SLE association signals across HLA, (b) HLA-peptide interaction, and (c) residue-autoantibody association. Classical alleles, SNPs, and amino-acid residues of eight HLA genes were imputed across 4,915 SLE cases and 13,513 controls from Eastern Asia. We performed association followed by conditional analysis across HLA, assessing both overall SLE risk and risk of autoantibody production. DR15 alleles HLA-DRB1*15:01 (P = 1.4x10-27, odds ratio (OR) = 1.57) and HLA-DQB1*06:02 (P = 7.4x10-23, OR = 1.55) formed the most significant haplotype (OR = 2.33). Conditioned protein-residue signals were stronger than allele signals and mapped predominantly to HLA-DRB1 residue 13 (P = 2.2x10-75) and its proxy position 11 (P = 1.1x10-67), followed by HLA-DRB1-37 (P = 4.5x10-24). After conditioning on HLA-DRB1, novel associations at HLA-A-70 (P = 1.4x10-8), HLA-DPB1-35 (P = 9.0x10-16), HLA-DQB1-37 (P = 2.7x10-14), and HLA-B-9 (P = 6.5x10-15) emerged. Together, these seven residues increased the proportion of explained heritability due to HLA to 2.6%. Risk residues for both overall disease and hallmark autoantibodies (i.e., nRNP: DRB1-11, P = 2.0x10-14; DRB1-13, P = 2.9x10-13; DRB1-30, P = 3.9x10-14) localized to the peptide-binding groove of HLA-DRB1. Enrichment for specific amino-acid characteristics in the peptide-binding groove correlated with overall SLE risk and with autoantibody presence. Risk residues were in primarily negatively charged side-chains, in contrast with rheumatoid arthritis. We identified novel SLE signals in HLA Class I loci (HLA-A, HLA-B), and localized primary Class II signals to five residues in HLA-DRB1, HLA-DPB1, and HLA-DQB1. These findings provide insights about the mechanisms by which the risk residues interact with each other to produce autoantibodies and are involved in SLE pathophysiology.

Stepwise conditional analysis adjusting for the entire effect of each amino-acid residue and classical allele at each of HLA-DRB1, HLA Of note is intergenic SNP rs2860580 (close to the 5' of HLA-A), highly correlated with HLA-A-70 (r 2 = 0.90), and not significantly stronger than the residue. The effect of this SNP did not replicate across cohorts (S5 Table). Also of note are three HLA-B residues (67, 80 and 81) that passed our conditional association threshold after conditional analysis of all alleles but did not pass the unconditioned association threshold (S2l Fig), and did not replicate.

Accumulation of risk residues
We observed a significant effect of accumulation of risk residues (P = 8.8x10 -5 ) between cases and controls (Fig 2), in both discovery (P discovery = 4.1x10 -5 ) and replication (P replication = 1.4x10 -3 ) cohorts, and complementary to the accumulation of protective residues in controls/ cases (S3 Fig, S7 Table). Risk odds ratios increased linearly, suggesting a model of additive risk. We found no significant difference in overall risk prediction between models using odds-ratio weighted versus unweighted risk residue counts (P = 0.45).

Expression analysis
To assess the impact of each of the seven proposed independent residues on expression of their respective genes, we performed eQTL analysis, followed by conditional analysis of publicly available data (European samples from 1000Genomes). We observed that several SNPs, residues and classical alleles were eQTLs for their respective gene. Following our study framework, conditioning the expression of each gene on its respective classical alleles reduced the signal below the genome-wide significance threshold (P = 5x10 -8 ). Interestingly, conditioning on their respective residues did not explain gene expression fully except for DRB1. Conditioning for the independent residues (DRB1-11+13+37) explained all of DRB1 expression (S4 Fig).

Risk residue signatures
SLE risk residue positions (Table 2) mapped overwhelmingly to the peptide-binding groove for each of the eight proteins. Furthermore, risk positions predominantly interacted with the peptide C-terminus, specifically "anchor" positions [21] p4/6 (7/20 most-risk positions over the six major subunits) and p7/9 (9/20). Most risk residues occurred at identical positions among multiple subunits, indicating that residues promoting peptide binding and T-cell receptor activation tend to cluster at several key locations in the interface. peptide C-terminus (S6n Fig). At these positions, the most-risk amino acids tended to be large, hydrophobic or negatively charged (Table 2); only most-risk DRB1-13R (in complete LD with DRB1-11P) was positively charged.

Contrasting residues between SLE and rheumatoid arthritis
Specific autoimmune diseases have different features of their autoantigens and thus peptides recognized by the MHC that eventually lead to autoantibody production. For comparison, we also looked at rheumatoid arthritis (RA), where HLA is the predominant risk locus (particularly HLA-DRB1), but with different risk alleles and residues 29 . Principal SLE and RA autoantigens were compiled and characterized (S9 Table). SLE autoantigens are overwhelmingly positively charged (S9a Table), whereas RA autoantigens are negatively charged (S9b Table). A hallmark of RA is the presence of anti-citrullinated protein antibodies (ACPA) [26]. Citrullination deiminates arginine residues (removing their positive charge), leaving the sidechain neutral; poly-citrullinated proteins become very negatively-charged (S9b Table).
Based on these very large differences in autoantigen charge, we developed a simple model to predict the interaction of MHC binding-groove side-chains with largely positivelycharged SLE autoantigens and with largely negatively-charged citrullinated RA autoantigens (S1 Text). This model computationally divided the 20 naturally occurring amino acids into: those likely to interact strongly with SLE autoantigens, those likely to interact weakly, and those that are neutral (S10a Table). A similar analysis was performed for RA autoantigens (S10b Table). Importantly the model explicitly convolves peptide-MHC-T-cell receptor tripartite interactions, as it is based on experimental data of immunogenic/non-immunogenic peptides [27]. The model largely contained the set of observed SLE risk amino acids (Table 2), and prominently represented negatively-charged MHC side-chains. Meanwhile, for RA, the model predicted positively-charged MHC side-chains (among others) as risk. In support of these RA-specific predictions, the strongest source of RA risk comes from the HLA-DRB1 "shared epitope" motif [28] aa-70[(Gln/Arg)-(Lys/Arg)-Arg-Ala-Ala]aa-74, composed almost entirely of predicted RA risk residues. Indeed, upon residue-mapping of MHC risk for RA, the signal primarily localized to DRB1-11-Val/Leu, DRB1-71-Lys/Arg, DRB1-74Ala, B-9Asp and DPB1-9Phe [29], with 6/7 experimentally determined RA-risk residues coming from the predicted list (S10b Table). The lone protective-predicted residue, B-9Asp, is the only one specifically associated with ACPA -RA [30], in which antigens are likely to be more neutrally charged.

HLA variants associated with autoantibodies
Individual profiles of autoantibodies were available for most SLE (Korean and Han Chinese; S11 Table) patients (n = 2,164). nRNP and Ro/La antibodies had genome-wide significant (P omnibus <10 −8 ) residue associations (S12 Table). We queried whether the contribution of SLE-associated amino-acid positions and/or other variants was heterogeneous according to autoantibody types.
The strongest risk and protective amino acids for autoantibody association overwhelmingly mapped to Class II loci, particularly HLA-DRB1 (nRNP, Ro/La and ACL) and HLA-DPB1 (Sm), in a case-only analysis (Table 3). This is consistent with these MHC subunits playing critical peptide sequence-dependent roles in selection of antigens displayed to the T-cell receptor. Furthermore, all significant residues occupy the middle of the peptide-binding pocket (6/7 most significant residues bind the p4/6 pocket; one binds p7).

Antibody
Taken together, these results show that MHC subunits, particularly Class II, carry strong, residue-specific risk signals for overall SLE susceptibility and specific antibodies to common SLE antigens. All signals concentrate in the peptide-binding grooves, at sites and with aminoacid properties that broadly overlap between all subunits tested. Many signals are consistent between autoantibody development and SLE risk (e.g. HLA-DRB1 � 09:01 and DRB1-11Asp, shared between SLE, nRNP + and ACL + risk; HLA-DRB1 � 13:02 and DRB1-13Ser, strongly protective for both SLE and nRNP + ).

Discussion
In this study, we confirmed SLE associations in six East Asian cohorts for the HLA-DRB1 locus and identified additional independent Class I and II association signals. The main association signal fell within the same region identified by a meta-analysis on largely European populations [15], with the most-risk alleles (e.g. DR15) conserved. Amino-acid positions carried significantly more signal than classical alleles (which feature many mutations, only some of which are involved in binding and signaling) and SNPs. Independent residue signals localized to HLA-DRB1-13/11-37/26, DQB1-37, DPB1-35, B-9, and A-70. The most significant signals in each subunit localized to the peptide-binding groove. DPB1 was suggested earlier as an SLE risk locus in Japanese [24]; the statistical power of this study confirmed that report in subjects ascertained independently and pinpointed responsible residues. Our diverse cohort set pointed to DRB1-37 being equally responsible for the signal previously assigned to DRB1-26 [7]. In our analysis of the HLA aggregate effect, we identified significant interaction between residues of DRB1 (11,13,26), and Class I HLA-B (9) (P = 8.7x10 -15 ); the nature of these interactions requires further study and confirmation. In addition, we identified a significant cumulative effect of risk residues, where the increase in risk is positively and incrementally correlated to an increase in the number of risk residues.
Interestingly, we observed that several classical alleles and residues were eQTLs for expression of their own genes. In most cases, compared to the amino acid residues, classical alleles together better explained own-gene expression. Only DRB1 expression was explainable by amino-acid residues (11, 13 and 37), whereas classical alleles explained the entire expression signals for all genes. However, it is important to note that these observations are based on the imputation and expression data from European samples (1000Genomes) and not from Asians. Additionally, we did not consider a full conditional analysis of expression using SNPs and additional residues. Therefore, the degree to which we can extrapolate these findings on gene expression remains to be determined in future studies.
Most immunogenic peptides are enriched in large, hydrophobic and charged residues [27]. Accordingly, in each HLA subunit, risk residues were typically themselves large and hydrophobic or charged, which would facilitate the best interaction with the peptide to facilitate both MHC binding and T-cell receptor binding and activation. A simple model of MHC-peptide-T-cell receptor interactions was consistent with the observed risk residues. We adapted the model to deal specifically with SLE (autoantigens largely positively-charged) as well as rheumatoid arthritis (autoantigens largely negatively-charged). The model performed well for both diseases, and could prove useful in other studies as well.
Our study is the first to systematically associate risk of specific autoantibodies across large SLE case/control cohorts. The most-risk and most-protective residues for autoantibody association overwhelmingly map to Class II loci (particularly DRB1 for nRNP, Ro/La, and cardiolipin and DPB1 for Sm; Table 3), consistent with the role of CD4 + T-helper cells in autoantibody generation by B-cells (rather than CD8 + cytotoxic T-lymphocytes, which signal through T-cell receptors specific for Class I proteins).
Intriguingly, though amino acid properties were largely conserved between overall SLE and specific autoantibody risk, the positions were fairly different. DRB1-30 was a primary determinant of anti-nRNP risk, DRB1-70 of anti-Ro/La risk, and DPB1-11 of anti-Sm risk; none of these residues were GWS for overall SLE risk (conversely, DRB1-11/13 was shared between overall SLE risk and nRNP + , Ro/La + , and ACL + status). These last, along with other observations (e.g. HLA-DRB1-09:01 and DRB1-13Phe, risk for SLE and nRNP + , but protective for Ro/ La + ; HLA-DRB1 � 16:02, strongly risk for SLE but no detectible risk for tested autoantibodies), suggest that SLE and specific autoantibody development share some genetic risk elements but diverge at others.
Autoantibody risk mapped entirely to HLA residues contacting the middle of the peptide (p4/6 pocket, 6 of the 7 most significant residues; p7, 1/7). These recognize the C-terminal portion of the peptide, which binds to the MHC before the N-terminus [41]; p4/6-7/9 therefore serve a key role in peptide selection. Sjögren's syndrome risk also concentrates at MHC positions recognizing p7/9 [31]. Individual autoantigens present a small set of immunogenic peptides to the HLA, with autoantibody risk arising from recognition of those specific peptides. Given the enormous diversity of SLE antigens, overall SLE risk might stem from the binding Protective alleles https://doi.org/10.1371/journal.pgen.1008092.t005 of many peptides (alternatively, these antibodies might arise as a downstream consequence of pathogenesis set off by fewer peptide-autoantibody pairs). Our data strongly suggest that these hallmark SLE autoantigens are selected in a predictable fashion by the HLA subunit (particularly HLA-DRB1 and HLA-DPB1) peptide-binding grooves. Similar patterns are seen across other MHC subunits and autoimmune diseases. Negative HLA-DPB1 p1/7/9 peptide-binding groove charge (aa-55[Asp-Glu-Glu]aa-57; aa-67[Glu-Glu-Glu]aa-68; aa-82[Glu-Asp-Glu]aa-85) is associated with risk of anti-topoisomerase antibodies in systemic sclerosis [42]; negative HLA-DRB1 p4 pocket charge (aa-70[Asp-Glu]aa-71) is associated with anti-desmoglein antibodies in pemphigus vulgaris [35].
The development of antibodies to RNA, DNA and cardiolipin is thought to come from nucleic acid-binding protein/nucleic acid complexes breaking tolerance through recruitment of T-cell help [43]; these molecules are otherwise very poorly immunogenic. The fact that overall SLE-risk alleles and amino acids differ somewhat from those for specific autoantibodies tested here, suggests that the queried autoantibodies might result from a progression of SLE through a more general breakdown of immune processes, rather than being causal.
In summary, we identified novel SLE signals in HLA Class I loci (HLA-A-70, HLA-B-9), and localized primary Class II signals to five residues in HLA-DRB1, HLA-DPB1, and HLA-DQB1. These seven residues not only increase the proportion of HLA heritability explained to 2.6%, but also significantly increase overall risk, particularly with risk-allele accumulation. Detailed analysis expanded this to 20 risk positions across the six major HLA subunits. We demonstrate how these positions and amino-acid properties (large, aromatic, charged) correlate with peptide-MHC binding and T-cell receptor activation, for both general SLE risk residues and for risk of specific autoantibodies (nRNP, Ro/La, Sm, cardiolipin). It is of note that our study does not consider other interacting loci or individual variation in the autoantigens themselves [44], and there may be small signals remaining at other HLA loci. Our association results on Asians complement previous reports from European, African, and Hispanic populations. Importantly, our observations generalize across MHC subunits and various other autoimmune diseases. Our data and analysis present a framework for modeling peptide antigen presentation to both the MHC and the T-cell receptor, tolerance breakage, and autoantibody development.

Ethics statement
This study was approved by the Oklahoma Medical Research Foundation's Institutional Review Board, IRB 10-23. IRB approval was granted for use of either de-identified or coded materials collected from previous studies in which original consent included a provision for sharing; because of this, no additional informed consent was required.

Study participants
Our study was conducted in two phases: discovery and replication. The discovery phase included primarily the participants from our previously published study (Korean, KR; Han Chinese, HC; Malaysian Chinese, MC) [8]. Details about recruitment and phenotyping for our discovery phase individuals (n = 10,142; 2,490 cases and 7,652 controls) can be found elsewhere [8]. To increase statistical power of our Han Chinese (HC) samples, we incorporated 392 out-of-study controls from dbGaP (phs00431.v1.p1) [45]. Our discovery set thus included 2,490 cases and 8,044 controls (S1 Table).
In order to replicate our initial findings, we included 2,425 cases and 5,469 controls from two Japanese cohorts (JP1, JP2) and one HC cohort (HC2) (S1 Table). Our first Japanese replication cohort (JP1) was collected under the support of the Autoimmune Disease Study Group of Research in Intractable Diseases, Japanese Ministry of Health, Labor and Welfare, and the BioBank Japan Project. Details about the subjects and study design are described elsewhere [46]. Samples for our second Japanese cohort (JP2) were obtained at Kyoto University, Japan. Our HC replication set (HC2) included primarily the participants recruited by UCLA, as described elsewhere [47].
All patients satisfied American College of Rheumatology criteria for SLE classification [48,49]. Controls were geographically matched to SLE cases. Participants provided written consent at study enrollment, and the Institutional Review Boards or ethical committees of participating institutions approved this study. Potentially identifying information was removed for all participants.

Genotyping and quality control
Genotyping and quality control of the discovery samples, as well as JP1 details, can be found in the source publications [8,46,50]. HC2 replication samples were genotyped on the ImmunoChip platform following the same quality-control protocol as our discovery samples. Details about the genotyping and normal HC quality controls included in the analysis are described elsewhere [50]. JP2 replication samples were genotyped on the Illumina HumanCoreExome BeadChip platform at Tokyo Medical and Dental University and Kyoto University.

HLA imputation
To identify association at the MHC region, we extracted the SNPs within the extended MHC region (chr6: 25-35 MB) and imputed classical alleles, untyped SNPs, and amino-acid residues using SNP2HLA [51]. In order to capture the appropriate genetic background for imputation, we used two different Asian HLA imputation reference panels: a merged panel [52] made of Korean [53] and Asian [54] imputation panels was used on our Korean, Han Chinese and Malaysian samples, whereas our Japanese replication cohorts were imputed using a Japanese imputation panel [40]. All reference panels included SNPs, classical alleles, and amino acids for eight HLA loci (HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1 and -DRB1; DRA1 is practically invariant and was not genotyped in our study). Imputation accuracy using our two panels (~90%) has been described in the parent publications [40,52]. Imputation of cases and controls was performed together for each cohort. In order to reduce the uncertainty of imputation, we restricted our analysis to variants with high imputation quality (r 2 >0.7) in each cohort. Imputed dosage results were used in all subsequent analyses.

Association analysis
For the assessment of SLE associated signals in each cohort, we performed two types of regression analysis: logistic regression analysis for each bi-allelic marker and omnibus test (log-likelihood ratio test) for each multi-allelic marker. Each SNP, classical allele, and amino-acid residue was regressed and corrected by sex and the first three principal components (PCs). Corrected P-values after conditioning are presented throughout this paper. To estimate the association strength at amino-acid positions, we carried out omnibus tests [29,54]. Omnibus tests for each position were set up as logistic regressions of all residues within the position (except the most frequent one), and corrected by sex and PCs. Omnibus P-values (P omnibus ) were estimated as likelihood ratio tests of the base model A 0 versus the expanded model A 1 : where the θ term is the intercept. β and γ are the logistic regression coefficients, P i,j is the j-th principal component of individual i, S i is the sex of individual i recoded as a binary categorical variable (male = 0, female = 1), V i,k is the dosage of the k th amino-acid residue in individual i among the n possible side-chains (ordered from k = 1, least frequent, to k = n, most frequent), � is an error term, df is the degrees of freedom, and D is the omnibus statistic. All conditional analyses were performed as omnibus association tests and were nested into two stages, A: conditioning for the strongest residue, and B: conditioning for the full effect of the corresponding alleles (S1 Fig). Conditional analyses were cumulative, where each step includes the condition for the full effect of the genes in the previous steps. We used the significance threshold 5x10 -5 for all omnibus tests. The complete analysis consisted of six steps as described below.
In the first step of the analysis, we conducted an unconditional omnibus test of all SNPs, residues and alleles, and used this as the basis to identify the strongest signal (S1 and S2a Figs). In the second step, we fine-mapped the residues within HLA-DRB1 (A, S2b and S2c

Combined P-values
In order to combine cohort-specific association P-values while preserving effect size and direction, data for individual SNPs, amino acids, and classical alleles were combined through sample size-corrected meta-analysis, as implemented in Metal [55]. Odds ratios for the combined P-values were estimated using the standard error approach. This method was ultimately used because only summary statistics for the Japanese cohorts were available to us. To combine Pvalues from omnibus tests, we used Fisher's method, which is recommended when effect sizes are not available [56,57].

Explained proportion of heritability
In order to estimate the proportion of explained heritability contributed by each of our independent residues (amino-acid positions), we used the liability model described by So and Sham [58], which estimates the effect of risk alleles on genetic liability. In this case, we used the odds ratios of all risk amino acids within a residue. HLA-DRB1 residue 11 (henceforth referred to as DRB1-11) was removed from the estimation of the total effect on liability because DRB1-13 is tightly linked to it; we included only DRB1-13 in this total calculation to avoid inflation.

Accumulation of risk residues
In order to assess the effect of accumulating risk/protective alleles (for this case, allele is meant for residue), we used the best-guess (phased) genotypes for our discovery and replication sets. For each individual, we counted the number of risk and protective amino acids present for all seven independent residues (A-70; B-9; DPB1-35; DQB1-37; DRB1-11/13, and either 37 or correlated residue 26). We estimated the odds ratios of having 1 to >8 risk residues versus none. Individuals whose best-guess genotype was uncertain were removed from this analysis. We regressed the number of risk alleles versus odds ratio and identified the best fitting model. In order to assess additivity of the effects, we performed logistic regression for all combination of single and multiplicative effects (modeled as interactions). Additive models were compared to interaction models through the Akaike information criterion. In order to investigate if the effect of the accumulation of risk amino-acids weighted by their odds ratios was a better predictor of SLE risk versus the unweighted count of risk amino-acids, we estimated a genetic risk score for each imputed haplotype, and estimated the area under the curve. Comparison of the weighted versus unweighted ROC curves was performed using the Bonferroni test [59] in easyROC.
To estimate the most common protective and risk haplotype between HLA alleles, we used the best-guess genotypes. Haplotype construction was performed in the haplo/stats library in R through the expectation-minimization algorithm. Analysis of the haplotypes derived from the phased imputed data yielded similar results. Odds ratios for each haplotype were estimated through a generalized linear model. Linkage disequilibrium between pairs of alleles was estimated as specified by Lewontin [60].

Expression analysis
To investigate the effect of the imputed alleles, residue amino-acids and SNPs on HLA gene expression, we imputed 373 European individuals from the 1000Genomes Project using the T1DGC European reference panel [51] using SNP2HLA. We extracted RNA-seq expression data for those same individuals from the GEUVADIS project [61], and estimated the linear model for each SNP, residue and allele (with r2>0.8). We assessed gene expression of each gene conditioned on the effect of each of our identified independent residues.

Protein structural representations
Structural representations of HLA-A, HLA-B, HLA-C, HLA-DPA1/DPB1, HLA-DQA1/DQB1, and HLA-DRA1/DRB1 were produced (PyMOL), using PBD files 4HWZ, 3VCL, 1IM9, 3LQZ, 1JK8, and 2SEB, respectively. For display of overall SLE risk across each protein, appropriately conditioned -log(P omnibus ) was linearly normalized to the interval [0, 1], with the least-associated position mapping to 0 and the most-associated position mapping to 1. Then each normalized value was converted to the RGB color (x, 0, 1 -x). Thus, the most highly associated position is shown as deep red (1, 0, 0), and the most weakly-associated position as deep blue (0, 0, 1). Intermediate values map linearly (according to -logP) between blue and red. This creates a simple visualization of the 3-dimensional distribution of risk across each HLA subunit, with concentrations of red positions highlighting the regions of strongest association. Significantly associated positions are indicated by text labels.

Autoantigens in SLE and rheumatoid arthritis (RA)
Based on literature review, protein sequences were collected from NCBI for principal autoantigens of both SLE and RA. For SLE, the four protein autoantigens that were experimentally characterized in this study (nRNP, Ro, La, and Sm), as well as histones H1 and H2B [62], were catalogued. Multiple protein subunits were studied for two proteins: nRNP (U70, U1A, C) and Sm (B, B', N, D1, D2, D3, E, F, G). For RA, the key autoantigens fibrinogen (fibrin precursor; A and B subunits), vinculin, collagen type II, filaggrin, vimentin, and keratin were catalogued.
For each of these proteins, total charge and isoelectric point (pI) were computed (DNAS-TAR 14.0.0 EditSeq). The RA autoantigens were also modeled in their poly-citrullinated forms, and the charge and pI were recalculated. For sake of these calculations, citrulline was approximated by glutamine, the naturally occurring amino acid to which it is most similar. Total charge was then normalized to a per-residue charge by dividing by the length of the protein.

Statistical model of HLA-epitope interaction
In order to generalize the binding of MHC subunits to arbitrary peptides, we used a statistical potential [63], which represents the favorability of specific amino acids contacting one another, derived from an analysis of solved structures in the Protein Data Bank (Web Resources). Given the extraordinary diversity of antigens in SLE, this was deemed a practical way to address the numerous potential peptide/MHC combinations, rather than considering specific peptide/allele pairs with an atomistic (e.g. molecular mechanical) description of the interaction. This also has the simplifying assumption that MHC risk is additive across binding-groove residues (when in reality protein side-chains interact both structurally, through atomic interactions with each other and with the peptide, and genetically, in that side-chain combinations are linked together in alleles). To create a simple MHC allele "immunogenic peptide preference" score, amino acids statistically over-and under-represented in peptides initiating T-cell activation by the interaction between peptide/MHC and T-cell receptor (TCR) were taken from Calis et al. [27]. The statistical potential gives a score to each possible MHC side-chain interacting with each possible peptide amino acid. For each possible MHC side-chain, the metric computes the difference between the statistical potential values for antigenic-over-represented amino acids and antigenic-under-represented ones. In this way, a ranked list of MHC side-chains at positions in the peptide-binding groove (where the side-chains could contact the peptide) that would be predicted to favor peptide recognition and receptor activation was created (S1 Text).
Because RA autoantigens were quite negatively charged, and SLE autoantigens quite positively charged, we made a simple change to the statistical model for the two different autoimmune diseases: for RA, the Calis et al. [27] dataset was used as is; whereas for SLE, Glu (negatively charged) was replaced by Arg (positively charged) in the list of immunogenic peptide residues, and Lys (positively charged) was removed from the list of non-immunogenic peptide residues (Table 6 differences bold italic): As an example of how calculations were performed, Trp is enriched in immunogenic peptides [27] (in both our SLE and RA models), and Pro has a strongly favorable interaction with Trp in the statistical potential [63] (S10 Table); thus a Pro at an MHC binding-groove position might be expected to interact with Trp-bearing peptides in such a way as to contribute to an immune response (given that the enriched residues are taken from T-cell activation studies, this convolves both MHC-peptide and MHC-peptide-TCR interactions). The "preference" of a potential MHC binding-groove side-chain for immunogenic peptides was computed as the difference in the interactions with the immunogenic and non-immunogenic amino acids. For instance, an MHC binding-groove Phe is estimated to strongly prefer immunogenic peptides, based on the strength of its favorable interactions with immunogenic Trp/Phe/Ile and on unfavorable interactions with non-immunogenic Ser/Met/Gln (S10 Table). Similarly, Thr is estimated to strongly prefer non-immunogenic peptides, based on the strength of its exact opposite preference (i.e. unfavorable interactions with immunogenic Trp/Phe/Ile and on favorable interactions with non-immunogenic Ser/Met/Gln).

Autoantibody associations
We performed sub-phenotype association analysis with six autoantibody profiles (antibodies against nuclear ribonuclear protein, nRNP; Ro/SSA; La/SSB; Smith, Sm; double-stranded DNA, dsDNA; and cardiolipin phospholipid, ACL). nRNP, Ro, La and Sm are binding proteins for negatively-charged nucleic acids [64]. Cardiolipin is a negatively-charged phospholipid. Autoantibodies develop first against the positively-charged nucleic acid-binding proteins, and through tolerance breakage, antibodies subsequently develop against their ligands (ssRNA and ssDNA) and mimetics such as dsDNA and cardiolipin. All analyses were carried out using logistic regression and omnibus tests as described above for the case/control analysis. Data for case-only analysis was available only for the Korean cohort and one Han Chinese cohort (S11 Table).
For each of the autoantibody profiles, the residue positions with significant omnibus association P-values (P omnibus <5x10 -8 ; S12 Table) were selected for further study. Within this list, the most statistically significant side-chain associations were tabulated and evaluated according to the statistical model above.     Table. Primary autoantigens for SLE (a) and RA (b). Shown are protein sequence, isoelectric point (pI), length, # and fraction of residues that are Arg (R), Lys (K), Asp (D) and Glu (E), charge, charge per residue, and pI and charge per residue following poly-citrullination. For sake of easily modeling charge and pI, citrulline was approximated by glutamine. Red colors represent basic (positively-charged) proteins. Blue colors represent acidic (negatively-charged) proteins.

Supporting information
(XLSX) S10 Table. Simple model of MHC side-chains interacting with (S10 A) SLE and (S10 B) RA peptide antigen side-chains. A. As SLE autoantigens are frequently highly positively-charged, Arg was added to and Glu omitted from the list of "over-represented in antigenic peptides list". Similarly, Lys was omitted from the "under-represented in antigenic peptides" list. Preference for antigenic versus non-antigenic side-chains was calculated both by the difference in SUM and the MIN of RIFW versus QMS. The MHC side-chains found to most prefer SLE-immunogenic peptide side-chains include Trp, Ile, Phe, Leu, Glu, Tyr, Pro, Met and Asp. Those found to least prefer SLE antigenic side-chains include Thr, Gly, Ser, Cys, Asn, Lys, Gln, His and Arg.