Single-cell transcriptome landscape of circulating CD4+ T cell populations in autoimmune diseases

Summary CD4+ T cells are key mediators of various autoimmune diseases; however, their role in disease progression remains unclear due to cellular heterogeneity. Here, we evaluated CD4+ T cell subpopulations using decomposition-based transcriptome characterization and canonical clustering strategies. This approach identified 12 independent gene programs governing whole CD4+ T cell heterogeneity, which can explain the ambiguity of canonical clustering. In addition, we performed a meta-analysis using public single-cell datasets of over 1.8 million peripheral CD4+ T cells from 953 individuals by projecting cells onto the reference and cataloging cell frequency and qualitative alterations of the populations in 20 diseases. The analyses revealed that the 12 transcriptional programs were useful in characterizing each autoimmune disease and predicting its clinical status. Moreover, genetic variants associated with autoimmune diseases showed disease-specific enrichment within the 12 gene programs. The results collectively provide a landscape of single-cell transcriptomes of CD4+ T cell subpopulations involved in autoimmune disease.


INTRODUCTION
Autoimmune disorders encompass various conditions where the immune cells display abnormal reactivity toward normal tissues.These diseases are multifactorial, polygenic, and prevalent, affecting 3%-5% of the population. 1,2Numerous studies have highlighted the role of CD4 + T cells in the onset or exacerbation of these diseases. 2,3D4 + T cells exhibit a variety of states (e.g., naive, memory), polarizations (e.g., Th1, Th2, Th17, T follicular helper or Tfh), and also include a distinct subpopulation engaged in the mainte-nance of self-tolerance and homeostasis (regulatory T cells or Tregs). 4,5While a great deal of effort has been devoted to the detailed classification of CD4 + T cells, the complete picture of heterogeneity and its relationship to diseases is still controversial.Furthermore, making consistent assessments across reports is challenging since these reports were based on inconsistent cellular classifications.
3][14][15] Furthermore, in the field of T cell research, a number of tools have enabled precise T cell profiling across various contexts such as chronic infection and tumor-infiltrating T cells. 16,17On the other hand, conventional clustering and marker gene detection strategies for single-cell analysis possess the following weaknesses: (1) cell fraction definition requires arbitrary boundaries, (2) marker genes for clusters can be occupied by redundant genes or uninterpretable genes, such as long noncoding or ribosomal genes, due to the influence of larger cell population structures, and (3) pairwise differentially expressed gene detection cannot capture global gene variation across multiple clusters.Several studies have attempted to tackle these issues. 18,19Especially, in some previous reports, non-negative matrix factorization (NMF) 20 has been utilized to extract both cell-type identity and cellular activity programs, thereby addressing the profiling of complex cell populations. 21,22However, it still harbored complexity in determining the parameters and in the flexible integration between different datasets.
Here, we construct a consensus reference for CD4 + T cells in peripheral blood from autoimmune and healthy individuals covering various inflammatory conditions.The reference consists of 18 cell types defined by a conventional clustering strategy and 12 transcriptomic gene programs extracted by conducting decomposition using NMF without boundaries, which overcame the weakness of existing clustering-based singlecell analyses.The results show that diverse CD4 + T cell features were formed by a combination of 12 independent gene programs.We also illustrate that the gene features obtained by NMF could be projected to other bulk/scRNA-seq data to help interpret various datasets.Using these frameworks to examine the genetic contribution and subsequent changes of CD4 + T cells in autoimmunity, we performed a meta-analysis that enrolled over 1.8 million CD4 + T cells using published singlecell data of 20 diseases and integrated genome-wide association study (GWAS) statistics for 180 traits with our dataset.These analyses provided a full picture of CD4 + T cells in autoimmune diseases from the perspective of phenotypes and genetics.

TCR features across CD4 + T cells reflect cellular properties
Because TCR responses shape T cell functions and differentiation, TCR diversities and overlaps provide useful information for the properties and relationships of populations.Therefore, we analyzed single-cell TCR features sequenced along with gene expression.Clonotype sizes and diversity across cluster L1 populations revealed that Temra was most clonally expanded, followed by Tem, Tcm, Treg, and Tnaive (Figures 1F and 1G).Similarly, in cluster L2 populations, Temra (Th1), Tem (Th1), and Tem (K) Degree centrality of TCR networks.Significance across clusters was calculated by one-way analysis of variance and, after multiple test corrections by false discovery rate, Tcm (Tfh) and Treg Act were retained as significant cell types.Then, pairwise Tukey-HSD post hoc tests were performed.*p adj < 0.05 in comparison with HCs.(L) TCR-intrinsic regulatory potential (TiRP) score distributions on UMAP plot.Mean scores for each cluster L2 were shown.(M) TiRP score distribution across cluster L2.The dot shows the mean, and the confidence interval (CI) shows 95% CI of the bootstrap distribution of means (n = 1,000).Adjusted p values of significant clusters, Tnaive MX1 1.29 3 10 À2 , Tem (Th1) 5.82 3 10 À11 , Temra (Th1) 4.84 3 10 À22 , Treg Naive 2.14 3 10 À13 , Treg Act 2.14 3 10 À13 , Treg Eff 3.82 3 10 À5 (two-sided Mann-Whitney U test was performed for one cluster vs. the other clusters iteratively).
(Th1/17) possessed a limited number of clonotypes, whereas Tnaive and Treg Naive maintained diverse clonotype pools (Figure 1H).The TCR similarity network showed a repertoire sharing between neighboring clusters, Tnaive and Tcm, Tcm and Tem, and Tem and Temra, while the distal connection such as from Tnaive to Temra was not observed, suggesting stepwise development from Tnaive to Tcm, Tem, and Temra (Figures 1I, 1J,  and S2B).The repertoires were also mutually shared within Tcm cell populations, suggesting the plasticity of T cell polarization against the same epitopes.In addition, no substantial overlap was observed between the repertoires of Treg Naive and conventional T cells (Tconvs), whereas Treg Act and Treg Eff demonstrated shared repertoires with Tcm populations (Figures 1I, 1J,  and S2B).We also measured the centrality of TCR networks for each cell type to evaluate the differentiation potential of each cluster.The centrality of Tcm (Th0), Tem (Th1) pre, and Tnaive were consistently high, suggesting that these cells possess the possibility to differentiate into a variety of cell types (Figure S2C).In addition, TCR networks differed depending on the disease states (Figure 1I).Especially the centrality of Tcm (Tfh) and Treg Act were higher in SLE (Figures 1K and S2D).These results indicated that the kinetics of CD4 + T cell differentiation varied depending on the disease state.
Previous studies have shown that T cells with stronger TCR stimulation within the thymus are more likely to differentiate into Tregs than Tconvs. 27Therefore, Tregs have specific TCR properties, such as hydrophobicity in complementary determining region 3 regions. 28We measured the Tregness of TCRb chains (TCR-intrinsic regulatory potential [TiRP] score 28 ) and found that the mean TiRP score was higher in Treg cells compared with Tconv cells (Figures 1L and 1M).On the contrary, Tem (Th1) showed a low TiRP score, indicating that Tem (Th1) has experienced stimulation with non-self antigens.Among Treg cells, Treg Naive and Treg Act showed higher TiRP scores than Treg Eff.It has been thought that naive Tregs contain predominantly thymic differentiated Tregs (tTregs), while effector Tregs are compensated by peripherally differentiated Tregs in addition to tTregs. 29This notion was concordant with our observations that naive Tregs had the strongest Treg characteristics in the TCRs and that Treg Act and Eff shared TCRs with Tconvs (Figures 1I, 1J, and S2E).Furthermore, in MS patients, the TiRP scores of the Treg Act were significantly low, reflecting disease-dependent Treg compensation by Tconvs (Figure S2F).Overall, TCR repertoires provided valuable insights into T cell characteristics and relationships during the differentiation.
Decomposition of cellular programs using NMF Next, we attempted to identify cellular programs within and across cell types.We noticed that conventional clustering and marker gene detections could fail to capture meaningful clusters and genes.For example, differentially expressed genes in our reference included overlapping genes among Th1 cell populations and nonsense genes in Tnaive cells, suggesting that the conventional marker gene detection is insufficient for CD4 + T cells (Figure S1C).We suspected that artificially delineating in the clustering process is unsuitable for a gradual population such as CD4 + T cells.In addition, because marker gene detections are performed by pairwise comparison, global representa-tions across cell types cannot be detected.To overcome these limitations, we applied non-negative matrix factorization 20 to normalized gene expression of our scRNA-seq data.NMF and its derivative method, consensus NMF (cNMF), 22 have been widely used in single-cell analysis as a technique to extract both cell-type identity and cellular activity as gene programs.NMF unbiasedly dissected gene expression profiles into a gene feature matrix W and a cell feature matrix H (Figure 2A).To determine the number of components, we assessed the explained variances and maximum inter-component correlations and selected 12 for the number of components as they kept sufficient information and were not redundant (Figure S3A; STAR Methods).Based on the gene feature profiles and the enriched pathways, we annotated the NMF components (Figures 2B-2D and S3B; Tables S4 and S5).Several factors were related to T cell polarization, such as Treg-Feature (Treg-F) (NMF1; genes with high weights; IKZF2, FOXP3), Th17-F (NMF 2; RORC, CCR6), TregEff/Th2-F (NMF5; HLA class II genes, CCR10, CCR4), Tfh-F (NMF6; TIGIT, CXCR5), Th1-F (NMF11; GZMK, EOMES, CXCR3), and differentiations such as Naive-F (NMF3; CCR7, TCF7), Central Memory-F (NMF8; S100A8, ANXA1), and Cytotoxic-F (NMF0; GZMB, NKG7).NMF5 was enriched in both Th2 and Treg Eff, suggesting that effector Treg cells and Th2 cells may be controlled by the shared program as previously suggested 30 (Figure 2B).NMF6 (Tfh-F) also demonstrated moderate activity in Treg Act, suggesting an overlap between Treg Act and T-follicular regulatory (Tfr) cells 31 (Figure 2B).NMF11 high cells were enriched in Tem (Tph), Tem (Th1), and Tem (Th1/17) cells, showing a wide range of Th1ness gene usage across these subtypes.Moreover, NMF7 was a type I interferon signature gene component enriched in Tnaive MX1 (Figures 1C and S3B).Intriguingly, NMF10 captured a global feature across cell types consisting of AP-1 family genes (JUNB, FOS), NFKBIA, CD69, and CXCR4 (Figure 2D).This feature was concordant with tissue-homing T cells observed in the thymoma of MG patients 7 and the central nervous system of neurodegenerative disease patients, 32 and was labeled as Tissue-F.NMF4 (Act-F) was related to IL7R signaling, which is an essential survival and differentiation signal. 33The proportion of explained variance (Evar) showed the most drastic variations in the peripheral CD4 + T cells were differentiation from Tnaive to Tcm, Tem, and Temra, and the polarizations were relatively smaller changes and independent of the differentiation programs (Figure 2B).Altogether, NMF succeeded in the decomposition of peripheral CD4 + T cell gene programs into 12 components and showed that complex CD4 + T cell populations were represented by a simple combination of the 12 components.

NMF projection enables fast interpretation of various CD4 + T transcriptome datasets
One of the biggest challenges in single-cell analysis is the integration of datasets.To achieve a simple integration, we expanded the NMF framework to allow the projection of the pre-computed gene feature matrix onto other datasets by developing a bioinformatics tool, NMFproj (Figure 2A, https://github.com/yyoshiaki/NMFprojection).Recognizing the prevalent use of cNMF, we also designed NMFproj to be directly compatible with cNMF outputs.To measure how the NMF features explain the variance of the query dataset, we introduced a QC metric named the proportion of overlapped highly variable genes (POH) and Evar for all factors (Evar all) (Figure S3C).A low POH indicates that the query dataset has much variability other than the NMF features evaluated by NMFproj.We applied NMFproj to various datasets to validate the scalability (supplemental note).Analysis of bulk RNA-seq of sorted peripheral CD4 + T cells provided by the DICE project 23 demonstrated that each fraction was well represented by the 12 NMF gene features (POH: 0.272; Evar all: 0.839, Figure S3D).Miyara classification 34 categorizes CD4 + T cells into six distinct fractions (Fr.I to Fr. VI) based on the expression of CD45RA and CD25.This classification enables the distinction between Tregs and Tconvs, as well as naive and activated cell states.However, the identity of borderline populations, such as Fr.III (CD45RA À CD25 int ), remains less understood.Using NMFproj, we re-evaluated this classification and found that Fr.III possesses Th17 type characteristics in line with the original report 34 (POH: 0.542; Evar all: 0.823, Figure S3E; supplemental note).In addition, profiling of circulating Tph cells 24 revealed that Tph cells possessed both NMF6 (Tfh-F) and NMF11 (Th1-F) in concordance with Tem (Tph) we defined as cluster L2 (POH: 0.134; Evar all: 0.816, Figure S3F; supplemental note).We also attempted to utilize NMFproj for the QC of in vitro induced Treg (iTreg) cells of mouse 35 and found that iTreg + cells induced in optimized conditions for enhancing Treg functionality showed higher NMF1  (A) Strategy for meta-analysis of peripheral CD4 + T cells across diseases.First, CD4 + T cells were extracted from peripheral blood mononuclear cell (PBMC) scRNA-seq datasets using Azimuth. 62Extracted CD4 + T cells were mapped on our reference using Symphony 14 with a batch correction.Mapped cells were used to assess cell frequency and NMF cell features for each cluster.
(legend continued on next page) (Treg-F) values than conventional iTreg cells (POH: 0.182; Evar all: 0.844, Figure S3G; supplemental note).We also applied NMFproj to scRNA-seq datasets of cross-tissue immune cells 36 (Figure S6, POH: 0.560; Evar all: 0.332 in CD4 + T cells, supplemental note), pan-cancer tumor-infiltrating CD4 + T cells 15 (Figure S4A, POH: 0.530; Evar all: 0.338, supplemental note), and mouse splenocytes 37 (Figure S4B, POH: 0.394; Evar all: 0.146, supplemental note).CD4 + T programs defined by circulating CD4 + T cells were adequately projected to various conditions, with permissive POH values and lower Evar all.Given the variance observed in Evar all between circulating and tissue-derived datasets, we sought to explain this by comparing the ab initio NMF in a pan-cancer dataset with the NMF defined in circulating CD4 + T cells.We found that the majority of circulating CD4 + T factors had correlating factors in the NMF defined by the CD4 + tumor-infiltrating lymphocyte (TIL) data (Figures S5A-S5D).For example, NMF7 (IFN-F) in circulating CD4 + T corresponded to NMF15 in ab initio NMF defined by the pan-cancer dataset (Figures S4A and S5D).However, NMF8 (Cent.Mem.-F) and NMF4 (Act-F), which were broadly elevated or almost inactive within the tumor, were not included in the NMF defined for TILs.In addition, some factors, including NMF17, which contains CXCL13 and IL21, were only included in NMF components defined by TILs (Figure S5D).From these results, it seems that the NMF defined in circulating CD4 + T cells can be correctly projected by NMFproj even to TILs, but some tissue-specific factors may need to be carefully evaluated.Furthermore, cell-specific qualitative changes have been reported in autoimmune diseases, such as Treg dysfunction in SLE, 38 and we hypothesized that NMFproj could be used to detect these changes in individual cell populations.To test this, we applied NMFproj to bulk RNAseq data of sorted peripheral CD4 + T cell fractions from various autoimmune patients. 39NMFproj detected a subset-specific gene program robustly even in a variety of autoimmune disease conditions (Figure S7A; supplemental note).The results showed cell-type-wide enhancement of NMF7 (IFN-F) in SLE and mixed connective tissue disease and hampered NMF1 (Treg-F) in Fr.I nTregs (CD45RA + CD25 + ) in SLE patients as previously reported 38 (Figure S7B).These results indicated that NMFproj could robustly assess the qualities of CD4 + T cells in various tissues and disease states, regardless of bulk/single cell or human/ mouse.We were also able to perform analyses while maintaining interpretability in a large dataset spanning multiple diseases and cell types.

Meta-analysis of CD4 + T cells in various autoimmune diseases
To extend CD4 + T cell profiling to various autoimmune and infectious diseases, we performed a meta-analysis using publicly available single-cell data. 6,8, We itegrated publicly available datasets with two strategies: (1) quantitative evaluation of cell frequencies by mapping to our reference and (2) evaluation of qualitative changes per cell type using NMFproj.We extracted CD4 + T cells from peripheral blood mononuclear cells (PBMCs) using Azimuth 62 and then mapped them to our reference using Symphony 14 (Figure 3A, the pipeline is available at https:// github.com/yyoshiaki/screfmapping).Reference mapping using Symphony achieved accurate label transfer at both cluster L1 and L2 resolutions (Figures S8A-S8D, accuracy = 0.923 and 0.781, adjusted Rand index = 0.825 and 0.631 in cluster L1 and L2, respectively), and the result was concordant with those by other tools, CellTypist 36 and ProjecTILs 16 (Figures S8E and  S8F).The suitability of Symphony mapping for conditions such as infectious diseases was also verified using silhouette scores (Figure S8G).We collected 1,809,668 CD4 + T cells collected from 647 cases and 306 controls from 25 projects (Figures 3B and S9A; Table S6).For quality assurance, only datasets in which both HC and patients were present and at least three cases were included were used.As a prominent change, Tnaive decreased and Temra increased in various autoimmune diseases (Figure S9B; Table S7).It has been reported that Temra increased in the peripheral blood of rheumatoid arthritis (RA), MS, ulcerative colitis (UC), and Crohn's disease (CD) patients, [63][64][65][66] which was consistent with the present data.Kawasaki disease and type 1 diabetes (T1D) were exceptions among autoimmune diseases, with a slight increase in Tnaive and no significant change in Tcm, Tem, and Temra (Figure S9B), as reported previously. 67,68At cluster L2 resolution, we found that Tnaive MX1 increased in COVID-19, SLE, T1D, and primary Sjo ¨gren syndrome (pSS) patients (Figure 3C).The type I IFN response is essential for viral elimination and has been reported to be associated with COVID-19 pathology 69 and also known to be  associated with SLE, 70 pSS, 71 and T1D. 72Our meta-analysis could detect these effects as the increase of Tnaive MX1.Moreover, in our meta-analysis, Tcm (Th17) was increased in various diseases, including previously reported diseases such as MG, 73 MS, 74,75 and psoriasis. 76Regarding Tregs, we reported that Fr.II (CD45RA À CD25 + ) is increased in sarcoidosis, while Fr.I and III are increased in active SLE. 34Another group reported an increase of Tregs in pSS. 77In concordance with these observations, Treg increased in SLE, neurosarcoidosis, sarcoidosis, and pSS, especially for Treg Eff in neurosarcoidosis and for Treg Act and Treg Eff in SLE patients (Figures 3C and S9B).Interestingly, the acute infection response to COVID-19 showed an increase in Tnaive, whereas influenza infection showed an increase in Temra.We also found age-dependent Tnaive decrease and Temra, Treg Eff increases concordant with previous reports 78 (Figure 3C).Sex differences in immunity are critical, especially for autoimmune diseases, because 80% of autoimmune disease occurs in females. 79Previous reports have addressed several changes in females, such as the increase of recent thymic emigrants 80 and greater activation responses by in vitro stimulation. 81As for gender differences, we observed a decrease in Tcm (Th2), Tem (Th1/17), Temra (Th1), and Treg Eff and an increase in Tnaive Act, Tnaive MX1, Tnaive SOX4, Tem (Tph), and Treg Naive in females.These alterations depending on disease, gender, and age were also observed as specific distributions on the principal-component analysis plot (Figures 3D-3F, S9C, and S3D).Overall, we profiled numerical features of CD4 + T cells in broad autoimmune status, age, and gender.
Next, to measure the quality changes in autoimmune diseases, we applied NMFproj to the datasets and investigated NMF cell feature changes in each cluster L2 population (Figures 3G,  S10A, and S11; Table S9).To add measurements for the biological significance, we also calculated a scaled intercept as the baseline gene program activity (Figure S11).The strongest skews were enriched in NMF7 (INF-F) in SLE and COVID-19 patients in a cell-type-wide manner.We also found that even neutral populations such as Tnaive, Tnaive (Act), and Tcm (Th0) showed disease-specific propensities.For example, NMF0 (Cytotoxic-F) increased in RA, MS, and pSS, NMF10 (Tissue-F) was increased in MS, COVID-19, SLE, and neurosarcoidosis, while NMF3 (Naive-F) decreased in a broad range of autoimmune diseases.In Treg cells, NMF1 (Treg-F) decreased in T1D, MG, and MS, indicating the dysfunction of Treg in these diseases independently of the number of Treg cells.In particular, for T1D and MS, where the sample size was large, we investigated the differentially expressed genes in each Treg subtype in each diseases and examined the genes with high feature values for NMF1 (Treg-F) (Figure S10B).Variations in the NMF1 (Treg-F)-related genes were most pronounced in Treg Eff, with T1D showing a decrease in functional molecules such as IL2RA, TNFRSF1B, and Treg regulators such as FOXP3, IKZF2, while MS showed a decrease in IL2RA, IL10RA, IKZF4, along with an increase in CTLA4, which is one of the prominent Treg effector molecules.Age-dependent increases of NMF8 (Cent.Mem.-F) and NMF4 (Act-F) were also observed.In females, NMF4 (Act-F) was enhanced broadly.Other findings, such as the upregulation of an activation molecule, CD69, which was the feature gene of NMF10 (Tissue-F), in MS and SLE, 82 and the reduction of effector Tregs and their decreased function in T1D, 83 are also consistent with our metaanalysis.In conclusion, by meta-analysis, in addition to quantitative changes, we identified qualitative changes depending on the disease, sex, and age at a resolution that is difficult to observe in existing methodologies.
Predictive potential of peripheral CD4 + T profiles for autoimmune diseases Given that each autoimmune disease disorder possessed a characteristic CD4 + T cell profile, we hypothesized that disease status might be predicted solely from CD4 + T profiles by utilizing machine learning techniques and our autoimmune-wide scRNAseq dataset (Figure S12A).To confirm this, we created three models for the prediction of disease status.The first model took the frequency of each cell population in cluster L2, the second model took the cellular features of the NMF in subsets, and the last model took both as input parameters (Figure 3H).We note that, for NMFproj features, only Tnaive and Tcm (Th0) cell features, which were affected by various conditions as discussed previously (Figures 3G and S10A), were used to avoid over-fitting.First, binary classification models were constructed for SLE, COVID-19, and MS to distinguish between certain diseases and healthy individuals using logistic regression where equal numbers of diseased and healthy subjects were used for the training, and the models were evaluated using samples from independent projects from these used for the training (Figure 3I).SLE and COVID-19 yielded relatively good predictions from cell frequencies alone (area under the curve [AUC] = 0.84, 0.82 in SLE and COVID-19, respectively).NMF cell features alone also could predict SLE and COVID-19 well (AUC -= 0.94, 0.85 in SLE and COVID-19, respectively), and the perdition using both cell frequencies and NMFproj values further improved accuracy (AUC = 0.94, 0.91 in SLE and COVID-19 respectively).For MS, for which hematological biomarkers have not yet been well established, prediction from cell frequencies or from NMFproj values alone was not successful (AUC = 0.61, 0.47 in cell frequency and NMFproj, respectively), while both cell frequencies and NMFproj values resulted in better predictions (AUC = 0.75) than previous plasma based methods. 84Next, we built multiclass classification models that more closely resemble real-world clinical practice.The multiclass classification was assumed to be a more difficult task due to the similarities among autoimmune diseases and the imbalance of training sample sizes.We trained a gradient boosting model with 5-fold crossvalidation on data from 714 samples from 8 diseases or healthy for which at least 10 samples were available for training and then evaluated the model using the data from the independent dataset (Figures S12C and S12D).The model trained only by cell frequencies or by NMFproj values could predict COVID-19, SLE, HC, and MS (area under the precision-recall curve [PR-AUC] = 0.72, 0.68, 0.48, 0.12 in the cell frequency model and 0.91, 0.75, 0.47, 0.35 in the NMFproj model, for COVID-19, SLE, HC, and MS, respectively), and the model trained by both cell frequencies and NMFproj values in Tnaive and Tcm (Th0) was marked with superior accuracy (PR-AUC = 0.92, 0.69, 050, 0.33 for COVID-19, SLE, HC, and MS, respectively).The combined cell frequency and NMF model also achieved the highest accuracy (the cell frequency model: 0.413, the NMF model: 0.484, and the cell frequency + NMF model: 0.527).These results highlight that the disease-specific changes in CD4 + T cells, both in terms of quantitative and qualitative alterations, contribute to the prediction of disease status.

Partitioned heritability of autoimmune diseases on CD4 + T cell NMF features
We examined the association between CD4 + T characteristics and genetic factors for each disease and trait.Studies using GWAS statistics have reported associations between autoimmune diseases and immune cells. 2,85In particular, stratified linkage disequilibrium score regression (S-LDSC) has revealed the association between CD4 + T cells and autoimmune diseases by stratifying the heritability of polygenic autoimmune diseases by genetic features. 86,87Since we captured elaborate CD4 + T cell features, we investigated which of these features was associated with each trait using the S-LDSC framework.In addition to the cell-type-specific genes of cluster L2, 12-dimensional features extracted by NMF were used as genetic features.The Roadmap enhancer-gene linking (Roadmap) and activity-bycontact strategies introduced in the sc-linker framework 88 were used for linking genes and SNPs.We note that as some gene programs were shared with other immune cells (Figure S6G), some associations below may potentially be influenced a broader range of cell types.Among 180 traits, autoimmune diseases showed significantly high enrichment in NMF features (p = 8.51 3 10 À12 ), suggesting autoimmunity is closely associated with CD4 + T cells (Figure 4A).Cross-sectional disease association revealed that many diseases, such as inflammatory bowel disease (IBD), RA, and MG, have an enrichment of heritability on NMF1 (Treg-F) (Figure 4B).By focusing on the most accumulated factors for each disease, we found that autoimmune diseases can be divided into several groups.For each disease, the most enriched gene features were observed as NMF1 (Treg-F): RA, UC (deLange), hypothyroidism; NMF2 (Th17-F): CD (deLange), IBD (deLange), MG; NMF5 (TregEff/ Th2-F): celiac disease, T1D; NMF7 (IFN-F): SLE, primary biliary cirrhosis; NMF10 (Tissue-F): MS, psoriasis.In MS, accumulation was observed in various features, including NMF2 (Th17-F) and NMF11 (Th1-F).The heritability of each autoimmune disease was accumulated in several factors, suggesting that autoimmune diseases have multiple susceptibilities.In other traits, weak accumulation on NMF1 (Treg-F), NMF2 (Th17-F), and NMF11 (Th1-F) was common in COVID-19 in both severe symptoms and infection, while NMF7 (IFN-F) was infection specific (Figure S13A).Lymphocyte counts were also susceptible to NMF4 (Act-F) (Figure S13A).We also examined enrichment in celltype-specific genes.Similar enrichment patterns, such as Treg Naive in most autoimmune diseases and Tnaive MX1 in SLE, were observed, while caution should be taken as marker gene detection is not optimal for CD4 + T cells as in the previous section (Figure S13B).Taken together, we comprehensively profiled heritability enrichment on CD4 + T cell gene features across autoimmune diseases.
Partitioned heritability is associated with qualitative and/or quantitative changes in CD4 + T in a diseasespecific manner Finally, we compared partitioned heritability and observed changes in CD4 + T in terms of quantity (cell frequency) and quality (NMFproj) to assess the genetic effect on phenotypic changes (Figure 5A).We first investigated the correlation between enriched heritability and changes in cell frequency and NMF features for diseases that were enrolled in our analysis for both GWASs and meta-analysis.We found several patterns depending on the disease (Figure 5B).First, in MG and psoriasis, both changes in cell frequency and NMF feature correlated with Next, we examined MS, MG, and SLE, whose samples were collected in this study, in more detail (Figure 5C).In MG, NMF2 (Th17-F), which accumulated heritability most significantly, was increased in Tcm (Th17) and Tem (Th1/17) in correlation with genetic factor accumulation, and the cell frequency was also correlated with genetic factors (Figures 5C and S13).On the other hand, NMF1 (Treg-F), which showed the next highest accumulation of heritability, was negatively correlated with the genetic effect and lower in Treg cells (Figure S13), indicating that the dysfunction of Treg cells might be enhanced by the genetic effect.In MS, the highest heritability accumulation was observed in NMF10 (Tissue-F).This factor was increased in all cell populations without cell specificity, resulting in a low correlation with the heritability.SLE susceptibility was most accumulated in NMF7 (IFN-F).In our meta-analysis, an enhancement of NMF7 (IFN-F) was observed in all cell populations, especially in Tnaive MX1.
Overall, our study cataloged heritability enrichment and phenotypic changes across autoimmune diseases, enabling elucidation of the disease-specific effect of underlying genetic factors on CD4 + T cell phenotypes.

DISCUSSION
The classifications and characterizations of CD4 + T cells have been challenging, with cellular heterogeneity being a major obstacle. 4In this study, by performing single-cell analysis on CD4 + T cells from autoimmune and healthy subjects, we succeeded in mutually exclusive and collectively exhaustive subtype identifications of peripheral CD4 + T cells.Moreover, in contrast to the conventional dualistic comparisons, such as Th1 vs. Th2 and Treg vs. Th17, the NMF-based decomposition revealed that CD4 + T cells are formed by a combination of 12 features rather than simple contradistinction.While qualitative profiling by NMF was not suitable for numerical evaluation, it allowed for a more robust assessment of gradual cell populations.Moreover, our results can also be extrapolated for other singlecell and bulk RNA-seq studies by using a label transfer and the projection of NMF features.These analytical frameworks also allowed us to perform autoimmune-wide single-cell meta-analyses and integration of CD4 + T cell features with GWASs.As a result, we comprehensively cataloged CD4 + T cell alterations in 20 diseases, providing a valuable resource for a broad range of disease research.The assessment of qualitative changes through NMFproj enabled us to explore biological insights, such as Treg functional abnormalities, that were previously unattainable using cytometry.Furthermore, the decomposition of gene programs using NMF was beneficial not only for T cell profiling but also for interpreting GWAS results.We found that genetic factors can have both disease-specific and cross-disease impacts on autoimmune conditions.The accumulation of heritability on Tregs across diseases and the disease specificity of other features may be potential clues for future therapeutic development.
By examining genetic factors, CD4 + T cell changes, and TCR characteristics in a disease-specific manner, we gained valuable insights into various diseases.For example, MG is caused by autoantibodies against the neuromuscular junction, with germinal center responses involving Tfh cells and B cells within the thymus. 7,89While Th17 function enhancement has been reported in MG, 73,90 we also observed a heritability enrichment in NMF2 (Th17-F), suggesting that tissue damage by Th17 cells may contribute to symptom completion and persistence.In MS, we observed an increase in NMF10 (Tissue-F) and heritability enrichment in addition to the previously known Th17 and Th1 increase and functional enhancement. 74,75,91These results suggest that a strong tissue inflammatory response is involved in MS and emphasizes that the tissue-specific gene program centered on the AP-1 family may be a novel MS-specific therapeutic target. 92In addition, in MS, while Treg Eff slightly increased, the quality of Treg cells in terms of transcriptome and TCR was low, indicating that the compensation of Tregs from Tconvs is an explanation for Treg dysfunction in MS. 93 In SLE, our analysis of heritability enrichment and qualitative alterations supports the traditional belief that type I IFN is central to the disease. 94Type I IFN drives differentiation into Tregs and Th1 cells, 95,96 and our result suggests that the pleiotropic effect of type I IFN contributed to the complicated cell frequency changes observed in this study.Furthermore, TCR overlaps between Tcm (Tfh) and Treg Act were observed specifically in SLE, suggesting potential Treg-Tfh plasticity in SLE similar to reported Tfh-Treg plasticity under certain inflammatory conditions. 97,98We identified distinct CD4 + T cell responses between COVID-19 and influenza infection, with an increase in Tnaive cells in COVID-19 and Temra cells in flu.This divergence may reflect differences between pretrained immunity to influenza and initial responses to SARS-CoV-2, as most COVID-19 samples were collected before the vaccine rollout.In addition, our meta-analysis revealed sexand age-related CD4 + T cell changes, with new observations such as increased Tnaive MX1 and Tem (Tph) in females, potentially contributing to gender differences in autoimmune disease incidence.Thus, our study highlights the CD4 + T cell features of each disease and condition, providing new insights for consideration.
In addition, this study created a comprehensive single-cell circulating CD4 + T cell catalog across various diseases, providing the opportunity to tackle the challenging task of assessing whether disease prediction is feasible using CD4 + T cell profiles.Our framework offers an approach to measuring CD4 + T profiles from limited cell counts or sparse gene expression data through the pre-defined reference and NMFproj.However, the predictive accuracy of the machine learning model utilizing CD4 + T profiles in this study was still limited, which could be attributed to factors such as limited sample sizes, sample imbalance, and variations between cohorts.Nevertheless, this underscores the potential for predicting disease status using these profiles.Moreover, the distinct variations in CD4 + T profiles across diseases emphasize the necessity for individualized treatments tailored to specific conditions.
In summary, we constructed the frameworks for extracting the CD4 + T cell programs, enabling a comprehensive interpretation of CD4 + T cells.Moreover, the landscape of disease-specific CD4 + T cell alterations and genetic effects provides biological insights for potential precision medicine.

Limitations of the study
There are several limitations worth noting.Firstly, since our meta-analysis contained data from not only whole PBMCs but also pre-enriched CD4 + T cell samples, we could not consider the proportional alterations of CD4 + T cells in overall PBMCs.Furthermore, our study focused on peripheral blood and did not evaluate tissue-specific alterations, such as barrier tissuespecific programs. 99This might be the reason that we could not capture some known vital phenomena for CD4 + T cells, such as anergy and exhaustion.For example, when we compared with ab initio NMF for TILs, while the factors defined in blood were largely preserved CD4 + T cells in tissues, it is worth noting that certain factors, including those encompassing RPA-T4, Thermo Fisher Scientific), PE-labeled anti-CD19 mAb (HIB19, BioLegend), Live/Dead (Thermo Fisher Scientific).Live-CD3 + CD4 + CD19 À cells were isolated using BD Biosciences FACS Aria II or BD Biosciences FACS Aria III.CD4 + T cells and B cells were mixed in equal numbers in some samples.
The sorted cells were loaded to Chromium Next GEM Chip K (10x Genomics) on Chromium Controller (10x Genomics) for barcoding and cDNA synthesis.The library construction was performed using Chromium Next GEM Single Cell 5 0 Kit v2 and Chromium Single Cell Human TCR Amplification Kit (10x Genomics) for 5 0 according to the manufacturer's protocol.The libraries were sequenced on NovaSeq6000 (Illumina).
Integration with bulk RNA-seq dataset Fastq files were processed using an RNA-seq integrative pipeline, ikra (v2.0.1), 102 composed of Trim Galore! 0.6.7, 110Salmon 1.4.0, 111 tximport 1.6.0 112with the reference Gencode M26 for mice and 37 for humans.Datasets for which the TPM matrix was provided were downloaded and used directly for analyses.For the ImmuNexUT (E-GEAD-397) dataset, the downloaded count matrix was converted to TPM.

Gene expression decomposition using NMF
To decompose cellular processes, we applied NMF implemented in scikit-learn (v0.24.2) to normalize gene expression of HVGs.Using the non-negative matrix factorization (NMF) method, the normalized gene expression matrix X, with elements represented as x ij , was decomposed into a gene feature matrix W (with elements represented as w ic ) and a cell feature matrix H (with elements represented as h cj ).In this context, i refers to the gene index, j represents the cell index, and c denotes the component.The decomposition is given by: X = W,H For our analysis, the number of components was determined to be 12 based on two criteria: i) it exceeded the elbow in the distribution of explained variance and ii) it was just before a sharp increase in the maximum inter-components Spearman's correlation.The explained variance for the given number of components c is denoted as Evar c and is derived using the residual sum of squares (RSS) for component c, represented as RSS c : The explained variance Evar c is then:

Figure 1 .
Figure 1.Global profiling of CD4 + T cells (A) Sample collection strategy.HC, healthy control; MG, myasthenia gravis; MS, multiple sclerosis; SLE, systemic lupus erythematosus.(B and C) Cluster layer 1 (L1) and layer 2 (L2) on Uniform Manifold Approximation and Projection (UMAP) embeddings.(D) Dot plot depicting signature genes' mean expression levels and percentage of cells expressing them across clusters.Marker genes for the plot were manually selected.See also Figure S1C for automatically extracted marker genes.(E) Expression correlation of clusters with bulk RNA-seq for sorted CD4 + T cell fractions generated by the DICE project.(F) UMAP plot showing clonotype size.The color indicates the number of cells sharing the same T cell receptor (TCR) sequences.(G and H) Clonotype size distributions across clusters.(I and J) TCR similarity networks in autoimmune patients, healthy donors (I), and all donors (J).TCR similarity was calculated for each sample, and only edges where overlapping clonotypes were detected in R2 (I) or R4 (J) samples are depicted as robust overlaps.The edge color indicates the average TCR similarity of all samples.

Figure 2 .
Figure 2. NMF captured 12 CD4 + T cell features (A) Schematic view of NMF and NMF projection (NMFproj).(B) Matrixplot showing the mean scaled NMF feature weight for each cluster L2 population.The explained variance (Evar) is also shown on the right.The NMF feature weight is scaled by the maximum value for each feature for visualization.(C) NMF cell feature value on UMAP plots.(D)Gene features for each component.The top 10 genes for each feature were selected.The 12 gene features are annotated using top genes and previous reports as NMF0 Cytotoxic-Feature (F), NMF1 Treg-F, NMF2 Th17-F, NMF3 Naive-F, NMF4 Activation-F (Act-F), NMF5 TregEff/Th2-F, NMF6 Tfh-F, NMF7 IFN-F, NMF8 Central Memory-F, NMF9 Thymic emigrant-F, NMF10 Tissue-F, and NMF11 Th1-F.The Reactome pathway with the smallest p value for each gene feature is shown below the heatmap (FigureS3B; TableS5).

(
B) Bar plots showing the number of samples (left) and the number of CD4 + T cells (right) enrolled in the meta-analysis.The dashed line in the left plot indicates a sample size of 10.Celiac, celiac disease; Kawasaki, Kawasaki disease; T1D, type 1 diabetes; AD, Alzheimer's disease; PPP, palmoplantar pustulosis; Parkin-sonDis., Parkinson's disease; BD, Behc ¸et's disease; CD, Crohn's disease; Flu, influenza; pSS, primary Sjo ¨gren syndrome; RA, rheumatoid arthritis; UC, ulcerative colitis.(C) Dot plot showing changes in cell frequency at cluster L2 resolution.Dot colors show coefficients, and sizes show the significance of the generalized linear model (GLM) (STAR Methods).Detailed statistics can be found in Table S8.Only significant dots (p adj < 0.05) are shown.(D-F) Principal-component analysis (PCA) plots of samples based on cell frequencies.Sample distributions for each disease state (D), loading vectors for each cell type (E), and sample characteristics in healthy donors (F) are shown.(G) Chord diagram showing the top 100 significant associations with positive coefficients between NMF features and cells in each condition, calculated by GLM (STAR Methods).Detailed statistics are shown in

Figure 4 .
Figure 4. Partitioned heritability of autoimmune diseases by CD4 + T cell features

Figure 5 .
Figure 5. Relationship between genetic factors and phenotypic changes in CD4 + T cells (A) Model of genetic effect on phenotypic changes in CD4 + T cells.CD4 + T cell changes are observed as qualitative (NMFproj cell features) and quantitative (celltype frequencies) changes.(B) Scatterplot showing the genetic effect on cell frequencies (x axis) and NMF features (y axis).Sc-linker weight per cell was calculated by dot products of sclinker outcome (NMF) and NMF cell features.For cell frequencies and NMF cell features, coefficients of GLM output for each cluster L2 population were used.Spearman's correlation of sc-linker weight and cell frequency/NMF cell feature changes were calculated.For the correlation with sc-linker and NMF cell feature changes, we used the maximum R among NMF features for the visualization.COVID19-A, very severe respiratory symptom; COVID19-B, hospitalized; COVID19-C, SARS-CoV-2 infection.(C) Individual sc-linker weights, cell frequency changes (coefficient for each cluster L2), and NMF cell feature changes in the factor with the highest magnitude (E score) (coefficient for each cluster L2) of MS, MG, and SLE were visualized on the UMAP embeddings (left panel).For the coefficient of the NMF cell feature changes, only one representative factor with the highest E score for each disease is shown.The bar plot of Spearman's correlation of cell frequency and NMFproj changes with partitioned heritability is shown in the right panel.The colors of the bars, except for cell frequency, indicate the E score calculated using sc-linker.

Table S9
. The thickness of edges indicates the coefficient of GLM, and colors indicate conditions such as diseases, gender, and age.(H) Strategy for predicting autoimmune states from CD4 + T cell profiles using machine learning framework.As the input parameters, one model took only cell frequency, age, and gender (without NMFproj), while the other took cell frequency, NMF cell features in Tcm (Th0) and Tnaive, age, and gender (with NMFproj).(I) Receiver operating characteristic curves of logistic regression models trained by cell frequencies (top left), by NMFproj values in Tnaive and Tcm (Th0) (top right), and both cell frequencies and NMFproj values (bottom).SLE, COVID-19, and MS patients were trained on 159, 116, and 35 patients with the same number of healthy subjects, and evaluated on 40, 89, and 17 patients and the same number of healthy subjects from independent datasets.Numbers in parentheses indicate the area under the curve (AUC).