Detection of disease-specific signatures in B cell repertoires of lymphomas using machine learning

The classification of B cell lymphomas—mainly based on light microscopy evaluation by a pathologist—requires many years of training. Since the B cell receptor (BCR) of the lymphoma clonotype and the microenvironmental immune architecture are important features discriminating different lymphoma subsets, we asked whether BCR repertoire next-generation sequencing (NGS) of lymphoma-infiltrated tissues in conjunction with machine learning algorithms could have diagnostic utility in the subclassification of these cancers. We trained a random forest and a linear classifier via logistic regression based on patterns of clonal distribution, VDJ gene usage and physico-chemical properties of the top-n most frequently represented clonotypes in the BCR repertoires of 620 paradigmatic lymphoma samples—nodular lymphocyte predominant B cell lymphoma (NLPBL), diffuse large B cell lymphoma (DLBCL) and chronic lymphocytic leukemia (CLL)—alongside with 291 control samples. With regard to DLBCL and CLL, the models demonstrated optimal performance when utilizing only the most prevalent clonotype for classification, while in NLPBL—that has a dominant background of non-malignant bystander cells—a broader array of clonotypes enhanced model accuracy. Surprisingly, the straightforward logistic regression model performed best in this seemingly complex classification problem, suggesting linear separability in our chosen dimensions. It achieved a weighted F1-score of 0.84 on a test cohort including 125 samples from all three lymphoma entities and 58 samples from healthy individuals. Together, we provide proof-of-concept that at least the 3 studied lymphoma entities can be differentiated from each other using BCR repertoire NGS on lymphoma-infiltrated tissues by a trained machine learning model.


Introduction
B cells are one of the essential pillars of the adaptive immune system that generates highly specific and also long-lasting immunity [1,2].They originate from hematopoietic precursor cells in the bone marrow and acquire their characterizing feature-the B cell receptor (BCR)-in a multistep recombination and selection process [3,4].Each BCR consists of a unique configuration of paired immunoglobulin heavy (IGH) and light (IGL) chains that mediate antigen binding specificities and upon engagement trigger, in concert with coactivator molecules, a cascade of signaling events that result in activation and proliferation [5,6].Activated B cells can then differentiate into plasma cells, which produce and secrete immunoglobulins or antibodies to neutralize cognate antigen, or long-lived memory B cells that are capable to quickly mount high-affinity recall responses [7].To guarantee an adequate arsenal of binders for the enormous breadth of foreign antigens, the immune system uses the process of immunoglobulin VDJ recombination to generate maximal sequence diversity [3,8].During VDJ-recombination in developing immature B cells, randomly chosen variable (V), diversity (D) and joining (J) gene segments within the immunoglobulin loci are recombined to chromosomal sequences encoding a functional BCR [8].This recombination process is facilitated via induced double-strand breaks and DNA repair/ligation mechanisms that may result in additional deletions or insertions that further increase sequence variance of single BCRs [8].On the repertoire level, most of the immunoglobulin diversity is generated in the complementarity-determining region 3 (CDR3) sequence which spans the joined VDJ regions [9,10].In addition, BCR diversity is boosted by somatic hypermutation (SHM), an iterative affinity maturation process that is initiated in response to antigen in the germinal centers (GCs) of secondary lymphoid tissues [7,11].These transient but highly specialized microanatomical structures provide a dynamic environment that enables the proper coordination of repeated SHM and selection cycles to evolve polyreactive low-affinity BCRs into antibodies with maximum epitope selectivity [6,7,11].
Lymphomas represent hematological neoplasms of differentiated lymphocytes which typically originate in lymphatic tissue [12].Interestingly, 95% of all lymphomas are found in B lineage cells [13].Lymphomas are classified based on histopathological and clinical features PLOS COMPUTATIONAL BIOLOGY that mostly depend on the putative cell of origin that has undergone malignant transformation [14].The BCR takes a prominent role as oncogenic driver and mediator of sustained growth in these malignancies [14].This is not only exemplified by the fact that a specific VDJ-rearrangement characterizes the malignant clonotype in a given patient but also by the acquisition of mutations that mimic chronic active BCR signaling [15].Moreover, comparative analyses of a large number of prior sequencing studies have revealed prominent repertoire restrictions up to the extent of very similar or even identical CDR3 sequences across different patients with the same disease that are now recognized as subtype-specific sequence features for classification [16][17][18].Furthermore, non-malignant bystander cells occupy varying space within the heterogenous tumor microenvironments of different lymphomas [19,20].Some lymphomas bear very dominant malignant clonotypes, while in other lymphomas the malignant clonotype is less frequently found in a background of non-malignant bystander cells.These bystander immune cells such as T cells and B cells show a diverse range of other T-cell receptor-gene-or VDJ-rearrangements.
It is generally accepted that the correct diagnostic evaluation of lymphomas by the pathologist is complex in light of the numerous WHO-defined-sometimes rare-entities.Since recognition of the patterns of malignant and bystander cells needs a lot of expertise, evaluation by a consultation pathologist represents a standard in most centers.Here, we hypothesized that high-throughput analysis of the B cell architecture of different types of lymphomas may provide an additional basis for diagnosing disease subtypes.In this context, we set out to employ machine learning algorithms, which have demonstrated their capability to enhance the classification of immune states in antigen receptor-repertoire sequencing data, as valuable tools [21][22][23][24][25]. Since BCR next-generation sequencing (NGS) technology relies on unbiased amplification of all VDJ-rearrangements in the tissue of interest, also benign bystander lymphocytes are detected.Here, we present proof-of-concept that a logistic regression model is capable of differentiating between three paradigmatic lymphomas-nodular lymphocyte predominant B cell lymphomas (NLPBL), diffuse large B cell lymphoma (DLBCL), and chronic lymphocytic leukemia (CLL).This data shows the great potential of finding signatures in such large repertoire datasets consisting of the malignant clonotype and benign bystander cells that-clinicallyuntil now remain largely unexploited.

Characteristics of the lymphoma and control cohorts
In our study, we analyzed a total of 620 lymphoma BCR repertoire samples, comprising 90 from NLPBL (formerly nodular lymphocyte predominant Hodgkin lymphoma; NLPBL [26]), 182 from DLBCL, and 348 from CLL cases.We enriched the data with 291 control BCR repertoires derived from the blood of healthy donors (HD).Basic characteristics of the cohort are outlined in Table 1.Detailed information on individual BCR repertoire samples, including corresponding subjects, is provided in S1 Table.

Broad repertoire metrics in the lymphoma cohorts
In a first step, we compared general repertoire metrics between the four different groups.Blood BCR repertoires of HD were the most diverse and less clonal (Fig 1A -1C).Of the lymphoma cases, NLPBL patients showed lowest clonality, followed by patients with DLBCL and CLL (Fig 1A -1C).Of note, some DLBCL and CLL samples contained only the malignant clone and therefore exhibited a clonality of 1.While the term "repertoire" may not be entirely suitable in these cases, for consistency, we maintained this annotation throughout the paper.The percentage of somatically hypermutated clonotypes provides a rough estimate of how many clonotypes in the repertoire have undergone antigenic selection.The average somatic hypermutation across all BCR repertoires within each cohort was as follows: 19.2% in HD, 18.4% in NLPBL, 44.9% in DLBCL and 22.4% in CLL (Fig 1D).In addition to assessing the overall somatic hypermutation in BCR repertoires, we compared the somatic hypermutation levels of the top 10 clonotypes within each sample (S1 Fig).

Training of machine learning models on BCR repertoires of lymphoma tissue
In our pursuit of accurately classifying lymphoma subtypes based on BCR repertoire characteristics and as an initial benchmarking effort, we developed and trained two machine-learning models.In consideration of interpretability, we opted for two well-established and straightforward models: Logistic regression and random forest.These models were trained using a comprehensive feature set encompassing various aspects of clonotypes, including clonotype fractions, CDR3 sequence lengths, VDJ genes, and Kidera factors [27].Additionally, we incorporated repertoire metrics such as clonality, Shannon diversity index, richness and the fraction of somatic hypermutation within each repertoire into the feature set.Although some of the metrics were correlated, as shown in the correlation matrix (S2 Fig) , we included all relevant metrics to ensure a comprehensive and full understanding of the data, as each metric provides distinct non-overlapping information.
We applied different scenarios in which we challenged the models to discriminate between HD, NLPBL, DLBCL and CLL as well as different combinations thereof.For each scenario we started an individual training phase.For each training phase, the entire dataset comprising the individual classes was partitioned into a training subset (80%) and a test subset (20%).The splitting was performed in a stratified manner, ensuring the proportions of classes were maintained in both subsets, thus guaranteeing an unbiased evaluation of the developed models (Table 2).To account for the imbalance of the classes we used random oversampling, resampling all classes but the majority class.
Within the training phase, we implemented a robust model optimization strategy to avoid model overfitting.This incorporated a stratified k-fold cross-validation approach with k set to 3 folds.The stratification within the cross-validation approach ensured that the ratio of classes remained consistent across each fold and throughout the entire model training process.A grid search strategy was employed for hyperparameter tuning, traversing a predefined hyperparameter space.A list of resulting hyperparameters and the corresponding search spaces can be found in S2 Table .To assess the impact of bystander cells, we varied the number of top clonotypes considered in the calculation, ranging from 1 to 10, and subsequently extending to 20, 50 or 100 clonotypes.

Validation of machine learning algorithm
During the training phase, the models were trained on ⅔ and validated on ⅓ of the training data.This form of validation is crucial in affirming the reliability, predictive capacity, and effectiveness of the models in differentiating the subsets based on the featured BCR repertoire [28].
The validation assessment was undertaken through the analysis of key metrics including accuracy, recall, precision and the F1 score.The evaluation of these performance metrics provided a comprehensive understanding of each model's ability to correctly classify the different cases.and HD on the validation sets using information from the top two to three clonotypes alone, logistic regression exhibited improved performance with an increasing number of top repertoire clonotypes for all other cases.However, its performance began to decline when utilizing 50 or 100 clonotypes.In all comparisons including NLPBL, logistic regression demonstrated superior performance to random forest when considering information from the top 4-20 clonotypes.The observation that especially discriminating NLPBL benefits from the inclusion of a larger number of top repertoire clonotypes aligns well with the recognized significance of the bystander lymphocyte repertoire in this entity.

Final testing of the machine learning models
Upon completing the training phase, the model that exhibited the highest performance based on validation results was further trained on the entire training dataset and evaluated on a previously unseen test dataset.Table 3 showcases crucial metrics, encompassing accuracy, recall, precision, and the F1 score.
Fig 2B shows the results of the best validated models on the test data not available during the training phase.Similar to the validation results, we saw superior performance of the logistic regression models.While the first scenario of CLL, DLBCL and HD could be accurately classified by a logistic model only using information of the top three clonotypes, in the other scenarios, the logistic regression heavily benefits from the information of the top 4-20 clonotypes.Although the best performing models with respect to the validation results used 20 clonotypes, models using the information of only 10 clonotypes performed equally or even better on the test set.

Data separation for n = 1 to n = 100 clonotypes
To gain a more comprehensive insight into the model's performance, we conducted Principal Component Analysis (PCA) on our feature list while varying the number of top repertoire clonotypes included in the analysis.Given the clinical relevance of distinguishing between multiple lymphoma subtypes, our primary focus was on the scenario that encompassed all four groups.When visualizing the data in the first two dimensions, we observed overlapping clusters of lymphoma subtypes across different numbers of top repertoire clonotypes (Fig 3A -3D).Notably, NLPBL BCR repertoires samples exhibited significant overlap with those from HD, whereas DLBCL repertoire samples appeared to overlap more with those from CLL. Examining the various embeddings, it became evident that the problem was not perfectly linearly separable along the dimensions of the first and second principal components.However, we did observe the formation of distinct clusters within these two dimensions.Corroborating our performance findings, the most significant separations were achieved when considering the top 4-20 repertoire clonotypes.

Exploration of the factors contributing most to correct lymphoma classification
Next, we wished to explore, which of the features contributed most to correct lymphoma classification.We show the 20 predictors with greatest coefficient magnitude averaged over all classes in the best performing model for the comparison of NLPBL, DLBCL with CLL and HD (Fig 4A).This overview helps to understand the overall importance and strength of each predictor variable for the model across all classes.We further analyze the contribution of each predictor between pairs of cohorts (Fig 4B).The three most important predictors of the logistic regression were the frequency of the most abundant clonotype in the repertoire and crude repertoire metrics (clonality, Shannon diversity).While the fraction of somatic hypermutation within a repertoire belonged to the 20 most dominant predictors, repertoire richness seemed to play a minor role in the discrimination between NLPBL, DLBCL, CLL and HD.Moreover, the model predictions relied on features such as biochemical properties (Kidera factors) and length of the CDR3 sequence of the most dominant clonotypes.Interestingly, we found that Kidera Factor 7, which is associated with flat and extended conformations, and Kidera Factor 9, which represents the partial positive charge of the side chain, served as predictors for DLBCL.In contrast, Kidera factor 8, which reflects the alpha-helical secondary structure, was associated with CLL.The length of the CDR3 of the most dominant clonotype was associated with NLPBL.
Finally, we found specific gene usages of the most dominant clonotypes to add to the discriminative power of the model.For instance, the expression of IGHV4/OR15-8 and IGH4-34 by the most dominant clonotype was predictive of DLBCL and NLPBL, while IGHJ3 was associated with NLPBL and CLL.

Discussion
We set out to develop a machine-learning tool capable of distinguishing between different types of lymphomas by analyzing the B-cell architecture within lymphoma-infiltrated tissue using BCR-repertoire NGS data.Interestingly, our data revealed that a simple logistic regression with the right set of predictors performed exceptionally well in achieving this goal.It demonstrated a high degree of accuracy when discriminating between the three lymphoma types in our test dataset.
The factors contributing to this successful discrimination were, to some extent, as we anticipated.For example, we found that broad repertoire metrics carried significant discriminatory value.This result aligns with our expectations, especially in diseases characterized by varying levels of immune bystander cells, where such metrics naturally play a role.For this proof-ofconcept study, we intentionally selected lymphoma entities characterized by distinct microenvironments.It is conceivable that fitting the algorithm on lymphoma entities with greater similarities could potentially diminish the discriminatory capacity of the basic repertoire metrics, while highlighting the significance of more specific features.
What made our findings particularly intriguing was the discovery that while we saw some discriminative features, which might have been selected based on prior knowledge of the malignant clonotype, some were not as much prioritized by this regression model.For instance, the classical VDJ-rearrangement known to be present in the malignant NLPBL clonotype (V3D3J6) [29,30] coupled with an unusually long CDR3 sequence length was not prioritized by the algorithm to the extent we would have expected.While one might have anticipated a clear separation based on these features, the discrimination between NLPBL and other lymphomas, such as DLBCL, relied instead quite strongly on other immune repertoire metrics.Overall, most of the discriminatory power came from features of the most dominant clonotypes in the repertoire, yet there was quite some weight on global repertoire metrics underscoring the complexity of lymphoid malignancies.The analysis of patterns of non-malignant bystander cells-unrecognized by traditional bioinformatic analysis-may perspectively shed more light on the immunobiology of lymphoma and may therefore even have broader implications beyond diagnostics.
From a technical perspective, it needs to be noted that several key predictors in our model exhibited correlation with each other, notably observed in the broad repertoire metrics of diversity and clonality.Generally, higher diversity tends to correlate with lower clonality, and vice versa, within biological systems.This relationship arises from the fact that higher diversity suggests a wider range of distinct clones within a population, naturally resulting in a decreased dominance of any single clone (lower clonality).Conversely, lower diversity implies a greater proportion of cells originating from a limited number of ancestral clones (higher clonality).
However, it's crucial to recognize that the association between diversity and clonality can vary depending on the specific context and biological system under investigation.For instance, in certain disease states or immune responses, an increase in diversity might coincide with an increase in clonality if certain clones are selectively expanded in response to specific stimuli.Thus, although correlated, these metrics provide distinct and non-overlapping information.To ensure a comprehensive understanding of the data, we included all relevant metrics in our analysis.While focusing on the comparison between logistic Regression and Random Forest models for fixed numbers of clonotypes, we performed hyperparameter tuning using a grid search approach.We opted for a predefined comprehensive feature set over other feature selection methods to manage computational resource effectively.While acknowledging the potential risk of overfitting, we mitigated this concern by employing a cross-validation regime.
In terms of model complexity, the robust performance of logistic regression and random forest models suggests that the problem at hand may initially appear linearly separable.However, it is noteworthy that logistic regression starts to show limitations when the number of clonotypes exceeds 20, rendering the problem non-linearly separable.This trend prompts the exploration of deep learning models, which are well-suited for handling complex, non-linear relationships within the data.Deep models, such as neural networks, have the capacity to capture intricate patterns and relationships within the data, making them a promising avenue for further exploration.Yet, larger sequencing datasets would be needed to be able to apply deep learning models.In this context, the role of data quantity in optimizing predictive power cannot be overstated.Accumulating more data, particularly diverse and representative samples, is instrumental in bolstering the performance of machine learning models.Additionally, careful consideration of sampling strategies is crucial.Properly balanced and stratified sampling can mitigate issues related to class imbalances and ensure that the model is trained on a comprehensive spectrum of cases.By addressing these aspects, we could refine our machine-learning approach to achieve higher diagnostic accuracy in lymphoma subtyping and potentially even uncover biologically valuable insights into lymphoma clonotypes and their microenvironment.
Our dataset should be considered within the broader framework of evolving research on the application of machine-learning algorithms for lymphoma diagnosis and subtyping.Numerous studies are emerging in the field of digital hematopathology [31][32][33][34][35], and an increasing body of data underscores the pivotal role of machine-learning in harmonizing sequencing data related to lymphoma driver-genes [21][22][23][24][25].Our approach introduces a novel dimension by incorporating immune architecture as an additional layer of analysis.In this respect, the potential of AI may be particularly significant when dealing with challenging scenarios such as suboptimal samples characterized by their small or squeezed nature.In such cases, the BCR repertoire analysis may prove to be less susceptible to interpretation errors compared to traditional morphological assessments.Moreover, specific situations may stand to gain significant benefits, e.g.Richter's transformation that can sometimes be quite complex to diagnose, while in other instances discerning CLL from DLBCL poses fewer challenges.As an added diagnostic tool in complex cases, it also holds the potential to address the widely recognized shortage of personnel in the field of pathology, which has garnered significant attention within the medical community [36].
Together, our study represents a significant advancement in the field as we demonstrate a compelling proof-of-concept: the ability to differentiate between distinct lymphoma entities by leveraging BCR repertoire NGS on lymphoma-infiltrated tissues through the application of a trained machine learning model.This paves the way for further research and potential clinical applications.In the future, our approach may become a component of digital lymphoma diagnostics as an efficient resource to complement conventional techniques.categorical features, we applied one-hot-encoding.Representing the theoretical physico-chemical properties, we calculated the ten Kidera Factors from the individual amino acid sequence of each CDR3 region and augmented the dataset with the clonality, Shannon diversity index and richness for each repertoire as described.In order to deal with the variable number of clonotypes we set a fixed number of clonotypes per repertoire ranging from n = 1 to 10 and subsequently extending to 20, 50 or 100 clonotypes.Based on their clonotype fractions, we extracted the n most dominant clonotypes within each repertoire while discarding the remaining clonotypes.For repertoires with fewer than n clonotypes we applied zero-padding to ensure consistent input dimensions across all samples.We concatenated the features of the n clonotypes to a single vector representing a single repertoire within 111 (n = 1) up to 10151 (n = 100 dimensions.Depending on a freely varying hyperparameter we scaled the numerical feature across all vectors to have zero mean and unit variance.We partitioned the data into a training (80%) and test set (20%) ensuring the proportions of classes were maintained in both subsets.

Training
Within the training process we fitted two different model types in a supervised manner.The first model was a random forest consisting of an ensemble of single decision trees.The second model consists of a logistic regression minimizing a multinomial loss.While focusing on the comparison between both models we performed hyperparameter tuning using a grid search approach.We opted for a predefined comprehensive feature set over feature selection methods to manage computational resource effectively.While acknowledging the potential risk of overfitting due to feature correlation (S2 Fig) , we incorporated all relevant features available to ensure the most comprehensive information about the data.We tried to mitigated potential pitfalls by using a stratified k-fold cross-validation approach with k set to 3 folds fitting each model with a different set of hyperparameters (S2 Table ) to a subset of the training data.We calculated the performance in form of the F1 score on the remaining part of the training set and averaged the obtained scores over all folds.The best performing settings of hyperparameters were chosen and used to train the model on the entire training set.

Testing
The best performing models with respect to the training phase were used to predict unseen data from the separate test set.We compared the predictions of each model to the known labels and calculated a final F1 score.
All analyses and data plotting were conducted using RStudio (version 1.1.456)or Python (version 3.11.5)within a Conda environment.Model fitting and evaluation were executed using scikit-learn 1.3.0.All computations were carried out on a MacBook Pro featuring an M1 processor, with Kernel Version Darwin 22.6.0 and macOS 13.5.2.
All code is available on github.com/paulovic96.

Fig 1 .
Fig 1. Broad BCR repertoire metrics in lymphoma and control cohorts.In the four panels, the essential repertoire metrics (A) clonality, (B) richness, (C) diversity, and (D) somatic hypermutation rate are shown with corresponding quantiles (Q 0.25 , Median, Q 0.75 ).We pairwise performed a two-sided Mann-Whitney-U-Test with α = 0.05 (*** p < 0.001).https://doi.org/10.1371/journal.pcbi.1011570.g001 Fig 2. F1 scores of logistic regression models in training and test sets.(A) shows F1 scores averaged over validation folds during training in all three scenarios and (B) those of best validated logistic regression model on the test set in all three scenarios.As a comparison, the performance of the best random forest is displayed in green.https://doi.org/10.1371/journal.pcbi.1011570.g002

Fig 4 .
Fig 4. 20 lymphoma subset predictors with greatest coefficient magnitude.(A) Predictors were averaged over all classes in the best performing model for discrimination of HD vs. NLPBL vs. DLBCL vs. CLL.(B) Contribution of each predictor to the discrimination between pairs of cohorts.https://doi.org/10.1371/journal.pcbi.1011570.g004