Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings

Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT.


Materials and method
This study involved collecting datasets, extracting of protein sequence information and converting it into matrix information, and inputting the acquired matrix information into a machine learning model for training and hyperparameter optimization.Furthermore, the method was compared with existing approaches.The experimental structure of this work is shown in Fig. 1.

Benchmark datasets
Most of the datasets used in the BCE predictors developed to date have been obtained from the Immune Epitope Database (IEDB) 32 or the Bcipep 33 database, where the data have been experimentally demonstrated.We collected the main benchmark datasets used by existing prediction models.Among them, the early proposed ABCPred 29 , Chen 17 and BCPreds 19 methods all collect data from the Bcipep database and construct their respective benchmark datasets.Other methods proposed subsequently, such as SVMTrip 34 , iBCE-EL 9 , BepiPred2.0 4 , DLBEpitope 30 and EpiDope 31 , use data collected from IEDB.To ensure a fair and comprehensive comparison with previous methods, we used the benchmark datasets shown in Table 1.The dataset of ABCPred, Chen and BCPreds are all from the Bcipep database and the sequence length is fixed.Whereas the data of LBtope and iBCE-EL are all from the IEDB database and only the sequence length of LBtope is fixed.Therefore, we choose BCPreds, LBtope and iBCE-EL_training as the training dataset and the other datasets as the independent test set, respectively.

BCPreds dataset
The BCPreds dataset was created by El-Manzalawy et al. 19 .First, sequence data for the B-cell epitopes were collected from the Bcipep database.Amino acid residues were then added or deleted at the ends of the original sequences so that each peptide chain was 20 amino acid residues in length.Next, 80% sequence identity was used as the threshold and duplicate or highly homologous sequences were removed using CD-HIT, leaving 701 epitope sequences as the final positive samples.The same number of non-epitope sequences of the same length were then obtained from Swiss-Prot 35 as negative samples.

LBtope dataset
The LBtope dataset was one of the first datasets to obtain BCEs from the IEDB database, with five datasets created by Singh et al. 23 .In this work we have only used the 'Lbtope_Fixed_non_redundant' dataset, which uses the same data processing techniques as the BCPreds dataset.

iBCE-EL dataset
The iBCE-EL dataset was collected from the IEDB database.This dataset only includes sequence data that have been experimentally proven to be BCE epitopes twice or more.The original sequence length is maintained, and CD-HIT is used with a set threshold of 70% to reduce sequence homology.Unlike LBtope, the dataset does not add or subtract from the original sequence length.

Feature representation of peptides
Amino acid composition Amino acid composition (AAC) 26 is a representation of the frequency of occurrence of each amino acid in a segment of a peptide chain, which can be expressed as where f i = R i N ( i = 1, 2, 3, . . ., 20 ) refers to the proportion of this amino acid in this peptide chain, R i refers to the i th amino acid and N refers to the length of this peptide chain.

Amino acid pair antigenicity scale
The Amino Acid Pair (AAP) antigenicity scale was first proposed and used by Chen et al. 17 .First, the ratio of the frequency of occurrence of amino acid pairs in positive samples to the frequency of occurrence of amino acid pairs in negative samples was calculated, and the resulting ratio was logarithmically normalised to [− 1, 1].Positive samples were obtained from the Bcipep database and negative samples were obtained from the Swiss-Prot 36 database, where negative samples were selected for all protein sequence information in the database except for B-cell epitope sequences.AAP antigenicity scale is calculated as where f + AAP and f − AAP are the occurrences of specific dipeptides in BCEs and non-BCEs, respectively.

Amino acid trimer antigenicity scale
The Amino Acid Trimer (AAT) antigenicity scale was first used by researchers in SVMTrip 34 and is similar to the AAP antigenicity scale, except that it targets amino acid trimers rather than amino acid pairs and results are also normalised to [− 1,1] on the propensity scale.The AAT antigenicity scale is calculated as where f + AAT and f − AAT are the occurrences of specific tripeptide in BCEs and non-BCEs, respectively.

Sequence embeddings of BERT
The BERT (bidirectional encoder representations from transformers) model proposed by Devlin et al. 37 , has gained significant recognition in the field of natural language processing due to its exceptional performance.As a result, it has been widely adopted across various domains, including bioinformatics.For example, Qiao et al. 38 developed BERT-Kcr, a predictor for identifying protein lysine crotonylation sites use the BERT model.Similary (1) Liu et al. 39 created BERT-Kgly, a novel predictor for lysine glycosylation sites by combining features extracted from a pre-trained protein language BERT model with a deep learning model.BERT focuses on using a new masked language model (MLM) to train a bidirectional transformer for creating deep bidirectional language representations.The coding layer of this mechanism uses a multi-head self-attention approach to process both left and right contexts simultaneously, allowing for parallel processing of all words in a sentence.The attention mechanism is based on three main concepts: Query (the target word, or the annotated word to be generated), Value (the original Value representation of each word in the context), and Key (the Key vector representation of each word in the context).In the multi-headed self-attentive mechanism, it first uses h linear transformations with different parameters to project Query, Value and Key, followed by inputting each of the transformed h sets of vectors to the self-attentive layer.After the self-attentive layer, the model obtains differ- ent attentional results.These results are then combined to create an information representation of the different subspaces.The main calculation procedure is shown in the following equation: where Q, K, V represent the three vectors Query, Key, Value.d k is the dimension of Key, and W Q i , W V i , W V i and W O are parameter matrices 40 .When a sentence is fed into the BERT model, the input vector for each word con- sists of three components: token embedding, segment embedding and position embedding.Where the token embedding represents each word present in the dictionary, encoded based on a different partitioning method.The fragment embedding indicates whether the word belongs to the first or second half of the sentence.The formula for encoding the positional embedding, which represents the position of a word in a sentence, is as follows: where pos is the position, i is the component position of the vector d model represents the dimension of the vector 37 .Then, context-dependent features can be obtained from various encoder layers of the model.

Machine learning methods
The XGBoost (eXtreme Gradient Boosting) 41 classifier was used in this study to build the model.It is based on a modification of the gradient boosting decision tree (GBDT) 42 .This classifier is a modified version of the gradient boosting decision tree (GBDT), which combines multiple regression trees to predict values that are as close to the true values as possible and have strong generalisation power.The model has two main advantages: it is regularised to prevent overfitting and supports parallelisation, which can greatly speed up training.

Evaluation metrics
This study employed five commonly used evaluation metrics to assess the model: accuracy (ACC), precision (Pre), sensitivity (Sn), F1 score (F1) and Matthews correlation coefficient (MCC).Additionally, we also calculated the receiver operating characteristic (ROC) curve and calculated the area under the receiver operating characteristic (ROC) curve (AUROC).
where TP, TN, FP and FN signify the numbers of true positives, true negatives, false positives and false negatives, respectively.( 4)

Results
Sequence discrepancy between positive and negative samples in the benchmark dataset We developed a machine learning method based on peptide sequence information to distinguish between BCEs and non-BCEs.The assumption of different sequence patterns for positive and negative samples was taken into account.For overall pattern differences, we visualized them by two sample markers, the distribution and preference of epitope and non-epitope residues 43 .Figure 2 illustrates that residue P is significantly enriched in several positions in the positive samples, while residues T, G, and D are enriched to varying degrees.In contrast, residue L is significantly depleted in several positions in the negative samples, along with residues I and F. The overall enrichment or depletion rate for specific sequential positions was 9.4%.Figure 3 shows an enrichment or depletion ratio of 6.9% for specific sequence positions in the benchmark dataset LBtope.Residues P and Q are enriched to a higher degree in positive samples, while residues L, V, and A are depleted to a higher degree in negative samples.It is worth noting that the sum of the enrichment and depletion rates of the different residues at each position implies a difference between the positive and negative samples at that position.

Model based on embeddings of pretrained BERT models
In this work, we directly used the BERT pre-training model constructed by Zhang et al. 44 based on 556,603 protein sequences to encode peptide chain sequences and extract their embeddings of tokens as features.To experiment the effectiveness of this feature encoding approach, we extracted the embedding of the marker 'CLS' and the average embedding of all amino acids in the entire peptide chain as features from the epitope and nonepitope sequences in the BCPreds, iBCE-EL_training and LBtope datasets, respectively.The 'CLS' is always the first token of each classification sequence, as described in Devlin et al. 45 .Next, we built a model to predict linear BCE using the XGBoost classifier and trained it with various features.Table 2 shows the results of the five-fold cross-validation of the different models.The results show that the model trained with the embedding of the token 'CLS' consistently outperforms the model trained with the average embedding of all amino acids in the entire peptide chain, regardless of which dataset the model is trained on in BCPreds, iBCE-EL_training and LBtope.Therefore, in the next work, we will only use the embedding of the token 'CLS' from the protein BERT pre-training model as features for our experiments.The experimental results based on the BCPreds dataset were optimal in the above three datasets, and the combination of using embedding of the token 'CLS' as a feature to train the model achieved an AUC value of 0.693 and an ACC value of 0.629 for the five-fold cross-validation.
Using the average embedding of all amino acids as a feature, the model achieved a five-fold cross-validation AUC value of 0.671 and an ACC value of 0.619.

Evaluate model performance based on different datasets
The AAC, AAP, AAT and embedding are combined with the token 'CLS' in the protein BERT pre-training model.This is then input into XGBoost for training to find hyperparameters.It is important to note that although variable-length linear BCEs exist in these datasets, these four feature representations enable the extraction of variable-length linear BCEs as fixed-dimension matrix information.As a result, the variable-length linear BCE does not affect the training model.We trained three different BCE prediction models using BCPreds, LBtope, and iBCE-EL_training datasets.These models are labelled LBCE-BERT(BCPreds), LBCE-BERT(LBtope) and LBCE-BERT(iECB-EL).To evaluate the model's performance, we performed a five-fold cross-validation and independent dataset validation based on seven different datasets (BCPreds dataset, Chen dataset, ABCPred dataset, Blind387 dataset, LBtope dataset, iBCE-EL_training dataset, iBCE-EL_independent dataset), where the evaluation metrics for models are calculated as shown in section "Evaluation metrics".The following are the results and correlation analysis of the five-fold cross-validation of the three models.

Evaluation of model LBCE-BERT(BCPreds)
Table 3 shows that our proposed method, based on the BCPreds dataset with five-fold cross-validation, has an AUROC of 0.924.This is a 3% improvement over the EpitopVec model, which was also trained on the same dataset.We then tested this model on other datasets and found that it performed well in both Chen and ABCPred.As shown in Table 4, our method achieves a higher AUROC of 0.990 on the Chen dataset compared to EpitopVec (the current optimal method) which achieved 0.959.Additionally, our method has an ACC of 0.941, representing a 21.6% improvement over the original method.On the Chen dataset, the models' prediction accuracy ranged from 0.494 to 0.941, the lowest performing methods were iBCE-EL and LBtope, with ACC values of 0.494 and 0.533, respectively.In the ABCPred dataset (Table 5), the accuracy of the original method (ABCPred) is only 0.6593, which is lower than that of AAP(0.7314),BCPreds(0.7457),and EpitopVec(0.856).Additionally, LBCE-BERT(BCPreds) has the highest AUROC value of 0.934 among all methods, indicating that our model's www.nature.com/scientificreports/performance is stable and efficient.The Blind387 dataset was also tested, and the results are shown in Table 6.This dataset consist of epitopes derived from viruses and was obtained from published literature.Our model's performance on this dataset was mediocrely, with an ACC of 0.705, which is 4.1% higher than the original method used to build the dataset but slightly lower than the optimal method, EpitopVec.Similarly, the AUROC value of 0.757 is only 2.2% lower than the optimal method.Chen, ABCPred, LBtope and iBCE-EL are some of the earlier methods for obtaining BCE datasets using the IEDB database, compared to BCPreds.According to the results in Table 7, LBCE-BERT(BCPreds), BCPreds, iBCE-EL, and EpitopVec(BCPreds) have similar accuracy on the LBtope dataset.However, on the training and independent test sets of iBCE-EL (Table 8), EpitopVec(BCPreds) slightly outperformed LBCE-BERT(BCPreds).

Evaluation of model LBCE-BERT(LBtope)
Table 7 shows the results of the five-fold cross-validation based on the LBtope dataset.The model trained on this dataset performs slightly worse.On the BCPreds dataset (Table 3), the AUROC values for LBCE-BERT(LBtope), EpitopVec(LBtope), and ABCPred were 0.641, 0.645, and 0.643, respectively.The ACC values for LBtope and iBCE-EL were 0.5157 and 0.4871, respectively.In the Chen dataset (Table 4), our model is the second-best performing method, following the original proposed method.The lowest performing method is iBCE-EL, with an accuracy of only 0.494.LBCE-BERT(LBtope) did not perform well in both ABCPreds and Blind387, with  3, our model's accuracy was slightly higher than EpitopVec(LBtope) but lower than the original method on both the training and independent test sets of iBCE-EL.However, our model's AUROC was the lowest among the three models.

Evaluation of model LBCE-BERT(iBCE-EL)
In the five-fold cross-validation of our model based on the iBCE-EL training set, Table 8 shows that the AUROC value of 0.820 is 2.7% higher than the current optimal model EpitopeVec of 0.793.The original method presenting this data is only 0.782, proving that our method LBCE-BERT(iBCE-EL) is the current optimal model.In the iBCE-EL independent test set, the original method (iBCE-EL) achieved an AUROC of 0.786.EpitopeVec(iBCE-EL) achieved an AUROC of 0.785, while LBCE-BERT(iBCE-EL) achieved an AUROC of 0.828 and an ACC of 0.757.These results suggest that LBCE-BERT(iBCE-EL) has better generalisation ability.This model was also evaluated on other datasets.The results of LBCE-BERT (iBCE-EL) on each dataset BCPreds (Table 3), Chen (Table 4), ABCPred (Table 5), Blind387 (Table 6) and LBtope (

Figure 2 .
Figure 2. Overall sequence pattern discrepancy between positive and native samples illustrated by Two Sample Logo 43 in the benchmark dataset BCPreds.

Figure 3 .
Figure 3. Overall sequence pattern discrepancy between positive and native samples illustrated by Two Sample Logo 43 in the benchmark dataset LBtope.

Table 1 .
The dataset used in the benchmarking of our method.

Table 2 .
Cross-validation results based on two kinds of BERT embedding.a 'avg' means the average of the embeddings of residues in the epitope.

Table 3 .
Multi-method prediction results on the BCPreds dataset.
a To facilitate understanding, the highest value in each dataset is shown in bold.The methods shown in bold are those proposed in this work.b Blank cells indicate that the indicator scores reported in the original publication do not exist.Vol.:(0123456789) Scientific Reports | (2024) 14:2464 | https://doi.org/10.1038/s41598-024-53028-w

Table 4 .
Multi-method prediction results on the Chen dataset.a To facilitate understanding, the highest value in each dataset is shown in bold.The methods shown in bold are those proposed in this work.b Blank cells indicate that the indicator scores reported in the original publication do not exist.

Table 5 .
Multi-method prediction results on the ABCPred dataset.aTo facilitate understanding, the highest value in each dataset is shown in bold.The methods shown in bold are those proposed in this work.bBlank cells indicate that the indicator scores reported in the original publication do not exist.AUROC values of 0.617 and 0.656 on these two datasets, respectively.Additionally, based on the LBtope dataset, LBCE-BERT (LBtope) achieved an AUROC of 0.733 and an ACC of 0.671.Other methods achieved ACC values ranging from 52.2 to 69%.Furthermore, as shown in Fig.

Table 6 .
Multi-method prediction results on the Blind387 dataset.aTofacilitate understanding, the highest value in each dataset is shown in bold.bBlank cells indicate that the indicator scores reported in the original publication do not exist.

Table 7 .
Multi-method prediction results on the LBtope dataset.a To facilitate understanding, the highest value in each dataset is shown in bold.The methods shown in bold are those proposed in this work.b Blank cells indicate that the indicator scores reported in the original publication do not exist.