Novel machine learning method allerStat identifies statistically significant allergen-specific patterns in protein sequences

Cutting-edge technologies such as genome editing and synthetic biology allow us to produce novel foods and functional proteins. However, their toxicity and allergenicity must be accurately evaluated. It is known that specific amino acid sequences in proteins make some proteins allergic, but many of these sequences remain uncharacterized. In this study, we introduce a data-driven approach and a machine-learning method to find undiscovered allergen-specific patterns (ASPs) among amino acid sequences. The proposed method enables an exhaustive search for amino acid subsequences whose frequencies are statistically significantly higher in allergenic proteins. As a proof-of-concept, we created a database containing 21,154 proteins of which the presence or absence of allergic reactions are already known and applied the proposed method to the database. The detected ASPs in this proof-of-concept study were consistent with known biological findings, and the allergenicity prediction performance using the detected ASPs was higher than extant approaches, indicating this method may be useful in evaluating the utility of synthetic foods and proteins.

Food allergies are atopic disorders, which can be classified into immunoglobulin E (IgE)-mediated and non-IgE-mediated disorders. Allergies to cow's milk, egg, wheat, and peanuts, for examples, are IgE-mediated. Food allergies are typically caused by hypersensitivity of the immune system to specific proteins in foods, which brings about various allergic reactions ranging from itchiness, swelling of the tongue, vomiting, diarrhea, hives, trouble breathing, low blood pressure, and systematic anaphylaxis in severe cases (1,2). Gupta et al. (3) showed that children under 18 years in the United States yielded an estimated food allergy prevalence of 8% and that approximately 40% of patients with food allergies have experienced a lifethreatening allergic reaction. Food antigen-specific IgE antibodies play a pivotal role in most of immediate hypersensitivity to food components. The cross-linking of IgE receptors on mast cells and basophils triggers the release of mediators such as histamines and proteases (4). Antigen-specific IgE production from B cells requires for T-cell help, and T-cell-derived cytokine induces the class-switch of B cells into IgE-producing plasma cells (5). The production of such cytokines from helper T cells is regulated by T-cell receptors, which mediate signaling through the affinity to T-cell epitope peptides presented on human leukocyte antigen (HLA) molecules expressed in professional antigen-presenting cells, and thymic microenvironments formed by medullary thymic epithelial cells (mTECs) enable T-cells lineage commitment by presenting self-antigens (6).
Some of the proteins contained in foods are the cause of food allergies. Currently, the allergenicity of proteins has been evaluated by a method based on the homogeneity to known allergenic proteins. In many countries, the method according to the Food and Agriculture Organization-World Health Organization criterion is used, in which proteins having >35% similarity in the 80-aa sliding window of allergen proteins or those identical to six-to-eight contiguous amino acids that are contained in allergen proteins are regarded as allergenic proteins (7). However, novel foods and functional proteins can now be created using genome editing technologies and synthetic biology. The goal of this study is to develop a prediction method that is more reliable even for such novel proteins. Recently, machine learning (ML) has been applied to various research field, and various bioinformatic methods have also been proposed to improve the prediction performance of Food and Agriculture Organization/World Health Organization rules. Most of them are built on whether the protein contains a similar amino acid subsequence to known allergen-specific peptides, which we call allergen-specific patterns (ASPs) (8)(9)(10)(11)(12)(13)(14)(15)(16). Unfortunately, these bioinformatic methods are still limited in their ability to identify allergen proteins that does not contain known ASPs; therefore, we developed a ML method to identify unknown ASPs by data-driven approach.
The goal of this study is to identify new ASPs that are responsible for allergenicity from a database of allergen proteins and nonallergen proteins. We propose an ML method called allerStat to efficiently identify statistically reliable ASPs and use them to predict food allergic reactions. This problem is both computationally and statistically challenging. The computational challenge is that, since the number of all possible subsequences in protein amino acid sequences is extremely huge, we need to develop a computational trick that enables an efficient exhaustive search. The statistical challenge is that, when a selection is made from a huge number of candidates, it is difficult to properly evaluate the reliability (p values and confidence intervals) of the selection, due to selection bias (c.f., multiple-comparison bias). Hence, it is necessary to develop a method to properly mitigate the bias. The main novelty of allerStat is that it overcomes these computational and statistical challenges by effectively combining sequence mining (17)(18)(19)(20)(21)(22)(23)(24) and multiple testing correction methods (25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37).
As a proof-of-concept (PoC) study, we developed a dataset consisting of 21,154 proteins classified according to whether they have been reported for allergic reactions (2248 allergic and 18,906 nonallergic proteins) by collecting them from multiple databases. We applied allerStat to the dataset and successfully identified 5994 statistically significant ASPs after correcting the selection bias at the significance level of α = 0.05. We observe that a part of the identified ASPs have high HLA type II binding activity which plays a significant role in atopic diseases and food allergies (38) and are consistent with known IgE epitopes (39). Furthermore, we develop an allergic reaction prediction method based on the identified ASPs and demonstrate that its prediction performance of the method is better than existing methods (40)(41)(42). We provide the database for the PoC study in the supplementary materials and the code is provided at https://github.com/takeuchi-lab/allerStat.

Results
In this paper, we propose an ML method called allerStat for identifying amino acid subsequences highly associated with allergic reactions, which we call ASPs. As a PoC, we constructed a protein dataset that contains both allergic and nonallergic proteins in various biological categories, and we applied allerStat to the dataset.

Allergen protein dataset
The input format of the dataset for allerStat is illustrated in Figure 1A. Each row of the table represents an individual protein. Each protein consists of three pieces of information: amino acid sequence, biological category, and the presence or absence of allergic reaction (label). We call a protein that causes or does not cause allergic reaction as an allergenic protein or a nonallergenic protein, respectively. Biological category information is needed to avoid misidentifying category-specific patterns as ASPs. From the original dataset, we extracted the information as illustrated in Figure 1B, in which each biological category in the dataset is classified into three types. The first type contains both allergen and nonallergenic proteins (called paired category), the second type only contains allergenic proteins (called positive-only category), and the third type only contains nonallergenic proteins (called negative-only category). The distinction of these three types of categories is needed because, e.g., amino acid subsequences frequently observed in a positive-only category cannot be distinguished whether they are specific to the category or to an allergic reaction.
As a PoC, we developed a dataset of 21,154 proteins whose presence or absence of allergic reactions have been already verified (see the Experimental procedures section). The whole dataset in the form of Figure 1A is given in Data S1, and the biological category information of the dataset in the form of Figure 1B is given in Data S2. The PoC dataset consists of 2248 allergenic proteins and 18,906 nonallergenic proteins. The average, minimum, and maximum lengths of the proteins are 421, 5, and 34,350, respectively. There are 20 paired categories, 204 positive-only categories, and one negative-only category. It is generally difficult to discriminate between allergenic and nonallergenic proteins. Thus, we focused on 20 foods that have been well analyzed for allergens. Allergens and their family proteins were deleted based on reviewed protein information obtained from UniProt, and 17,372 nonallergenic proteins were prepared. Furthermore, the thymic medulla provides a unique microenvironment in which virtually all the selfantigens are presented so that autoreactive T cells are eliminated from the T cell repertoire before emigrating to the periphery, thus establishing central T cell tolerance. mTECs play a pivotal role in self-antigen presentation process by virtue of their promiscuous expression of tissue-restricted proteins. Therefore, such a protein expressed in mTECs basically cannot induce allergic reactions. The published gene and protein expressions were integrated to create 1534 nonallergenic proteins. When this PoC dataset is used for prediction performance evaluations, we need to take care of the fact that many similar proteins are contained in each category. To avoid data leakage issue, we employed, what we call, leave-categoryout cross-validation (LCO-CV). Fig. S1 shows the distributions of several physico-chemical features of allergenic and nonallergenic proteins, where, for each protein, these feature values are computed by the average for amino acids in it, with the feature values for each amino acid being presented in Data S3. There is no significant difference in the distributions of any physico-chemical features between them, which suggests that it is impossible to predict allergenicity by simply using those features. Figure 1C shows examples of amino acid sequences and patterns (subsequences). We call contiguous amino acid subsequences of various lengths patterns. The goal of this study is to find the patterns highly associated with allergic reactions as ASPs. The challenges stem from the fact that there exists an extremely large number of candidate patterns to be considered. For example, if we consider patterns up to length 50, because there are 20 amino acids, there are 20 + 20 2 + ⋅⋅⋅ + 20 50 = 10 65 possible patterns. In fact, we have to consider only patterns contained in the dataset, but even so, the total number of patterns in the PoC dataset is as large as 3,783,825,994. Such an extremely large number of candidate patterns causes not only computational but also statistical difficulties. The main contribution in this paper is to develop a method for overcoming these computational and statistical difficulties. The number of patterns contained in the PoC dataset by their lengths is shown in Fig. S2.

Allergen-specific patterns
In this study, we call the patterns that satisfy all the following three conditions as ASPs. The definition of an ASP is that (1) the pattern is observed statistically significantly more frequently in allergenic proteins than in nonallergenic proteins, (2) the pattern is observed in none of the nonallergenic proteins, and (3) the pattern is not specific to particular biological category.
The first condition is needed to guarantee the statistical reliability of the identified ASPs. The second condition means that, if we expect that an ASP is considered as a candidate of a cause of allergic reaction, it should not be observed in any  nonallergenic proteins. The third condition suggests that patterns specific to a particular biological category should not be mistakenly identified as ASPs. In this study, the third condition is verified by checking whether one of the following conditions is met: (3a) the pattern is observed in at least one paired category or (3b) the pattern is observed in multiple positive-only categories.
Because a paired category contains both allergenic and nonallergenic proteins, a pattern observed more frequently in allergenic proteins can be considered allergen-specific, rather than specific to the biological category (condition 3a). On the other hand, because it is not possible to distinguish whether the pattern contained in a single positive-only category is specific to the biological category or the allergic reaction, we only regard the patterns as ASPs if they are commonly contained in multiple positive-only categories (condition 3b). Figure 2A shows the definition of ASPs, and Figure 2B shows examples of ASPs and non-ASPs.

Statistically significant pattern mining method
Because the number of candidate patterns is huge, it is necessary to introduce a computational trick to reduce the computational cost. Efficient algorithms for handling large number of patterns have been studied extensively in the data mining community. In particular, methods for efficiently handling sequential patterns are called sequence mining (17)(18)(19)(20)(21)(22)(23)(24), and our method is built on one of the sequence mining methods called prefix-span (20). The large number of candidate patterns raises not only a computational challenge but also a statistical one. Over the past decade, several methods have been proposed for evaluating the statistical significance of discovered patterns in several data mining tasks (33,(43)(44)(45)(46)(47). In this paper, we evaluate the statistical reliability of ASPs by adapting the techniques developed in these studies to our problem setup.
As mentioned above, because the number of candidate patterns is huge, the computational cost will be extremely large if all the candidate patterns are explicitly handled in the method. In sequence mining, a tree structure of subsequence patterns is constructed, and efficient computation is made possible by pruning the branches of the tree structure, as illustrated in Figure 3A. In the most basic sequence mining task called frequent sequence mining, one can efficiently find patterns whose frequency (called support) is greater than a certain threshold. Frequent sequence mining takes advantage of the fact that the frequency of the parent pattern in the tree structure is no less than the frequency of the child patterns. Figure 3A illustrates the concepts of tree structure and pruning in a frequent sequence mining task.
Statistical hypothesis testing based on contingency tables can be used to statistically evaluate whether certain amino acid subsequence is more frequently observed in allergenic proteins. We employed the Fisher Exact Test (FET) for testing the significance of a contingency table. Because we consider a large number of candidate patterns, we need to consider testing the statistical significance of a large number of contingency tables by multiple FETs.
When multiple tests are conducted, a multiple testing correction is needed to adequately control the risk of false positive (FP) findings. In the context of multiple testing, one is often required to control the probability of finding one or more FPs below the significance level, a criterion known as the family-wise error rate (FWER). If we employ FWER, we usually specify the desired FWER level α (like α = 0.05) and then calculate the significance level per test (called nominal significance level) δ. The most basic multiple testing correction method for controlling FWER is called the Bonferroni correction δ = α/T, where T is the number of tests. This can assure that FWER becomes less than α. However, unfortunately, in our problem, T is the number of candidate patterns and is so large that the Bonferroni correction is overconservative. In fact, because there are 3,783,825,994 candidate patterns in the PoC dataset, only those with the nominal FET p value smaller than δ =0.05/3,783,825,994 < 1.3 × 10 −11 can be considered statistically significant for FWER < α = 0.05. Figure 3B illustrates that the statistical reliabilities of the identified ASPs can be quantified by considering multiple testing problems with multiple FETs each of which corresponds to each pattern in the dataset. Various multiple testing correction and other selection bias correction methods have been studied (25)(26)(27)(28)(29)(30)(31)(32)(34)(35)(36)(37).
In this study, we introduce a method for multiple testing correction for large number of contingency tables by effectively combining a randomized test called the Westfall-Young (WY) method (48) with a sequence mining method. What we expect for δ is that FWER is less than α when the data is randomly generated. So, in WY method, the labels (i.e., allergen/nonallergen) are randomly shuffled to create a randomized dataset which does not contain any ASPs by construction. Then, we conduct multiple tests for the randomized dataset and calculate the smallest (nominal) FET p value. Because all selected patterns from the shuffled dataset in WY method are interpreted as FP findings, to avoid them, the significance threshold for FWER control must be smaller than all the (nominal) FET p values obtained from the shuffled datasets. By generating multiple (e.g., 10,000) shuffled datasets with different random seeds, we can estimate the distribution of minimal (nominal) FET p values. The adjusted significance level δ is defined to be the lower 100α% point of the minimal (nominal) FET p value distribution. Figure 3C schematically illustrates the multiple testing correction for evaluating the statistical reliability of the selected ASPs by WY method.
Unfortunately, because the WY method requires the calculation of (nominal) FET p values for all possible patterns, it cannot be used as it is. Thus, we introduce a trick to avoid computations that does not affect the lower 100α% of the minimal (nominal) FET p value distribution. Concretely, we compute the lower bound of the (nominal) FET p values of the patterns included in each branch of the tree structure shown in Figure 3C. Then, it is possible to avoid the computation for the patterns which do not affect the lower 100α% point of the minimum (nominal) FET p value distribution by pruning the branches based on the lower bound. The main technical difference of allerStat from existing sequence mining methods is that the tree pruning strategy is designed so that branches only containing patterns not affecting the estimation of the FWER distribution in the WY method are pruned. This computational trick was first introduced in (44), in which the goal was to quantify the statistical significance of itemset mining tasks.
In this study, we adapted the method to our sequence mining task.

Statistical properties of the ASPs in the PoC dataset
The ASPs identified in the PoC dataset are listed in Data S4. In total, 5064 and 5994 statistically significant ASPs were AllerStat: Significant allergen-specific sequences found when the statistical significance thresholds were set to be α = 0.01 and 0.05, respectively. Figure 4, A-C show the distributions of the lengths, the adjusted p values, and the supports of the identified ASPs, respectively. Figure 4D shows the distributions of the number of ASPs in each allergenic protein. Figure 4E shows the distribution of how many different biological categories each ASP is included in.

Biological analysis of ASPs
The 5994 ASPs detected by allerStat with α = 0.05 were reprocessed to remove overlapped sequences and were concatenated to 1072 patterns (ConcASPs; Data S5) to make it easier for further analysis (see the Experimental procedures section and Text S1). Among the ConcASPs, 687 patterns had a length of 15 amino acids or more. Therefore, we next examined HLA type II binding activities. A number of recent studies suggest that HLA plays a significant role in atopic diseases and food allergies (49)(50)(51). Protein antigens in foods are internalized by antigen-presenting cells such as dendric cells and B cells. These antigens are processed into short peptides (12-25 aa long) that can bind to HLA-II and presented to T cells. The binding to HLA-II molecules, such as HLA-DR, is considered as the first step to antigen-specific IgE antibody production. The association of the ConcASP and typical 15 HLA-DR alleles such as HLA-DRB1 was investigated using NetMHCIIpan 4.0 (38). We used the default setting of % Rank of Strong (2%) or Weak (10%), which means, according to NetMHCIIpan's instructions, "The percentile rank for a peptide is generated by comparing its score against the scores of 200,000 random natural peptides of the same length of the query peptide. For example, if a peptide is assigned a rank of 1%, it means that its predicted affinity is among the top 1% scores for the specified molecule."A total of 82% of the 687 ConcASPs with ≥15 aa long carried a HLA-DRB1-binding activity we examined (Fig. 5A). Of the 687 ConcASPs, 40.8% for HLA-DRB1*13.01 to 59.1% for HLA-DRB1*04.01 showed HLA-DRB1-binding activity for each major allele. Consensus motifs of core sequences to each HLA-DRB1 allele are shown in Figure 5B as Web logo. The full result of the existences of binding for the 687 ConcASPs and 15 HLA-DR alleles is presented in Data S6.
Furthermore, we examined the relationship between identified ASPs and known IgE epitopes. The web server program called allergen database for food safety provides analytical tools for searching the similarity with these validated B-cell epitopes for which site binds to the IgE from patients with food allergy. We performed a homology search with online A B C D Figure 3. Schematic illustrations of the proposed method. A, illustration of a sequence mining problem and tree pruning. The support of a pattern is the number of sequences in the database that contains the pattern. Suppose that we find all patterns with support 2 or more. Then, if we find a pattern whose support is less than 2, the support of any extended pattern must be less than 2, and therefore we can prune (stop extending) the pattern. B, illustration of multiple testing problem with multiple Fisher Exact Test each of which corresponds to each pattern. C, family-wise error rate (FWER) controlled by the Westfall-Young (WY) method. D, illustration of a fast computation of WY method by exploiting the tree structure and its pruning.
alignment search program called Protein Blast (BlastP). As a result, 24.3% of ConcASP were highly homologous to the sequences of B-cell epitopes (Data S7). For instance, pattern 537 had a high degree of sequence homology with seven epitopes identified from tree nut and bean allergens, suggesting an allergen cross reactivity (Fig. 5C). Remarkably, patterns 80, 497, and 539 shared homologies with epitope sequences identified in two completely different species: bee and peanut.
Some more specific examples are presented in Fig. S3. Additionally, 225 out of 1072 ConcASP (21.0%) were perfectly matched with the known food allergen epitopes.

Predicting allergic reaction using ASPs
We considered the problem of predicting allergic reactions of unknown proteins using the ASPs identified by allerStat. Allergic reaction prediction from amino acid sequences is needed for safety assessments of biotechnology-based synthetic foods such as genome editing. Cross-validation (CV) is often employed in the evaluation of predictive modeling. CV evaluates predictive performance by removing some instances from the dataset, training an ML model with the remaining instances, and applying the trained model to the removed instances for performance evaluation. It is important to note that CV is based on the assumption that each instance is independently identically distributed (i.i.d.  Figure 5. Biological characteristics of the sequence patterns specific to allergenic proteins. Protein sequences that have amino acids with a length of 15 or more were examined to predict human leukocyte antigen (HLA)-DRB1 binding activities. A, percentage of HLA-binding motifs contained in the extracted sequence patterns. B, consensus motifs of core sequences to each HLA-DRB1 allele are shown as sequence logos. C, ConcASP patterns contain validated B-cell epitope sequence. BLAST search was performed to find similarity with B-cell epitope. ASP, allergen-specific pattern.
Because the proteins used in this study are strongly associated within the same biological category, we used what we call LCO-CV. In LCO-CV, all proteins in each category were removed, an ML model is trained with the proteins in the remaining categories, and the prediction performance is investigated by applying the trained model to the proteins in the removed category. We considered LCO-CV for each paired category. Figure 6A is a schematic illustration of LCO-CV. To predict the presence or absence of an allergic reaction, it is desirable to use not only the ASPs but also the patterns that AllerStat: Significant allergen-specific sequences are more frequently observed in nonallergens. Therefore, we also considered non-ASPs which were similarly identified as ASPs. Non-ASPs differ from ASPs, in that non-ASPs should be significantly more frequently appeared in nonallergenic proteins but may also be appeared in a small number of allergenic proteins. Therefore, without the condition corresponding to the second condition 2) in the definition of ASP, a non-ASP is defined as (1 0 ) the pattern is observed statistically significantly more frequently in nonallergenic proteins than in allergenic proteins and (3 0 ) the pattern is not specific to particular biological category.
We note that, from the reasons above, we employ non-ASPs only for allergenicity prediction tasks.
When predicting an allergic reaction of a new protein, it is desirable to interpret why the protein is or is not determined to have allergic reaction. Some ML models, such as deep neural networks, are called black-box models in that are too complex to interpret how the predictions are made. Because the interpretation of the prediction process is important in the current problem, we adopted a linear prediction model in which a feature is defined by the existence of each ASP or non-ASP in the amino acid sequence. In the training set, each row corresponds to a protein and each column corresponds to each ASP or non-ASP, where the ASPs and non-ASPs in the training set are patterns that were identified without using the proteins in the removed category by LCO-CV. Figure 6B illustrates the concept of a predictive model based on the ASPs and the non-ASPs selected by support vector machine (SVM). See the Methods section for the detail of prediction model development by SVM. Figure 6, C and D show the results of the predictive analysis based on the LCO-CV for the PoC dataset. To evaluate the effectiveness of the ASPs identified by allerStat, we compared the prediction performances with the methods proposed in previous studies. There are several previous studies on MLbased allergen prediction from amino acid sequences (10-16, 40-42, 52, 53). An ML-based allergen prediction model consists of two steps: Step 1) feature extraction from amino acid sequences and Step 2) model training based on the extracted features. Because the goal of the current study is to evaluate the effectiveness of the ASPs identified by allerStat, we only compared the feature extraction method at Step 1 with those of previous studies and used the same ML model (sparse SVM) at Step 2.
We compared the ASPs obtained by allerStat with other sequence-derived features used in existing allergenicity prediction studies: Allerdictor (41), AlgPred2 (53), Allertop (42), and MEME (40). Note that, for fair comparison, we only used information that can be obtained from amino acid sequences in the PoC dataset. The sequence-derived features in Allerdictor and AlgPred2 are contiguous sequence of amino acid with fixed length k, which is often called k-mer. The main drawback of k-mers is that, different from allerStat, it could only consider patterns of fixed length. We investigated the performances of k-mer with k = 1,2,6, because Allerdictor uses 6-mer features, while AlgPred2 uses either 1-mer features or 2-mer features. MEME is a method for extracting frequent amino acid subsequences based on a probabilistic model called the hidden Markov model. The main drawback of the feature extraction method by MEME is that it is an unsupervised method and therefore cannot extract patterns observed at different frequencies in allergenic and nonallergenic proteins. Allertop differs from allerStat, Allerdictor, and MEME in that the features extracted by Allertop are constructed from the physico-chemical features of the entire amino acid sequence. The main drawback of the feature extraction method by Allertop is that it is difficult to interpret which parts of the amino acid subsequence are associated with the allergic reaction. Note that, unlike allerStat, the statistical reliabilities of the patterns extracted in these three previous methods cannot be properly evaluated. Figure 6C shows the receiver operating characteristic (ROC) curves of the prediction results based on LCO-CV. Figure 6D shows the area-under-curve (AUC) scores of the ROC for various prediction models. Also, we evaluated three other criteria to evaluate binary classification performances: AUC-10%, F1 score, and Matthews Correlation Coefficient (MCC) for the average of 20 categories. Details of these criteria are presented in Methods section, and the full table of the results is presented in Data S8. When interpreting the results, it is important to note that the sample size of each category varies greatly, and the numbers of allergenic and nonallergenic proteins are highly unbalanced. We employed these four criteria because they are commonly used for evaluating prediction performances for such unbalanced data. It can be observed that the prediction performances of allerStat are consistently better than the three existing methods. The information on the prediction model trained with the entire PoC dataset is provided in Data S9.

Discussions
The problem of finding ASPs based on a database of allergenic/nonallergenic proteins can be regarded as a typical feature selection problem for a binary classification problem. However, it is important to note that the proteins in the database are not i.i.d., which is a common assumption taken in conventional classification problem. In particular, proteins belonging to the same category (apple, bovine, etc.) have amino acid sequences that are specific to that category, and if the search is not conducted carefully, category-specific amino acid sequences will be identified instead of allergen-specific ones. To address this problem, we divided the proteins in the database into categories and used different criteria for ASPs depending on whether each category was paired, positive-only, or negative-only. Furthermore, in the allergen prediction task, care must be taken not to make predictions based on category-specific amino acid sequences by introducing LCO-CVs in which each category is excluded from the training data. In the protein allergen study, it is important to carefully take into account for the "non-i.i.d.ness" of proteins.
Our results show that 82% of the sequence patterns specific to allergenic proteins obtained by allerStat carried the binding activity to any of the 15 major HLA-DRB1 alleles. Almost 50% of the patterns had the HLA-DRB1-binding affinity for either of the 15 HLA-DRB1 alleles. This indicates that allerStat can extract biologically significant patterns. Sequence logo analysis indicates that the core sequences had characteristic motifs at the position of 1, 4, 6, and 9 in 9 core sequences, especially DRB1*01.01, *04.01, *04.02, *07.01, and *11.01 (Fig. 5B). This suggests that our extracted patterns include many core sequences important for major HLA-DRB1 binding. We did not evaluate HLA-binding activities of the ConcASPs having a length less than 15 aas. The remaining 18% of the patterns that do not have any HLA-DRB1-binding activity need to be investigated to clarify their importance, because they may include biologically significant sequences. On the other hand, nonallergenic patterns have sequences with three or six amino acid long, which do not have any consensus motif.
Among the identified ASPs, 1083 allergen-specific IgE epitopes on 247 food allergens have been experimentally elucidated. Among the obtained 1072 ConcASPs from allerStat, 225 of which (21%) included any of the entire length of 125 epitope sequences. On the other hand, only 19 epitope sequences were found in 17,372 nonallergen sequence (0.1%). Although three epitopes including Pen c 3 epitope (ISSK and YGVA) and Fag e 1 epitope (QQPGQ) had overlapping in between, all 19 epitopes length is in three to seven amino acids considering to have low specificity. Moreover, 24.3% of ConcASP showed high homology with known food allergen epitopes. Interestingly, there are many patterns homologous with epitopes among different species (Fig. 5C). These results indicated that allerStat could properly extract B-cell epitope sequences characteristic to the allergen from our dataset, even though epitope information was not held in the training data. The remaining 75% patterns are thought to contain T-cell epitopes and derived unexpected sequences, such as those related to immunological tolerance and antibody production, in addition to unknown T-and B-cell epitopes.

Allergen and nonallergen data collection
Allergen data were retrieved from full-length amino acid sequence list in the COMprehensive Protein Allergen Resource (COMPARE) 2020 data (https://comparedatabase. org/). The COMPARE database, a collaborative effort of the HESI Protein Allergenicity Technical Committee, is a curated database comprising allergen-sequence-associated peerreviewed publications. Nonallergen data were collected by reviewing protein sequence expressed in 20 foods including Wheat (Triticum aestivum), Bovine (Bos taurus), Chicken (Gallus gallus), Soybean (Glycine max), Crab (Scylla serrata), Shrimp(Penaeus monodon), Peanut (Arachis hypogaea), Buckwheat (Fagopyrum esculentum), Salmon (Salmo salar), Kiwi (Actinidia deliciosa), Mustard (Sinapis alba), Olive (Olea europaea), Carrot (Daucus carota), Apple (Malus domestica), Tomato (Solanum lycopersicum), Peach (Prunus persica), Potato (Solanum tuberosum), Corn (Zea mays), Rice (Oryza sativa), and Oyster (Crassostrea gigas) obtained from UniProt (https://www.uniprot.org/). It is known that there are many food-allergic patients for these selected foods, and allergens in them that commonly cause an allergic reaction have been well analyzed historically. Therefore, not only major allergen but also minor allergens have been shown in previous studies such as component analysis. It is indicated that the presence of unknown allergens is considered low as compared with foods that few food allergy patients are reported. Since the prediction performance of ML strongly relay on the data quality, known allergens in these foods were listed by adding the data from Allergen Database for Food Safety to the COMPARE database. Furthermore, to eliminate the possibility of cross-reactivity induced by proteins belonging to the same family for allergens in each of the foods, nonallergen dataset was finally created by excluding allergen and its similar protein from the total protein. mTEC dataset were generated by integrating gene and protein expression profiles (54,55). Data reduction of gene expression in SGLT1+ mature mTECs (GEO# GSE49625) were selected with the cutoff level of t = 5 (adjp ≈ 0.01).

HLA-II-binding activity of motifs specific to allergenic proteins
Among 1072 concatenated ASPs, 687 of which had a length of 15 aa or more. HLA-DRB1-binding activity was predicted by NetMHCIIpan server ver.4 (https://services.healthtech.dtu.dk/ service.php?NetMHCIIpan-4.0), which requires amino acid sequences with 15 aa or longer. The HLA-DRB1-binding activity of the allergen-specific sequences was examined at 15 aa windows from the N terminus to the C terminus. ASPs having strong and weak binding activities were analyzed using sequence logo to clarify specific motifs to each HLA-DRB1 allele using Web Logo (https://weblogo.berkeley.edu/logo.cgi).

Similarity between the identified ASPs and known IgEepitopes
Sequence similarity was searched by running BlastP with following parameters: E value = 2000 and cutoff of 10 −4 . The sequences of exactly matched targets were extracted from BlastP results (E value: 20,000). As a preprocessing for this purpose, we concatenated ASPs in the following criteria. For two sequences x and y, let x ⋄ y indicates the concatenated sequence. We define overlapped concatenation between two sequences x and y as x 0 ⋄ w ⋄ y 0 if there exists x 0 ; y 0 ; w 2 S (S: set of all sequences) such that x = x 0 ⋄ w and y = w ⋄ y 0 . Note that the overlapped concatenation is just the ordinary concatenation x ⋄ y when we take w as an empty sequence. The overlapped concatenation of three or more sequences is defined as the repetition of an overlapped concatenation for two sequences. Then, we define a sequence x as a concatenated ASP (ConcASP) if (i) there exists an allergen sequence y in the dataset such that x is included in y as a subsequence (denoted by x ⊑ y) and (ii) x is represented as an overlapped concatenation of any number of ASPs. Finally, after retrieving all ConcASPs, we extracted only the maximal ones, that is, we excluded those that are included in other concatenated ASPs. In order to find completely matching IgE epitopes to concatenated ASPs, we simply examined the inclusion, that is, checked whether x ⊑ y for each x in IgE epitope sequences and each y in ConcASPs. Note that, since the database of IgE epitopes we employed includes both sequence and nonsequence (e.g., conformational) epitopes, we extracted only sequence epitopes of total 1104. Among the 1072 concatenated ASPs, 225 completely matched at least one epitope sequence. We present the algorithm for overlapped concatenation as Algorithm 1 in Text S1.

Fisher Exact Test
Let D, D + , and D − represent the sets of all proteins, allergenic proteins, and nonallergenic proteins, respectively. Furthermore, the sizes of D, D + , and D − are respectively written as n, n + , and n − . To quantify whether there is a difference in the frequencies of patterns between allergenic and nonallergenic proteins, we consider the following contingency table: where

Multiple testing correction by WY method
When we conduct only one hypothesis test, the type-I error, that is, the probability of FP finding, can be controlled at a significance level α 2 (0,1) (typically α = 0.05) by rejecting the null hypothesis when the p value is less than α. However, when we conduct many hypothesis tests simultaneously, we often need to control the FWER: the probability of having one or more FP findings. To control the FWER at the significance level of α, we need to reject the null hypothesis more conservatively, that is, a null hypothesis is rejected when the p value is less than the adjusted significance level δ 2 (0,1), where δ is usually much smaller than α. Methods for finding the adjusted significance level δ such that the FWER can be controlled to α is discussed in the context of multiple comparisons. The most commonly used multiple comparison method is Bonferroni correction in which the adjusted significance level δ is set to be α/{# of hypotheses}, where {# of hypotheses} is the number of hypotheses to be tested simultaneously. Bonferroni correction is known to be overly conservative, especially when the number of hypotheses is large. In allerStat, we employ a random permutation test called the WY method (48) as a multiple test correction method. In the WY method, multiple randomized datasets are constructed by permuting the labels (allergenic or nonallergenic in our problem setup) at random. For each randomized dataset, the minimum p value is computed. Then, the adjusted significance level δ is set to be the ⌈(α + 1)/M⌉ th smallest value among M smallest p values, where M is the number of randomized datasets (typically M = 10,000). Figure 3C illustrates the notion of WY method.

Sequence mining and efficient computation of the WY method
Unfortunately, since allerStat needs to handle an extremely large number of patterns, it is impossible to calculate the minimum p value for each of the M randomized datasets for the WY method. In allerStat, we consider a tree structure among patterns as shown in Figure 3 for efficiently computing the minimum p value. Specifically, we consider a node corresponding to a pattern q and the set of its descendant sequence patterns SðqÞ. Then, our basic idea is to compute a lower bound of the FET p values of the descendant sequence patterns in the form of p low ðD; qÞ ≤ pðD; q 0 Þ for any q 0 2 SðqÞ: If we can find such a lower bound, we can reduce the computational cost because we do not need to calculate the p values of all descendant nodes when we encounter a node whose lower bound p low (D,q) is greater than the current minimum p value. Here, we employed a lower bound that satisfies the property in (2), written as p low ðD; qÞ ¼ min ( gðn þ ; n − ; minfD½q; n þ g; minfD½q; n þ gÞ; gðn þ ; n − ; minfD½q; n − g; 0Þ

)
: The lower bound in (3) was proposed in FastWY method in (43,44), in which the authors studied itemset mining tasks to Containing pattern q Not containing pattern q Sum find statistically significant combinations of multiple genetic factors. In allerStat, we adapted the technique in the FastWY method for sequence mining setting. Therefore, the proof that the bound (3) satisfies the property in (2) can be simply done as described in (43,44). We present the algorithm for AllerStat as Algorithm 2 in Text S2.
Prediction model for allergic reaction using ASPs and non-ASPs In order to predict allergic reactions by proteins using ASPs and non-ASPs, we employed the SVM with features being defined by ASPs and non-ASPs. We follow the formulation of SVM employed by SVC class with linear kernel in scikit-learn (https://scikit-learn.org/stable/modules/ generated/sklearn.svm.LinearSVC.html), a python implementation that we used. For an overview of SVM, see, for example, Chapter 7 of (56).
SVM assumes that we are given n instances, where each instance consists of d input variables x i 2 R d and binary output variable y i 2 {−1,1} (1 ≤ i ≤ n). Here, R denotes the set of all real numbers, and R d denotes the set of all ddimensional real vectors. Then, given a test instance whose d-dimensional input variables are x 0 while its output variable is unknown, suppose that its prediction result is given by the linear model where w Ã 2 R d and b Ã 2 R are to be learned by SVM. If f w Ã ;b Ã ðx 0 Þ is larger than (resp. smaller than) zero, the prediction function conjectures that the output variable for x 0 is 1 (resp. −1). From the training dataset fðx i ; y i Þg n i¼1 defined above, SVM learns optimal w Ã and b Ã by where C is a hyperparameter to control overfitting or underfitting. For C, in this computational experiment, we chose the best one in the specified prediction performance (AUC, AUC-10%, F1 score, or MCC; explained later) by LFO-CV, except that we used common C for AUC and AUC-10% where C is chosen by AUC. We chose C from log 10 C 2 {−27/9, −23/ 9, −19/9, ..., 9/9} (10 cases), however, we extended to smaller or larger C as long as the smallest or the largest C is the best. Note that, since this LFO-CV is conducted for the training set composed by an LFO-CV, this LFO-CV for the selection of C is conducted for a remained set of biological categories. Then, we show how to apply ASPs by allerStat and protein sequences to the formulation of SVM above. Let Q j 2 S (1 ≤ j ≤ d) be an ASP or a non-ASP, z i 2 S (1 ≤ i ≤ n) be training sequences, and y i 2 {−1,1} (1 ≤ i ≤ n) be their allergenicity (1 if allergenic or −1 otherwise). Then we define x i 2 R d as the indicator variables of including patterns in z i , that is, x ij ¼ 1 if Q j ⊑ z i or x ij ¼ 0 otherwise. As a result, we can employ SVM to learn an allergenicity prediction function according to ASPs and non-ASPs. Here, we can see that the j th element in w Ã is the importance of the corresponding pattern: if it is a large positive number, the pattern is expected to contribute the allergenicity. Note that the training sequences are defined by LFO-CV stated in the Results section.
Then, we show how the existing methods are compared to allerStat stated in the Results section. For k-mers and MEME, we employed the same process as that for allerStat, except for the retrievals of patterns. For 6-mers (following Allerdictor), we used length-six patterns whose supports (number of proteins that includes the pattern) are 15 or more, rather than ASPs and non-ASPs. The threshold of the support 15 is chosen so that the total number of patterns is similar to that of ASPs and non-ASPs. This is almost the same for 1-mers and 2-mers, however, we did not apply the threshold of the support, and we specified x ij as the "ratio of j (k-mer) among all k-mers in z i " (i.e., 0 ≤ x ij ≤ 1) instead of specifying x ij 2 {0,1}, following AlgPred2. For MEME, since it is an unsupervised method, we ran MEME by providing only allergenic proteins to retrieve ASPs, then by providing only nonallergenic proteins to retrieve non-ASPs. For Allertop, since it provides a fixed-length vector from a protein sequence, we just used it for the SVM training and the prediction.
Finally, we show the prediction performance criteria used in the experiment. First, for a set of test proteins fðx (converted from sequence to d-dimensional vector as above) and a prediction function (we use (4) in this experiment), we calculate true positive (TP), FP, true negative (TN), false negative (FN), true positive rate (TPR), and false positive rate (FPR) as follows: TPRðθÞ :¼ TPðθÞ = ½TPðθÞ þ FNðθÞ; FPRðθÞ :¼ FPðθÞ = ½FPðθÞ þ TNðθÞ: Then, the ROC curve is defined as the curve in a twodimensional space fðFPRðθÞ; TPRðθÞÞg þ∞ θ¼−∞ , and its AUC score is defined by the integral of the curve over [0,1] (note that FPR must be between 0 and 1). AUC-10% is defined by the integral of the curve over [0,0.1] multiplied by 10: it observes the AUC for low FPR, and the multiplication of 10 is employed to ensure the AUC-10% is 1 for completely correct classification. The F1 score and MCC are defined as follows: Where TPFP 0 :¼ TPð0Þ þ FPð0Þ, TPFN 0 :¼ TPð0Þ þ FNð0Þ, TNFP 0 :¼ TNð0Þ þ FPð0Þ, and TNFN 0 :¼ TNð0Þ þ FNð0Þ. As shown above, the F1 score and MCC assumes that the threshold θ is fixed as zero. So, in order to improve them, the prediction function f must be learned to classify proteins correctly at θ = 0. On the other hand, in order to improve AUC and AUC-10%, the prediction function f must be learned to classify proteins correctly for changing θ.

Data availability
All the data are provided as Supplementary Data. The software is available at https://github.com/takeuchi-lab/ allerStat. Conflicts of interest-The authors declare that they have no conflicts of interest with the contents of this article.