Evaluating potential risks of food allergy of novel food sources based on comparison of proteins predicted from genomes and compared to www.AllergenOnline.org

Potential proteins from three novel food sources (Chlorella variabilis, Galdieria sulphuraria, and Fusarium strain flavolapis) were predicted from genomic sequences and were evaluated for potential risks of allergic crossreactivity by comparing the predicted amino acid sequences against the allergens in the www.AllergenOnline. org (AOL) database. The preliminary analysis used CODEX Alimentarius limits of >35% identity over 80 amino acids to evaluate the predicted proteins which include many evolutionarily conserved proteins. Regulators might expect clinical serum IgE tests based on identity matches above the criteria if the proteins were introduced in genetically engineered crops. Some regulators have the same expectations for proteins in novel foods. To address the inequality of extensively conserved sequences, we compared the predicted proteins from curated genomes of 23 highly diverse allergenic species from animals, plants and arthropods as well as humans to AOL sequences and compiled identities. Identity matches greater than CODEX limits (>35% ID over 80 AA) are common for many proteins that are conserved through extensive evolution but are not predictive of published allergy risks based on observed taxonomic cross-reactivity. Therefore, we recommend changes in the allergen databases or methods of identifying matches for risk evaluation of new food sources. Our results provide critical data for redefining allergens in AOL or for providing guidance on more predictive sequence identity matches for risk assessment of possible risks of food allergy.

The history of human exposure to the relevant food source and records related to the developed food (novel food source, or gene/protein donor) including the published history of allergy is important in judging potential hazards and risks.
Understanding food allergy risks requires knowledge of the proteins in various foods that commonly or less-commonly cause food allergy as well as the mechanisms of the allergic response. For example, peanut (Arachis hypogaea) is a source of severe allergic reactions in many countries. The dominant allergens in peanut are the most abundant seed storage proteins: Ara h 1, vicilin; Ara h 2 and Ara h 6, 2S albumins and Ara h 3, a legumin-like protein (Porterfield et al., 2009;Palladino and Breitender, 2018;Cabanillas et al., 2018). Of these, two are highly soluble proteins that are not rapidly digested at acidic pH by pepsin and so are readily available for immediate reactivity in the mouth or the intestinal tract when consumed. Two are less soluble in water yet are present in sufficient quantities that they still present significant risk. Twelve other peanut proteins are recognized as allergens, though clearly less potent clinically than Ara h 1, Ara h 2, Ara h 3 and Ara h 6. These proteins have been reported to be bound by IgE from some allergic subjects and in some cases when presented at unnatural high abundance in basophil assays, they may stimulate histamine release. The proteins that are low in abundance in the natural food have not been identified as major allergens except possibly the peanut oleosins (Schwager et al., 2017). The majority of proteins from Peanut, however, are not recognized as allergens.
Risks of allergy are dose dependent so identification of a protein as an allergen does not mean it represents a significant risk of food allergy unless it is common and abundant. Risks of allergy also vary markedly among people allergic to the same source (Westerhout et al., 2019). Protein homologues of the dominant peanut allergens are found in other legumes and tree nuts and are the major allergens for most people with clinical allergy to those sources (Cabanillas et al., 2018). Proteins that cause cross-reactions can usually be grouped into protein families, although there are many non-allergic proteins within any of the identified biochemical protein groups. For example, the important muscle allergen tropomyosin from crustaceans is highly conserved. The sequence homology between allergenic crustaceans, mollusks and insects such as mealworm is over 60% identity by BLASTP or FASTA and there is IgE cross-reactivity from the proteins of these organisms using sera from many shrimp allergic subjects. However, homologues in birds and mammals including humans are more than 52% identical to shrimp tropomyosin and while some in vitro IgE cross-reactivity is observed for some subjects' sera, there is little evidence of shared allergy (Faber et al., 2017;Ruethers et al., 2018).
Novel food ingredient sources are being developed to meet the growing demand for dietary proteins in industrialized countries due to the increasing human population, concerns for animal welfare, and environmental impacts of traditional sources of protein (Bleakley and Hayes, 2017;Frigerio et al., 2020). Many diverse food sources have been consumed in some geographic regions with a history of safe use, although the use and safety or risk are rarely well documented in less industrially developed regions. Some potential food sources are truly novel, with no history of safe human consumption including microbial sources such as specific microalgae, fungi or yeasts as whole foods or ingredients. Since there are no validated methods for predicting de novo sensitization, the allergenicity assessment for these truly novel foods is focused on immediate risks to consumers due to the presence of existing IgE that could arise either from unexpected exposure to an allergen to which they are already allergic, or to a likely cross-reactive protein. A sound risk assessment process will have the primary focus on judging knowledge of history of allergy to the source, and similarity of the proteins of the source to known allergens.
The safety assessment of genetically engineered (GE) organisms has served as a model for assessing allergenicity risk of some new foods in the United States (US). Hazard identification and risk assessment steps for GE organisms were broadly discussed in the early 1990's (Federal Register Docket No. 92N-0139, Vol 57, No. 104, May 29, 1992) and (Metcalfe et al., 1996). A primary health related concern has always been whether a new gene in a GE organism encodes an allergen or a potentially cross-reactive protein that would act as an allergen for those who are already allergic. Advisory groups were convened by the Food an Agricultural Organization (FAO) and World Health Organization (WHO) panels in 1996 and 2000. In 2001 the FAO/WHO held a meeting and recommended untested steps including looking for peptide matches of 6 contiguous amino acids and targeted serum IgE binding studies (FAO/WHO, 2001). In 1996 only a few hundred allergenic protein amino acid (AA) sequences were known in publications. The AA sequences of the new protein in the GE crops were compared to allergens in private databases of the developer, or in the NCBI Protein non-redundant (nr) database using keyword search limits. Searches were accomplished by FASTA in small databases or BLASTP in NCBI (Pearson, WR, 2000;Pearson WR, 2014). Alignments that might represent an allergen were searched for identity matches of eight contiguous amino acids to any segment of any allergen. If matched, serum IgE binding tests would be conducted focusing on those with allergies to the source of the new protein. However, in practical terms, developers often abandon those as potential products.
Evaluation of the short segment amino acid comparisons (6-8 amino acid matches) were later shown to be non-predictive (Hileman et al., 2002). The CODEX Alimentarius meeting in 2001 as published in 2003and reaffirmed in 2009(CODEX, CAC/GL 44 in 2003and reviewed in 2009(CODEX Alimentarius Commission, 2009)) considered those criteria and other information and the consensus was that a FASTA search looking for minimum identity matches of >35% over 80 amino acids was a more predictive test (Goodman et al., 2008).
It has been suggested that the current CODEX guideline of >35% identity over at least 80 amino acids threshold be considered in conjunction with E-scores (expectation scores) generated from the FASTA algorithm to make a more informed decision as to whether a protein has the potential to cause allergenic cross-reactivity (Thomas et al., 2005;Ladics et al., 2007;Silvanovich et al., 2009;Cressman et al., 2009). The E-score reflects the measure of relatedness among protein sequences and can help separate the potential random occurrence of aligned sequences from those alignments that may share structurally relevant similarities. A small E-score (e.g., less than 1e-7) reflects a likely functional similarity and may suggest a biologically relevant similarity for allergy or potential cross-reactivity, while large E-scores (>1.0) are typically associated with alignments that do not represent a biologically relevant similarity (Pearson 2000(Pearson , 2014(Pearson , 2016Henikoff 1992, 1996).
However, this guidance should be viewed as highly conservative and precautionary based on historical experiences of cross-reactivity and clinical co-reactivity. Clinically important IgE cross reactivity is common for proteins sharing >70% AA identity over nearly their fulllengths, yet cross-reactivity is extremely rare for proteins sharing less than 50% identity (Aalberse, 2000). Other aspects of protein structure and IgE binding are important to consider cross-reactivity (Aalberse et al., 2001).
The AllergenOnline.org (AOL) database at the Food Allergy Research and Resource Program (FARRP) at the University of Nebraska was started in 2004-2005. It is a public, peer-reviewed database of allergens based on protein AA sequences in the NCBI Protein database following evaluation of published evidence in peer-reviewed literature (Goodman et al., 2005(Goodman et al., , 2016. The AOL database includes proteins from studies of airway, contact, food, venom and salivary allergen sources with IgE binding. When provided in publications, evidence of histamine release and clinical reactivity adds confidence to calling the proteins an allergen. The AllergenOnline.org database has been updated annually by adding newly published allergens every year from 2006 through 2020 by a review process with a panel of allergen experts that include researchers and clinicians (Goodman et al., 2016). It has been used for evaluating risks of food allergy for many GE crops and can be used for evaluating new foods.
AOL uses FASTA comparison with the criteria of matches being >35% identity over 80 amino acids as was set by the CODEX Allergenicity guideline in 2003. But since some proteins or alignments might be of less than 80AA if fragments of allergens were transferred into other species, or if these sections contain high identity segments that could cause severe cross-reactivity AOL also adjusts the calculation with normalization of alignments less than 80 AA. As an example an Nterminal segment of 77AA of Ara h 2 includes two or three IgE binding epitopes and if transferred to a non-peanut food could cause severe clinical reactions in some peanut allergic consumers (Dreskin et al., 2019). As described online (www.allergenonline.org) in a support page for sequence searches, the number of AA identity matches of any alignment less than 80 AA is recalculated by dividing by 0.80 to normalize to an 80 AA length. The minimum identity match to consider as possibly cross-reactivity is 29 identical AA in any FASTA alignment which is calculated as 36.25%. This modified FASTA search provides a more reliable evaluation of potential risks than either a strict FASTA search eliminating sequences shorter than 80 AA or a short (8 AA) alignment.
For truly new foods it is now possible to use modern techniques of proteomics and genomics to predict all potential proteins from new sources. Evaluating all proven proteins of a whole organism for potential risks of food allergy would not be efficient or effective if that required identification of each individual protein in the food with tests of possible IgE binding, or clinical reactivity. Therefore, evaluation of potential risks of food allergy from an organism such as an alga, fungus or new plant that does not have a history of human consumption requires new evaluation steps. Some regulators and scientific advisors have recommended using predicted proteins from the whole organism's genome or transcriptome for comparison to allergen databases using the CODEX guidelines to predict risks of food allergy. Importantly, the CODEX guideline was not intended to evaluate the full-proteome or predicted protein dataset of a whole organism as the criteria of >35% identity over 80 has not been validated for whole proteome comparisons.
The end-result of the bioinformatics comparison of proteins with allergens is a decision about the need for specific serum testing and if so, the specific allergic population that should be used to collect serum samples (Goodman et al., 2005). But, since appropriate serum testing is not trivial, correct interpretation of bioinformatics findings are important. Many genes and their expressed proteins, including many genes that encode "minor" allergens are highly conserved across species and so it is highly probable that these will trigger a match using the CODEX guidelines. Predictions of protein sequences from genomic and transcriptomic evaluations therefore require quality checks to understand relevance before deciding on the need for clinical testing and critical evaluation of the criteria used for decision making is required (Siruguri et al., 2015).
Based on our years of use and development of AllergenOnline.org, it appears that the CODEX guidelines are far too conservative to judge proteins that match evolutionarily conserved allergens, especially when applied to whole genomes. We have therefore performed this study in part to understand the extent of over-predictions. We have evaluated protein sequence identity matches between three diverse species (a green alga Chlorella sp., a red alga Galdieria sulpharuraia and a Fusarium strain flavolapis) searching the AllergenOnline.org (AOL) database and the NCBI Protein database to consider matches to likely allergens.

There are three objectives in this study
First, to evaluate identities of all possible proteins from the genomes of three species intended for food use based on comparison of the predicted proteins against allergens in the AllergenOnline.org database using the CODEX guidelines of >35% identity over 80 AA.
Second, to address the inequality of extensively conserved sequences, we compared the predicted proteins from the genomes of 23 highly diverse allergenic and non-allergenic species; including human, animals, plants and arthropods to all AOL sequences, compiled identities to understand how common high identity matches are, and evaluated patterns of identity across protein types.
Third, to critically evaluate the limits of the CODEX guidelines when used as a whole genome analysis, using all types of proteins. The overall goal being to determine what in addition to the CODEX criteria is reasonable for risk assessment of whole foods.

Tests of three species based on genomic predictions of proteins
We chose to use a green alga Chlorella variabilis, a red alga Galdieria sulphuraria, and a newly identified Fusarium strain flavolapis fungus as test organisms. These organisms are being developed as single-cell food protein resources. Chlorella is a genus of single-celled green algae which contains high concentrations of protein (51%-60% of dry matter), amino acids, vitamins, dietary fiber, and a variety of antioxidants, bioactive materials, and chlorophylls. Green algae have a history of sustainable production and consumption. (Klamcyzynska and Mooney, 2017). Chlorella vulgaris and Chlorella pyrenoidosa are not considered novel in the EU since they have been historically consumed by humans (Regulation EC No. 258/97). In the US they are recognized as GRAS by the FDA as algae commonly consumed in foods in many countries (Wells et al., 2017). Recently the genome of Chlorella variabilis, NC64A was completed and was used here as a model genome (Blanc et al., 2010).
The unicellular red algae, Galdieria sulphuraria, was isolated by developers from extreme environments (from pH 0 to 4, and up to 56 • C) and has been proposed as an edible alga with a high content of protein and other dietary important nutrients. This alga can be grown via fermentation and is being developed for use in food products (Schonknecht et al., 2013), but has not yet been consumed by humans.
A single species of Fusarium is already used in several food products with the brand name, Quorn. Quorn is produced and marketed as a human food by Marlow Foods, Ltd. Quorn foods contain mycoprotein which is derived from Fusarium venenatum, which is grown by fermentation (Finnigan et al., 2019). Products of Quorn have been consumed as a non-meat protein source in the United Kingdom for 30 years and since 2002 in the US. There are a few case reports of food allergy to Quorn (Katona and Kaminski, 2002;Hoff et al., 2003a). Some of those may be due to inhalation allergy to proteins of Fusarium sp. (Weber and Levetin, 2014). Some consumers of Quorn have experienced transient GI symptoms without IgE antibody production. A very small number have experienced possible IgE mediated food allergic reactions including one reported fatal reaction (Tee et al., 1993;Hoff et al., 2003aHoff et al., , 2003bYeh et al., 2016;Jacobson and DePorter, 2018). To put this in perspective, many common food sources have caused at least one fatal food allergic reaction and as long as packaged food is labeled clearly, consumers with allergies can avoid consumption of foods that may cause allergic reactions if they are properly labeled (Ramsey et al., 2019;Gowland and Walker, 2015). Other strains or species of Fusarium with different compositions are now under development as possible food sources including, Fusarium strain flavolapis, the strain we are using here for which the developers have performed whole genome sequencing.

Preparation of protein sequences of the three targeted genomes
The predicted proteins for the genome of Chlorella variabilis NC64A were downloaded from the NCBI genome library (https://www.ncbi. nlm.nih.gov/genome/?term=Chlorella+variabilis+ %5Borgn%5D). For Galdieria sulphuraria, the company Fermentalg provided the DNA sequences which were identified using Illumina sequencing (2x150 bp reads). The sequencing quality was checked using FastQC (Andrews 2010) and cleaned using PRINSEQ (prinseq.sourceforge.net) by trimming off low quality bases. Two assemblers were used, SPAdes with 21, 33, 55 and 77 k-mer values (Bankevich et al., 2012), and Trinity using 25 k-mer (https://github.com/trinityrnaseq/trinityrnaseq/wiki). Post assembly polishing was performed using Pilon (Walker et al., 2014). The quality of assembly was checked using Quast (Gurevich et al., 2013). The percentage of mapping was evaluated using BWA mapper (Li and Durbin, 2009). Genes were predicted using the Galdieria model from AUGUSTUS (Stanke and Morgenstern, 2005). Sequences to exclude included tRNA sequences which were predicted using tRNAscan-SE (Lowe and Eddy, 1997) and rRNA which were predicted using barrnap (https://github.com/tseemann/barrnap). Functional annotation was conducted by a combination of AUGUSTUS software and BLASTP comparison for the predicted proteins against the published Galdieria sulphuraria genome (https://www.ncbi.nlm.nih.-gov/genome/?ter m=Galdieria+sulphuraria) from the NCBI library. Sequences were compiled into FASTA format files for comparison to the AllergenOnline. org database. The compiled sequences were also compared to the published Galdieria sulphuraria genomic sequences filed by Schonknecht et al., as described in 2013 as ASM34128v1 using alignment tools in order to check for potential sources of inaccuracy.
Nature's Fynd provided the genomic sequences for Fusarium strain flavolapis, which they are developing for use as a food ingredient. They performed genomic sequencing using Pacbio (for long-reads) and Illumina (2x250 bp reads) for short, high quality reads of this cultured species. These sequences were compiled and evaluated for accuracy and completeness using FASTQC. Sequences were compiled using assemblers MaSuRCA with 22 k-mer value (Zimin et al., 2013) and SPAdes (Bankevich et al., 2012) used K-mers of 21, 33, 55, 77, 99 and 127. Post assembly polishing used Pilon (Walker et al., 2014). Pacbio reads were mapped using Minimap2 (Li 2016), and Illumina reads were mapped using Bowtie2 (Langmead and Salzberg, 2012). Genes were predicted using the Fusarium model from AUGUSTUS (King et al., 2015), mitochondrial genes were predicted using Prodigal (Hyatt et al., 2010), tRNA were predicted using tRNA scan-SE (Lowe and Chan, 2016) and rRNA were predicted using Barrnap software (https://gith ub.com/tseeman/barrnap/). Functional annotation was done using the ERGO software package of IgenBio (Wilder et al., 2016). The overall sequence completeness was further evaluated by comparison to the genomes of strains of Fusarium sp. which had been previously characterized to provide a framework for understanding completeness (Niehaus et al., 2016).
To provide reasonable comparisons, the predicted proteins for the genomes of 23 species representing foods of diverse allergenic risks and included those of human, other animals and plants. The sequences were downloaded from public databases including the NCBI genome library (https://www.ncbi.nlm.nih.gov/genome), EnsemblPlants (http://plants. ensembl.org/index .html), and Phytozome V. 12, the Plant Genomics Resource (https://phytozome.jgi.doe.gov/pz/portal.html#) as summarized in Table 1. For species without published genomes as of October 2018, we downloaded the predicted protein sequences from the NCBI protein library. All protein sequences were downloaded on October 2018. The bioinformatics pipeline was completed using our lab cluster on the Holland Computer Center server at the University of Nebraska.

FASTA comparison for the predicted protein sequences of the genomes were compared to Allergenonline.org version 16 and 18B
Predicted protein sequences from the proposed three novel food species and 23 diverse species were compared to allergens in versions 16 and 18B of www.AllergenOnline.org by overall FASTA 35. FASTA version 35 was installed on the Holland Computing Center server to allow batch searches that mimic the individual protein searches available on our AllergenOnline.org website, however based on the best identity matches over 80 AA long. Different E-score thresholds (10, 1, 0.001, 1e-7, 1e-30, 1e-50, 1e-75, 1e-100) were used to check the significance of matches on the private HCC searches. The same scoring matrix was used (BLOSUM 50) as on the public AllergenOnline.org database. The sequence matches to proteins in AllergenOnline.org were compiled in an Excel worksheet with a record of the highest match identity. The resulted matches were evaluated to identify matches of >35% identity over 80 or more amino acid segments.

BLASTP comparison of predicted protein sequences within the NCBI non-redundant protein sequences database that includes annotated protein sequences from GenBank, RefSeq and TPA as well as SwissProt, PIR, PRF and PDB
Predicted protein sequences of Chlorella variabilis, Galdieria sp. and Fusarium strain flavolapis. as well as the 23 other species used in this study were used to search the general protein database using the current version of BLASTP in 2018 and early in 2019. The website is https://bla st.ncbi.nlm.nih.gov/BLAST.cgi. The current version of BLASTP outputs changed markedly in July 2019, removing the ability to use keyword limits in BLASTP searches to restrict matches to particular categories of sequences based on keywords. In addition, the output of BLASTP has changed and we used the Traditional Results for historical comparisons. Searches without keyword limits allows the highest identity matches to be viewed for evaluation of the common conservation of the protein sequences. The previous selection criteria using keyword limits such as "allergy" or "allergen" were removed. Those changes speed the searches but eliminates useful screening decisions. We also used BLASTP searches of species targets from the 23 species and of the matched allergens from out AllergenOnline.org to provide guidance on the relevance of lowidentity matches including >35% identity over 80 amino acids.
For Fusarium strain flavolapis, the quality of the sequences included 340k reads after trimming and correcting from Pacbio, and 56.5 M read pairs from Illumina. Assembled sequences included 89 contigs, with the largest contig being 4.9 MB, N50 for 3.2 MB, N75 for 2.3 MB and L50 6, L75 10 and 0 Ns with a GC content of 48.3%. Pacbio reads mapped at 99.95% using Minimap2 software. Illumina reads mapped at 99.81% using Bowtie2 software. The number of predicted proteins were 14239.

Comparison of all possible proteins from the genome of the three novel foods against allergens in AOL
The total number of unique matches to allergens for predicted proteins from the three potential food species that scored over a range of Escores with results >35% identity limit over 80 AA of CODEX guidelines are shown in Table 2. The normal default E score for FASTA or for BLAST is 10, but smaller E score numbers restrict the output to provide more stringent alignments. The purpose of these comparisons was to evaluate whether the CODEX criteria are reasonable for risk assessment of the three proteins using >35% identity over 80 AA as the criteria to bench mark a need for serum IgE tests or other additional evaluations. As shown in Table 2, the three species of interest have not been consumed (widely) by humans and are thus not known to cause allergies, yet they show very high numbers of matches greater than 35% identity over 80 AA at 1e-07 to allergens in AOL, with E score settings much smaller than the default of BLASTP. More realistic numbers of alignments, meaning identities between species that have been reported as possibly being cross-reactive were found when the E-score was set to 1e-100.
For comparison, we tested all predicted proteins from the genomes of 23 species ranging from humans to fungi, fish, mammals and many species of plants to evaluate the number of possible risky proteins. These matches are summarized in Table 3 for comparison to the three species of interest. Matches following CODEX guidelines are intended to identify proteins that may be sufficiently similar to an allergen to suspect possible IgE cross-reactivity and the possibility of triggering a clinically important allergic reaction. As shown in Table 3, a significant number of matches >35% identity to multiple allergens was found for proteins from all 23 species with E scores of 10 or even 1. Even using an E-score of 1e-100, the number of any match unique proteins seems far higher than expected based on numbers of allergenic proteins in commonly allergenic sources. Experiences in clinical research demonstrates that even the most commonly allergenic species such as peanut, produce only 4 to 6 commonly allergenic unique proteins (Ara h 1, Ara h 2, Ara h 3, Ara h 6 and possibly Ara h 8 and Ara h 9) and a total of <20 total allergens. Many other commonly allergenic species, such as shrimp list fewer than 10 allergenic proteins that elicit symptoms from human exposure by the airway, contact or ingestion allergens (www.allergen.org). A few sources of airway allergy such as the common house dust mite (HDM) Dermatophagoides farina and the evolutionarily related Dermatophagoides pteronyssinus have nearly 40 different proteins that may be bound by IgE of people with inhalation allergies. However only three proteins from HDM (Der f 1, Der f 2 and Der f 23, or Der p 1, Der p 2 and Der p 23) are considered major allergens and four others (Der f 4, Der f 5, Der f 7 and Der f 21) are considered mid-level allergens (Thomas, 2015). The other HDM proteins are unlikely to be clinically important because of low level expression, high instability, and unlikely inhalation exposure. Interestingly many of the allergens that have been identified are commonly conserved proteins that share high identity scores across relatively unrelated taxa such as profilins, heat shock proteins and beta-expansins. There are rare to fairly common reports of allergy to some of these species, while only a few clear reports of allergy are common for many species. Our intent in testing 23 species including human proteins was to identify an E-score limit that might be valuable for risk assessment and also to test percent identity scores that might be more predictive than the CODEX limit of >35% identity over 80 AA and to consider the relevance of >35% identity.

Identities of all possible proteins from the genome of the three novel food sources and 23 common species matches to AOL
The results in Table 2 illustrate that the algae (Chlorella variabilis NC64A) has sequence matches of >35% identity to between 14 and 991 unique proteins in AOL, depending on which E-score limit was used. Even at the moderate E-score of 1e-7 there were 159 proteins that suggest potential cross-reactivity. By comparing all predicted proteins from the 23 diverse species including humans (Homo sapiens) in Table 3, we found similarly high numbers of matches of the predicted proteins to allergens across the species. Pistachio had the lowest number of matches, but few total proteins have been predicted from nucleotide sequences for pistachio or pecan (Table 3). When we compared the highest scoring aligned proteins of Chlorella variabilis to all proteins in AOL version 18B as shown in Table 4. The highest scoring matched allergen was to cyclophilin of Daucus carota, but that protein is highly conserved to sequences in all 23 species. Heat shock protein 70 of the Aedes aegypti mosquito is highly conserved as shown by sequence matches to proteins in 22 species. The lowest scoring matches in Table 4 include a few bona fide allergens with identity matches close to 35% identity, and with modest E-scores. Those include matches to thioredoxin of fungi at 39-40% identity and venom allergen 5 of a wasp at 35.8% identity. Most of the matched allergens are conserved across many species of the 23 chosen here. Many are house-keeping proteins including cyclophilins, heat shock proteins, 60S ribosomal protein, triosephosphate isomerase, aldolase, gliadins. However, the percent identities are not high compared to BLASTP matches to homologues  Total  277988  82613  9043  3201  413  119  57  35  Unique  991  752  297  159  64  39  21  14  Galdieria sulphuraria  Total  67989  17792  3202  1222  170  97  50  32  Unique  101  96  85  73  39  32  12  8  Fusarium strain flavolapis  Total  192772  65321  13320  5867  646  317  135  88  Unique  508  466  326  232  125  95  44  30 from a variety of protein sources and from species that are not likely to represent risks. For example, BLASTP comparison of triosephosphate isomerase (EFN53775.1) in Chlorella variabilis to non-redundant protein database had the top 100 matches to triosephosphate isomerase in diverse species with sequence identity ranged from 69 to 100%. Similarly, Chlorella heat shock protein 70 (EFN57963.1) had matches to heat chock proteins in different species with sequence identities of 77.5-100%. This shows the conservation of these proteins among diverse allergenic and non-allergenic species. Similarly, Table 5 shows that Galdieria sulphuraria had matches to 59 weak or putative allergens and 6 very weak matches to food allergens (tropomyosin, vicilin, and convicilin) with E-scores >0.02. Due to high sequence identity of evolutionary homologues, these identity matches were over predictive for possible risks of allergic cross-reactivity. The searches were rerun using an E score of 1e-7 that removed proteins that are clearly unlikely to cause cross reactive. The results are shown in Table 5. The identified food allergens represent important protein classes of allergens, yet the identity matches shown in this study show very low identities of proteins as with those from Chlorella, meaning they are unlikely to be significant risks for cross-reactivity. That can be demonstrated by comparing the matched allergens to the NCBI Protein database using BLASTP. The results for FASTA comparison of predicted proteins of the Quorn fungal genome-predicted proteome, another species of Fusarium, was tested for background evaluation. Quorn has been used as a food source in the United Kingdom for >30 years. The results are shown in Supplementary Table 1, that identified 181 matches to weak or putative allergens and 12 very low identity matches to food allergens with very low sequence identity over short AA segments.

summary examples of FASTA comparisons using all predicted proteins from the 23 studied species
Predicted proteins from the public genomes of all 23 species were compared to AllergenOnline.org looking for matches of >35% identity, using an E score cutoff of 1e-07. Wheat genome predicted proteins matched 312 putative allergens, but only eight major allergens. Soybean genome predicted proteins matched 243 putative allergens and 32 matches to major allergens (vicilins and conglycinins of soybean, walnut, pecan and pistachio). Genome predicted human proteins matched 206 weak or putative allergens, one matched the major allergen tropomyosins from a variety of sources including crustacean allergens and those of fruit flies (Drosophila sp.), fish (salmon and cod). Another human protein matched lipid transfer proteins (LTP) with a modest identity match to LTP from pomegranate (42.3% identity with an E score of 3.7e-19). Searching AllergenOnline.org with the pomegranate LTP shows many higher identity matches, often >55% ID with E scores of Table 3 Total and unique matches to allergens in AOL for predicted proteins from 23 different allergenic and non-allergenic species.   smaller than 1e-20 to 1.1e-25. LTPs from a variety of sources have evidence of cross-reactive laboratory IgE binding, but there are fewer reports of multiple allergic reactions to diverse sources of LTPs. This search identified many proteins that are unlikely to represent major risks of cross-reactivity as the protein sequences are conserved across broad taxonomic categories with no history of cross-reactivity.

Evaluation of the limits of CODEX guidelines looking for matches of >35% identity
3.5.1. Identification of known allergens in AllergenOnline.org database using FASTA at specific E-score limits for significance The predicted proteins from some allergenic species were compared to AllergenOnline.org database at different E-scores, and we focused on the best E-score threshold for identification of known allergens using the official WHO/IUIS Allergen Nomenclature in AOL database. Table 6 illustrates the identified allergens using FASTA in different allergenic species at representative E-scores of 1e-7, 1e-30, and 1e-100. All known allergens of major and minor allergenic sources in the AllergenOnline.org database were detected using E-scores of 10, 1, 0.001, and 1e-7. However, some potentially important matches to allergens were missed in the FASTA searches when the E-score was reduced less than 10e-7.

Major allergens with a high risk of cross-reactivity
Proteins predicted from the 23 genomes that included humans were searched to allergens having a relatively high risk of clinical crossreactivity to major allergens. The distribution of taxa having matches to clinically important major allergens included lipid transfer proteins, vicilins, glycinins, 2S albumins, tropomyosin, and arginine kinase are shown in Table 7. The matches were related to taxonomic relationships of the species as well as the protein families, yet the identity matches are broadly diverse. Lipid transfer proteins, vicillins and glycinins are highly conserved in beans, soybeans, apple, peach, and papaya. Yet publications of cross-reactivity for these proteins among the protein families is limited and it appears true clinical cross-reactivity is very limited. Major allergens in crustacean shellfish e.g. tropomyosin and arginine kinase are generally cross-reactive between crustaceans and occasionally to insect proteins, yet identities of >35% identity were commonly found for those two proteins between human, drosophila, bovine, salmon, and cod and there is no evidence of clinical cross-reactivity for those taxa compared to crustaceans. Importantly, human proteins are not considered to be allergenic for humans.

Minor allergens and noise of CODEX limits
To consider protein identity matches to minor allergens, predicted proteins of the 23 species were compared to AOL version 18B by FASTA using the HCC supercomputer. Those proteins that had a match of >35% identity to proteins from at least 10 of these species were considered evolutionarily conserved minor allergens. Most of the minor allergens represented had sequence identities less than 50% when compared within the protein type. Matches of >35% identity were found to 170 allergens listed in AOL, and those are considered minor also because they do not have published evidence of causing clinical reactions, only IgE binding. They all matched at least 10 different species out of the 23  Table 6 Identification of known allergens listed in the AllergenOnline.org database using full-length FASTA. Predicted proteins of the allergenic species listed in this table were compared to the AOL database. Matches above CODEX limits to known allergens were found using the official WHO/IUIS Allergen Nomenclature in the AOL database identified E-scores from 1e-7, 1e-30 and 1e-100. A few allergens were missed between E-scores of 1e-7 and 1e-30.  Table 2 lists these minor allergens and shows the number of species matched out of 23. Searches that identify protein identity matches to proteins in 10 or more diverse species must be evolutionarily conserved and are unlikely to represent real risks of crossreactivity.

Conclusion
It is becoming more common to use a whole genome or a proteome bioinformatics approach to identify potential proteins in a wide variety of species. Some regulatory agencies or risk assessment scientists have suggested using these predicted proteins against allergen databases to identify possible risks of allergenicity for food safety. The CODEX guideline (>35% identity over 80 amino acids to any known allergen) has become a standard for possible risks of cross-reactivity since 2003. The comparison to www.AllergenOnline.org was made available to the public in 2005 to assess individual proteins. The database is updated annually. The interpretation of identity matches over 35% over 80 amino acids or the equivalent is assumed to be a positive identity match that would require serum IgE binding tests sera from subjects allergic to the matched allergen. Since we know that matches at that identity level can occur by random chance, we tested the use of protein sequences predicted from genomes, transcriptomes, or proteomes against AOL to estimate the commonality of false positive matches.
We compared the predicted proteins from the genomes of 23 diverse allergenic and low-or non-allergenic species including plant sources, fungi, fish, insect and other animal sources as well as human sequences against the AOL database using standard CODEX criteria as well as full-FASTA alignments to provide identity matches. We used a wide variety of E score criteria to consider that as a variable as well. A number of housekeeping proteins across many species had moderate to high identities to minor putative allergens in AOL. However, many of these proteins are highly conserved in most eukaryotes and as a consequence would be expected to be found in any search using the standard CODEX criteria. In contrast, major allergens are not highly conserved in sequence and structure and were not identified using the search parameters except in closely related species.
For those highly conserved proteins identified across many species, there are nonetheless differences in the levels of AA sequence identity conservation that impact their potential for shared clinical crossreactivity. Moreover, differences in protein abundance and potency are significantly different between species, affecting the allergenic potential of the species.
We have used a wide range of E-score thresholds to test search methods. We propose that an E-score threshold of 1e-7 may be needed for identification of a few important allergens in this type of study, yet identity matches of >35% are still common for highly conserved proteins at 1e-7.
Examples using three predicted proteomes from three novel foods were assessed against the AOL database and many identity matches were seen. The comparison of predicted proteins from 23 test species demonstrated conclusively that the low-level match of >35% identity over 80 amino acids over-predicts potential risks of allergy. We have concluded that Chlorella variabilis, Galdieria sulphuraria and Fusarium strain flavolapis do not represent a significant risk of food allergy to the general population as matches to similar proteins from many diverse species are very common.
Alternative strategies of increasing the match criteria above 35% identity, possibly to 45% identity; decreasing the E-score below 1e-7 or smaller; may be needed although matches to a few allergens may be missed at 1e-20 and ranking of allergens in AOL regarding risks of disease could markedly improve this assessment strategy. Other investigators should use similar strategies and risk assessors should consider the broad questions of whole food safety for novel or new foods to establish more predictive assessment limits.

Author contributions
MA performed the initial literature reviews, performed bioinformatics comparisons to AllergenOnline.org, and drafted the manuscript. CZ oversaw the design of the bioinformatics pipeline and edited the manuscript. MK oversaw generation of the genomic sequences of Fusarium strain flavolapis and BF contributed to allergenicity discussion of and edits Fusarium strain flavolapis. MC oversaw and provided genomic data for Galdieria sulphuraria. HG provided overall suggestions and edited the manuscript. REG developed the original concept for the study and oversaw the completion and provided allergenicity risk review.

Funding
Some funds were provided by Fermentalg and by Nature's Fynd Some funding was provided by the AllergenOnline.org sponsors (Unilever and NuSeed). The majority was from revolving research accounts of Professor Goodman.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table 7
Distribution of matches of proteins predicted from the 23 species genomes to clinically important allergens in AllergenOnline.org. The matches were identified based on CODEX guidelines of >35% identity over 80 AA and would be considered as possibly cross-reactive yet, human proteins matched in this search are clearly not allergens, demonstrating over-prediction.