1 Allergen Bioinformatics : Recent Trends and Developments

Allergy is a major cause of morbidity worldwide. Allergic reactions result from maladaptive immune responses in predisposed subjects, to otherwise harmless molecules. These allergenic molecules, usually proteins/glycoproteins, can not only elicit specific immunoglobulin E (IgE) in susceptible subjects, but also crosslink effector cell-bound IgE molecules Leading to the release of mediators (e.g. Histamine) and causation of symptoms. From clinical and molecular biological data available in several publicly accessible databases, it is now evident that among hundreds and thousands of proteins that exist in nature, only a few can cause allergy. For example, in more than 500,000 entries (71345 documented at the protein level; Nov, 2010) in swissprot/uniprot database (http://www.uniprot.org), only 686 proteins have been listed in the IUIS allergen nomenclature database (www.allergen.org) as documented allergens. Although about 1500 allergens (including iso-allergens) have been listed in the Allergome database (www.allergome.org), it has been shown that they are distributed into a very limited number of protein families. However, critical feature(s) that makes proteins allergenic is not fully understood. In the present article, we’ll discuss recent applications of bioinformatic tools that shaped our current understanding about allergenicity of proteins.


Introduction
Allergy is a major cause of morbidity worldwide.Allergic reactions result from maladaptive immune responses in predisposed subjects, to otherwise harmless molecules.These allergenic molecules, usually proteins/glycoproteins, can not only elicit specific immunoglobulin E (IgE) in susceptible subjects, but also crosslink effector cell-bound IgE molecules Leading to the release of mediators (e.g.Histamine) and causation of symptoms.From clinical and molecular biological data available in several publicly accessible databases, it is now evident that among hundreds and thousands of proteins that exist in nature, only a few can cause allergy.For example, in more than 500,000 entries (71345 documented at the protein level; Nov, 2010) in swissprot/uniprot database (http://www.uniprot.org),only 686 proteins have been listed in the IUIS allergen nomenclature database (www.allergen.org)as documented allergens.Although about 1500 allergens (including iso-allergens) have been listed in the Allergome database (www.allergome.org), it has been shown that they are distributed into a very limited number of protein families.However, critical feature(s) that makes proteins allergenic is not fully understood.In the present article, we'll discuss recent applications of bioinformatic tools that shaped our current understanding about allergenicity of proteins.

Allergen bioinformatics -a need of the hour
Experiments on genetic engineering during the last few decades have led to the production of numerous genetically modified (GM) organisms.So, proteins introduced into GM organisms through genetic engineering must be evaluated for their potential to cause allergic diseases.As a classical example, transgenic soy, that has been genetically engineered to express ground-nut 2S albumin, was found to elicit hypersensitivity reactions in groundnut allergic people (Nordlee et al., 1996).In 2001, the FAO/WHO suggested a procedure for performing FASTA or BLAST (Basic Local Alignment Search Tool) searches, and a threshold of greater than 35% identity in 80 or greater amino acids to identify potential allergenic cross-reactivity of transgene encoded proteins in genetically enhanced crops (Silvanovich et

Allergen databases
Exponential growth of molecular and clinical data on allergens has created a huge demand for efficient storage, retrieval and analyses of available information.There are numerous allergen databases available on Internet.They are targeted to different aims ranging from easy accessibility of data to novel allergen prediction.A few examples have been provided in table-1.The IUIS (International Union of Immunological Societies) allergen nomenclature subcommittee has created a unique, unambiguous nomenclature system for allergenic proteins.It maintains an allergen database (www.allergen.org)containing an expandable list of WHO/IUIS -recognized allergen molecules arranged according to Linnean system of classification (Kingdoms: Plantae, Fungi and Animalia and subdivided into lower orders) (Chapman et al., 2007) of the source organism.This database is a precise and convenient source for researchers, since it contains the biochemical name and molecular weight of the allergens and isoallergens (multiple molecular forms of the same allergen showing ≥67% sequence identity).It is searchable by allergen name, source and taxonomic group.For example, a search using the key word 'Bet v 1' shows about 36 variants (isoallergens) of this allergen, each with genbank, uniprot accession numbers and, if available, with PDB IDs.Each uniprot ID is linked to the original entry in uniprot database.Moreover, once the uniprot IDs are obtained, their sequences can be retrieved in batches using uniprot's 'retrieve' tab.Allergome (Mari et al., 2006) is a vast repository of data related to all allergen molecules.It contains data about a larger number of allergens than actually recognized by IUIS/WHO.It also contains links to other databases (eg Uniprot, PDB) and computational resources with additional extensive links to literature.The Allfam database is a useful resource for grouping of allergens into protein families.It utilizes the allergen information from 'Allergome' database and protein family information from pfam database.It can be sorted by source (plants/animals/bacteria/fungi) and route of exposure (inhalation/ingestion/contact/sting etc) or can be searched for specific protein families.Allergen entries are linked to corresponding records in the Allergome database.In addition, each allergen family is linked to a family fact sheet containing descriptions of the biochemical properties and the allergological significance of the family members.For each allergen, an appropriate species-specific non-allergenic control homolog was included.It has been found that 25 out of 30 allergens do not have any bacterial homologues; two other allergens have only a few, while all the non-allergenic controls retrieved numerous bacterial homologues.Moreover, major allergens like Bet v 1, also lack human homolog.The authors, thus, interpreted that the allergens are usually foreign proteins that lack bacterial homologues (Emanuelsson and Spangfort, 2007).

Allergenic proteins can be organized into families
The first definite interpretation that allergens can be grouped came from arranging allergens into pfam protein families.Pfam classifies proteins into families on the presence of specific domains (pfam domains) identified through multiple sequence alignments and Hidden Markov Models.Pfam 25.0 (latest version; March 2011) contains over 100, 000 protein sequences classified into 12,275 families (Finn et al., 2010).The allergen database that contains pfam domain information is 'AllFam' (http://www.meduniwien.ac.at /allergens/allfam), where allergen sequences are classified into protein families using the Pfam database, and its associated database, SwissPfam.AllFam includes all allergens that can be assigned to at least one Pfam family.But many allergens are multi-domain proteins.The domains of these proteins are merged into a single AllFam family, if the Pfam domains of this allergen occur only in combination with a single other Pfam domain.AllFam gave us an opportunity to retrieve and sort allergen data according to source (plant/animal/fungi/bacteria), route of exposure (inhalation/ingestion/contact etc) and Pfam/AllFam family identities.This analysis combined with the study of evolutionary relationship among the proteins has led to the following valuable insights: i. Pollen allergens (Inhalant plant allergens) are restricted into few protein families (Radauer and Breiteneder, 2006).They populate only 29 out of more than 7000 protein families, with (a) Expansins (b) Profilins and (c) calcium-binding proteins (with EF-hand domains) consisting most of the pollen allergens followed by Bet v 1 related /pathogenesis-related proteins (PR10 family).Figure -2 shows the evolutionary relationship between several allergenic and non-allergenic members of (a) expansins and (b) profiling families.The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei, 1987).The evolutionary distances were computed using the Poisson correction method (Zuckerkandl and Pauling, 1965) and the phylogenetic analyses were conducted in MEGA4 (Tamura et al., 2007).Similar method has been followed in the subsequent sections of the present article.Allergens of the expansin family are clustered as highly identical proteins as shown in the figure.Allergenic plant profilins also constitute a conserved homologous group with high sequence identities (70-85%) among themselves, while showing low identities (30-40%) with non-allergenic profilins from other eukaryotes including human (Radauer and Breiteneder, 2006).About 10 of the 29 pollen allergen families are also present in plant-derived foods.ii.In case of major animal food protein families evolutionary distance from human homologue reflects their allergenicity (Jenkins et al., 2007).This has been demonstrated in major food allergen families like (a) parvalbumins, (b) casins and (c) tropomyosins.(Radauer and Breiteneder, 2007).They are (a) the Prolamin superfamily with PF00234 domain (b) the cupin superfamily with PF00190 and PF04702 domains (c) the Profilins with PR00235 domain and (d) the Bet v 1 -like proteins containing PF00407 domain.Prolamins are seed storage proteins containing about 82 characterized allergens, with 65 enlisted as ingestants.Figure -4 shows the evolutionary relationship among the Bet v 1-homologous protein family.Twenty-four proteins of this group are known as allergens present in pollen and plantderived foods responsible for causing allergic sensitization in a large number of people.

Allergen-associated protein domains
The other allergen database that utilizes the Pfam protein family information is Motifmate (http://born.utmb.edu/motifmate/index.php) (Ivanciuc et al., 2009a).Motifmate assigns pfam domains to the allergens listed in the SDAP (Structural Database of Allergenic Proteins) database developed and maintained by the University of Texas (http://fermi.utmb.edu/SDAP)(Ivanciuc et al., 2003).The recent version of this database contains 679 Proteins (May, 2011).The authors pointed out that all the allergenic protein entries in SDAP could be associated with only 130 pfams (of total 9318 pfams) with only about 31 pfam protein families containing 4 or more number of allergens. .This outcome supports the previous finding that the allergenic proteins are clustered in few pfam families.

Insights from structural bioinformatics
After the elucidation of X-ray crystal structure of the birch pollen allergen Bet v 1 (Gajhede et al., 1996), structures of several allergens have been solved.Searching the protein databank with the keyword "allergens" returns 321 entries, with occasional presence of multiple entries for one single allergen.Although protein structure gives us valuable insight into their function, structures of several allergens are still not known.More importantly, some allergen families have members with known structures, while others may have very few / no member whose structure(s) have been deduced.Allergen structures are particularly useful to elucidate molecular features related to allergenicity, cross-reactivity and for designing hypoallergenic derivatives.For example, there are about five structures in protein data bank that correspond to Bet v 1, the major birch pollen allergen : the x-ray structure (1BV1.pdb), the NMR structure (1BTV.pdb),mutants (1B6F.pdb and 1QMR.pdb),complexed with IgG Fab (1FSK.pdb)and the hypoallergenic isoform Bet v 1d (3K78.pdb).On the contrary, several groups, such as the cupin family of seed storage protein allergens are under-represented.Knowledge about allergen structures is important because it is the over-all structure, not the sequence, which determines the biochemical/immunological properties.Molecular modeling can help us in case the experimentally determined structure of the allergen is not available.Homology modeling, also known as comparative molecular modeling, can predict the 3D model of a given protein from its amino acid sequence using experimentally derived structure(s) (X-ray/NMR) of one or more related homologous protein(s) (called template).This technique is becoming increasingly popular because, if required template selection and alignment criteria are met, it is believed to be the most reliable modeling technique to date (Marti-Renom et al., 2000).It is becoming increasingly useful because although there are millions of proteins in nature, the number of structural folds they can assume is limited (Zhang, 1997) and the number of X-ray/NMR structure of proteins is exponentially increasing providing an increased chance of getting a suitable 'template'.Several authors have successfully utilized this technique of molecular modeling to predict allergen structures and to elucidate the structural basis of cross-reactivity between allergens.Ara h 1 (vicilin) and Ara h 2 (2S albumin) are seed storage proteins of peanut (Arachis hypogea).They are recognized by serum IgE of >90% of peanut-allergic people, thus showing their importance as major peanut allergens (Shin et al., 1998;Stanley et al., 1997).Ara h 1 shows IgE-mediated cross-reactivity with other vicilin allergens such as Len c 1 (from lentil) and Pis s 1 (from sweet pea).Following sequence alignment using ClustalX, structural models of Ara h 1, Len c 1 and Pis s 1 were generated from experimentally derived structure of beta-conglycinin (RCSB protein data bank code: 1IPJ) using programs InsightII, Homology and Discover3 (Accelrys, USA).Electrostatic surfaces of these proteins were also generated using program GRASP (Nicholls et al., 1991).Mapping of linear epitope sequences revealed that nine out of 23 linear B-cell epitopes are located in the N-terminal region.They are unique to Ara h 1.But the remaining B-cell epitopes, situated in the Cterminal part, are well-exposed to the surface, share a high degree of homology and 3D conformation to Len c 1 and Pis s 1.They might be responsible for cross-reactivity among Ara h 1, Len c 1 and Pis s 1 food proteins.Similarly, Ara h 2 and other dietary allergenic 2S albumins Jug r 1 (walnut), Car i 1 (pecan nut), Ber e 1 (Brazil nut) were modeled using the atomic coordinates of homologous Ric c 1(castor bean 2S albumin).Mapping of known epitope sequences on the template and modeled structures revealed no structural homology between allergenic 2S albumins of peanut, walnut, pecan and brazil nut.This indicates that cross-reactivity between Ara h 1 and other 2S albumins, which is less likely, might not be mediated by protein epitopes, but CCDs (cross-reactive carbohydrate determinants).However, the c-terminal epitope region of Jug r 1 showed a clear structural homology with Car i 1 indicating the possibility of their cross-reactivity.
Another important insight was obtained from the homology modeling of allergenic cyclophilins (Roy et al., 2003).Groups of highly homologous cross-reactive allergens such as cyclophilins, profilins, MnSOD are known as pan-allergens.They often cross-react with their respective human homologues (Crameri et al., 1996) and such cross-reactivity might be responsible for severity and perpetuation of symptoms in the absence of exogenous allergen exposure (Fluckiger et al., 2002).Allergenic cyclophilins (peptidyl-prolyl cis-trans isomerase; PF00160) have been identified from several organisms such as: Periwinkle (pollen allergen Cat r 1), birch (pollen allergen Bet v 7), Aspergillus fumigatus (Asp f 11, Asp f 27), Psilocybe cubensis, Malassazia furfur (Mala s 6, formerly known as Mal f 6) and carrot.IgE-mediated cross-reactivity between Mala s 6, Asp f 11, yeast cyclophilin and human cyclophins has been demonstrated (Fluckiger et al., 2002).The structure of human cyclophilin, which shows high sequence identities to Asp f 11, Mala s 6 and yeast cyclophilin, was known from crystallography (PDB code: 2RMB).Thus, taking this as the template, the molecular models of three other cyclophilins were generated and compared with the human homologue to understand the structural basis of their cross-reactivity.Molecular modeling was done using program Modeller (Sali and Blundell, 1993).The structures were energy-minimized using program Discover with Consistent Valence Force Fields (Hagler et al., 1979) and their steriochemical qualities were checked using Procheck (Laskowski, 1993).Several empirical/semi-empirical programs were used to predict the antibody binding sites (B-cell epitopes) on these proteins and residue-wise solvent accessibility values of these predicted epitopes were calculated using program NACCESS (Hubber, 1992).The cyclosporine-binding site of these proteins were also identified by aligning the sequences (using ClustalW) and structures.This study revealed large conserved solvent-exposed patches on the surfaces of these proteins strongly suggesting their cross-reactivity.The x-ray crystal structure of Mala s 6 (PDB code: 2CFE), published three years later (Glaser et al., 2006), very much resembles its predicted model .The domain-swapped structure of Asp f 11 dimer (PDB code: 2C3B), also published at the same time, showed similar structural fold.Asp f 11 dimer (resulting from increased protein concentration) seems to be enzymatically inactive, since the active sites of both its subunits are blocked due to dimerization.However the constituent monomers retained the basic cyclophilin structure.More recently, a comprehensive 3D structural modeling of allergens, with no known structure, has been conducted followed by surface accessibility calculation and mapping of known IgE-binding epitope sequences.It has been found that Ala, Asn, Gly and Lysine have a high propensity to occur in the IgE-binding sites on the surface of allergenic proteins (Oezguen et al., 2008).Finally, techniques of structural bioinformatics have also been applied to assess features critically required for allergenicity and cross-reactivity.This has been done by analyzing the predicted structure of protein T1, the naturally occurring non-allergenic member of the Bet v 1 allergen family (Ghosh and Gupta-Bhattacharya, 2008).Protein T1 shows considerable sequence similarity with the proteins of Bet v 1 allergen family, but it is neither allergenic, nor cross-reactive to the Bet v 1 group (Laffer et al., 2003).Comparative molecular modeling, solvent accessibility calculations and mapping of surface electrostatic potential showed substantial difference in antigenic surface that can be responsible for the loss of crossreactivity.Solvent-accessible surface area and electrostatics calculations were done using program DSSP (dictionary of secondary structure of proteins) and program APBS (Adoptive Poissopn-Blotzman solver) respectively (Baker et al., 2001;Kabsch and Sander, 1983).Although, as suggested by ligand docking, it should be able to perform its biological function as a brassinosteroid carrier.

Conclusion
Allergy is a world-wide problem.Allergic symptoms are elicited following exposure to a structurally diverse group of proteins known as allergens.Understanding allergenicity at the molecular level has wide application in food safety and in treating allergic diseases.What makes a protein allergic is not yet understood.However, advanced tools of bioinformatics have been applied to address this problem.It has been found that allergens are usually foreign proteins with few/no bacterial homologue.They are clustered into few protein families (associated with a limited number of protein domains) opposing the idea that any protein can be allergenic.Methods have been developed to predict probable allergenicity from protein sequence, although more works need to be done for better and more precise prediction.The structural difference between IgG-binding and IgE-binding epitopes is still not very clear, but homology modelling in combination with residue-wise solvent-accessibility of monomers and biological assemblies of allergens certainly gives valuable information about antigenic determinants on protein allergens.
Figure-1 shows the distribution of allergenic proteins in different Allfam families.The major allergen families (containing 10 or more allergens) with corresponding Pfam domains are shown in

Fig. 2 .
Fig. 2. Phylogenetic trees showing the relationships of two major pollen allergen families: (a) Expansins, (b) Profilins and their respective non-allergenic homologues.Pollen-related plant food allergens such as Ara h 5, Dau c 1 etc are also included.Uniprot accession numbers are shown.Positions of allergens are indicated by dotted lines.

Fig. 3 .
Fig. 3. Dendrogram showing evolutionary relationship among 12 different parvalbumins (a) and 13 different tropomyosins (b) from animals and human.Allergenic proteins and their non-allergenic homologues as well as the closest human homologues are chosen.The Uniprot accession numbers and positions of allergen clusters are indicated.
an all erg en Gl y m 4-P 26 98 7 Soy abea n Stre ss-in duc ed prot .-Q43453 Alfalf a PR10.2-Q94 IM3 Pisum f.Ypr10.Pi.ful.2 -O24256 Pean ut Ara h 8-Q6 VT83 Pe an ut PR 10 -Q 0P KR 4 W h it e lu p in P R 10 -O 24 01 0 C h ic k p e a P R 1 0 a -Q 3 9 4 5 0 G a r d e n P e a A B R 1 8 -Q 0 6 9 3 0 S J W o rt P O C P H 1-Q 8H 1L 1 Be ll pe pp er PR 10 -Q 2V T5 5 B o n n et p ep p er P R 1 0-Q 5 D U H 6 C a p s ic u m b a c c a tu m P R 1 0 -A 3 Q N N 0 B o n n e t p e p p e r -Q 5

Fig. 4 .
Fig.4.Un-rooted neighbor-joining tree showing evolutionary relationship among the members of Bet v 1-related plant protein family (containing pfam domain PF00407).Uniprot accession numbers and position of the allergen cluster has been indicated.(Radauer and Breiteneder, 2007).They are (a) the Prolamin superfamily with PF00234 domain (b) the cupin superfamily with PF00190 and PF04702 domains (c) the Profilins with PR00235

Fig. 6 .
Fig. 6.Structure diagram of the allergenic cyclophilin Mala s 6 as predicted in 2003 from homology modeling (left figure) and as determined by x-ray crystallography (right figure) in 2006 (2CFE.pdb).Alpha helices have been shown in red, while beta sheets are in yellow and loops (smoothed) are in green.

Table 2 .
Major allergen families (AllFam familes) that contain 10 or more allergens are shown with correspondent pfam domains and examples.Number of allergenic members / allergen family has also been shown.AllFam takes the allergen information from "Allergome", the comprehensive allergen database.In the latest version of Allfam (May, 2011), 950 allergens have been arranged into 150 allergen families (AllFam families).It has been found that the allergens are distributed in a really skewed manner with about 30% members belonging to only 5 families (Prolamin, Profilins, EF hands, tropomyosin and cupins) and showing few restricted biological functions such as hydrolysis, storage or binding to cytoskeleton[6].Moreover, allergens contain about 245 pfam domains in total, which is only about 2.0% of all domains identified to date.
Table-2.Fig. 1. Pi chart showing the distribution of allergic proteins in different AllFam families.Numbers of constituent allergens have been indicated within brackets.