Human genomic diversity, viral genomics and proteomics, as exemplified by human papillomaviruses and H5N1 influenza viruses

The diversity of hosts, pathogens and host-pathogen relationships reflects the influence of selective pressures that fuel diversity through ongoing interactions with other rapidly evolving molecules in the environment. This paper discusses specific examples illustrating the phenomenon of diversity of hosts and pathogens, with special reference to human papillomaviruses and H5NI influenza viruses. We also review the influence of diverse host-pathogen interactions that determine the pathophysiology of infections, and their responses to drugs or vaccines.


Introduction
The availability of complete genome sequences and the wealth of large-scale biological datasets provide an unprecedented opportunity to elucidate the genetic bases of human diseases and host-pathogen interactions. The diversity of hosts, pathogens and host -pathogen relationships reflects the influence of selective pressures that fuel diversity through ongoing interactions with other rapidly evolving molecules in the environment. Such influences add another source of genetic adaptability as cells adjust to new environments and out-manoeuvre pathogenic threats. The study of genetic variation in pathogens and hosts has practical significance for developing strategies to combat and control infectious diseases. Vaccines based on highly polymorphic antigens may be confounded by allelic restriction of the host immune response. In addition, the study of the distribution of genomic polymorphisms among different hosts may provide information on responses to drugs or vaccines. Based on the concept that host damage is the most relevant outcome of the host -pathogen interaction, we need better to understand both host and pathogen polymorphisms in drug or vaccine design. A deeper knowledge of the diversity and nature of pathogens will provide valuable insights into genetic markers that may be useful for detection, identification and forensics. The ability to discriminate between virulent pathogens and their counterparts that are either less or not virulent is another major challenge. It is important to discover genetic factors that mediate the virulence process in order to devise novel methods to prevent or treat disease. Similarly, understanding how the host responds to microbial invasion, and how the pathogen evades or manipulates the immune response to subvert the host, will contribute to the development of vaccines and other prophylactic strategies. Concomitantly, the unravelling of associations between hosts and pathogens will be highly relevant to the modelling of the population biology of multi-host pathogens and their impact on co-infections. This paper reviews and discusses specific examples concerning the issue of diversity of hosts and pathogens, and the influence of diverse host-pathogen interactions that determine the pathophysiology of infections, and their responses to drugs or vaccines.

Human genome diversity and 'SNPshots'
The genetic blueprint of an individual not only determines disease susceptibility, but also his/her response to drug treatment. Numerous genes are involved in drug response and toxicity, introducing a daunting level of complexity in the search for candidate genes. Thus, genetics-particularly gene polymorphismsexert a significant impact on target discovery.
The HapMap constitutes a catalogue of common genetic variants that exist in humans. It describes what these variants are, where they occur in our DNA and how they are distributed among individuals within a specific population and among populations in different parts of the world. This project provides information that can link genetic variants to the risks for specific illnesses, which can lead to new methods for preventing, diagnosing and treating diseases. 1,2 Differences in individual bases are by far the most common type of genetic variation. These genetic differences, known as single nucleotide polymorphisms (SNPs), represent DNA sequence variations that occur when a single nucleotide in the genome sequence is altered. SNPs are more common than other types of polymorphisms, and occur at a frequency of approximately one in 1,000 base pairs 3 throughout the genome (including promoter regions and coding and intronic sequences). Some of these differences may alter gene products in ways that confer susceptibility or resistance to diseases, or contribute to disease severity or progression. Although over 99 per cent of human DNA sequences are the same across the human population, DNA sequence variations impact on how humans respond to disease; to environmental stresses such as infections, toxins and chemicals; and to drugs and other therapies. 4 Since genetic factors also affect response to drug therapy, SNPs can help to determine why individuals differ in their abilities to absorb or clear certain drugs, as well as to ascertain the mechanisms of adverse drug effects. Moreover, by affecting drug-target proteins such as G-protein-coupled receptors, enzymes, ion channels and proteins involved in detoxification pathways, non-synonymous coding SNPs (cSNPs) (namely substitutions resulting in alterations of encoded amino acids) significantly influence the diverse responses of efficacy and toxicity of therapeutic agents in the human population. 5 In general, only about 5 per cent of the diseasecausing non-synonymous mutations hitherto identified have direct effects on the catalytic or ligand-binding properties of enzymes and receptors. 6 -8 An interesting example is retinol-binding protein 4 (RBP4), the retinol-specific transport protein present in plasma. Elucidation of the crystal structures of different forms of RBPs have revealed their interactions with retinol, retinoids and transthyretin (TTR; one of the plasma carriers of thyroid hormones). 9 The core of RBP is a betabarrel whose cavity accommodates retinol. The retinol hydroxyl group is near the protein surface, in the region of the entrance loops surrounding the opening of the binding cavity, and participates in polar interactions. The G75D mutant introduces a negative charge into the cavity, thereby interfering with retinol binding both electrostatically and sterically. The result is vitamin A deficiency with a phenotype of night blindness. 10 (Figure 1a, b). Furthermore, RBP4 is expressed and secreted by adipose tissue and is strongly associated with insulin resistance. A strong positive correlation also exists between RBP4 mRNA and adipose inflammation (monocyte chemoattractant protein-1 and CD68), and glucose transporter 4 mRNA. 11 Some non-synonymous cSNPs associated with human disorders disrupt important structural features of the affected protein. For example, a polymorphic variant that disrupts a critical disulphide bond is C260Y in HLA-H, culminating in hereditary haemochromatosis (Figure 1d). 12

Human genome diversity and drug responses
The total number of SNPs reported in public SNP databases currently exceeds 9 million. 13 Occasionally, an SNP may actually cause a disease and can therefore be exploited to search and isolate the disease-causing gene. Since SNPs are genetic variations that occur at regular intervals and at high frequency throughout the human genome, they can be used as markers within the genome. If a particular marker is found to be common among individuals with a particular disease, it suggests that the gene involved is probably located near the marker. 14 This renders SNPs of great value to biomedical research, and for developing pharmaceutical products or diagnostics. 15 For virtually all medications, inter-patient variability in response to drug therapy is the rule rather than the exception. Tremendous progress has been achieved in understanding the molecular basis of drug action, and in elucidating genetic determinants of disease pathogenesis and drug response. 16,17 This inter-patient variability is potentially regulated by processes such as drug transport, drug metabolism, cellular signalling pathways (eg G-protein-coupled receptors) and response pathways (eg apoptosis, cell cycle control). Polymorphisms of drug-metabolising enzymes and/or pharmacological targets are frequently associated with adverse drug reactions or failure of efficacy. 17 -20 Drug responses may also be modulated by non-genetic factors, however, especially co-medications and co-morbidities.
Thus, SNPs potentially can be applied in the development of individualised medicine and can provide an important source of information for studying the relationship between genotypes and phenotypes of human diseases. 21 A bottleneck occurs when linking information about the variation in human genes to the variation in drug responses ( pharmacogenetics) and understanding how interacting systems of genes determine individual drug responses ( pharmacogenomics). 22 A fundamental challenge in analysing disease cSNPs is the relative scarcity of alleles that can be mapped to three-dimensional protein structures. In the future, it is envisioned that knowledge of an individual's SNP genotype may provide a basis for assessing susceptibility to a disease and the optimal choice of therapies. 6

Viral genome diversity exemplified by papillomaviruses and influenza viruses
Viruses are exceptionally diverse in morphology, genetic organisation, replication strategies, virulence and many other characteristics. Viral genome sequences represent a treasure trove of essential information for understanding pathogenesis better, as well as developing novel diagnostics and antiviral therapies. The ability of various medically important viruses to develop high degrees of genetic diversity, and to acquire mutations to escape immune pressures, contributes to the difficulties in vaccine development.

Diversity of human papillomaviruses
There are more than 100 different types of human papillomavirus (HPV), the causative agent of benign papillomas or warts, and a cofactor in the development of carcinomas of the genital tract, skin, head and neck. HPVs are broadly divided into cutaneous and mucosal HPV types. Eight major proteins, designated as early (E) or late (L) gene products, are encoded by the HPV DNA genome. Proteins E1 and E2 are involved in viral replication, as well as the regulation of early transcription. E1 binds to the origin of viral replication (ORI) and exhibits ATPase as well as helicase activity, whereas E2 forms a complex with E1, facilitating its binding to the ORI. 23 Furthermore, E2 acts as a transcription factor that regulates early gene expression by binding to specific E2 recognition sites. 24 E4 plays important roles in promoting the differentiation-dependent productive phase of the viral life cycle. 25 The E5 protein supports HPV late functions 26 and disrupts major histocompatibility complex (MHC) class II maturation. 27 The E6 and E7 oncoproteins are mainly responsible for HPV-mediated malignant cell progression, leading ultimately to invasive carcinoma. 28 Finally, L1 and L2 are the major and minor capsid proteins, respectively, and HPV vaccines based on L1 are already in clinical use. 29 In 1999, we determined the complete nucleotide sequence of a novel genital HPV type from a female sex worker with a wart virus infection in Singapore -namely, HLT7474-S -which was designated as candidate HPV type 85 (HPV-85) by the Reference Center for Papillomaviruses, German Cancer Research Center, Heidelberg, Germany. Its genomic organisation and phylogenetic relationships were analysed. 30,31 The DNA sequence of the L1 open reading frame (ORF) of HPV-85 shares similarities of 78.3 per cent, 78.1 per cent and 78.0 per cent with those of the most closely related known types (HPV types 39, 70 and 45, respectively), thus satisfying the criteria for a new HPV type, which is defined on the basis of a dissimilarity exceeding 10 per cent in the L1 gene. 32 In addition, the E6 and E7 ORFs of HPV-85 exhibit highest percentage similarities of 79.7 per cent and 77.9 per cent to E6 of HPV-18 and E7 of HPV-59, respectively, thus reiterating the relatedness between HPV-85 and known genital HPVs belonging to group A7. 33 Phylogenetic trees 34,35 based on the individual ORFs, putative proteins and long control regions (LCRs) reveal the relationships between HPV-85 and the high-risk HPVs from group A7. The E1, E2, E5 and L2 proteins of HPV-85 are more closely associated with those of HPV-70 and HPV-39. Greater similarities are observed for the E4, E6 and L1 proteins of HPV-85, however, compared with those of HPV-18 and HPV-45, and for the HPV-85 E7 protein and LCR compared with its HPV-59 counterparts. These data exemplify the diversity of HPV viruses, with particular reference to HPV-85 as a co-evolved member of the A7 group of genital HPVs.

HPVs and protein disorder
It is now recognised that many functional proteins or their long segments are devoid of stable secondary and/or tertiary structure, and exist instead as very dynamic ensembles of conformations. They are known by different names, including natively unfolded, intrinsically disordered, intrinsically unstructured, rheomorphic, pliable and different combinations thereof. Disordered proteins have high flexibility and are reported to be involved in regulation, signalling and control pathways in which interactions with multiple partners, as well as high-specificity and low-affinity interactions, are often required. 36 To elucidate whether intrinsic disorder plays a role in the oncogenic potential of different HPV types, we performed a detailed bioinformatics analysis concentrating on the E6 and E7 oncoproteins of high-risk and low-risk HPVs. Three high-risk (HPV-16, HPV-18 and HPV-85) and two low-risk (HPV-6 and HPV-11) HPV types were analysed in order to compare the extent of intrinsic protein disorder in these virus types. The amino acid sequences of the different HPV types were extracted from the Los Alamos National Laboratory (ftp://ftp-t10.lanl.gov/ pub/papilloma/SWISS-PROT-files). Predictions of intrinsic disorder in HPV proteins were performed using a set of disorder predictors -that is, POODLE-S 37 and POODLE-L. 38 We employed POODLE-S to analyse the E6, E7 and L2 proteins, since their sequences are only 100 residues long, whereas POODLE-L was used for other proteins. The results are presented in Figure 2.
We conducted Tukey's multiple comparison test to compare the residue disorder values for each of the two HPV groups. The one-way analysis of variance (ANOVA) indicates that the E6 oncoproteins of oncogenic HPVs (HPV-16, HPV-18 and HPV-85) are significantly more disordered (p , 0.001) than those of non-oncogenic HPVs (HPV-6 and HPV-11). Thus, the results of this analysis are consistent with the conclusion that high-risk HPVs are characterised by an increased degree of intrinsic disorder of the E6 protein. The molecular basis of this disorder, in terms of protein sequence variation of virulent HPV types, is supported by experimental evidence for the transforming ability of E6 proteins of oncogenic HPVs. 28 Furthermore, the disorder trend is more significant for E6 than for E7, consistent with a previous report using the commercial software PONDR. The data also highlight the diversity of high-risk and low-risk HPVs at the protein structure level. 39 These intrinsic differences in E6 protein dynamics may be exploited and targeted pharmacologically -for example, using monoclonal antibodies or chemical inhibitors.
Certain non-oncoproteins of non-oncogenic HPVs exhibit greater 'disorder'-for example, L1 and L2 of HPV-6 and HPV-11, and E2 of HPV6. This observation suggests that it is the disorder of critical viral protein(s) that defines the oncogenic capability and risk level of HPV. Our data illustrate an alternative computational approach to distinguishing between the non-oncogenic from the high-risk oncogenic HPV types. It will be interesting to determine whether the phenomenon of increased protein disorder of more virulent virus types can be generalised beyond the HPV family.

Genetic diversity of H5N1 influenza viruses
Since 2003, highly pathogenic avian influenza A H5N1 viruses have spread from Asia to Europe and REVIEW Sakharkar, Sakharkar and Chow  The table illustrates Tukey's multiple test for comparing low-risk and high-risk HPVs. One-way ANOVA reveals the differences in disorder scores for the HPV proteins.
Africa, infecting wild birds, commercial poultry and humans with alarming fatality rates. Scrupulous surveillance and multidisciplinary interrogation of H5N1 evolution and 'migration patterns' are crucial for preventing further casualties in humans and poultry. 40 The escalating number of human cases of H5N1 virus infection has raised serious concerns about the potential emergence of an influenza pandemic attributed to a mutated H5N1 strain with efficient human-to-human transmissibility. 41 The two major surface glycoproteins encoded by the segmented influenza virus RNA genome are haemagglutinin (HA) and neuraminidase (NA). HA is the major antigen for neutralising antibodies and is involved in the binding of virions to sialic acidlinked receptors on host cells. Infectivity of influenza virus depends on the cleavage of HA by specific host proteases, whereas NA mediates the release of progeny virions from the cell surface and prevents clumping of newly formed virus particles. 42,43 HA is a 550 amino acid polypeptide that forms homotrimers (spikes) on the exterior of the influenza virus particle. Nascent HA is directed to the cell membrane in an infected host cell and is anchored to the cell membrane by a short transmembrane region at the C-terminus. Its biological activation involves proteolytic cleavage of a specific region by host enzymes. The nascent HA is also subject to extensive post-translational glycosylation that serves as a mechanism for immune evasion. Introducing new mutations in these two proteins represents the major strategy used by H5N1 to expand its host range and to avoid recognition by the host immune system. Here, we illustrate this diversity of influenza viruses by comparing HA proteins from over 270 H5N1 strains. A total of 272 HA sequences were downloaded from the National Center for Biotechnology Information (NCBI) Influenza Resource database, and phylogenetic analyses were performed using the PHYLIP and Neighbourhood-Joining (NJ) method, with a bootstrap of 1,000. This phyloinformatics analysis revealed a distinct pattern of spatial clustering of the strains based on their geographical origin, rather than temporal clustering or according to host range ( Figure 3). The dataset includes samples isolated up to 2006, and represents globally distributed locations

Sakharkar, Sakharkar and Chow
from Thailand, Vietnam, Indonesia, Japan, Mongolia, Russia, Europe to Africa. Interestingly, the host range across these H5N1 clades ranges from chickens to humans and includes several mammalian species. The clustering of strains does not show any bias towards the host species -for example, the Thailand clade includes isolates from chickens, cats and tigers. Multiple sequence alignments using the ClustalW program reveal that HA is highly polymorphic. A total of 312 positions exhibit polymorphisms over two-thirds of the protein length (Figure 4a). Mapping of these positions on the protein databank (PDB) file using 2FK0 as the template is depicted in Figure 4b. Polymorphisms in the various residues of H5N1 HA occur individually and not in tandem -that is, two or more polymorphic residues may change independently in different HA variants; however, it is noteworthy that the polymorphisms are concentrated in the receptor-binding domain of HA. These data can facilitate monitoring of the receptor-binding specificity of modern influenza viruses, especially H5N1. The reasons why H5N1 HA mutants segregate among more than one host species, and why there are differences in causing widespread disease, may be due in part to the differences in occurrence and distribution of cellular receptors for H5N1 (ie a2,3 or a2,6 sialic acid-linked receptors). The differences in the infectivity of various H5N1 strains in different hosts is ultimately determined by the compatibility and binding between the virus strain and the host cellular receptor. The polymorphism in the receptorbinding region also reflects the host immune response against the virus and the high mutating capacity of the virus that spawns escape mutants. These polymorphisms contribute to a greater diversity in viral virulence and to the expanded host range of H5N1. Such molecular-based surveillance can aid the designing of potential influenza vaccines, as well as a better understanding of the mechanisms of virus evolution and inter-species transfer.

Genetic determinants of host susceptibility to infection
Advances in genomics and understanding of pathogen variability, as well as diversity of the human immune systems have led to new trends in vaccine development that focus on epitope-based vaccines. An epitope is a small peptide fragment from an infectious agent that can induce a host immune response to eliminate the pathogen. Such a vaccine strategy shows promise in dealing with host and pathogen diversity. Compared with traditional vaccines, epitope-based vaccines are more specific, safe and more easily produced and controlled. The keys to the success of such approaches are the prediction models for rapidly scanning pathogen genomes to identify effective T-cell epitopes. A recent review article focuses on different methods available for MHC-peptide binding prediction for epitope-based vaccine design. 44 The virulence of the pathogen and the susceptibility of the host determine the occurrence and severity of an infectious disease. The highly polymorphic glycoproteins of the MHC are the key proteins involved in the host immune response. MHC class I glycoproteins are expressed on the surface of every nucleated human cell and play important roles in viral infections. They present endogenous peptides derived from the cell itself to cytotoxic T cells. Since human viruses use their host's cellular machinery for replication, infected cells present viral proteins on their surfaces by using human leukocyte antigen (HLA) class I glycoproteins. This co-presentation of viral peptides elicits a cell-mediated immune response that destroys the virally infected cell. Conversely, HLA class II glycoproteins expressed on antigenpresenting cells display antigenic peptides derived from the pathogen. T cells recognise these antigenic peptides as foreign and initiate an immune response to the antigen. Each antigenic peptide must fit into the peptide-binding cleft; both peptide size and composition determine the fit. Each peptide is typically nine to 14 amino acids long, and its sequence is determined by the pathogen. At the host level, depending on the particular surface of the peptide-binding cleft, some antigenic peptides may be preferentially presented, while others may not be presented at all. The great diversity of clefts across the human population translates into the ability to recognise and generate an immune response to virtually any pathogen. The type of antigenic peptide displayed in the cleft is an important factor in the immune response generated. Thus, several physical, chemical and genetic factors determine whether a given peptide will fit into the peptide-binding cleft and elicit an immune response ( Figure 5). Furthermore, class I and class II MHC molecules are the most polymorphic human proteins; some of these have over 200 allelic

REVIEW
Sakharkar, Sakharkar and Chow variants. This extreme polymorphism is driven and maintained by the long-standing battle for supremacy between our immune system and infectious pathogens. The polymorphisms within both HLA class I and class II glycoproteins occur almost exclusively in the region of the glycoprotein that constitutes the peptide-binding cleft.

Conclusions and future prospects
The pharmaceutical industry is currently grappling with a tremendous number of potential drug targets against infectious diseases identified through sequence data for the human genome and many important pathogens. The challenge ahead is to delineate the factors involved in host -pathogen interactions, and to translate the enormous discovery potential of human and pathogen genomes into real products for therapeutic intervention. One of the major bottlenecks in bringing new drugs to market is our incomplete understanding of the genes and proteins central to host -pathogen interactions and the mechanisms underlying certain human diseases. Other limitations that need to be addressed include patient heterogeneity and pathogen polymorphisms in clinical trials, the existence of multiple molecular targets and the shortage of experimental models of therapeutic efficacy with good predictive validity and objective surrogate measurements of disease progression. Integration of computational biology, together with experimental approaches, can accelerate our ability rapidly and reliably to identify target proteins that can be harnessed for therapeutic intervention. 45 The effective application of such complementary analytical methods will blaze trails for the exploitation of available genetic and molecular information. When supplemented with 'individualised therapy' based on a patient's genetic profile, these developments may lead to fewer serious adverse effects and improved responses to drug treatments and vaccine regimens. Finally, the ultimate goal must be to bridge theoretical biological, genetic and molecular phenomena, and cellular and organism biology, as well as medicinal chemistry in the relentless search for new cures in the battle against pathogens that plague humankind.