Identifying Influenza Viruses with Resequencing Microarrays

Resequencing microarrays rapidly identify influenza viruses.


Identification of genetic variations of influenza viruses
is essential for epidemic and pandemic outbreak surveillance and determination of vaccine strain selection. In this study, we combined a random amplification strategy with high-density resequencing microarray technology to demonstrate simultaneous detection and sequence-based typing of 25 geographically distributed human influenza virus strains collected in 2004 and 2005. In addition to identification, this method provided primary sequence information, which suggested that distinct lineages of influenza viruses co-circulated during the 2004-2005 season, and simultaneously identified and typed all component strains of the trivalent FluMist intranasal vaccine. The results demonstrate a novel, timely, and unbiased method for the molecular epidemiologic surveillance of influenza viruses.
I nfluenza viruses are a major cause of respiratory infections in humans and result in substantial illness, death, and economic problems throughout the world. Along with regular seasonal epidemic outbreaks caused by common circulating strains, novel strains emerge sporadically because of reassortment in the segmented influenza RNA genome and have resulted in devastating influenza pandemics (1)(2)(3). Since mutations and reassortments are often determinants for infectious potential, antiviral drug susceptibility, and viral escape from vaccine-elicited immunity, continually surveying the genetic composition (i.e., primary sequence) of circulating and emerging variants is necessary. These needs have become increasingly relevant recently because the World Health Organization (WHO) has reported 85 human deaths caused by avian A/H5N1 influenza viruses throughout Asia since 2003 and raised concerns about the potential for another influenza pandemic (4).
Automated Sanger/electrophoresis-based sequencing technology has been used as the standard platform for DNA and genome sequencing. Although conventional sequencing produces accurate data, the requirement for knowledge of template sequences and the inability to quickly process multiple targets hinder its practical application in epidemiologic and diagnostic investigations. As an alterative, high-density oligonucleotide resequencing microarrays represent a promising new technology that has been used to rapidly and accurately identify nucleotide sequence variants (5-7) from viral, bacterial, and eukaryotic genomes (8)(9)(10)(11)(12)(13). Use of resequencing microarrays to detect single nucleotide polymorphisms and generate primary sequences enables identification of genetic variants and provides valuable epidemiologic information that is critical for outbreak surveillance. In most cases, however, this technology has relied on specific amplification of a limited number of target sequences before hybridization, thus restricting throughput and limiting final identification to strains that retain primer-targeted sequences.
In an attempt to adapt resequencing microarray technology to surveillance and diagnostics, we developed the respiratory pathogen microarray (RPM) version 1 for detection and sequence typing of 20 common respiratory and 6 category A biothreat pathogens known to cause febrile respiratory illness (14). A large portion of RPM version 1 is focused on a subset of tiled sequences corresponding to partial fragments from the hemagglutinin (HA), neuraminidase (NA), and matrix (M) genes for detection of influenza A and B viruses. In this study, we demonstrate unbiased determination of viral subtype and lineage by generation of primary sequence using random nucleic acid amplification and resequencing microarray technology.

RPM Version 1 Design
Each tiled prototype sequence was selected to have an intermediate level of sequence homology across a group of microbial or viral strains, which allowed for efficient hybridization and unique identification of most or all subtypes of targeted pathogenic species. For each relevant base of a given prototype sequence, the array contains eight 25mer probes (4 sense and 4 antisense). Two of 8 probes represent perfect matches, while the others correspond to possible mismatches at the central (13th) position of the 25mers. The prototype regions targeting influenza viruses were composed of partial sequences from HA genes of influenza A virus subtypes (H1, H3, and H5) and influenza B virus, NA genes of influenza A virus subtypes (N1 and N2) and influenza B virus, and the M genes of influenza A virus (full-length M1 and partial M2) and influenza B virus (Table 1). Both HA and NA regions encompassed a sufficient number of polymorphic sites to define subtypes. These regions were combined with prototype sequences for 22 other pathogens and tiled on 12.8µm chips (Affymetrix Inc., Santa Clara, CA, USA), which contain ≈240 K 25mer probes and have the capacity to resolve 30,000 nucleotides. The design and content of RPM version 1 array have been previously described (14).

Sample Collection and Nucleic Acid Isolation
The influenza clinical specimens used in this study were collected through the Department of Defense Global Emerging Infections System during the 2004-2005 influenza season. Influenza throat swab specimens were collected in accordance with the case criteria previously described (15). Throat swabs were obtained within the first 72 h of the onset of symptoms, placed in viral transport medium (MicroTest M4, Remel Inc., Lenexa, KS, USA), and delivered by commercial carrier to the Air Force Institute for Operational Health in Brooks City Base, San Antonio, Texas, for culturing and molecular characterization. Specimens were passaged once through primary rhesus monkey kidney tissue culture (BioWhittaker, Walkersville, MD, USA). Cultures were tested for influen-za A or B viruses by using the centrifugation-enhanced shell-vial technique with monoclonal antibody detection as previously described (16). Cultures testing positive for influenza A or B viruses were confirmed by using reverse transcription-polymerase chain reaction (RT-PCR) analysis with previously reported protocols (16,17). Total nucleic acids were extracted from 90-µL cultured samples or aliquots of live trivalent nasally administered influenza vaccine (FluMist 2004/05, MedImmune, Inc., Gaithersburg, MD, USA) by using the MasterPure DNA purification kit (Epicentre Technologies, Madison, WI, USA) and dissolved in 30 µL of nuclease-free water.

RPM Version 1 Hybridization and Processing
Purified DNA amplicons were adjusted to 2 µg in 35 µL of EB buffer, mixed with 15.1 µL of fragmentation cocktail buffer (5 µL NEB buffer 4, 5 µL 10 mmol/L Tris, pH 7. The hybridization intensities were analyzed with the GeneChip operating software to generate raw image files (.DAT) and simplified image files (.CEL) with intensities assigned to each of the corresponding probe positions. GeneChip DNA analysis software version 3.0 (GDAS), which implements the ABACUS algorithm (7), was used to produce an estimate of corrected base calls file (.CHP). Base calls generated from each tiled region of the array were then exported from GDAS as Federal Acquisition Streamlining Act (FASTA)-formatted sequences.

DNA Sequencing
Automated DNA sequencing was performed as previously described (17). HA nucleotide sequences for influenza strains used in this study are available at GenBank (accession nos. DQ265706-DG265730). The nucleotide sequences of primers used for amplification and sequencing are available upon request.

Sequence Analysis
DNA sequences generated from RPM version 1 were searched against the Influenza Sequence Database (http://www.flu.lanl.gov/) (19) by using the BLAST algorithm (20). Advanced options for blastn search were set as follows: -W (word size) 7, -r (reward for a nucleotide match) 1, -q (penalty for a nucleotide mismatch) -1. These parameters were chosen to maximize sensitivity and allow sequences with as many as 50% ambiguous calls to still produce full-length searches. Sequence alignments were performed with the ClustalX program (ftp://ftp-igbmc. u-strasbg.fr/pub/ClustalX/).

Microarray Hybridization
To assess the performance of RPM version 1 with a real-world clinical isolate set, we tested 25 cultured strains collected from 4 continents during the 2004-2005 influenza season and previously diagnosed by culture and RT-PCR as influenza. One influenza subtype was identified in each tested sample based on the RPM version 1 hybridization profiles and sequence reads shown in Figure 1A-C and E. DNA fragments of HA1, NA1, and M genes randomly amplified from an H1N1 isolate specifically hybridized to their corresponding prototype regions on RPM version 1 ( Figure 1A). Prototype regions of 1 influenza subtype exhibited no interference from other subtypes ( Figure 1A-C), and prototype regions of other pathogens on RPM version 1 showed no cross-hybridization with any influenza virus segments ( Figure 1D). Of the 25 isolates tested, we identified 12 A/H3N2, 12 influenza B, and 1 A/H1N1 ( Table 2). The A/H1N1 and A/H3N2 subtypes effectively hybridized to the same prototype M sequence (derived from the A/NWS/33 H1N1 strain), confirming that M genes are conserved among different H/N subtypes of influenza A to allow the universal identification of influenza A subtypes with a single tiled prototype region. A computational hybridization simulation model we developed confirms this suggestion (A. Malanoski, unpub. data). Aside from the highly conserved matrix region, no cross-hybridization was observed between subtypes, which suggests that the more variable HA and NA tiles are subtype specific. The GDAS generated DNA sequences from 3 genes (HA, NA, and M) from each sample with 42%-92% of the prototype tiled sequences, resulting in unambiguous calls by the microarray (Table 2). To demonstrate the accuracy of microarray resequencing reads, the HA genes from all 25 samples were amplified by a specifically primed RT-PCR and subjected to conventional sequencing. The sequences produced by random amplification and RPM version 1 were identical to those identified by the conventional sequencing method with the exception of ambiguous base calls (Ns). That is, in cases where both methods assigned a base identity at a particular sequence position, those assignments were always identical (data not shown).

Sequence Analysis and Strain Identification
Microarray resequencing data and conventional sequencing data were searched by using the Influenza Sequence Database with the BLAST algorithm. Results for the highest bit scores were taken as strain identifications and are shown in Table 2.

Influenza A
Based on sequences of HA genes, which are routinely used for genetic and antigenic characterization, microarray strain identifications of all 13 influenza A isolates correlated with identifications from the conventional sequencing method. Although A/H3N2 isolates were sometimes matched with different specific strain sequences from the Influenza Sequence Database based on the top BLAST hits for each isolate, all were redundant representatives of the same A/Fujian/411/02 lineage identified by conventional sequencing. These results indicate that ambiguous calls (Ns) did not affect the accuracy of BLAST identification. At most, only 6 mismatches occurred between the actual sequence of each isolate and sequence of its top BLAST search hit ( Table 2, column M1).
Alignment of the HA peptide sequences translated from RPM version 1-obtained DNA sequences for 12 A/H3N2 isolates ( Figure 2) showed that they all shared signature Fujian-like lineage amino acid substitutions (threonine and histidine) at positions 155 and 156 (17). Serine (position 227), which is located within antibody binding site D, was also conserved in these isolates, distinguishing them from the A/California/7/04 strain, which has proline at this position (17). In addition, isolate A/Ecuador/1968/04 shared similar amino acids with those observed in the A/Fujian/411/02 strain at antigenic sites A (lysine, position 145) and B (serine, position 189). Because of a more limited collection of NA and M gene sequences in the Influenza Sequence Database, strain identifications based on these 2 genes could only place them into clade A strains of H3N2 influenza A viruses sampled from New York State, which caused the A/Fujian/411/2002-like epidemic of the 2003-2004 influenza season (data not shown) (21). Although the only tiled M sequence was adopted from an A/H1N1 strain (A/NW/33), M results generated from the H3N2 isolates were still clearly identifiable as belonging to the A/H3N2 subtype and more specifically to the Fujianlike strain. The A/England/400/05 isolate was the only isolate appropriately identified as A/H1N1, and all 3 sequences (HA, NA, and M) generated from RPM version 1 and conventional sequencing for this isolate matched A/New York/227/2003 (H1N1). This is an A/New Caledonia/20/99-like strain that has been consistently circulating globally since 1999 (16).  Genotyping RPM version 1 can differentiate a broad number of variants based on a single-tiled "prototype" probe region without relying on predetermined hybridization patterns (9). A number of nucleotide mismatches that distinguished tested isolates from tiled prototype probe sequences were identified in each sample ( Table 2, column M2). Some were unique with respect to existing influenza database-recorded sequences. All of these polymorphisms were verified by conventional sequencing ( Table 3). Analysis of HA sequences generated from 12 A/H3N2 isolates by RPM version 1 showed that 4 of these nucleotide variations are common to 11 of the samples, excluding the outlying A/Ecuador/1968/04 isolate. Two of these common base substitutions, 313 G→A and 352 A→C, are at the third nucleotide of their respective codons and represent synonymous mutations. Such mutations do not code for amino acid changes and are usually selectively neutral and much more likely to be shared by common ancestry than by parallel evolution. These facts strongly support phylogenetic grouping of these 11 strains (Figure 3). In contrast, 393 A→T and 483 G→A are nonsynonymous mutations and code for critical amino acid changes. Analysis of conventional sequencing data confirmed that these 2 positions are in the antigenic site B and that the affected amino acids were changed from tyrosine to phenylalanine and from serine to asparagine, respectively. These 2 substitutions are both characteristic features of the A/California/7/04 strain that distinguish this group at both sequence and antigenic levels from other Fujian-like strains. The identified polymorphisms show that 11 of the 12 A/H3N2 isolates, although collected from 4 continents, are members of the same A/California/7/04 lineage, while the lone outlier, A/Ecuador/1968/04, is clearly identified as a member of the older A/Fujian/411/02 lineage. These observations demonstrate that RPM version 1 data can be effectively used for molecular epidemiologic tracking.
Nearly every isolate was shown to have unique base mutations, many of which resulted in amino acid substitu-tions. Identification of these mutations reaffirms common knowledge that genetic drift is a frequent event during circulation of influenza viruses and that the RPM version 1 gene chip is an effective tool for tracking unique genetic changes within influenza strains.

Detection of Multiple Targets
To test the capability of RPM version 1 to detect multiple pathogens with the random amplification protocol, we analyzed total nucleic acid isolated from trivalent FluMist intranasal vaccine. Figure 1D shows that 8 tiled influenza sequences on RPM version 1 were strongly hybridized by randomly amplified FluMist nucleic acids, and the resulting sequence data confirmed that FluMist includes immunogenic surface protein (HA and NA) genes from influenza H1N1, H3N2, and influenza B strains. Sequence analysis showed that these antigen-encoding genes matched those of 3 wild-type influenza strains recommended by WHO for making vaccine for the 2004-2005 season (Table 4). Two types of M genes from FluMist were identified by RPM version 1 as those in the cold-adapted Ann Arbor strains of influenza A and B, both of which are essential components in the cold-adapted master donor virus vaccine strain (23).

Discussion
Because of the relative ease of transmission of respiratory pathogens, tremendous pressure exists to develop rapid and sensitive tools to identify them. The surveillance of influenza virus outbreaks requires identification not only on the species level but also on the subtype or strain level. Current molecular methods, such as PCR and multiplex PCR, have dramatically improved detection sensitivities and efficiency compared with culture and serologic methods (24). However, they require multiple diagnostic tests to discriminate between organisms at multiple phylogenetic levels and are inherently limited in scope and resolution (i.e., increases in resolution necessitate corresponding decreases in scope). Furthermore, these tests rely on the conservation of primer-targeted sequences and as such can be rendered completely ineffective by as little as a single base mutation.
Currently, most microarrays used for microbial detection are spotted arrays that use redundant oligonucleotides as independent probes. For these methods, 2 types of probe targets are usually considered. The first are conserved gene sequences such as 16S rRNA and gyrase (25,26), which are chosen for identification at the genus or family levels. The second are relatively unique sequences such as virulence factor genes and antigenic determinant genes (27,28), which are used for species or serotype identification. In this way, pathogen recognition by microarray becomes as reliant on specific hybridization patterns as PCR is on primer-target conservation. Thus, a microarray is only able to resolve identity to the level of divergence represented by the diversity of probes present on the array. With resequencing arrays such as RPM version 1, multiple contiguous sequences (range 100 bp to 2 kb) containing both conserved and unique target genes from each species or subtype can be selected as prototype regions, and every nucleotide from the hybridized target regions can be potentially read as an independent data point using resequencing algorithms (5). The key advantage of the resequencing array is that it does not require a specific match between the analyzed sample and the probe, and mismatches actually add value because they can be identified and used as strain-specific markers.
Since the antigen-encoding HA and NA genes are highly variable between different subtypes, sequences specific  for HA1, HA3, HA5, NA1, and NA2 were all tiled on RPM version 1 independently so that influenza A H3N2, H1N1, and H5N1 viruses could be identified and resequenced. Further analysis of the generated sequences showed variations between target and prototype sequences and accurately identified tested isolates at the strain level and as members of recognized circulating variants ( Table  2). With its capability to identify strains, the resequencing microarray is a powerful tool for analysis of genetic characteristics of circulating and emerging influenza viruses and can be used to track movement of known variants. Although only 1 type of M gene (H1N1), which is relatively conserved among influenza A viruses, was tiled on RPM version 1, it was still able to cross-hybridize and differentiate M genes from different subtypes ( Figure 1 and Table  4). This tiled gene would theoretically allow detection of any other type of influenza virus for which antigenic HA and NA sequences were not tiled on the array. Another powerful feature of RPM version 1 is its broad-spectrum detection capability, allowing simultaneous resequencing of dozens of gene targets from multiple pathogens in 1 assay. This capability, however, is dependent on an equally broad-spectrum amplification method. With 66 diverse gene probes tiled on RPM version 1 covering 20 common respiratory and 6 biothreat pathogens (14), it was logical to use a generic, sequence-independent PCR strategy to amplify all potentially pathogen-derived sequences in an unbiased fashion before hybridization. By adopting a random amplification protocol (18) for use with RPM version 1, we could simultaneously detect multiple microorganisms, as shown with trivalent FluMist vaccine.
Correctly identifying 4 different influenza subtypes and their corresponding genes provided a simultaneous demonstration of 3 features of the resequencing microarray: strain identification through pattern recognition, sequence determination, and broad-spectrum capability. Conventional sequencing can determine DNA sequence and has been routinely used for genetic typing in surveillance investigations (16,17,21,29). However, it requires designing specific primers and multiple RT-PCRs to determine and amplify individual genes (such as HA, NA and M) before proceeding with sequencing reactions (this is especially true for highly polymorphic RNA viruses such as influenza virus). This requires initial use of other lower resolution techniques to identify strain type. All of these steps are time-consuming and labor-intensive. RPM version 1, combined with a random amplification protocol, can provide sequence information about a wide variety of genes representing many pathogens simultaneously and rapidly without knowledge of the identity of the tested sample. With the current possibility of an avian influenza virus A/H5N1 pandemic (30), surveillance for and characterization of emerging variants are essential to the rapid implementation of control measures.
In conclusion, we have combined a random amplification strategy with a resequencing microarray to efficiently and simultaneously detect, type, and genetically characterize geographically diverse influenza viruses. Application of this and similar methods may aid in a better understanding of the incidence, prevalence, and epidemiology of influenza infections and simultaneously allow more rapid identification of epidemic and pandemic outbreaks. Support was provided by the Air Force Medical Services (Office of HQ USAF Surgeon General) and the Office of Naval Research.
Dr Wang is a molecular biologist at the Naval Research Laboratory, Washington DC. His research interests include molecular diagnosis of infectious diseases, genomics, and bioinformatics.