A comparative analysis of the information content in long and short SAGE libraries

Background Serial Analysis of Gene Expression (SAGE) is a powerful tool to determine gene expression profiles. Two types of SAGE libraries, ShortSAGE and LongSAGE, are classified based on the length of the SAGE tag (10 vs. 17 basepairs). LongSAGE libraries are thought to be more useful than ShortSAGE libraries, but their information content has not been widely compared. To dissect the differences between these two types of libraries, we utilized four libraries (two LongSAGE and two ShortSAGE libraries) generated from the hippocampus of Alzheimer and control samples. In addition, we generated two additional short SAGE libraries, the truncated long SAGE libraries (tSAGE), from LongSAGE libraries by deleting seven 5' basepairs from each LongSAGE tag. Results One problem that occurred in the SAGE study is that individual tags may have matched to multiple different genes – due to the short length of a tag. We found that the LongSAGE tag maps up to 15 UniGene clusters, while the ShortSAGE and tSAGE tags map up to 279 UniGene clusters. Both long and short SAGE libraries exhibit a large number of orphan tags (no gene information in UniGene), implying the limitation of the UniGene database. Among 100 orphan LongSAGE tags, the complete sequences (17 basepairs) of nine orphan tags match to 17 genomic sequences; four of the orphan tags match to a single genomic sequence. Our data show the potential to resolve 4–9% of orphan LongSAGE tags. Finally, among 400 tSAGE tags showing significant differential expression between AD and control, 79 tags (19.8%) were derived from multiple non-significant LongSAGE tags, implying the false positive results. Conclusion Our data show that LongSAGE tags have high specificity in gene mapping compared to ShortSAGE tags. LongSAGE tags show an advantage over ShortSAGE in identifying novel genes by BLAST analysis. Most importantly, the chances of obtaining false positive results are higher for ShortSAGE than LongSAGE libraries due to their specificity in gene mapping. Therefore, it is recommended that the number of corresponding UniGene clusters (gene or ESTs) of a tag for prioritizing the significant results be considered.


Background
Serial Analysis of Gene Expression (SAGE) introduced by Velculescu et al. [1] is a powerful open source method for profiling transcripts expressed in a given tissue. In this technique, mRNA transcripts are converted to cDNA and then processed 5' to the poly A+ tail to isolate short cDNA fragments called "tags." These tags are linked together into long concatemers and sequenced. The length of a SAGE tag is either 10 (short SAGE tag) or 17 (long SAGE tag) basepairs (bps) following a known restriction site. SAGE results are recorded as a list of distinct tags whose tag frequency can be tabulated to yield a quantitative measure of gene expression. The frequency counts of each SAGE tag reflect the abundance of the respective mRNA transcript expressed in the transcriptome of the tissue or cell type under study. Unlike microarray technology, which is limited to a finite number of known gene sequences arrayed on a chip, SAGE detects all transcripts expressed in a tissue sample and provides more quantitative information than microarrays. However, the disadvantages of SAGE are that the technique is expensive, time and labor intensive, and prone to sequencing errors [2]. Therefore, the total number of SAGE libraries produced for a study is generally smaller than a microarray study.
Annotation for a SAGE tag is a major task for SAGE data analysis. Many resources have been developed for mapping SAGE tags to genes, for instance, the SAGEmap from the National Center for Biotechnology Information (NCBI) [3] and the SAGE Genie from National Institutes of Health Cancer Genome Anatomy project [4]. Although these tools are useful, they rely on high quality databases to make confident tag-to-gene mapping. With only 14 bps (10 bps+ restriction sites) per a short SAGE (ShortSAGE) tag, it is impossible to directly screen a tag against the whole genome since 14 bps are insufficient to identify a unique genomic locus. UniGene Clusters http:// www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene is the most frequently used database for searching corresponding transcriptome (e.g. genes or ESTs) of a SAGE tag. If a tag cannot be mapped to a UniGene cluster, it is impossible to determine if the tag is spurious (i.e. missequenced, misincorporation of a nucleotide, not an mRNA), or represents a rare or novel gene not found in the UniGene databases. Therefore, it defeats the purpose of detecting unknown genes using SAGE tags. On the other hand, a LongSAGE tag (21 bps: 17 bps + restriction sites) is sufficiently long -making it possible to screen LongSAGE tags directly against the whole genome to identify its unique locus with a reasonable chance of success.
Due to the short length of a SAGE tag, it is common to see that a SAGE tag, especially the ShortSAGE tags, maps to multiple UniGene clusters which may be genes or ESTs,. When multiple genes or ESTs are found for a single tag, it is impossible to differentiate the tag count for genes/ESTs that have the same SAGE tag sequence. Therefore, when such ShortSAGE tag is found to express differentially between two samples, it cannot be determined which gene(s) or EST(s) is expressing differentially. This can lead to serious problems in interpreting gene expression levels between different tissues or states. The longer tags from the LongSAGE libraries may help correct this problem in addition to providing the opportunity to identify new and unique genes.
Although LongSAGE libraries possess several inherent advantages vis-à-vis ShortSAGE libraries, to date, available studies that compared the information content of Short-SAGE and LongSAGE are limited [2,5]. In addition, previous studies focused more on the tag annotation issue than other topics. Lu et al. generated four LongSAGE libraries using colon cell lines with/without a p53 mutation under either normal oxygen or hypoxia conditions. Based on these four LongSAGE libraries, they generated four Short-SAGE libraries by extracting the 10-bp tags from the long-SAGE tags. They limited their analyses on the confident tags, that is, the tags with counts > 1. They concluded that the ShortSAGE more efficiently identifies differentially expressed genes than LongSAGE. They also found that only 4-7% of the redundant confident ShortSAGE tags can be resolved by confident LongSAGE tags. Similarly, van Ruissen et al. [2] did not find improvement on SAGE tag annotation by LongSAGE tag. That is, both ShortSAGE and LongSAGE have about 30% of tags with reliable annotation. Overall, these studies seem to favor Short-SAGE libraries.
In this study, we investigated various issues related to the information content of LongSAGE and ShortSAGE libraries. Different from Lu et al. [5], we utilized two types of ShortSAGE libraries. One is modified from the LongSAGE libraries as Lu et al. did. The other is the real ShortSAGE library sequenced from the samples. We generated four SAGE libraries (Two LongSAGE and two ShortSAGE) using human brain tissue samples of two Alzheimer cases and two controls. We attempted to address the following: (1) determine the number of tags that can be matched to UniGene Clusters using LongSAGE and ShortSAGE tags; (2) evaluate tags that we were unable to assign to Uni-Gene Clusters; (3) compare the number of significant differentially expressed genes that can be derived from LongSAGE and ShortSAGE libraries; and (4) investigate the use and potential advantages of LongSAGE tags in identifying novel genes not listed in UniGene database. Table 1 summarizes the basic tag information for each SAGE library. More than 70,000 tags were extracted from both LongSAGE and ShortSAGE libraries. The number of tag counts per tag ranges from one to 2,202 for long SAGE tags, and one to 1,098 for short SAGE tags. Interestingly, the total tag counts and the numbers of distinct tags (unique tags) were higher in AD than control samples in both LongSAGE and ShortSAGE libraries. For instance, there are 34,475 unique tags in L_AD and 30,581 in L_Ctrl, indicating more tags expressed in the AD than control tissues. Since not all tags are expressed in both libraries of AD and control samples, the number of tags that are expressed in at least one of libraries increases to 55,093 for LongSAGE, 43,937 for tSAGE, and 37,900 for ShortSAGE compared datasets. Furthermore, the overall frequency of SAGE tags mapped to UniGene build 182 for each library is not very high. For instance, we found 14,643 tags (42.5%) in L_AD and 11,646 tags (38.1%) in L_Ctrl that map to the UniGene database, which lead to a large number of orphan tags (no UniGene IDs) in each library (Table 1).

Results
Applying the same strategy described in Lu et al. [5], we evaluated the tag-to-gene relationship using confident LongSAGE tags, which are defined for the tags with counts > 1. Under this constraint, we still observed more Long-SAGE tags in L_AD than L_Ctrl. Interestingly, we observed similar frequencies of redundant short tags. We found that only about 4.9 -5.7% of tSAGE tags mapped to multiple LongSAGE tags (Table 2). Further, more than 70% of confident tags can be mapped to UniGene Cluster(s), indicating that the overall low tag-to-gene mapping for each library is mainly coming from those tags with tag counts < 2 (non-confident tags).
As expected, the tag-gene relationship is more specific for the LongSAGE tags than the short SAGE tags. Figure 1 depicts the distribution of tags based on the number of their corresponding UniGene clusters for each compared dataset. The LongSAGE library shows a large percentage of orphan tags (65%) in comparison to tSAGE and Short-SAGE that have about 18% of orphan tags. This is expected, as the probability of mapping to a UniGene Cluster is much smaller for a long SAGE tag due to the extra seven bps. Three compared libraries show a similar percentage of tags mapping to a single UniGene cluster, that is, 32.3% for the LongSAGE, 32.7% for the tSAGE, and 33.1% for the ShortSAGE libraries. However, 97.3% of LongSAGE tags are either orphan tags or map to a single UniGene cluster, while both tSAGE and ShortSAGE libraries still have about 50% of tags mapping to more than one UniGene clusters. The maximum number of UniGene clusters that correspond to a single tag was 15 for the LongSAGE tags, and 279 for both tSAGE and ShortSAGE tags. This may imply that there is a higher chance of obtaining false matches for a ShortSAGE tag than a Long-SAGE tag. For instance, of the 17,793 LongSAGE tags that map to a single UniGene cluster, only 5,749 tags map to a single UniGene cluster after converting to the tSAGE tags, and the rest contribute to the pool of tags that map to more than one cluster which may represent false matches. As theorized, the increased specificity in gene mapping offered by the LongSAGE tags is substantial, compared to ShortSAGE tags.
When we compared the expression pattern between AD and control for three types of libraries: LongSAGE, tSAGE, and ShortSAGE, both LongSAGE and tSAGE libraries share strong similarity ( Figure 2). This is reasonable as they were based on the same samples. Unexpectedly, S_AD and S_Ctrl show very similar expression levels for the majority of genes, which is different from the case and control samples used for LongSAGE and tSAGE libraries. Our testing results reflected the expression patterns in Figure 2. We detected 380 LongSAGE tags, 400 tSAGE tags, and 156 ShortSAGE tags with significant differential expression between AD and control (P < 0.05). Clearly, we detected fewer tags in the ShortSAGE dataset than the other two. Although significant, this difference could be due to gene expression variation between samples with the same disease status.
Since both LongSAGE and tSAGE libraries were derived from the same samples, we used these two datasets to measure the relative ability of long and short SAGE libraries to detect altered gene expression. We found that the 400 significant differentially expressed tSAGE tags were  × 10 9 bps (14%) and equal frequency of each nucleotide occurred at a base. The number of matched gene sequences for an orphan tag increases as the number of matched bps decreases (Table 3). A total of 39 gene sequences were identified through this approach. Since the tag sequence used in the BLAST analysis consists of four bps (nucleotide position one to four) from the restriction site and 17 bps (nucleotide position five to 21) from the SAGE tag, we also restricted our selection to tags that have at least all 17 bps in the tag region which match to a gene sequence. The reason for this is that sequencing errors are more likely in the restriction sites rather than in the tag region. Under these criteria, the ending position of the matched segment in the tag sequence is always 21 and the starting position needs to be less than or equal to five. We found nine orphan tags that met these criteria ( Table  4). Four of nine orphan tags matched to a single human gene sequence -with 21, 20, and 18 matched bps, which are more likely to be the real transcripts for these four orphan tags.

Discussion
The use of SAGE libraries has been advocated, but technical complexity has limited their use. In addition, the value of long vs. short tag SAGE has not been widely explored. A few facts for a SAGE study are listed below. First, the tSAGE libraries share similar numbers of unique tags and tag counts with the "real" ShortSAGE libraries. The small differences between tSAGE and ShortSAGE libraries may be simply due to the variation between samples. These outcomes imply one advantage for the LongSAGE libraries as they can be analyzed in two ways (as long or short tags Tag frequency comparison. Comparisons of tag frequencies between AD and controls of LongSAGE, ShortSAGE, and tSAGE libraries.   counts from its corresponding LongSAGE tags, a false positive result of a tSAGE tag may simply be due to its mapping to multiple LongSAGE tags. In a real setting, this problem will exist for a tag that maps to multiple genes or ESTs. When there are only ShortSAGE data available, we will not be able to dissect the tag-gene relationship as described here. We may make a wrong decision by con-cluding a significant short SAGE tag by simply looking at the p-value, even if the p-values are very small. Since all 156 tSAGE tags in the Positive group (the presumed true significant tSAGE tags) map to a single Long-SAGE tag that has high specificity in tag-to-gene mapping, one potential solution is to take into account the number of UniGene clusters mapped to a tag in the decision making process. Among the 156 tSAGE tags in the Positive group (the presumed true significant ones), 67% of tags match to two UniGene clusters. On the other hand, 53% of tSAGE tags in the Negative group (the false ones) mapped to more than two UniGene clusters. If we treat the tags that map to two or fewer UniGene clusters as the presumed true significant tags, we will only include 47% of false ones, which is better than including all tags with false positive results.
Through this paper, our tag-to-gene mapping analysis relies on the UniGene database. However, a UniGene cluster does not always imply a gene. It is possible that multiple UniGene clusters refer to the same gene. In our LongSAGE tags analysis, we found that 97.3% of Long-SAGE tags are either orphan tags or mapped to a single UniGene Cluster, which is less likely to produce ambiguity of tag-to-gene mapping. For the remaining 2.7% of LongSAGE tags, 1.9% (1044 tags) map to two UniGene clusters. While it is not our main focus to dissect the property of each UniGene cluster in this paper, we found that 10.7% of 1044 LongSAGE tags have the same description for the two clusters even though their UniGene IDs are different. Therefore, it is possible that some of these Long-SAGE tags are in fact mapping to a single gene, which may increase the specificity of tag-to-gene mapping for Long-SAGE tags.
The large number of orphan tags also represents the limitation of the UniGene database. We showed that there is a potential to use long SAGE tags to identify novel genes that are not listed in the UniGene database. Unlike the short SAGE tag, the long SAGE tag has a sufficient number of nucleotides -allowing us to perform BLAST analysis to The property of significantly differentially expressed tSAGE tags Figure 3 The property of significantly differentially expressed tSAGE tags. A diagram to relate the LongSAGE tags to 400 tSAGE tags that are significantly differentially expressed between AD and control. The distribution of the tSAGE tags is summarized based on the number of their corresponding LongSAGE tags.

Human brain samples and pathological assessment
Human brain tissues were collected in the Kathleen Price Bryan Brain Bank at the Duke University Alzheimer Disease Research Center (ADRC) and in the Brain Bank of the Center for Human Genetics (CHG) at Duke University Medical Center (DUMC), following the rapid autopsy protocol [7]. The hippocampus was dissected at the time of autopsy, and matching 100-200 mg portions of CA 1-4 were removed and used for RNA isolation and expres-sion studies. Four brain tissue samples, including two AD (Sample IDs: 470 and 589) and two controls (sample IDs: 673 and 707), used in this study were previously described in Xu et al [8]. All four samples have the same apolipoprotein E 3/3 genotype (APOE3/3). The pathological diagnosis of AD was established according to CERAD criteria [9], and the degree of AD pathological changes was staged according to Braak [10]. The AD patients used in this study have pathological changes at the Braak and Braak stage IV and V (B&B stage IV and V), and the control was cognitively and pathologically normal with B&B stage I. Post-mortem delay times ranged from 1:10 to 4:15 hours [8].

RNA isolation for SAGE library construction
Total RNA was isolated from frozen hippocampus samples of AD patients and controls using TRIzol reagent (Invitrogen) according to the manufacturer's instructions. Briefly, brain tissue was homogenized in TRIzol reagent by Dounce homogenization and the homogenized samples were incubated for five minutes at room temperature. After the addition of chloroform, the mixture was centrifuged to separate the RNA containing aqueous phase from the TRIzol reagent. The aqueous phase was transferred to a fresh tube and the RNA precipitated after adding 0.5 volume of isopropyl alcohol. The RNA pellet was washed once with 75% ethanol, dried, and resuspended in DEPC treated water and stored at -80°C.

Construction of human hippocampus SAGE libraries
For ShortSAGE library construction, standard protocols as described by Velculescu et al [1], and Basrai and Hieter [11] were used with minor modifications. Briefly, SAGE was performed with 10 μg total RNA isolated from human brain hippocampus samples as outlined above. The cDNA was prepared using the SuperscriptII cDNA synthesis kit (Invitrogen) with gel-purified 5'-biotinylated Oligo(dT) 18 (Integrated DNA Technologies, Coralville, IA), according to the manufacturer's protocol. NlaIII and BsmFI restriction enzymes (New England Biolab, Beverly, MA) were used for tag generation. BsmFI digestion was performed at 37°C for 2.5 h (instead of 65°C) using 40 units BsmFI in a 300 μl reaction volume with supplied buffer. After a three-hour concatemerization step, the concatemers were heated at 65°C for 10 minutes, followed by two minutes on ice to enhance cloning efficiency. Purified concatemers were subsequently cloned in the SphI site of pZero-1 (Invitrogen) and transformed in competent ElectroMax DH10B cells (Invitrogen) using a 0.1 cm cuvette and the Gene Pulser II (BioRad). Individual SAGE library clones were selected and PCR amplified using 96-well format Qiagen Real minipreps, and sequenced with ABI 3700 capillary sequencer using BigDye chemistry.
LongSAGE library construction was performed with 10 μg total RNA using the standard SAGE protocol with the modifications according to Saha, et al. [12]. We used the MmeI type IIS restriction endonuclease (New England Biolab) to release the linker tag molecules from the cDNA.

SAGE tag extraction
ShortSAGE tags (10 bps) were extracted from the PHD files with eSAGE software, using a threshold value of PHRED 20 for each base (Margulies and Innis 2000). The SAGE tags were compared between the ShortSAGE AD (S_AD) and ShortSAGE control (S_Ctrl) library using eSAGE software to form a compared ShortSAGE database. LongSAGE tags (17 bps) were extracted from raw sequence data of LongSAGE libraries using SAGE2000 version 4.5 Analysis Software. We directly merged the SAGE tags from the LongSAGE AD (L_AD) and LongSAGE control (L_Ctrl) libraries to generate a compared LongSAGE database. Both compared ShortSAGE and LongSAGE databases were mapped to UniGene build 182 (National Center for Biotechnology Information, NCBI).

SAGE data analysis
In addition to the four SAGE libraries described above, we used the same strategy employed by Lu et al. [5] to generate two additional short SAGE libraries based on the LongSAGE libraries. We truncated the seven 5' bps of each long SAGE tag to generate truncated LongSAGE (tSAGE) library, which is analogous to the ShortSAGE library -as each tSAGE tag has only 10 bps. The tag count of a tSAGE tag is the sum of tag counts of LongSAGE tags that have the same first 10 bps. Hereafter, we refer to the two tSAGE libraries as T_AD for the tSAGE AD library and T_Ctrl for the tSAGE control library. Similarly, we generated and compared a SAGE database for T_AD and T_Ctrl, and mapped tSAGE tags to UniGene build 182. This allows us to directly compare results for long and short SAGE (i.e. LongSAGE and tSAGE) tags derived from the same tissue samples. We utilized these six libraries (three compared SAGE databases) to investigate the information content of long and short SAGE libraries.
First, the data was summarized for these six SAGE libraries. We computed the number of unique tags, the total tag counts, the number of tags that map to UniGene, and the number of tags with no UniGene information (i.e. the orphan tags) for each library. We also evaluated the specificity of the long and short SAGE tags for gene mapping. We computed the number of genes corresponding to each tag for the three compared SAGE datasets. To estimate the percentage of redundant short SAGE tags that can be resolved by the long SAGE tags, we mimicked the approach of Lu et al. [5] using the LongSAGE and tSAGE libraries. We obtained a set of unique LongSAGE tags with tag counts greater than one. Then, we computed the num-bers of unique and redundant tSAGE tags that correspond to these LongSAGE tags. In other words, these redundant tSAGE tags can be resulted if their corresponding Long-SAGE tags are known. Further, we investigated the tag-togene mapping pattern of the tSAGE tags that originally map to a single UniGene cluster under the LongSAGE tag format.
Second, we examined the performance of the LongSAGE, tSAGE, and ShortSAGE libraries in identifying differentially expressed genes. Chi-square and Fisher exact tests, as previously described [13], were used to test differences in expression levels between AD and control for each tag in each compared SAGE dataset. Since it is not our goal to provide a set of candidate genes, but rather use the results to compare the relationship between significant short and long SAGE tags, we applied a nominal significance level of 0.05 to declare significant tags without considering a correction for multiple testing. We summarized the number of significant tags for each compared SAGE dataset. For all significant tSAGE tags, we investigated the number of its corresponding long tags. We compared the LongSAGE tag counts per tSAGE tag among three groups.
Finally, UniGene serves as a database to interpret the SAGE tags. Each UniGene cluster contains sequences that represent a unique gene or EST. Since the UniGene set is based on expressed mRNAs, it represents only a small portion of the genome. Although there are more than 53,000 unique UniGene entries, a large number of orphan tags are still found in both ShortSAGE and LongSAGE libraries. Here, we investigate whether LongSAGE tags can help us identify genes corresponding to these orphan tags and whether they represent real genes or are artifacts of library construction and analysis.
Since the maximum length of a LongSAGE tag is up to 21 bps (including the cut site), it is possible to search genes corresponding to these long tags using sequencing alignment tools, such as BLAST. BLAST finds regions of local similarity between DNA sequences. Under the assumption of equal probability of sampling a nucleotide at each base, the probability of obtaining an exact matched sequence with k bps is (1/4) k . Assuming that the human genome consists with N bps of nucleotides, the approximated probability of obtaining one matched chromosomal segment with k bps is if all chromosomal segments of k bps are independent, and the expected number of chromosomal segments that match to a tag with k bps is (N-k+1)(1/4) k . The number of genes matching to a given tag decreases as the number of  required matched bps (k) in the tag increases. If we assume that the human genome consists with 2.864 × 10 9 bps of nucleotides (Goden path length at http:// www.ensembl.org/Homo_sapiens/index.html), we may expect to find 10 sequence segments matched to a 14-bp tag sequence. This number reduces to less than one when we require the number of bps to match to a tag to be 16+ bps. Clearly, a larger k will have a higher accuracy in gene identification than a smaller k. Based on the above calculations; we used 17+ bps as our search criteria in BLAST analysis. However, this computation did not take into account some genes that may be highly homologous to each other. Here, we examine the frequencies of obtaining perfect matched gene sequences for orphan tags through BLAST analysis. A gene sequence is considered a perfect match with an orphan tag if a gene sequence has a segment matched to a complete portion of a tag, that is, no gaps (unmatched nucleotides) within the sequence are allowed. We randomly selected 100 orphan LongSAGE tags from the L_Ctrl library and screen the 21 bps Long-SAGE tag sequences by BLAST. We selected the tags that show a perfect match to human genes with at least 17 bps.

Authors' contributions
YJL supervised statistical analysis, drafted and revised the manuscript, and is responsible for the content of the paper. PX generated SAGE libraries used in this study and also helped manuscript preparation. XQ performed the data analysis. DES involved in patient ascertainment. CMH involved in autopsy works. JLH and MAP are the PIs of Alzheimer studies and grants which funded part of the research. They provided samples available for this study. JRG supervised molecular biology components and helped manuscript preparation.