Genomic Organization and Chromosomal Localization of the Human Nucleolin Gene*

Nucleolin, a eukaryotic nucleolar phosphoprotein, is involved in the synthesis and maturation of ribosomes. To characterize the genomic organization and regulatory sequences of this gene, two overlapping lambda clones containing the human nucleolin gene plus flanking regions were isolated from a genomic library using human nucleolin cDNA. Southern blots of genomic DNA from human, several mammals, chicken, and yeast revealed that the nucleolin gene is well conserved across these species. The gene consists of 14 exons with 13 intervening sequences and spans approximately 11 kilobases of DNA. Analysis of the splice junctions indicated that the amino-terminal domain and the four RNA binding domains plus the nuclear localization signal are split into adjacent exons. Sequences from the 5'-flanking and the first intron contain a high content of GC residues which is consistent with nucleolin being a "housekeeping" gene. Promoter elements include an atypical TATA box (GTTA), one CCAAT box much further from the initiation site, three reverse compliments of CCAAT (ATTGG), and two pyrimidine-rich nucleotide stretches. In addition, this region and the first intron contain numerous potential Sp1, GCF, CRE-fos, GCN, AP-1, AP-2, UCE, and sequences similar to the glucocorticoid receptor binding site. The transcription start site was determined by primer extension and S1 nuclease mapping of RNA from human liver. One Kpn and three Alu repeats were found within two of the middle introns. The 3'-untranslated portion of the gene contains five homology blocks in a 100-base pair region that are highly conserved among human, mouse, and hamster genomes. Finally, we have determined that the human nucleolin gene is located on chromosome 2q12-qter and is present at one copy per haploid genome. A restriction fragment length polymorphism with EcoRI has been detected in the gene.

Nucleolin, a eukaryotic nucleolar phosphoprotein, is involved in the synthesis and maturation of ribosomes. To characterize the genomic organization and regulatory sequences of this gene, two overlapping X clones containing the human nucleolin gene plus flanking regions were isolated from a genomic library using human nucleolin cDNA. Southern blots of genomic DNA from human, several mammals, chicken, and yeast revealed that the nucleolin gene is well conserved across these species. The gene consists of 14 exons with 13 intervening sequences and spans approximately 11 kilobases of DNA. Analysis of the splice junctions indicated that the amino-terminal domain and the four RNA binding domains plus the nuclear localization signal are split into adjacent exons. Sequences from the 5'-flanking and the first intron contain a high content of GC residues which iS consistent with nucleolin being a "housekeeping" gene. Promoter elements include an atypical TATA box (GTTA), one CCAAT box much further from the initiation site, three reverse compliments of CCAAT (ATTGG), and two pyrimidine-rich nucleotide stretches.
In addition, this region and the first intron contain numerous potential Spl, GCF, CRE-fos, GCN, AP-1, AP-2, UCE, and sequences similar to the glucocorticoid receptor binding site. The transcription start site was determined by primer extension and Sl nuclease mapping of RNA from human liver.
One Kpn and three Ah repeats were found within two of the middle introns. The 3'-untranslated portion of the gene contains five homology blocks in a loo-base pair region that are highly conserved among human, mouse, and hamster genomes. Finally, we have determined that the human nucleolin gene is located on chromosome 2q12-qter and is present at one copy per haploid genome. A restriction fragment length polymorphism with EcoRI has been detected in the gene.
Ribosomal biogenesis involves the synthesis and maturation of preribosomal RNA molecules within the nucleolus of eukaryotes in association with transiently bound, specific proteins, such as nucleolin and RNA polymerase I. Nucleolin is a lOO-kDa phosphoprotein which is activated by selective * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) 505584.
proteolysis and binds to rDNA spacer regions and nascent rRNA in the nucleolus and remains attached to ribosomes during transport to the cytoplasm (l-6). Although the mechanism of nucleolin action has not yet been completely defined, recent analyses of rodent (7) and our human nucleolin cDNA sequences (8) have revealed the presence of three major types of domains in nucleolin.
These may correlate with some of nucleolin's several activities. First, the amino-terminal region contains four highly acidic, phosphorylated segments that can interact with histones and decondense chromatin by releasing histones (9,10). These less well conserved and variable domains may be responsible for the observed ability of nucleolin to bind to rDNA spacer regions (11). Second, four well conserved RNA binding domains are probably the sequences responsible for nucleolin's contact with rRNA (12)(13)(14). Finally, a polyglycine region is found at the C-terminal end and may be involved in protein-protein interactions (15). It is interesting to note that the synthesis of nucleolin is positively correlated with increased rates of cell division, and thus the amount of nucleolin is highest in tumor or other rapidly dividing cells (16).
Inasmuch as ribosomal RNA (rRNA) synthesis is critical for the regulation of cell division and can itself be regulated by such diverse stimuli as nutrient starvation, stage of embryogenesis, nucleogenesis and viral infection (17-24), we have devoted considerable energy to understanding the structure and function of human nucleolin.
To this end, we have recently reported the cloning of human nucleolin cDNA and showed that the predicted amino acid sequence of nucleolin contains a multiple domain structure and is similar in many respects to the hamster andxenopus nucleolins. Furthermore, in attempts to gain insight into the transcription of the nucleolin gene, we now report the complete sequence of the human nucleolin gene and the analysis of regulatory elements in the 5'-flanking sequences. We were particularly interested in the possible conservation of splice junctions in relation to the protein domains and also the similarity of transcriptional regulatory sequences in different species. In addition, we have determined that the nucleolin gene is located on chromosome 2q12-qter and is not syntenic with other known ribosomerelated genes.

Isolation of Genomic
Clones-We screened a human genomic library with inserts generated by partial Sau3AI digestion of lung fibroblast DNA in X Fix vector (Stratagene). The hybridization probe was a random primer-labeled (25) EcoRI fragment of the full length nucleolin cDNA (8). Thirty clones were identified on the primary screen

Identification of the Human Nucleolin
Gene-A complete cDNA encoding human nucleolin (8) was used to isolate the nucleolin gene from a human genomic library. Two out of the thirty initially positive clones (HG3 and HGlO) hybridized to an oligonucleotide from the 5'-noncoding region of the mRNA. DNA from these clones was further analyzed by restriction enzyme mapping and hybridization with various oligonucleotides corresponding to different regions of the cDNA. HG3 contained the intact gene plus approximately 3 kb and 1.2 kb of 5'-and 3'-flanking sequences, respectively ( Fig. 1). HGlO extended 3 kb further in the 5' region than HG3 and continued in the 3' direction to the third intron.
Characterization of the Nucleotide Sequence and Splice Junctions-Most of our results were derived by sequencing seven restriction fragments from genomic clone HG3. The 1.2-kb 5'-flanking sequences, both untranslated regions, and exons were sequenced on both strands and correspond to approximately 60% of the HG3 clone ( Fig. 1). Sequences for the remaining portions of the clone, which represents intron material, were determined on one strand. The human nucleolin gene is organized into 14 exons spanning approximately 11 kb. (Figs. 1 and 2). Sizes of the exons range from 87 to 478 bp, while introns vary in length from 104 to about 1164 bp.
As shown in Table I, the intron/exon boundary sequences conform to published consensus sequences (35). Thus, each splice donor begins with GT and each acceptor site ends with an AG preceded by a polypyrimidine tract. All introns also contain potential lariat acceptor sites (36) upstream from the 3' splice junctions. The intron splice phase is type 0 (intron occurs between codons) for the first and second intron, type I (the intron interrupts the first and second bases of the codon) for introns 3, 4, 5, 7, 9, 11, and 13, and type II (the intron interrupts the second and third bases of the codon) for introns 6, 8, 10, and 12 (37). Interestingly, the intron splice phase is type I for odd number of exons and type II for even numbers of exons in the two-thirds of the molecule comprising the RNA binding domain.
Southern blot analysis of human genomic DNA using the cDNA as probe was performed to determine the conservation and redundancy of the nucleolin gene (Fig. 3B). The restriction digest patterns obtained with BamHI, BglII, EcoRI, HindIII, and PstI agree with the sequence data from the HG3 clone ( Fig. 2) and suggest that nucleolin is a single copy gene. In addition, we hybridized the nucleolin cDNA to EcoRI fragments of DNA from human, monkey, rat, mouse, canine, bovine, rabbit, chicken, and yeast (Fig. 3A). The autoradiographic pattern is different for each species, usually due to variation in intron sequences. However, the strong hybridization signal in all lanes except in canine and rabbit suggests that the nucleolin gene is well conserved. control experiments when yeast tRNA was substituted for liver RNA (Fig. 4A).
To confirm the above data, Sl nuclease protection studies were performed using liver RNA and a go-base oligonucleotide complementary to the region shown in Fig. 2. Again, a single band was identified which corresponds to a 5'-noncoding region on the mRNA of 112 base pairs (Fig. 4B). Although the site of initiation in many eukaryotic genes is frequently an adenine (35), the human nucleolin gene apparently initiates at cytosine.
The nucleolin gene is unusually rich in G and C residues in a region of about 750 bp upstream and downstream of the start of transcription.
In this region, the CG dinucleotide content is approximately equal to the GC pairs, which is characteristic of other housekeeping genes (38). These areas have been called CpG islands (39). The GC-rich region has 11 GC boxes, which are potential Spl binding sites. The pyrimidine-rich region which is present in all the eukaryotic (40) and yeast (41) ribosomal genes is present in human nucleolin gene throughout the mRNA leader sequence as well as at -30 and -198 from the transcription initiation site. To identify sequences that potentially control nucleolin gene expression, we utilized a computer program to locate regulatory elements in the 5'-flanking sequences and first intron of the human nucleolin gene (Fig. 5). There are no obvious TATA box or CCAAT consensus sequences in human nucleolin gene in this area. However, the sequence GTTACTG at -49 may be used for transcriptional regulation, by analogy with the mouse nucleolin gene where the sequence GATTACTG was suggested to be a putative TATA box (42). The CCAAT element at -257 and its reverse complement at positions -127, -360, and -832 are located much further from the transcription initiation site than the usual -70 position. The former and latter sequences are recognized by CPl/CP2 and by CPl/CPB-RC (reverse complement) factors, respectively. We noted that the CCAAT element at -126 with the consensus TTGGCTNNNAGCCAA is recognized by nuclear factor I (NF-l/CTP) (43).
These sequences in reverse orientation relative to the gene are followed by 9-nucleotide poly(A) tracts. A partial Ah repeat in intron 6 at position 5829 forms a direct repeat. Intron 6 also has a partial KpnI repeat at position 5497 having 78% homology to published sequences and a mouose LlMd-A2 repetitive element at position 5831 having 75% homology to published sequences (54).
Relation of Splice Junctions to Protein Domains-Some interesting features of the splice junctions in relation to the coding sequences have been derived from the data. The first  ATTCTGCTGTAGACATAGAGATGATGATCATAGCTGACTATGATGATGATCCCCCGCGAGCCTG~GAGG~TGCTC~  GGTTTGCTAAGCCCGCGAATCGAGTGAGACCCACAAGTCACTGGCGGCCTCCTTCGCCCTGC  CAGCCGGGGAACCCATCCGGTGGCTCTCGACCTGCTCCCGGGCCATCTGGTGACACTGACTTCGCAGCCACCACCTTAAT  TGGCGCATTCGACCCRAATAT~CCTGGGAACCTGTCGGGCGGTCT~GGCCCGGCTCTGCGGTCGCCCTCCCAGGCCCC  TCTCCCTGGCCCTGTGAGGCCAGAAAGTTACTTACTTCTCCGAGGCCAGTTCCCCATGTCTGAG~TATCTCCC~CTTGAGG  TTCTGTGGGGTAGGGGAGGGTTCGTGACTTTCTCACAGAAAACCTCGTACAGACCCCGCCACTGCCTTTATTAACAGCTC  TCAGGAGACTGCCTGCAGGAGGGGGGTCGCTCCGGCCCCATGCTCGCGGGC~GCAGGGAT~GCTGTGCCTCC~GG  GCCAACGGGAACTCCGCGGTCCCTGAACTTCCGGTGCTGGTGCTGGAGGACTCCTCGCTCCAGGGCCACCAGGAGCCGCGGCGTGA  GTGCGTGCCGGAACCGAGGGCGGGGTCTCTGAGGAACTCCC  CGGAGACCGCCCGATTCCACCACCCCCGCGCTCACCGCCAGTCCCACGACGCAGGCCGGG  ACCCGCGCGCCCACGGCCCATCAGCGCGACCTTGCACAAGCGAGCCCCGCCCCCACGGCGCCGTTGCCAGCCCCTCCC   RAACGGCCTTGAGCGCGACGCAGACGTGTAGGCCTGCTTCCGAGGGGCGAGCGCGGCGCCGCGGGGAGGAGGGCCTGCGC  GCAGTCCCGGGCGCGTTCTAGGGCGCCATGCTGCGGGAAGTCTCGCGCGATTAGTGGGGAGGTCTCGCGCTTCTGGCTAC  TTGGTGGCGAGGTGRAGAGCTTCTGCAGGTGCTGGGGGAGGGGGCGCTGGGCCTCGGGGTGGAGAGATGAGACC~CTT  TTGCGACGCGTACGAGCTGGGACTGACTCTGACGCACGTGCCCGGGAGCGTGCCTGCCACGTGGGCCGGCGTAGGTCTGG  AATCTCCAGAGGGACCGGGTGCCTTGGGCCGGGAAATGGCGGTATCGGCCCTAGTCGGAGTCCCGGCTGCGCTCGGATGT  CTCCGCCCCGGCCTGGCAAGCCGATACGTGGTGGGCCCCGG~GGTGGCTCTGCCGCGTGCCTTTTGCGCTGTGTTTCGG  GCAAGAGGTGGTCCTGCCAGGTACCCCCACGTGGCCGCACCCGCCTCTTT~GGGGCGGGGTAGTGCTGGGG~GGCAT  AAGCTTCATGAGRAAATRAGGTAGTATTTTTAAGTGCCTTGGGTGGTA  GATAAAGTACCGGGATTTGTAGTATAARAACACGGTTGTGCTT~CTMGGT~CGGGAGGAG~TCATTTCCTCAGGT  TGACTTTTTACCTTAGGGCAGGTTTTCTGTTGGTRRAGCCGTAGTCCCTCTT  GCATTGCCATCAGGAGTAGTTTCTATGTTAGTTGTGGTGTTTGGCACTATGAGA~TGATCTGAGACGGAGATGATGGCG  TATGRACACTRATGGCAAAATATGAATGGCCTGGCCTG~TGTCGAGGTGGAGGTGTAATGATCTATTTGTGTCCATTTTAG~ (56). The RNPP sequences are wholely within sequences encoding the first six amino acids. Information for the even exons (6, 8, 10, and 12), whereas RNPl sequences the four acidic stretches containing aspartate and glutamate are split into separate exons of almost equal length (Fig. 1B). residues interspersed with serine and methionine and inter-For instance, RNA binding domain 1 is the result of exons 6 rupted with basic amino acids is contained in exons 2 to 4. and 7 joining. The 3' end of exon 13 contains the information The nucleolar localization signal (Pro-Gly-Lys-Arg-Lys-Lys) for two-thirds of the glycine-rich region and exon 14 encodes which is involved in the transport of protein to the nucleus the remaining portion (Fig. 1B). This region might be involved (55)  A. EcoRI-treated genomic DNA (5 pg) f'rom human placenta, rhesus monkey. Sprague-Dawley rat, RAI,H/c mouse, canine, bovine, and yeast (S. cereoisiae) was hyhridized with the f'ull length nucleolin cDNA. H, human placental genomic DNA (6 pg) was digested with several restriction enzymes (BarnHI, &/II. I?'coRI, HindIII. and PstI) hybridized with the above probe and washed under stringent conditions (See "Materials and Methods"). The sizes (in kilobase pairs) of' HindIII-treated X markers are shown on the /cR.

Chromosomal
Localization of the Human Nucleolin Gene-DNA isolated from human-rodent somatic hybrid cell lines and their parental cells was analyzed for the presence or absence of the human nucleolin gene by Southern blot techniques. The human-hamster hybrids consisted of 28 primary clones and 14 subclones (10 of 42 positive) and the humanmouse hybrids represented 15 primary clones and 37 subclones (17 of 52 positive). EcoRI bands from human DNA of hybrid cells were readily resolved from most hamster and mouse cross-hybridizing bands. The nucleolin gene segregated concordantly with chromosome 2 and discordantly (~10%) with all other human chromosomes (Table II). These analyses permitted the unambiguous assignment of sequences hybridizing with the nucleolin cDNA probe to human chromosome 2. The gene was further regionally localized by examining results from Southern analysis of hybrids containing spontaneous breaks or well characterized translocations. Most informative were two hybrids isolated after fusing human parental cells (GM2658) containing at 2:6 (q12:q15) reciprocal chromosome translocation with Chinese hamster fibroblasts (72). One hybrid retaining the 2q12-qter translocation chromosome in the absence of the reciprocal translocation chromosome or a normal human chromosome 2 also retained the human nucleolin gene, whereas another hybrid retaining only the reciprocal translocation chromosome did not contain the nucleolin gene. Examination of two additional human-hamster and one human-mouse hybrids containing spontaneous breaks involving chromosome 2 with loss of the short arm or long arm confirmed these findings that the nucleolin gene is located in the region 2q12-qter.
--. A, two ""P-labeled oligonucleotide primers (primers 1 and 'L. as described under "Materials and Methods"), complementary to the 5'-untranslated end of the human nucleolin mRNA was hyhridized to 50 pg of' total RNA from human liver or to tRNA as a control. The products resulting from extension with avian myelohlastosis virus reverse transcriptase were electrophoresed adjacent to a DNA sequencing ladder also using primer 1 with a suhclone of the human nucleolin gene. H, a single-stranded "'P-1aheled primer (primer :3. Fig. 2) was hybridized to human liver RNA or to tRNA as a control. Noncomplementary single-stranded nucleic acids were digested with Sl nuclease, and the protected DNA fragment (indicated by an nrrou~) was detected as described above.
0.17:0.83 (95 individuals). Invariant bands of 0. 8, 2.1, 4.7, 7.7, and 25 kb were also present. Using a subfragment of the nucleolin cDNA, the polymorphic EcoRI site was determined to be the most 3' EcoRI site within intron 11 of the gene (Fig.  1). No other restriction fragment length polymorphisms were detected.

DISCUSSION
In this report, we characterize the genomic structure of the human nucleolin gene. This highly conserved gene has 14 exons and a structure typical of a eukaryotic protein-coding gene. We also describe the relationship of splice junctions to three different types of amino acid domains found in nucleolin. While the acidic and glycine-rich regions have splice junctions interrupting domains at irregular positions, each of the four RNA binding domains is split into separate exons at identical places. In addition, several putative regulatory element.s have been identified in the 5'-flanking sequences and the first intron. The unique transcription initiation site has been mapped by primer extension analysis and Sl nuclease protection assay. Finally, the nucleolin gene is single copy, Our results with the human nucleolin gene correlate well with the data previously reported on this gene from the mouse (42). Both contain 14 exons and 13 introns interrupting the coding sequences at identical positions.
However, the sizes and sequences of the introns have little or no similarity, except for 300 bp in the 5' end of the first intron. The location of the splice junctions of the RNPl consensus sequences are conserved at the 3' ends of the even exons (6,8, 10, and 12) and at the 5' ends of the odd exons (7, 9, 11, and 13). In addition, the RNP2 consensus sequences, which are also part of the RNA binding domains, occur at specific locations within the even exons. On the other hand, positions of the splice junctions for exons 2-4 vary in relation to the four acidic domains, even though these domains and the associated splice junctions are at similar places in both genes. The glycine-rich region at the carboxyl-terminal end of both nucleolin proteins is encoded by information from two adjoining exons.
Several features of the human nucleolin mRNA differ from previously reported mRNAs of other species. Primer extension and Sl nuclease protection analysis using total human sequences are not conserved as expected, the region in the 100-bp region showed five similar blocks in both human and rodents (Fig.  2). These conserved areas could be important in regulating nucleolin mRNA levels.
In order to begin addressing questions about the regulation of nucleolin transcription, we examined the 5'-flanking sequences for promoter elements and transcription factor binding sites (Spl, GCF, AP-1, AP-2, CPl/CP2, NF-1, and CACCC) and for regions of unusual DNA structures.
The most conserved regions among human, mouse, hamster, and rat occur between 230 bp upstream from the transcription initiation site and 335 bp downstream in the first intron (58), except for a short region of 50 bp upstream from the translation initiation codon. It is interesting to note that typical CCAAT or TATA consensus promoter sequences are not found at their usual positions of -75 and -25, respectively. Full length or portions of the human nucleolin cDNA was hybridized to Southern blots containing EcoRI-digested DNA from humanrodent hybrid cell lines. Detection of the gene is correlated with the presence or absence of each human chromosome in the entire group of hybrids. Discordancy represents presence of the gene in the absence of the chromosome (+/-) or absence of the gene despite the presence of the chromosome (-/+), and the sum of these numbers (X 100) divided by total hybrids examined represents percent discordancy. Nucleolin cDNA mapped to human chromosome 2.
Gene/chromosome Chromosome % discordancy +I+ +I--/+ -/-1 In addition to the nucleolin gene, several cellular growth control or housekeeping genes also lack traditional CCAAT or TATA consensus sequences (38). The human gene possibly has a CCAAT sequence at -127 and a TATA related sequence GTTACTG at -49, both of which are conserved also in rodents. The CCAAT consensus sequence observed in nucleolin gene could be the target site for numerous related factors that affect transcriptional efficiency (43, 59).
GC-rich sequences play a critical role in controlling the expression of various genes, including housekeeping genes and cellular oncogenes (60-62). Regions with high GC content can exhibit transcriptional enhancer activity when activating factors such as SPl and ETF bind (63, 64) and repressor activity when inhibitory factor GCF binds (44). The level of transcription of the human nucleolin gene could be regulated by the cell, depending on the positive or negative factors present in different cell types, since both SPl and GCF elements are present in almost equal numbers around the transcription initiation site. Although most SPl and GCF elements do not appear at the same positions in the known mammalian nucleolin genes, the SPl binding sites at position -160 and -172 are conserved between human and rodents and thus may relate to a common feature of nucleolin gene regulation.
Nucleolin expression and rRNA synthesis are stimulated when cells become induced to proliferate by basic fibroblast growth factor (65), while conversely, are decreased upon entry into the stationary phase (66). Recently, it was shown that the levels of both nucleolin pre-mRNA and mRNA were reduced 3-fold and that the synthesis of rRNA was strongly inhibited in dexamethasone-arrested p1798 murine lymphosarcoma cells (67). Also, Suzuki et al. (68) showed that administration of androgens to castrated rats resulted in in-crease in phosphorylation of nucleolin as well as in RNA polymerase 1 activity. It will be of interest to determine which control elements and trans-activating factors are involved in the regulation of nucleolin gene expression after treatment with specific hormones or growth factors. Although the RNA binding and glycine rich domains in nucleolin are highly conserved, the position of phosphorylation sites and the number of acidic amino acids in the fourth acidic segment are variable between species. Recently, the amino-terminal domain, which contains the acidic regions, was shown to be responsible for modulation of nucleic acid binding activity, the self-aggregation of protein molecules, and possibly the interaction with other molecules (69). It has been proposed that acidic activation domains may facilitate transcription initiation by interacting with a general component of the initiation complex such as the TATA binding TFllD or possibly polymerase II (70). For instance, the variable, acidic domains of nucleolin may be conferring species-specific promoter recognition by stabilizing the complex with UCE binding factors UBFl (sequence-specific DNAbinding factor) and SLl (confers species specific promoter selectivity on RNA polymerase 1) on promoters in rDNA (71). Finally, a 30-bp potential Z-DNA structure in the rodent nucleolin genes is absent in the human one. However, sequences (20 bp) flanking the potential Z-DNA on both sides are present in all of the known nucleolin genes, suggesting that the Z-DNA may be involved in species-specific transcription, and that the highly conserved 40-bp segment is important for general control functions of the gene. Our studies have defined the genomic structure and chromosomal location of the human nucleolin gene. Nucleolin plays a pivotal role in ribosome biogenesis and cell proliferation, and hence understanding the regulation of nucleolin may yield insights for cell division, embryogenesis, and viral infection.