The Mouse Gene Coding for High Mobility Group 1 Protein (HMGl)*

We have isolated an active gene encoding the mouse HMGl protein among a multitude of cross-hybridizing sequences, which most likely are retrotransposed pseu- dogenes. The hmgl gene contains five exons, of which the first is not translated, and the last contains a long 3’-untranslated sequence and three alternative polyadenylation sites. We found no evidence for a sequence en- coding a membrane localization signal in the hmgl gene, despite the presence of HMGl protein on the surface of several cell types. The hmgl promoter coincides with a CpG island, contains no TATA sequence, and drives the expression of reporter genes placed under its control. The hmgl gene may be a member of a family of closely related genes but appears to be the major or the only active gene coding for HMGl protein. High mobility group 1 protein (HMGl)’ is a very abundant and highly conserved chromatin protein, which is present in all vertebrate nuclei. HMG1-like proteins exist in invertebrates, yeast, protozoa, and plants and are probably present in all eukaryotic cells (reviewed in Refs. 1 and 2). The ubiquity and sequence conservation of HMG1-like proteins suggest that they play fundamental functions; roles in DNA replication, chromatin assembly, and transcription have been proposed but so far have not been proved unequivocally. sequences were derived, with the exception of HMG1-R-227. C, a model for the evolution of HMG1-R pseudogenes. hmgl and hmg2 genes probably were derived from a common ancestral gene, hmgX, as indicated by the high sequence similarity and the identical the HMG1-R sequences.

High mobility group 1 protein (HMGl)' is a very abundant and highly conserved chromatin protein, which is present in all vertebrate nuclei. HMG1-like proteins exist in invertebrates, yeast, protozoa, and plants and are probably present in all eukaryotic cells (reviewed in Refs. 1 and 2). The ubiquity and sequence conservation of HMG1-like proteins suggest that they play fundamental functions; roles in DNA replication, chromatin assembly, and transcription have been proposed but so far have not been proved unequivocally. I n vitro mammalian HMGl and its two DNA-binding domains bind with low affinity and no specificity to singlestranded, linear duplex and supercoiled DNA (3, 4). They also bind with high specificity and in a sequence-independent manner to DNA containing a sharp bend or kinks (5-7). More generally, HMGl has the ability to introduce bends or kinks into linear DNA and therefore is functionally (but not structurally) similar to the prokaryotic proteins HU a n d IHF (reviewed in Ref. 8). The main role of HMGl may be to facilitate the formation of specific nucleoprotein complexes (9) and perhaps to modulate the structure of chromatin (10).
Finally, HMGl has been shown to be present on the surface of neurons and other cell types (111, where it is probably bound to the polysaccharide moiety of proteoglycans and may play roles in adhesion and tissue remodeling. Several cell types also display on their surface other proteins that differ from HMGl only by a few amino acids (12). Thus, the hmgl gene may be a member of a gene family.
Istituto Superiore di Sanita, Progetto AIDS 1994 (to M. E. B.). The costs * This work was supported by Telethon Grant A.07 and funds from of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. The abbreviations used are: HMG, high mobility group; HMGI-R, HMGI-related; kb, kilobase(s); nt, nucleotide(s); PCR, polymerase chain reaction; RSV, Rous sarcoma virus.
Despite this wealth of biochemical information, the analysis of the physiological role of HMGl has been impaired by the lack of mutations in the gene coding for it. In fact, whereas HMGl cDNAs have been cloned from a variety of sources (11, the genomic locus coding for HMGl had not been identified so far. In this paper we describe the identification and the organization of the mouse hmgl gene.

MATERIALS AND METHODS
OLigonucLeotides and Enzymes-All oligonucleotides were purchased from Genset. DNA modification and restriction enzymes were from Boehringer Mannheim, Promega, and New England Biolabs. PCR-Intron 3 and intron 4 of gene hmgl were obtained by PCR on genomic DNA from mouse NIH3T3 cells, using the following oligonucleotides: INT3for (coding strand), 5'-ACCCAAGAGGCCTCCG3'; INT3rev Isolation of Genomic hmgl C1ones-lO6 phage plaques of the 129SV mouse genomic library in the A-FWI vector (5,105 primary recombinant phages, insert size of 9-22 kb, provided by Stratagene) were screened initially with the 3'-untranslated sequence of the mouse HMGl cDNA, obtained by PCR and labeled by random priming. The same number of plaques was later screened with two probes obtained from PCR products INT3 and INTI. Fragments from the positive clones were subcloned into Bluescript KS(+) and sequenced with T7 DNA polymerase (Pharmacia Biotech Inc.).
Plasmid Construction-To obtain the pHMG1-neo plasmid, the 4.5-kb NsiI-NsiI fragment (containing a 2-kb region upstream of the transcription start site, exon 1, intron 1, and a part of exon 2) and a 2-kb XhoI-XhoI fragment (containing the aph gene from Tn5 transposon) were cloned into the pBlueScript KS(+) vector. This plasmid expresses a chimeric protein in which the first 2 residues of the bacterial aph gene product are substituted by 15 amino acids from HMGl protein, followed by 11 amino acids coded by polylinker sequences. The phBamHI-neo and the phEcoRI-neo constructs were obtained from pHMG1-neo by internal deletion of a 1.5-kb BamHI-BamHI fragment and a 3-kb EcoRI-EcoRI fragment, respectively.
Cell Culture and IFansfection-The mouse NIH3T3 cell line was grown in high glucose Dulbecco's modified Eagle's medium, supplemented by 10% newborn calf serum and antibiotics. Cells were transfected by calcium phosphate co-precipitation in 6-cm dishes as indicated in the legend to Fig. 5.
RNA Extraction, Primer Extension, and RNase Protection-Total RNA was prepared as described by Chomczynski and Sacchi (13). For primer extension, the 32P-labeled oligonucleotide MnlI (5'-GCCTCTCG-GCTTCTTAG-3') was hybridized to 20 pg of total RNA extracted from NIH3T3 cells in 10 pl of hybridization buffer containing 10 m~ Tris-C1, pH 7.5, 2 m M EDTA, and 60 m~ NaCl. by the addition of 1 1. 11 of 0.5 M EDTA, reverse transcripts were ethanolprecipitated and run on a 6.5% sequencing gel. For RNase protection, total RNAs extracted from NIH3T3 cells 48 h after transfection were treated with RNase-free DNase (Boehringer Mannheim) and analyzed with a ribonuclease protection assay kit (Ambion). Approximately lo5 cpm of riboprobe (obtained by in vitro transcription of a 1.3-kb NotI-PstI fragment derived from the pHMG1-neo plasmid) was incubated overnight at 45 "C with 10 pg of total RNA. Following RNase treatment, protected fragments were separated on a 6.5% sequencing gel.

RESULTS AND DISCUSSION
The Mouse Genome Contains a Large Number of Sequences Similar to the HMGl cDNA-The presence of several sequences with homology to the HMGl cDNAhas been previously reported in human and pig genomes (14,15). Southern blot analysis using as probes different regions of the rat HMGl cDNA (HMGlboxA and the 3"untranslated region) confirmed the presence of many HMG1-related sequences in mouse, rat, and hamster ( Fig. 1 and data not shown). We reasoned that hmgl and related genes might be distinguished from processed retropseudogenes by the presence of introns. We performed polymerase chain reactions on total mouse genomic DNA using as primers several pairs of oligonucleotides corresponding to the mouse cDNA sequence (161, but despite the extensive variation of reaction conditions the PCR products were always identical to those obtained using the cDNA clone as template (data not shown). We concluded that the mouse genome must contain a large number of processed retropseudogenes that compete with the putative intron-containing gene for amplification or that the presumptive introns are too long for efficient amplification.
We then screened a phage bank of mouse genomic DNA for plaques hybridizing to the mouse cDNA sequence and repeated the PCR reactions with several pairs of primers on the material picked from the positive plaques. We obtained 240 positive plaques from 4 genome equivalents, but none of them yielded PCR products longer than the cDNAcontrols. We continued the analysis only on 15 plaques that contained both ends of the cDNA sequence and gave no PCR products on amplification. We obtained partial sequences from 9 phages; all were highly sim-ilar to the cDNA sequence but contained no intron. Seven of them cannot code for HMGl protein because they contain premature stop codons or frameshifts. Phage A-154 contains no stop codon or frameshift, but its conceptual translation differs from HMG1. Phage A-227 contains a sequence potentially coding for HYG1, but we did not verify whether this particular clone corresponded to the hmgl gene because an alternative strategy had yielded a more promising candidate (see ahead).
Isolation and Characterization of the Hmgl Gene-Recently Shirakawa and Yoshida (17) reported the cloning of the functional human gene coding for HMG2 protein, which is closely related to HMG1. We supposed that genes that belong to the same family and that show high sequence similarity might have a conserved intron-exon organization, as is the case for the globin family. We then decided to test whether the presumptive hmgl gene contained introns in the same position as the hmg2 gene.
We designed two couples of oligonucleotides with sequences corresponding to the HMGl cDNA sequence flanking the presumptive intron 3 and intron 4 and ending at their 3' terminus with one or two bases corresponding to the canonical splicing junction 5'-GT and 3'-CT. PCR reactions on genomic mouse DNA yielded two fragments of about 350 and 850 base pairs, which we named INT3 and INT4, respectively. Control PCRs performed on the cDNA clone gave no amplified band. Direct sequencing of the 350-base pair fragment showed that it had features characteristic of introns (AT-rich sequence, lack of any ORF, and presence of 3"splicing consensus site). Finally, a PCR reaction performed with the primers 5' to the presumptive intron 3 and 3' to the presumptive intron 4 gave an amplified band of 1.4 kb, indicating that INT3 and INT4 are colinear on the mouse genome.
Evidence that INT3 and INT4 fragments were represented by a single-copy sequence in the mouse genome was provided by Southern blot analysis of mouse genomic DNA, shown in Fig. 1. We then isolated from the phage mouse genomic library four positive clones (HMG1-@1 to HMG1-@4), which hybridized to both INT3 and INT4 probes and had,overlapping restriction maps (Fig. 3). HindIII and Not1 fragments from phages HMG141 and HMG144 were subcloned in plasmid pBlueScript E(+). Fig. 2 shows the sequence we obtained, which includes entirely the 5'-untranslated region and the coding region of the mouse HMGl cDNA (16). The cDNA sequence contains four base substitutions, one of which causes the conservative replacement of a glutamic acid for an aspartic acid residue in the acidic tail of HMGl protein. These sequence divergences most probably represent genetic polymorphisms between inbred mouse strains; the cDNA was derived from P19 cells (corresponding to the C3H strain), while our gene was isolated from a bank made with the DNA of SV129 mice.
Structure of the Mouse Hmgl Gene-The hmgl gene contains five exons, as indicated by the comparison of the cDNA and genomic sequences (Fig. 3). An untranslated first exon falls in a region of very high C and G content with the features of a CpG island. Exon 2, which is located about 2.5 kb 3' to exon 1, contains the translation start site. The DNA binding domain A is encoded by exon 2 and exon 3, and the DNA binding domain B is encoded by exons 3,4, and 5. The relative positions of the introns within the segments coding for the two HMG boxes are different, which is somewhat unexpected if one supposes that the vertebrate hmgl gene arose by internal duplication of an ancestral gene containing a single HMG box, similar to the modern gene for HMG1-like proteins in lower eukaryotes, plants, and insects (18-22). The terminal acidic tail is encoded by exon 5, which is the longest and contains a long untranslated region and multiple polyadenylation sites.
In order to map the transcription start site of the hmgl gene, we performed a primer extension on total RNA extracted from NIH3T3 cells (Fig. 4), using an oligonucleotide that maps immediately downstream of the translation start site in exon 2. One major extended product and two minor ones were obtained. Two of them are not far upstream of the 5' terminus of the longest known cDNA for mouse HMGl (16); the third, shorter extended product may correspond to a weak start site or to a strong pause site for reverse transcriptase.
We found no evidence of alternatively spliced variants of the HMGl mRNAs, and we found no sequence in the genomic locus that could code for membrane-targeting signals. Thus, the presence of the HMGl protein outside of the cell membrane in several cell types probably does not depend on classical protein secretion routes.
The Promoter of the hmgl Gene--To qualify as the authentic hmgl gene, the genomic fragment must contain a region in cis capable of driving its transcription. We constructed a plasmid, pHMG1-neo, in which a fragment encompassing 2 kb upstream of the transcription start sites, exon 1, the entire intron 1, and part of exon 2, was fused in frame to the prokaryotic aminogly- duced in NIH3T3 cells, and G418-resistant clones were selected (Fig. 5). Plasmids pRSVneo and pSV2ne0, in which the aph gene is under the control of the Rous sarcoma virus LTR or the SV40 promoter, respectively, were used as positive controls. In several independent experiments, transfection with the pHMG1-neo plasmid gave rise to a large number of resistant clones, almost comparable with the number of clones arising after transfection with the positive controls. Transfection with the bacterial aph gene with no eukaryotic sequences in cis, or with promoterless derivatives of the pHMG1-neo plasmid, gave rise to no resistant clones or to very few. Northern blot analysis of total RNA extracted from three clones stably transfected with pHMG1-neo and three clones stably transfected with pRS-Vneo confirmed that transcripts of the expected length and containing the aph sequence were indeed present ( Fig. 5B and results not shown).
In order to estimate the strength of the hmgl promoter, we compared by RNase protection the relative abundance of the transcripts produced under its control with the abundance of those produced under the control of the RSV long terminal repeat (Fig. 6). Total RNAs were isolated from NIH3T3 cells 48 h after transfection. Transcripts from the endogenous hmgl genes give rise to a 45-nt band (not shown in Fig. 6), the RSVneo mRNA gives rise to a 177-nt band, and the HMG1-neo chimeric mRNA gives rise to a 284-nt band. The intensities of the signals from a cotransfection (lane 1 ) point to a high level of expression of the chimeric mRNA and indirectly to a high activity of the hmgl promoter.
The region immediately upstream of the proposed transcription start sites does not contain any sequence conforming to the consensus TATAA, this promoter is probably TATA-less, as is often the case for promoters of housekeeping genes (23). However, it does contain several CAAT boxes, which might promote transcription by binding any one of several factor types (24).
Since its promoter can direct the expression of reporter genes placed under its control, we have no doubt that the gene we have isolated is active and encodes the HMGl protein. The hmgl gene is expressed at high but not identical levels in all tissues of the mouse embryo a t day 10.5 post coitum'; likewise, we and others found in all tissues that were investigated the typical set of three HMGl mRNAs due to differential usage of three alternative polyadenylation signals (14,25). Being quite active and compact, the hmgl promoter should be usefbl to direct the ubiquitous expression of transgenes.
How Many hmgl Genes Are Present?-Although the data reported in the preceding paragraphs indicate that we have isolated a bona fide hmgl gene, additional hmgl-related genes may exist, especially if they are intronless like the genes coding for SRY and SOX HMG box proteins (26). We did not find direct evidence for additional active genes; the fragments amplified by PCR were all colinear and belonged to the same hmgl gene  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1 " " " . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1  Plasmid pHMG1-neo contains the aph gene driven by the NsiI-NsiI promoter fragment; plasmids pMarnHI-neo and pAEcoRI-neo are derived from pHMG1-neo by deletion of a 1.5-kb BamHI fragment and a 3-kb EcoRI fragment, respectively. pRSV-neo and pSV2-neo contain the uph gene driven by the RSV and SV40 promoters. B , 10 pg of total RNA derived from stably transfected clones was separated in a 1.2% agarose/formaldehyde gel, transferred to a Genescreen Plus membrane, and probed with the 2-kb XhoI-XhoI fragment containing the uph gene. The migration of ribosomal RNAs is indicated. Lane 1, untransfected NIH3T3 cell line; lune 2, NIH3T3 clone stably transfected with pRSV-neo; lanes 3 and 4, two independent NIH3T3 clones stably transfected with pHMG1-neo. we isolated. Additional evidence, however, may be gathered from the examination of the sequences of hmgl pseudogenes. If all the pseudogenes are derived by retrotranscription from the single hmgl gene we isolated, they should all originally have had the same sequence, identical to the exons of the gene. Later, they should have accumulated base changes and additions or deletions, but these should have occurred independently in the various pseudogenes. In other words, the different pseudogenes are not expected to contain the same mutations. We aligned the HMGl cDNAto the intronless HMG1-related sequences we had partially characterized during the quest for the gene. Starting from the translation start site and up to the codon corresponding to amino acid 65 of HMG1, the degree of similarity was between 90 and 99.5%. However, several mutations were identical in a few of the HMG1-related sequences. In particular, a group of five sequences contained the same pattern of 10 deviations from the cDNA sequence (Fig. 7A); this finding rules out an independent origin for each one in this group of sequences.
We applied to the HMGl cDNA and to the nine related sequences a maximum parsimony algorithm, which tries to ar-range nucleotide sequences in a topology that minimizes the number of substitutions from an ancestral sequence. It is clear that the HMGl cDNA sequence could not correspond to the sequence from which the other ones were derived independently (an example of the trees generated is shown in Fig. 7B). By default, we conclude that some or all of the HMG1-related sequences may derive from a gene distinct from hmgl and which we code name hmg-Z (Fig. 7C). hmg-2 cannot code for the HMG1-related cell surface proteins; it is nonetheless much more closely related to the hmgl than to the hmg2 gene, which diverged from a common ancestor at least 200 million years ago (27). Northern analysis of total RNA from adult or embryonic mouse tissues gives no evidence of additional mRNA species beyond those deriving from the hmgl gene, and the selection of HMG1-related clones from cDNA libraries consistently leads to the identification of sequences deriving from the hmgl gene. This implies that the expression of the hmgd gene and/or of the genes coding for cell surface proteins very closely related to HMGl may be restricted to limited districts or times or may be totally absent nowadays. translation of the codons with non-silent substitutions with respect to the HMGl cDNA is given here; most of the HMG-R sequences contain frameshifts too. The nucleotide sequences are deposited at the EMBL data library with accession numbers X80459-X80467, and their similarity to the cDNAranges between 90 and 99.5%. 2 represents a stop codon; a dash indicates that the sequence corresponding to the indicated position was not determined. A group of five sequences contains the same pattern of mutations. E, the maximum parsimony algorithm TREECOM was applied to the HMGl cDNA and HMG1-R sequences. The predicted evolutionary tree shows that HMGl cDNA does not correspond to the sequence from which the HMG1-R sequences were derived, with the exception of HMG1-R-227. C , a model for the evolution of HMG1-R pseudogenes. hmgl and hmg2 genes probably were derived from a common ancestral gene, hmgX, as indicated by the high sequence similarity and the identical the HMG1-R sequences.
exon-intron organization. We suppose that another gene, hmgd, itself derived from hmgl or directly from hmg-X, might have originated some of