Molecular Cloning and Analysis of the Mouse Homologue of the Tumor-associated Mucin, MUC1, Reveals Conservation of Potential 0-Glycosylation Sites, Transmembrane, and Cytoplasmic Domains and a Loss of Minisatellite-like Polymorphism*

We present here the full-length cDNA sequence and genomic structure of the mouse homologue of the tu- mor-associated mucin, MUCl. This mucin (previously called polymorphic epithelial mucin) is present at the apical surface of most glandular epithelial cells. The mouse gene, Muc-1, encodes an integral membrane protein with 40% of its coding capacity made up of serine, threonine, and proline, a composition typical of a highly 0-glycosylated protein. The mucin core protein consists of an amino-terminal signal sequence, a tandem repeat domain encoding 16 repeats of 20-21 amino acids, and unique sequence containing transmembrane and cytoplasmic domains. Homology with the human protein is only 34% in the tandem repeat domain, mainly showing conservation of serines and threonines, presumed sites of 0-linked carbohydrate attachment. Homology rises to 87% in the transmem- brane and cytoplasmic domains, suggesting that these regions may be functionally important. The pattern of expression of the mouse mucin is very similar to that of its

We present here the full-length cDNA sequence and genomic structure of the mouse homologue of the tumor-associated mucin, MUCl. This mucin (previously called polymorphic epithelial mucin) is present at the apical surface of most glandular epithelial cells. The mouse gene, Muc-1, encodes an integral membrane protein with 40% of its coding capacity made up of serine, threonine, and proline, a composition typical of a highly 0-glycosylated protein. The mucin core protein consists of an amino-terminal signal sequence, a tandem repeat domain encoding 16 repeats of 20-21 amino acids, and unique sequence containing transmembrane and cytoplasmic domains. Homology with the human protein is only 34% in the tandem repeat domain, mainly showing conservation of serines and threonines, presumed sites of 0-linked carbohydrate attachment. Homology rises to 87% in the transmembrane and cytoplasmic domains, suggesting that these regions may be functionally important. The pattern of expression of the mouse mucin is very similar to that of its human counterpart and accordingly the two promoter regions share high homology, 74%, although previously identified potential hormone-responsive elements are not conserved. Interestingly, the mouse homologue, unlike its human counterpart does not exhibit a variable number tandem repeat polymorphism. We present evidence that suggests that the mouse gene was at one time polymorphic but has mutated away from this state.
High molecular weight mucin glycoproteins are expressed by a wide variety of epithelial tissues and are often important differentiation markers in the development of these tissues. MUC1, previously referred to as the polymorphic epithelial mucin, is a highly glycosylated membrane glycoprotein expressed by a large number of simple secretory epithelial tissues, e.g. mammary gland, pancreas, lung, fallopian tube, salivary gland, and chief cells of stomach as well as by certain carcinomas (Zotter et al., 1988) where it shows aberrant * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

to the GenBankTM/EMBL Data Bank with accession number(s)
The nucleotide sequence(s) reported in thispaper hos been submitted M64928 and "65132. 11 To whom reprint requests should be addressed.
expression (Burchell et al., 1987;Girling et al., 1989;. Monoclonal antibodies (mAbs)' HMFG-1, HMFG-2, and SM-3 directed to normal and/or malignant human mammary epithelial cells have been found to recognize epitopes present within this mucin (Taylor-Papadimitriou et al., 1981;Burchell et al., 1983Burchell et al., , 1989Gendler et al., 1988;Girling et al., 1989). Subsequently, partial cDNA clones revealed the presence of a 60-bp GC-rich tandem repeat coding for a potentially highly glycosylated 20-amino acid repeat (Gendler et al., 1987(Gendler et al., , 1988Siddiqui et al., 1988). The high degree of polymorphism observed at the DNA, RNA, and protein level, which led to the protein initially being called the polymorphic epithelial mucin, was found to be due to differing numbers of tandem repeats, the numbers/allele ranging from 20 to 125 (Swallow et al., 1987;Gendler et al., 1988Gendler et al., , 1990; the gene can thus be described as being an expressed variable number tandem repeat gene. Recent cloning of the porcine submaxillary (Timpte et al., 1988), human intestinal MUC2 and MUC3 (Gum et al., 1989, human tracheobronchial MUC4 (Porchet et al., 1991 ), and Xenopus integumentary mucin cDNAs (Hoffmann, 1988;Probst et al., 1990) has revealed the presence of tandemly repeated domains made up of 81, 23, 17, 16, 9, and 11 amino acid repeats, respectively. The presence of a tandemly repeated domain thus appears to be a characteristic feature of mucins, and this feature has been shown to give several of the mucin genes a variable number tandem repeat polymorphism (Swallow et al., 1987;Gendler et al., 1988;Griffiths et al., 1990;Hauser et al., 1990;Porchet et al., 1991).
Periodic acid-Schiff silver-stained gels identify an equivalent high molecular weight glycosylated protein, present in the milk fat globule, in a wide spectrum of mammals ranging from human to mouse (Patton et al., 1989;Patton and Patton, 1990). However, mAbs directed to the human MUCl tandem repeat core protein have shown cross-reactivity only with the mucin of higher primates,' suggesting that the sequence of the repeat unit has altered during evolution. As much as 50% of the molecular weight of the human MUCl protein is made up of carbohydrate attached mainly to sites present within the tandem repeats by 0-linked glycosylation. If the primary function of mucin tandemly repeated domains is to provide a The abbreviations used are: mAbs: monoclonal antibodies; P C R polymerase chain reaction; GTC: guanidinium isothiocyanate; MFG: milk-fat-globule; bp, base pair(s); kb, kilobase(s); kbp, kilobase pair(s); SDS, sodium dodecyl sulfate.
S. Patton, personal communication. 15099 carbohydrate scaffold, then divergence of the repeat sequence through evolution should be "allowed" within certain limits, those limits being the maintenance of potential 0-glycosylation. The MUCl protein is located exclusively in the apical domain of the plasma membrane of highly polarized epithelial cells. Full-length human cDNA clones revealed regions encoding a 31-amino acid transmembrane and 69-amino acid cytoplasmic tail domain Lan et al., 1990;Ligtenberg et al., 1990;Wreschner et al., 1990). Recent studies on apical membrane polarity, examining in particular the distribution of MUCl in mammary epithelial cell cultures, have suggested that the protein interacts either directly or indirectly with the actin cytoskeleton, presumably by means of its cytoplasmic tail .
In order to investigate the function of this protein and to look at evolutionary changes in the tandem repeat, the mouse homologue has been cloned. Full-length cDNA and genomic sequencing has revealed high overall homology between the two sequences. In particular, the mouse sequence contains a repeat domain encoding a variable 20-21-amino acid tandem repeat with a very high potential for 0-glycosylation, although homology between the human and mouse repeats is less than 40% at the protein level. Regions of highest homology between the human and mouse sequences include the 31-amino acid transmembrane and 69-amino acid cytoplasmic tail domain. This result suggests that this region may be very important functionally in its interaction with the actin cytoskeleton. The promoter region immediately upstream of the TATAA box also exhibits high homology, 74%, which may indicate that this region plays a role in the epithelial specific expression of this gene.
Southern blots indicate that the rodent MUCl homologue is not polymorphic. However, sequence analysis within the repeat domain of the mouse Muc-1 gene suggests that at one time the rodent gene may have exhibited a variable number tandem repeat polymorphism and that this polymorphism has subsequently been lost.

MATERIALS AND METHODS
Genomic DNA Isolation and Southern Blot Analysis-Genomic DNA was prepared from T47D (human mammary carcinoma (Keydar et al., 1979)), C57 MG (mouse mammary carcinoma (Vaidya et al., 1978)), and HCll (COMMA-1D-mouse mammary epithelial (Ball et al., 1988)) cell lines with an Applied Biosystems (Foster City, CA) 340A DNA Extractor. Samples (10-15 pg) were digested with the appropriate restriction endonucleases (New England Biolabs and Northumbria Biologicals Ltd.) under conditions recommended by the manufacturer, prior to electrophoresis through a 0.7% agarose gel (ICN Biochemicals). Southern blotting onto nylon membranes (Pall Biodyne, Glen Cove, NY) and subsequent hybridization/washing procedures were carried out according to the manufacturer's instructions. The human probes, pMUC7, corresponding to 500 bp of the tandem repeat domain (Gendler et al., 1987) and pGEM-16.2 corresponding to -1 kb of 3'-cDNA including regions coding for a 31amino acid transmembrane and 69-amino acid cytoplasmic tail domain  were utilized. Probes were labeled by random priming (Feinberg and Vogelstein, 1984) in the presence of [a-"'PIdCTP (Amersham International plc) to a specific activity of >1 X 10' dpm/wg. cDNA Library Screening with pGEMl6.2-Primary and amplified X g t l O cDNA libraries constructed from oligo(dT) primed poly(A)+ BALB/c mouse-lactating mammary gland RNA as described (Stubbs et al., 1990) were utilized. Approximately 6 X lo6 pfu of the amplified library were plated out on NM514 cells to a density of approximately 3 X lo5 plaques/24 X 24-cm agar plate. Double lifts were taken from each plate onto Biodyne nylon membrane. Filters were pretreated prior to hybridization according to standard protocol (Maniatis et al., 1982). Hybridization/washing procedures were as recommended by the manufacturer except that 43% deionized formamide was utilized rather than 50% and hybridization proceeded for 48 h. Twenty-two positive plaques were picked and subjected to three rounds of plaque purification. All positive plaques were found to have the same insert of approximately 1.2 kb including a poly(A) tail. Inserts were purified onto DEAE membrane (Schleicher & Schuell, Dassel, Germany), ligated into pBluescript-SKII+ (Stratagene Cloning Systems, La Jolla, CA) and transformed into bacteria XL-1 . This plasmid will be referred to as pMuclO.
Rescreening Primary cDNA Library with Initial Clones and PCR-The total primary library was plated out, as before, to a density of -1 X IO' plaques/24 X 24-cm plate. Double lifts were taken, as previously described, and screened using randomly primed pMuclO. Positive plaques were picked into 250 p1 of a 2 X PCR buffer of composition 20 mM Tris-HCl, pH 8.7, 100 mM KC1, 3 mM MgC1, and vortexed vigorously for 15 s before taking 25-pl aliquots for PCR amplification in a total volume of 50 111. Combinations of Clontech X g t l O PRI-MATE amplimers, directed on either side of the EcoRI cloning site, and a synthetic oligonucleotide specific to the mouse mucin gene 5' CCA AGC TTG ACT AGA CTG GTA GCT GAG CC 3' corresponding to nucleotides 1864 to 1844 (Fig. 2) on the antisense strand and containing a cloning site for HindIII (synthesized on an Applied Biosystems 380B DNA Synthesizer) were used to determine if any of these positive clones contained sequence further 5' of pMuclO. Amplification was carried out on a Hybaid thermal reactor using AmpliTaq polymerase (Perkin-Elmer Cetus) with segments as follows: 94 "C 10 min, 45 "C 10 min, 72 "C 1 min; then 25 cycles of 94 "C 15 s, 50 "C 15 s, and 72 "C 1 min, followed by a final extension of 10 min at 72 "C. Amplified fragments were size-separated through a 1.0% agarose gel, purified onto DEAE membrane, restriction digested, and ligated into the appropriately cut pBluescript vector. The largest clone contained a further 550 bp of sequence 5' to pMuclO. This fragment was cloned into pBluescriptKSII+ and will be referred to as pMuc2TR. The full-length insert of -1.75 kb (1.2 kb + 550 bp) was also subcloned into pBS-KSII+ by conventional means, as previously described, to account for any point mutations introduced through PCR, and this plasmid will be referred to as pMuc2.
Genomic Cloning-Cosmid clones were isolated from a mouse genomic library constructed from a partial Sau3A digest of mouse genomic DNA cloned into the BamHI site of cos203, an Epstein Barr virus-based shuttle vector (Kioussis et al., 1987). Replica filters of this library, constructed by Dimitris Kioussis (National Institute for Medical Research, London), were kindly provided by Dr. Alistair Lammie (Imperial Cancer Research Fund, London) and screened with the pMuc2TR probe according to the method of Church and Gilbert (1984). Purified clones, referred to as cosmo 1.21, 1.22, 1.23 and 2.21, 2.22, 2.23 were cut with various restriction endonucleases. Subsequently -15-kb BamHI and -10-kb EcoRI fragments containing the mouse mucin gene were cloned into pBluescriptKSII+. These plasmids will be referred to as pMucBam and pMucEco. PstI and TaqI restriction fragments containing all or part of the repeat domain were further subcloned into pBS-KSII+ for sequencing. 5'-cDNA Cloning-Total RNA was isolated from C57BL mouselactating mammary gland by pulverizing with a Braun (Melsuugen, Federal Republic of Germany) Mikro-dismembranator I1 followed by guanidinium isothiocyanate (GTC) extraction (Chirgwin et al., 1979). First strand cDNA synthesis was carried out using the cDNA synthesis system (Amersham International plc) and an antisense primer directed to part of the mouse repeat number seven, 5' CCC AAG CTT GTC TGG AGA GCT GGT GGA GTC 3', corresponding to nucleotides 1318-1298 (Fig. 2) on the antisense strand, and containing a site for HindIII. The product was amplified using PCR with a primer containing a sitefor EcoRI and directed to the translation start site, 5' CCC GAAT TCA TGA CCC CGG GCA TTC GGG CT 3', corresponding to nucleotides 73-93 ( Fig. 2) (as determined from the genomic clones) and the mouse repeat antisense primer. The amplified band of around 300 bp was digested with EcoRI and HindIII and ligated into pBluescript. A number of plasmids were isolated with inserts in the range 150-300 bp. Insert sizes varied due to the repeat primer priming at different repeats. These will be referred to collectively as pMuc5'.
Northern Analysis-Total RNA was prepared from T47D, ICRF 23 (human embryonic lung fibroblast cell line), and mouse-lactating mammary gland by the GTC method. Northern analysis was carried out as described previously  using [a-"'PP]dCTPlabeled probes corresponding to the mouse Muc-15' (probe a), repeat (probe b), and 3' (probe c) regions (Fig. 1).
Analysis of Polymorphism-Mouse-tail genomic DNA was prepared according to Hogan et al. (1986) from Mus musculus, isolated from wild populations on the Isle of May in the Firth of Forth, Westray, Sanday, Faray, and Eday (in the Orkney Isles), Skokholm Island (off the Welsh coast), Burton-On-Trent, Taunton, John O'Groats, Belfast, and Denmark and inbred laboratory strains, C57BL X CBA, the rat, Rattus noruegicus, Mus spretus from southern Spain, the bank vole, Clethrionomys glareoh, and the short-tailed vole, Microtus agrestis. Genomic DNA was also prepared from C57 MG, HC11, and C6 (rat glial fibroblast) cell lines as previously described. Samples (10-15 pg) were digested with TaqI restriction endonuclease under conditions recommended by the manufacturer, prior to electrophoresis through a 1.2% agarose gel. Southern blotting and washing conditions were carried out as previously described. The mouse Muc-1 probe utilized, pMuc2TR, consisted of 550 bp of repeat and was labeled by random priming as described previously.
Milk-fat-globule (MFG) proteins were isolated as described (Patton and Huston, 1986) from human, rhesus monkey, bovine, mouse, guinea pig, cat, dog, horse, and goat milk. Approximately 5-pg samples of total MFG protein from the various species were size separated as described (Patton and Patton, 1990). Proteins were visualized according to Morrissey (1981).
DNA Sequencing-Denatured double-stranded DNA was sequenced fully in both directions by the di-deoxy chain termination method using primers directed to vector sequence or synthetic oligonucleotides. Computer analyses of DNA and amino acid sequences were performed on a VAX computer using Intelligenetics and Genetics Computer Group Software.

RESULTS
Cloning Strategy-Southern blots of mouse genomic DNA cut with various restriction endonucleases were screened with cDNA probes corresponding to the human mucin repeat domain, pMUC7, and the human mucin 3"coding region. No cross-hybridization was observed with the repeat probe at low stringency, whereas the 3'-probe cross-hybridized at high stringency revealing a single EcoRI fragment at 10 kb and a single BamHI fragment at 15 kb (data not shown). The human 3'-probe, pGEM-16.2, was used to screen an amplified X g t l O cDNA library constructed from poly(A)+ mouse lactating mammary gland RNA, where the mucin is known to be expressed at high levels in other animals. All the initial clones obtained represented 3'-clones stretching from the poly(A) tail to a position corresponding 3' of the repeat domain of the human cDNA sequence. The primary cDNA library was rescreened with this initial clone, and resulting positives were analyzed using a PCR approach to determine if they contained sequence further 5' of the original clone. Clones containing approximately 550 bp of further 5"sequence were identified. To obtain the remaining 5'-cDNA sequence, a mouse cosmid library was screened with the mouse cDNA clone, pMuc2TR, in order to obtain genomic sequence in the 5"region from which to construct specific oligonucleotide primers to carry out PCR amplification. First-strand cDNA synthesis was carried out using C57BL mouse lactating mammary gland RNA and a synthetic antisense oligonucleotide primer directed to the furthest 5"sequence of the cDNA clone, pMuc2. Sequence obtained from the genomic clones, corresponding to the potential translation start site, was used to synthesize a specific primer in order to amplify the product of the firststrand reaction by PCR, using this primer and the antisense primer mentioned.
3'-and Repeat cDNA Clones-Twenty-two positive clones were isolated from the amplified library, all of which were observed to contain an insert of 1.2 kb (Fig. 1B). Two of these clones were selected for sequencing, the sequences from which were found to be identical. The initial clone, when used to screen the primary library, identified five positive clones. These were analyzed by a PCR approach using synthetic oligonucleotide primers directed to both vector sequence on either side of the cloning site, and to the mouse mucin sequence previously determined. In this way it was deduced that two of the five clones contained further 5"sequence amounting to about 550 bp (Fig. 1B). Sequencing of the 550bp fragment revealed that it was entirely made up of a 60-63bp degenerate repeat, the degeneracy of which increased from 5' to 3'. The repeat encoded a 20-21-amino acid repeat extremely rich in serine and threonine, >50% in some repeats. Sequence obtained from the PCR-generated clones and the clone obtained by conventional methods matched exactly.
5'-cDNA and Genomic Clones-Genomic clones were obtained from a mouse cos203 cosmid library constructed from a partial Sau3A digest of mouse genomic DNA. Six positive clones were selected and characterized, two of which appeared to be rearranged, this perhaps being due to the repetitive region. 10-kb EcoRI and 15-kb BarnHl fragments were isolated and subcloned into pBS-KSII+ for sequencing. Sequence obtained from the genomic clones indicated that these fragments contain the entire coding region (Fig. IC). Sequencing of the genomic clones enabled the synthesis of a potential translation start site oligo. First-strand cDNA synthesis using C57BL-lactating mammary gland RNA followed by PCR amplification, as described, resulted in the amplification of a diffuse band at about 150-300 bp. Restriction enzyme sites located at the ends of the primers enabled the product to be digested and cloned into pBS-KSII+ for sequencing. Six colonies were picked, three of which were identical in sequence, except for a PCR introduced point mutation in one (C to A at nucleotide 78) and three of which differed in the number of repeats (Fig. 1B). Apart from the single point mutation mentioned, all sequences from PCR-derived clones matched with both the genomic sequence and the cDNA sequence previously described.
To confirm that the three clones, 5', 3' and repeat, were part of the same transcript, Northern blots of mouse-lactating mammary gland RNA, T47D human mammary carcinoma cell line RNA, and ICRF23 human lung fibroblast cell line RNA were probed with regions of each of the clones (Fig. 1). An identical pattern of hybridization was observed in the mouse mammary RNA with a single transcript at about 2.3 kb in each case. The 3'-probe also showed weak cross-reactivity with T47D at high stringency where transcripts from the two different sized alleles were observed; this correlated well as the 3"clones were selected by their strong cross-hybridization with the human probe pGEM-16.2.
Nucleotide Sequence and Genomic Structure-The composite DNA sequence from the 5'-PCR-derived clone, the tandem repeat clone, the genomic clones and the 3"XgtlO clone is shown in Fig. 2. Sequence shown was determined in both directions. The predicted leader sequence was derived from sequence of the two genomic clones. Genomic sequence suggests that the transcription start site of the mouse gene occurs some 72 bp away from the translation start site, as is the case in the human mucin . This is suggested by the fact that the alignment between the two sequences is good in this region and that a TATAA box is located at the same position 23 bp away (Figs. 2 and 6). Repeated attempts at precisely mapping the transcription start site using a variety of oligonucleotide primers and conditions in primer extension and nuclease protection experiments met with no success. All cDNA sequence corresponded exactly with sequence determined from genomic clones apart from the point mutation previously mentioned.
The genomic structure of the mouse homologue (Fig. 3) is very similar to that of its human counterpart, there being seven exons and five introns of similar sizes in both genes; mouse intron one, however, is significantly longer than its human equivalent. All exon/intron boundaries are conserved. A dot-matrix plot comparing the human and mouse genomic   sequences, performed using the COMPARE and DOTPLOT Genetics Computer Group software with a window of 21 and stringency of 14, illustrated the high overall similarity and demonstrated very well the difference between introns and exon and promoter; introns were observed to correspond to gaps in the main diagonal (Fig. 4).
Predicted Amino Acid Sequence and Composition of the Mouse MUCl Core Protein-The deduced sequence of the mouse mucin core protein (Fig. 2) encodes an integral membrane protein with 40% of its coding capacity made up of serine, threonine, and proline, a composition typical of a protein that is highly 0-glycosylated. The protein appears to consist of three distinct regions: (a) an amino-terminal region containing a hydrophobic signal sequence preceeding a short stretch of unique sequence; ( b ) a tandem repeat region encoding 16 degenerate repeats (underlined), five of which are 21 amino acids in length, the remaining 11 repeats being 20 amino acids; and ( c ) a carboxy-terminal region containing unique sequence followed by a hydrophobic membrane spanning domain of 31 amino acids and a 69-amino acid cytoplasmic tail. According to the predicative method of von Heijne (1986), the signal peptide is 11 amino acids long and is cleaved between the glycine and phenylalanine residues (Fig. 5A).
The sequence of the 20-21-amino acid tandem repeat unit corresponds to what might be expected for a protein that is extensively glycosylated. On average there are 9 serine/threonine residueslrepeat with eight of these being found as doublets. The predicted molecular mass for this core protein is 65 kDa, yet mouse milk fat globule proteins when run on protein gels indicate that the fully glycosylated protein is greater than 200 kDa in size (see Fig. 8). This would imply that up to 75% of the molecular weight of the fully glycosylated protein is made up of carbohydrate. As well as there being multiple potential 0-linked glycosylation sites, the sequence also contains 10 possible N-linked glycosylation sites (Asn-X-Ser/Thr) in the extracellular domain, five of which are found within the last six repeats.
Homology with Human MUCl Protein-Alignment of the amino acid sequence of the human and mouse genes revealed the most significant homology to be centered around the membrane-spanning and cytoplasmic tail domains, 87%.

AGCCTTGAGTTTGTTTTCTAGCCCCTTCCCGCCTGTTCACCACCACC~CCCCGGGCATTCGGGCTCCTTTCTTCCTG M T P G I R A P F f L
." t125 CTGCTACTTCTAGCAAGTCTRRAI\G9fgaqaqgcgcaaggtgqggaqqgqCtq~g~tqttcaqgtggqaCtCCcagtctt 1175 L L L L A S L K t c t g f q q a g t t r q c t t a c t g g t a q a t t~t g~q~g t~~t~~~t g g~~~t~~~t~~~g q t~t g g~~~~q t~~g t q~~~g q q g g q a a a a q a g g c t~a g q a a g q a q g c t g a g a a g a g g t g a c c c a q q~~~~t~~t~~t c t t tgtqggqagtaqagaaqagagqctgtgtcagqaqaatqaqctccqgaqtqgaaqaat~ggqc~qtgtgcagtggtt~cag t c r a t a 4 f t c c c a c a c c C t g q~~g~t~~g q t~~~~~q~g~g t t~~~g q~~~g~t t~q q~t~~~~~t g~~t~~t t g t~t~ aagtagaggaataaatgaqccagaagcagagqtgagaCctctgcttaqggaqqgaqqctgtcaqgatgaagqcaaaccag g g g c t a c c t g a t g c t a a c a t q = t~~~~~t t c c t t c c c c 9 9 t t =~t t t g t t t t~~g~g t t t~t~t t t t t~t t~t t g~~~t q a t t r t t a t t c a t c a a c c c a g t a g g t r t r t t t t t t t t t~t t q t = g t t q t t q t t t~q t t t t t t t t t t = t t~~~= t t t g g~t~t~g +goo

ACTAGCCTTTCAGAAGACTCCGCCAGCTCTCCAGTAGCCCACGGTGGCACCTCTTCTCCAGCCRCCAGCCCTCTAAGGGA T S L S E D S A S S P V A H G G T S S P A T S P L~R +I550
11600 t1625 CTCCACCAGTTCTCCAGTTCACAGTAGTGCCTCCATCC~CRTCAAGACTACATCAGACTTAGCTAGCACTCCAGACC t1450 ti475 +I525

S T S S P V H S S A S I D N I K T T S +I650 t1675
DLASTDD ACAATGGCACCTCAGTCACAACTACCAGCTCTGCACTGGGCTCAGCCACCAGTCCAGACCACAGTGGTACCTCAAC~ACA +I700

ACCATGGCCACCACCTCCAGCCACAGCACTATTGCCAGCAGCTCTTACTATAGCACAGTACCATTTTCTACCTTCTCCAG T M A T T S S H S T I A S S S Y Y S T V P f S T F S S t1900 +I925
t2050 TAACAGTTCACCCCAGTTGTCTGTTGGGGTCTCCTTCTTCTTCTTGTTCTTTTACATTC~CCACCCATTTAATTCTT 12075 ,2100

N * S S P O L S V G V S F F F L F F Y I O N H P F N ' S
~.~~ .~ -+2;25~ t2150 CTCTGGAAGACCCCAGCTCCAACTACTACCAAGARCTGAAG1\GGAACATTTCTGGATTGgtqqg~=tCagcctagcctct S L E D P S S N Y Y Q E L K R N ' I S G L q C C a t g t q t C C C C t q a C q t a g C t C t t C a g g a C t g C a t g g C = t t~t~= t~t t~t~=~~g T T T t2175 t2200 t2225 12250 +2275 CTGCAGATTTTTAACGGAGATTTTCTGGGGATCTCTAGCATCARGTTCAGgt~~gtt~tqg~tttq~~ttggggg~gg~~

S G S V Y V E S T V V F R E G T F S +2525
~~.

A S D V K S Q L I Q H K K E A D S Y N * L T I S E V K
t2550 ~~ +25?5 12600 g a q g t g a t a g C C C C a q C t q C a 9 c c t 4 9 c a c c a t a c f a t 9 9 t t~= t t~t g q~q~~~q q a q t q g g~~ atccacctcccttggqqacttccctqaccaccgctttcccttctaqTGAATGAGATGCAGTTCCCTCCCTCTGCCCAGTC t2750 V N E M Q F P P S A Q S CCGGCCGGGGGTACCAGGCTGGGGCATTGCCCTGCTGGTGCTGGTCTGTATTTTGGTTGCTTTGGCTATCGTCTATTTCC 12550 t2675 t2700 +2725

+2925 950
t3025 +3050 r a a a g t g t a t c c c g a a q a a g c t f g g g c c a t c g a c c t g g g e~q q q t g g q q~~t t~t l /~q~~q q t q 9~q~~~t q q~~q~~t +3150 t q q a a c c c c t t q q a q a q c t t g =~= q~~~t t q~~= =~q q t g~~= t~~q g q t t t t~~= t g g = = g = t t = t t t t t~~= t g~~~t

GCCTGGTGAAGCC~GCCCTGCCCTGGGGGACACTGGGGCAGTTAGTGGTGGCTCT~GAI\G~CTGGCCTGG~CTG +3550 G A W \ C A G G G A T G G G A R C C C A C A T~G C T G~G A T G G C C T C C T G T T~g t t~t q = = = g~t g t q q t t t t t~= t = t g c t3500 t3525
1 3 6 4 0

+967S
Other significant areas of homology noted were the NH2terminal signal sequence, the serines and threonines (potential attachment sites for 0-linked carbohydrate) found within the repeated region, and the potential N-linked glycosylation sites. The most significant difference between the two sequences occurred at the 5'-end where a large region of the human sequence is not represented in the mouse sequence. Fig. 5A depicts the alignment between the two amino acid sequences, and Fig. 5B summarizes homology levels over the various domains. Alignment of the mouse amino acid sequence to a human sequence containing 12 consensus repeats was carried out on a VAX computer using the Genetics Computer Group GAP program.
Promoter Analysis-Several potential hormone response elements have been identified within 500 bp of the human mucin transcription start site including potential progesterone, glucocorticoid , and estrogen (Tsarfaty et al., 1990) response elements. An alignment of the corresponding sequence of the mouse mucin promoter suggests that these potential elements are not functional. In addition, a potential enhancer sequence, described by Tsarfaty et al. (1990) within the first intron of the human MUCl and which showed significant homology to a murine cellular enhancer was also not conserved. The two promoter regions within 500 bp of the transcription start site are well conserved, however, displaying 74% identity (Fig. 6). In particular a region immediately upstream of the TATAA box is very highly conserved, displaying close to 100% identity. This may imply that this region may be functioning in the epithelial-specific expression of this gene. No significant homology was found with any other promoter sequences in EMBL or GenBank databases.
Analysis of Polymorphism-Southern blots containing both wild and inbred rodent DNA samples revealed that all individuals possessed Muc-1 alleles of the same length (16 repeats) (Fig. 7). The restriction endonuclease TaqI, from a knowledge of the sequence data, should yield a fragment of approximately 1.6 kbp, and indeed this fragment was observed in all mouse, M. musculus, samples as well as Mus spretus, the rat and two other rodent species, the bank vole and short-tailed vole. This fragment has been observed in a sample of greater than 40 wild rodents (data not shown), suggesting that the Muc-1 gene may exhibit a fixed length in the rodents in general. This appears to be in contrast to the situation observed for this locus in a large number of other mammalian species. As the polymorphic region of this gene occurs in a region coding for protein, it is possible to visualize the length polymorphism at the protein level as well as at the nucleic acid level. Using milk-fat-globule preparations and silver-stained SDS-protein gels, it was possible to visualize the equivalent mammary mucin from a variety of mammals, including human, rhesus monkey, bovine, horse, goat, dog, cat, guinea pig, and mouse (Fig. 8). These studies indicated that all species considered, apart from mouse and guinea pig (members of two rodent subgroups), appear to be polymorphic at this locus, i e . exhibited two protein bands as opposed to a single band.
Detailed sequence analysis of the mouse Muc-1 repeat region indicated that on average repeats shared 75% homology, as opposed to 97-100% in the human MUCl gene (Gen- passing through the diagonal would be expected if there was a perfect match between the two sequences. Lines parallel to the main diagonal are indicative of repetitive domains. From this plot the various levels of homology can be clearly seen. In particular, regions corresponding to intron sequences can be seen as gaps in the diagonal (Intron VI has been deleted for the purposes of this plot). Analysis was performed using Genetics Computer Group Software and the Compare and Dotplot programs, with a Window:21 and Stringency:14.0. Axes are in base pairs. dler et Siddiqui et al., 1988;Abe et al., 1989;Ligtenberg et al., 1990;Wreschner et al., 1990;Lan et al., 1990), and that five of the 16 repeats possessed an extra codon at the same position (Fig. 9). Comparison between repeats strongly suggested that an unequal exchange event has occurred within the repeat domain resulting in the duplication of five repeats, and this event can be quite finely localized to have taken place between the end of "ancestral repeats" number 1 and 6 (Fig. 10).

DISCUSSION
We report here the full-length sequence of the mouse homologue of the tumor-associated mucin, MUC1. Our data show that the amino acid composition of the mouse mucin is typical of that for a mucin, with serine, threonine, and proline making up a large proportion of the total amino acid content. Alignment between the human and mouse sequences revealed the most significant homologies to be centered around the transmembrane and cytoplasmic tail domains. Interaction of this domain with the actin cytoskeleton has been demonstrated recently , and in combination with our data this would imply an important role for this region in the function of the protein. It is feasible that this region of the protein is binding to a link protein which in turn binds to the actin cytoskeleton. Other short stretches of high homology occur between the repeat and transmembrane/cytoplasmic tail region (see Fig. 5). Analysis of the full-length cDNA revealed the presence of what can be described as a degenerate tandem repeat domain, repeats being 60 or 63 bp, which would allow for up to half the amino acids to be 0-glycosylated (the serines and threonines). The predicted size of the mouse mucin core protein is 65 kDa yet SDS-protein gels indicate that the mature glycosylated protein present in mouse milk is greater than 200 kDa, suggesting that up to 75% of the molecular mass of the mature protein is made up of carbohydrate. Alignment between the human and mouse repeat domains revealed significant homologies to occur at the serine/threonine doublets, the prolines, the single histidine, and the central single threonine found in the human PDTRP motif (Fig. 5, A and B ) . From the deduced mouse mucin amino acid sequence and the alignment with the human sequence we can see why mAbs HMFG-2 and SM-3 directed to the human repeat core protein showed no cross-reactivity with the mouse mucin; the epitopes for these two mAbs being DTR and PDTRP, respectively ) (see Fig. 5, A and B ) ; the 2 charged residues contained within these motifs, glutamate (D) and arginine (R), are not conserved, these residues becoming uncharged alanine (A) and serine (S) or threonine (T) in the mouse.
The 5'-portion of the mouse cDNA shows significant differences from the human 5'-cDNA (see Fig. 5, A and B).
Where the human 5"region encodes a hydrophobic signal peptide, a short stretch of unique sequence followed by three degenerate repeats, the mouse sequence codes for a signal peptide, which shows high homology with that of the human, and a much shorter stretch of unique sequence which has no significant homology with that region of the human cDNA. The splice site at the end of the mouse mucin first intron appears to correspond to the alternate splice site reported for the human mucin by several groups (Ligtenberg et al., 1990;Wreschner et al., 1990). However, our sequence suggests that an alternative splicing mechanism does not operate in the mouse homologue as there does not appear to be any potential alternative splice acceptor within the next 20-40 bp which would result in the reading frame being maintained.
In addition to the many potential 0-glycosylation sites, the deduced amino acid sequence we describe contains 10 potential N-glycosylation sites, five of which are found within the repeat domain. This number is compared with the five found in the human mucin. Four of these five sites are precisely conserved between the two sequences, the other one having a potential site in the mouse sequence located within 1 amino acid residue of its position in the human sequence.
The tandem repeat domain in the human MUCl gene shows allelic variations in length which result in such a high degree of polymorphism that the sequence can be considered to be a variable number tandem repeat locus (Swallow et al., 1987). However, when the mouse repeat probe was used to investi-  I I I I I  I I  I I  I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l l I I I I I I I I I I I I I I I I I   nucleotides that are conserved in the human promoter region; dashes indicate missing bases in the homologue. It is of interest to note that two sequences present within the human mucin promoter previously described as being potential response elements for progesterone and estrogen are not conserved. Twelve extra bases (-414 to -403) are present in the mouse promoter corresponding to the position of the potential progesterone response element whereas the potential estrogen response element has been deleted (-311). The highly conserved sequence located between two potential Spl-binding sites has been underlined.

Ej MUS musculus MUS spretus MUS musculus
x -

FIG.
7. Variation at the mouse Muc-1 locus. 10-15-pg cell line or mouse tail genomic DNA was digested with TaqI restriction endonuclease, treated as described under "Materials and Methods" and hybridized to mouse Muc-1 probe pMuc2TR. In all cases a single hybridizing fragment can be seen migrating at 1.6 kbp. This fragment was also observed in greater than 40 wild mouse samples (data not shown), indicating that the mouse Muc-1 gene is not polymorphic. Panel is a composite of two autoradiographs. Alignment of all bands was justified as both gels contained the samples C57 MG, HC11, and C6 at each side. gate polymorphisms occurring in the mouse, none were found in either laboratory or wild mice (Fig. 7). These results indicate that the polymorphic nature of the gene has been lost in the mouse and other rodents. Milk-fat-globule proteins from a variety of other mammalian species, when run on SDSprotein gels and silver stained, identified an equivalent highly glycosylated protein migrating in the size range >180 kDa. Such gels indicated that species such as chimpanzee, rhesus monkey, horse, cow, goat, dog, and cat are polymorphic with respect to this protein (Patton et al., 1989;Patton and Patton, 1990), whereas, in good agreement with our southern blot data, the rodents appeared to be non-polymorphic (Fig. 8). In order to confirm that the proteins we were observing were indeed homologous, the polyclonal antibody, CT-1, directed to the last 17 amino acids of the human MUCl cytoplasmic tail" was utilized. However, although this antibody reacts very :' L. Pemberton, manuscript in preparation.

-205kDa
FIG. 8. Milk-fat-globule protein polymorphism. Approximately 5 pg of milk-fat-globule proteins isolated from the milk of a variety of species were analyzed as described under "Materials and Methods." All lanes reveal a high molecular mass milk fat globule protein migrating in the range >180 kDa. In all lanes except mouse and guinea pig, the protein is characterized by the presence of two bands, presumably from two respective different length alleles. In order to confirm that this protein was homologous to MUC1, the polyclonal antibody, CT-1, directed to the human cytoplasmic tail" was utilized. However, although this antibody cross-reacted strongly on sections, it did not perform well on Western blots. strongly against the equivalent mucin from a variety of other species on sections, it does not blot well (even with the human mucin).
Sequence comparison within the mouse mucin repeat domain revealed the presence of a degenerate repeat with on average 15/60 mismatches, or approximately 75% homology between repeats (Fig. 9). This is in contrast to the human MUCl gene in which repeat homology levels range from 97 to 100% (Gendler et al., 1987(Gendler et al., ,1988Siddiqui et al., 1988;Abe et al., 1989;Ligtenberg et al., 1990;Wreschner et al., 1990;Lan et al., 1990). The generation of new length minisatellite alleles is based upon misalignment of repeat arrays followed by an unequal exchange event (Jarman and Wells, 1989). This is now thought most likely to take place during sister-chromatid exchange. In order for repeat arrays to recognize each other, a minimum level of repeat homology is required, and in general the most polymorphic minisatellite loci (of which the human MUCl gene is one) are those with the most precise repeats (Jeffreys et al., 1985).
Although the mouse Muc-1 gene is not polymorphic, sequence comparison of repeats provides very strong evidence for the gene having a t one time undergone some kind of unequal exchange event/s. For instance, five of the 16 repeats possess an extra codon, in each case at the same position within the repeat (Fig. 9), and several copies of this length variant could only have been generated by a process of duplication. Indeed, one such duplication event can be quite finely localized to have occurred between the ends of ancestral repeats numbers one and six resulting in a duplication of five repeats (Fig. 10). From the number of mismatches over this region, eight out of 204 (or extrapolating, approximately 401 kilobase pair), we estimate this this duplication occurred approximately 8-10 million years ago (assuming an average mutation rate of approximately 1-amino acid substitution/ kbp every 200,000 years (Alberts et al., 1989)). It may be that the accumulation of five repeats containing an extra codon exerted pairing constraints on any subsequent misaligned duplexes and thus reduced the rate of unequal exchange to such an extent that base substitutions accumulated and the level of repeat homology dropped to a point where polymorphism was lost. The rate of nucleotide substitution in the rodents has been found to be significantly higher than that Mouse R e p e a t D N A C o n s e n s u s observed in humans or bovine (Wu and Li, 1985), but whether or not this is a factor in the rodent gene losing polymorphism it is as yet unknown.
In other large structures with a high content of 0-linked carbohydrate, exact repeats of short stretches of amino acids have also been found to occur. This is found to be the case for the human intestinal mucin, MUC2, (Gum et al., 1989) the porcine submaxillary gland mucin (Timpte et at., 1988) and the two Xenopus integumentary mucins (Hoffmann, 1988;Probst et al., 1990). These proteins show a large variation in size, and in the case of the human intestinal and Xenopus integumentary mucins this has been found to be due to a repeat polymorphism similar to that observed for the human MUCl gene (Griffiths et al., 1990;Hauser et al., 1990). Gendler et al. (1990) suggest that variations in size mean that length is not crucial to the function, but rather that the core exists in an extended form as a scaffold for the attachment of 0-linked carbohydrate. We hypothesized that one of the factors that might be acting to regulate the evolutionary divergence of mucin repeat sequences is the maintenance of these potential 0-glycosylation sites. The amino acid alignment of the human and mouse mucin repeat domains suggests that this may indeed be a factor with greater than 80% of the conserved amino acids in the repeat domain being serine/ threonine or proline. Maintenance of the potential O-glycosylation sites within the repeats suggests that the attached carbohydrate side chains are an important functional part of the external domain of this protein. This idea is reinforced by the observation that where antibodies directed to the human mucin core protein show no cross-reactivity with the mouse mammary mucin, antibodies directed to the carbohydrate moieties of the human mammary mucin  do cro~s-react.~ The recently cloned human pancreatic mucin (Lan et al., 1990) which is identical in sequence to MUCl has been shown to be differently glycosylated as compared with the mammary mucin (Khorrami et al., 1989). From these different lines of evidence we can envision a system in which the external repeat domain is glycosylated in a manner specific to its function or the glycosylation machinery in that particular tissue and the transmembrane and cytoplasmic tail domains act to localize and maintain the apical membrane distribution of the mucin in the epithelial cells by way of an interaction with the actin cytoskeleton.
Northern blots and antibody staining with antibodies directed to the human MUCl cytoplasmic tail3 indicate that the pattern of expression of the mouse mucin is very similar to that observed in the human. An analysis of the two promoter regions within 500 bp of the TATAA box indicated a very high level of similarity, greater than 70% (Fig. 6). However, potential responsive elements described for this region of the human promoter do not appear to be conserved, except for potential binding sites for Spl. In particular, the potential estrogen response element identified by Tsarfaty et al., (1990) is absent as is the potential sequence in the first intron which shows homology to a murine cellular enhancer. Potential progesterone-responsive elements and glucocorticoid-responsive elements described by Lancaster et al., (1990) are also not conserved in the mouse homologue. However, a region immediately upstream of the TATAA box located between two potential Spl-binding sites (-92 to -34) was found to be almost 100% conserved. A search of the KeyBank promoter and enhancer element sequences found no significant matches within this region. The possibility therefore exists that this region may be playing a role in the tissue-specific expression of this gene. L. Pemberton, personal communication.
Clones of the mouse homologue of the tumor-associated mucin, MUC1, will allow us to investigate the expression of this gene during embryonic and mammary gland development using in situ hybridization and by immunohistochemistry and it is hoped will also allow us to elucidate the function of this protein through gene-targeting experiments in mouse embryonic stem cells.