Episialin, a Carcinoma-associated Mucin, Generated by a Polymorphic Gene Encoding Splice Variants with Alternative Amino Termini*

Episialin is a mucin-type glycoprotein present at the luminal side of most glandular epithelial cells. We have isolated cDNA clones encoding episialin and deter- mined the structure of the gene. The gene encodes a transmembrane protein which consists of, for the greater part, tandem repeats of 20 amino acids. The number of these repeats varies between 40 and 90 among different alleles. The repeats and most of the remainder of the protein are very rich in potential O- linked glycosylation sites. Two different splice var- iants were found. the by these two variants differ in their signal sequences and in the extreme amino-terminal parts of the suggesting alternative two

Episialin, previously called MAM-6 by our group, is a mucin which is abundantly present at the apical surface of epithelial cells. An increase in episialin expression is often found in carcinoma cells, where the glycoprotein is detected both intracellularly and on the entire cell surface, including the membranes lining the adjacent cells (Hilkens et al., 1984). The mucin has an apparent molecular mass of over 400 kDa on SDS'-polyacrylamide gel electrophoresis. It contains many sialic acid residues, which results in a high net negative charge of the molecule (Hilkens and Buys, 1988). In combination with its high level of expression, this negative charge of episialin could have a significant effect on cell-cell interactions of carcinoma cells.
Previously, episialin has been defined by five monoclonal antibodies directed against different epitopes on the antigen (Hilkens et al., 1984(Hilkens et al., , 1985. These antibodies detect episialin molecules of either one or two size classes in all episialinsynthesizing cell lines (Hilkens et al., 1989). The size of these molecules varies significantly between different individuals or cell lines. It has been shown that this variation in size is caused by a genetic polymorphism, which is inherited in a mendelian fashion (Swallow et al., 1987;Hayes et al., 1988).
The biosynthesis of episialin in mammary tumor cell lines * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby has been studied extensively (Hilkens and Buys, 1988;Linsley et al., 1988). In ZR-75-1 cells, the molecular weights of the protein backbones of the two allelic forms were estimated to be 160,000 and 310,000, respectively, which is much lower than those of the mature glycoproteins.
We have shown that the difference in molecular weight. between the mature molecules and their precursors is caused by extensive glycosylation (Hilkens and Buys, 1988). Since most of the carbohydrate side chains are attached to the protein backbone by O-glycosidic linkages to threonine or serine residues, episialin is classified as a mucin.
Mucins are glycoproteins that are difficult to study at the protein level due to their large size and their high number of glycosidic side chains. Therefore, molecular cloning of mucin genes is required to elucidate the complex structure and the biological function of these glycoproteins.
However, no complete sequence for any mucin has been reported thus far. Here we present the full-length cDNA sequence and genomic organization of the episialin gene. Two splice variants are described here that differ from each other in their signal sequence and the region encoding the mature amino terminus, suggesting different processing of these two gene products.  (Hilkens et al., 1984); monoclonal antibodies 139H2 and 14OCl were generated against primary breast cancer membranes (Hilkens et al., 1985). Monoclonal antibody SM3 was raised against the deglycosylated Human milk fat globule mucin  and was kindly donated by Dr. J. Taylor-Papadimitriou and Dr. S. Gendler (Imperial Cancer Research Fund, London).
Immunoprecipitation and Southern and Northern Blotting-Labeling of cells with [3H]glucosamine and immunoprecipitation of episialin from cell lysates have been described before (Hilkens et al., 1984). DNA was isolated, separated on agarose gels, and transferred to nitrocellulose according to standard procedures. Unless otherwise stated, RNA was isolated using the lithium chloride/urea technique. Poly(A)+ RNA was isolated by chromatography using oligo(dT)cellulose, separated on formaldehyde gels, and transferred to nitrocellulose as described (Davis et al., 1986) Analysis of Inserts-The cDNA and genomic inserts of interest were cloned into the EcoRI site of the plasmid vector pEMBL8 (Dente et al., 1983) according to standard procedures. Subclones of these inserts were generated by cloning of restriction fragments of the insert in pEMBL18 or pEMBL19. Either double-or single-stranded DNA derived from these subclones was sequenced using the dideoxy chain termination method (Sanger et al., 1977) with either the Ml3 sequencing primer or oligonucleotide primers directed against sequences within the insert. Using the strategy as indicated in Fig. 1, over 95% of the cDNA sequence was determined on both strands.

RESULTS
Cloning of Repeat Region-The monoclonal antibodies SM3, 139H2, 14OC1, and 115F5 directed against episialin were used to screen a Xgtll expression library of the human breast carcinoma cell line T47D. A cDNA clone of 1200 bp, which yielded a product that was specifically reactive with all of these monoclonal antibodies, was obtained. Sequence determination from both ends of the cDNA insert revealed that it contained multiple perfectly repeated units of 60 bp, of which 84% were G and C residues. In every repeat unit, one SmaI restriction site was present. SmaI digestion established that the entire 1200-bp insert consisted of these tandem repeats only since no fragments larger than 60 bp could be observed. The cDNA insert was used as a probe on Northern blots carrying RNA from various cell lines. A hybridization signal was detected only with RNA of cell lines that express episialin, as determined using monoclonal antibodies. These results indicate that the cDNA clone encodes part of the episialin gene product.
In each of the episialin-expressing cell lines, mRNAs of either one or two size classes were present, varying in length between 4 and 7 kb (Fig. 2B). Digestion of DNA of these cell lines by a restriction enzyme that does not have a recognition site in the repeats yielded fragments with a similar variation in length (e.g. AU; Fig. 2C). A correlation can be observed between the apparent molecular weights of episialin immunoprecipitated from [3H]glucosamine-labeled cell lines ( Fig.  2A) and both the lengths of the mRNAs and the sizes of the hybridizing DNA restriction fragments. These results suggest that at least part of the variations in the molecular weight of this mucin are caused by allelic variations in the length of its gene.
To analyze the polymorphism in more detail, we constructed a restriction map of the region surrounding the repeats in T47D DNA (Fig. 3). Restriction enzymes that do not have a recognition site in the SO-bp repeat do not cut in the region that hybridizes with the repeat probe, suggesting that all the repeats are present in an uninterrupted tandem array located on a single exon. The length of this repetitive region differs between the two alleles by about 2.5 kb. In contrast, the restriction enzyme patterns of the regions on either side of the repeat units are identical for both alleles, showing that the polymorphism is caused by differences in the number of repeats.
Cloning cDNA Flanking Repeat Region-The lengths of the episialin mRNAs in T47D cells detected by the repeat probe are about 4.1 and 6.7 kb. The lengths of the repeat regions in each of these mRNAs were estimated using the corresponding genomic restriction map. In the small allele, the repeat region covers nearly 2.3 kb, representing almost 40 repeat units. In the long allele, the length of the repeat region is approximately stream of the repeat region. When the nonrepetitive part of one of these cDNAs was used as a probe on Northern blots, the same hybridization pattern was obtained as was observed for the repeat probe (data not shown). cDNA clones containing the region 3' of the repeats were isolated from an oligo(dT)-primed cDNA library, which was screened with the repetitive probe and genomic PstI fragments isolated from the genomic EcoRI clone (see below). Five cDNA inserts were analyzed containing identical regions of 1311 bp downstream of the repeat region. The combined length of the sequences on either side of the repeat region is about 1770 bp, which comes close to the estimated length of the nonrepetitive region (1.8 kb; see above).
Genomic Organization of Episialin Coding Region-The genomic EcoRI fragment, on which the repeat region of the small allele of the episialin gene in T47D cells is located, was cloned. The restriction map of this clone completely corre-sponds to the genomic map (Fig. 3) that was constructed using uncloned T47D DNA. The sequence of part of the cloned fragment was determined and compared with the cDNA sequence. The deduced organization of exon and intron sequences is schematically shown in Fig. 4A. The second exon contains the entire repeat region and is extremely long, whereas 3' of this exon, several exons are found that are separated from each other by small introns. The last exon (exon 7) is encoded by another genomic EcoRI fragment (data not shown). cDNA Sequence-Two different splice variants (A and B) have been obtained (Fig. 4). The variants are composites since we did not obtain cDNA clones spanning the entire repeat region. The flanking clones reached into the 5' or 3' ends of the repeat region. The combined sequence data as well as the deduced amino acid sequence are shown in Fig. 5 The sequence of a large number of cDNA clones containing part of the repeat region has been determined. Most of the repeat units were represented by the consensus sequence depicted in Fig. 5. Only a few repeats showed nucleotide changes, some of which also affected the amino acid sequence. It is remarkable that on either side of the repeat region the repeat sequence slowly degenerates.
In both splice variants, a start codon surrounded by a strong Kozak consensus sequence (Kozak, 1986) is preceded by a leader sequence of 50 nucleotides. This start codon is followed by a potential signal peptide, which crosses the boundary between exon 1 and exon 2 and is affected by the alternative splicing (see below). The longest open reading frame found in the cDNA continues until 328 codons downstream of the last Smcd restriction site in the repeat region. A typical transmembrane sequence of at least 24 amino acids is found in the carboxyl-terminal part, followed by a cytoplasmic domain of 69 amino acids. A polyadenylation signal is present 283 nucleotides 3' of the translational stop codon, resulting in poly(A) addition either 16 or 20 nucleotides further downstream. The length of the open reading frame of this small episialin allele is 1264 amino acids, 800 amino acids of which are located in the repeat region.
The repeats, which even in the smallest allele detected thus far constitute more than half of the polypeptide backbone, consist of 25% serine and threonine residues, which are potential O-linked glycosylation sites. The nonrepetitive part of the protein backbone has a high content (27%) of serine and threonine residues as well. Apart from these residues, the repeats also contain a high number of proline residues (25%). Five potential iv-linked glycosylation sites (Asn-X-Ser/Thr) are present between the repeats and the potential transmembrane sequence.
We did not detect any significant homology of the fulllength cDNA molecule or its protein product with other known sequences present in GenBank@ Release 60.0.
Splice Variants-As indicated in Fig. 5, both splice variants show a great resemblance. The only difference is the use of two alternative splice acceptor sites for exon 2 that are located 27 bp apart (Fig. 4B). Therefore, the length of the putative translation products differs by only 9 amino acids. However, the alternative splicing event affects the signal sequences of the gene products. According to the predictive method of Van Heyne (1986), the signal peptide of variant A is 22 amino acids long and is cleaved between the threonine and alanine residues present in the additional 9 amino acids (score 11.11). The cleavage site of variant B is located between the glycine and serine residues 23 amino acids downstream of the translational start codon (score 12.75), resulting in a different amino terminus of the mature glycoprotein.
To determine the relative amounts of these splice variants in T47D cells, we have hybridized 28 cDNA clones, which were shown to contain both exon 1 and the common region in exon 2, with an oligonucleotide probe recognizing the additional 27 nucleotides present in variant A. Four out of these 28 cDNA clones hybridized with this oligonucleotide probe. This indicates that in T47D cells variant A containing these additional 27 nucleotides is less abundant than variant B.

DISCUSSION
We have obtained cDNA clones that, together with the genomic clone containing all the repeats, span the complete coding region of the epithelial mucin episialin. Clones containing the 60-bp repeats have been isolated by other groups using monoclonal antibody SM3  or DF3 (Siddiqui et al., 1988), which are directed against the same epithelial mucin. The nucleotide sequence of the repeats of these cDNA clones (Gendler et al., 1988;Siddiqui et al., 1988) is identical to the sequence we have determined, although the sequence of Siddiqui et al. is published in the reverse orientation. We have established the coding strand of our cDNA by hybridization of a Northern blot with oligonucleotides deduced from either of the two strands of the repeats (data not shown reported by Gendler et al. (1988). The sequence 3' of the repeats in our clone is completely different from the one reported by Gendler et al. (1988). When we compare the latter with our genomic sequence, alternative splicing can be excluded as an explanation for this difference on the basis of the absence of a splice donor site immediately 3' of the repeats. Restriction fragments derived from various parts of the cDNA described here recognize the same pattern on Northern blots, indicating that they all hybridize with the episialin mRNAs. Therefore, the cDNA sequence presented in this paper totally represents episialin RNA.
It has been found by others (Karlsson et al., 1983;Swallow et al., 1987;Hayes et al., 1988) that the variation in length of the episialin glycoproteins has a genetic origin, but the molecular basis for this phenomenon had not yet been established. When we compare the restriction maps of the long and short alleles of the episialin gene in T47D cells, the difference in length is located entirely between the AluI sites that are located about 120 bp up-and downstream of the repeat region. The length of these AluI fragments in other cell lines varies between 2.4 and 5.4 kb, showing that the polymorphism is due to different numbers of repeats. As can be estimated from the length of the AluI fragments that hybridize with the repeat probe, the number of repeats varies between about 40 and 90 among the different cell lines. Because of the large amount of potential O-linked glycosylation sites in the repeats, it is likely that the repeats mainly serve as a carrier for these carbohydrate side chains. Recently, partial cDNA clones of other mucins have been reported. Both the porcine submaxillary gland apomucin (Timpte et al., 1988) and the human intestinal mucin (Gum et al,, 1989) contain tandem repeats with a high content of serine and threonine residues, suggesting that repeat units as carriers for carbohydrate side chains are common structures in mucins.
Apart from a high number of O-linked glycosylation sites, the protein backbone also contains five potential N-linked glycosylation sites. Although it was generally assumed that mucins would contain only O-linked glycosidic side chains, the occurrence of N-linked glycosylation of episialin had already been established in the structural analysis and biosynthesis studies of episialin (Hilkens and Buys, 1988;Linsley et al., 1988;Abe and Kufe, 1989).
Based on the existing rules, the signal sequences of both variants A and B should be active (Von Heyne, 1986). There-fore, the alternative splicing event not only gives rise to an alternative signal sequence, but also generates another amino terminus of the mature glycoprotein. This region may be important for the routing of the protein through the cell, perhaps resulting in different glycosylation pathways. This could explain the different cellular localizations of episialin that are detected in carcinoma cells (Kufe et al., 1984;Hilkens et al., 1984) and the distinct subpopulations of episialin that are found in sequential precipitation studies (Hilkens et al., 1989).
A potential transmembrane sequence is observed in the carboxyl-terminal part of the protein backbone of both splice variants. The presence of a potential transmembrane region in episialin was not expected since episialin has been detected in the spent medium of episialin-producing cell lines and in the serum of breast cancer patients (Hayes et al., 1986;Hilkens et al., 1986). Whether the glycoprotein is released from the cell surface by an extracellular protease activity or is excreted in a membrane-bound form is not yet known.
The only 3 cysteine residues occurring in the translation product are present in the transmembrane region. These residues are probably used for the addition of fatty acids to the molecule as has been reported before for other transmembrane molecules (for review, see Magee et al., 1989). This acylation may stabilize the binding of the mucin in the membrane. Since episialin does not contain cysteine residues in the extracellular domain, it cannot form oligomers via disulfide bonds. This confirms our previous finding that reducing conditions have no major influence on the mobility of the glycoproteins on SDS-polyacrylamide gels (Hilkens et al., 1989). This lack of formation of disulfide bonds is in contrast with the oligomerization of the gel-forming mucins that are present in secretions from specialized epithelial cells in many exocrine tissues, such as the salivary gland (for review, see Hilkens, 1988). The absence of disulfide bonds and its tissue distribution distinguish episialin from these mucins.
The availability of the complete cDNA of episialin allows us to perform transfection studies to investigate the biological function of this mucin and to determine whether the different signal sequences can affect its subcellular distribution and glycosylation.