Molecular Cloning of Human Intestinal Mucin cDNAs

A human small intestine Agtll cDNA library was screened using antisera prepared against the deglycosylated protein backbone of human colon cancer xenograft mucin. Three cDNAs were isolated from this screening, designated SMUC 40-42. These cDNAs were all found to contain tandem repeats of 69 nucleotides  which  encoded a threonine and proline-rich  protein  consensus  sequence of PTTTPITTTTTVTPTPTPTGTQT. RNA blots probed with one of these cDNAs, SMUC 41, exhibited large, polydisperse hybridization bands at -7,600 bases. Band intensities were strongest when human small intestine, colon, and colon cancer poly(A)+ RNA was used. In vitro translation of poly(A)‘I RNA from human small intestine, colon, and colon cancer cells produced a 162,000-dalton peptide that was immunoprecipitated with antibodies to deglycosylated mucin. SMUC 41 was also used to probe DNA blots, which indicated the presence of restriction fragment length polymorphisms in the intestinal mucin gene. These findings may be important in assessing the abnormal mucins found associated with several human diseases.

defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nuckotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession nunbeds) J04638. 5 Associate Investigator of the Veterans Administration.
11 Medical Investigator of the Veterans Administration. To whom reprint requests should be addressed. colonic mucin have been recently characterized (5, S), little is known about the structure and amino acid sequence of the protein core of this high molecular weight glycoconjugate.
Several human diseases have been observed to be associated with alterations in intestinal mucins. These include cystic fibrosis, familial polyposis coli, ulcerative colitis, and colon cancer. Patients with cystic fibrosis produce excessive amounts of mucin in their gastrointestinal, respiratory, and reproductive tracts whereas patients with familial polyposis coli, ulcerative colitis, and colon cancer produce mucins that are abnormally glycosylated (2-4, 7 , 8). Hence, a better understanding of the molecular genetics and biosynthesis o€ mucin may provide insight into the pathoge~esis, diagnosis, and treatment of several important human diseases.
In order to examine hrther the structure, biosynthesis, and genetics of intestinal mucin, we sought to clone cDNAs that encode the mucin protein backbone. This was achieved in the present study using antibodies to deglycosylated colon cancer xenograft mucin and a small intestine X g t l l expression library. The resulting cDNAs indicate that this mucin contains threonine-and proline-rich regions consisting of tandem repeats of 23 amino acids each. Furthermore, these cDNAs enabled us to identify the mucin message produced in various cell lines and tissues and to determine that the intestinal mucin gene is genetically polymo~hic. Degiyco-sykzted ~u c~n -M u c i n was purified from LS174T human colon cancer cell tumors (grown in nude mice) using gel filtration and CeCl density gradient centrifugation. This mucin had an amino acid composition that was 29% threonine, 14% serine, and 15% proline, similar to that found previously for human intestinal mucin (1)(2)(3). Details of the LS174T cell mucin purification and characterization are published elsewhere (9). The purified mucin was deglycosylated by treatment with hydrogen fluoride under anhydrous conditions for 1 h at 0 "C (to give HFA)' or 3 h at room temperature (to give HFB) (10).

Purification of Mucin and Production of Antibodies to
Compositional analysis indicated that almost all (-98%) of the sugar had been removed from HFB but that HFA stili contained -10% of its original content of GlcNAc and -75% of its GalNAc content.
Antibodies were prepared in New Zealand White rabbits against HFA, HFB, or native mucin using three or four subcutaneous injections of 50-100 pg of antigen. Enzyme-linked immunosor~nt assays indicated that all immunogens elicited antibodies (10).
Library Screening and Lysogen Preparation-A human jejunal cDNA library constructed in the X g t l l expression vector was obtained from Dr. Yvonne Edwards (Medical Research Council, Human Bio-The abbreviations used are: HFA and HFB, preparations of LS174T xenograft mucin deg~ycosylated with hydrogen fluoride as described in the text; MRP, a synthetic peptide with the mucin repeat sequence; BSA, bovine serum albumin; bp, base pairs; kb, kilobase pairs; SSC, standard saline citrate; pfu, plaque-forming unit. chemical Genetics Unit, University College London, London, United Kingdom) (11). This library was plated in soft agar at a density of 25,000 plaques/l50-mm plate as described (12). The plates were incubated at 37 "C until plaques began to appear and were then overlayed with isopropyl 8-D-thiogalactopyranoside-saturated nitrocellulose membranes and incubated for an additional 3 h. The membranes were then removed and immunoscreened using anti-HFB serum at a 1:50 dilution and horseradish peroxidase-conjugated goat anti-rabbit IgG (Tago) using previously described methods (13). Positives were purified to clonality by successive rounds of rescreening, phage DNA was isolated, and inserts were recovered by EcoRI digestion. Lysogenation of Escherichia coli strain Y1089(r-) (Promega Biotec) and lysate preparation was performed as described (12).
DNA Sequencing-All sequencing was done using M13mp18 and M13mp19 vectors. Single-stranded templates were prepared, and dideoxynucleotide sequencing was performed using modified T7 DNA polymerase (United States Biochemical Corp.) (14). For each cDNA, both strands were sequenced in their entirety. DNA sequences were assembled and analyzed using DNA and Protein Sequence Analysis software purchased from International Biotechnologies, Inc.
DNA, RNA, and Protein Blot Analysis and in Vitro Translation-RNA purification and poly(A)+ RNA isolation, gel electrophoresis, transfer to nylon membranes, and hybridization probe analysis was conducted as described (15). Protein immunoblots were performed using a 1:50 dilution of antibody (15). High molecular weight DNA was prepared using proteinase K, RNase A, and phenol as described by Blin and Stafford (16). This material was digested with restriction enzymes, and the fragments were separated by electrophoresis in 1% agarose gels using a buffer containing 100 mM Tris, 100 mM boric acid, and 2 mM EDTA, pH 8.0 (17). The gels were soaked for 30 min in 1.5 M NaCI, 0.5 M NaOH, and then in 3 M NaOAc, pH 5.5 for the same period of time. Transfer to nylon membranes, hybridization, and washing then proceeded as described above for RNA blots. In uitro translations and immunoprecipitations were performed as described (15) except that (a) [35S]cysteine (Amersham Corp.) was used as the radioactive amino acid in conjunction with a cysteine-free translation mixture, (b) 0.45 pg of poly(A)+ RNA was used in 39 pI of final reaction volumes, (c) the carrier lysate for immunoprecipitation also contained 10 ng/ml of added HFB, and (d) 3 pl of control serum or 3 pl of anti-HFB serum was used.
Synthetic Peptide (MRP) and Antibody Preparation-A peptide with the sequence KYPTTTPISTTTMVTPTPTPTGTQT was prepared using an Applied Biosystems model 430A peptide synthesizer by Joel Boymel of the National Jewish Center for Immunology and Respiratory Medicine, Denver, CO. The final 23 residues of this peptide represent the sequence of the first repeat of SMUC 40; the initial K and Y residues were added to allow glutaraldehyde conjugation and radioiodination (for future studies), respectively. For antibody production, 1 mg of peptide was emulsified in complete Freund's adjuvant and injected intradermally at multiple sites into a female New Zealand rabbit (18). Three weeks later a second set of injections using 0.5 mg of peptide in incomplete adjuvant was administered, and the rabbit was bled 12 days later and serum prepared.

RESULTS
Isolation of Intestinal Mucin cDNAs-Because the protein backbone of intestinal mucin is so heavily laden with oligosaccharide chains it is difficult to characterize biochemically. The conditions required to remove the carbohydrate result in breakage of the protein backbone. Thus, it is impractical to obtain information pertaining to the primary structure of intestinal mucin by conventional peptide sequencing. In order to acquire this structural information, we therefore decided to clone and sequence intestinal mucin cDNAs.
Antibodies prepared against HFB were used to screen the intestinal cDNA library and three positives were obtained from a screening of 230,000 recombinant plaques. These clones, which were designated SMUC 40, SMUC 41, and SMUC 42, were purified and tested for antigenicity using anti-HFA, anti-HFB, and anti-native mucin as shown in Fig.  1. Only antisera against the completely deglycosylated HFB produced positive plaques in this experiment. Fig. 2 shows immunoblot analysis of the &galactosidase fusion proteins produced by lysogens of these recombinants. Anti-HFB reacts H FA FIG. 1. Reactivity of SMUC 40 plaques with antibodies prepared to deglycosylated and native mucin. SMUC 40 (200 pfu/ 100-mm plate) was plated on E. coli strain Y1090(r-) in soft agar and incubated at 37 "C for 2.5 h. The four plates were then overlaid with isopropyl 0-D-thiogalactopyranoside-saturated nitrocellulose membranes, and incubation was continued for 3 h more. The membranes were then assayed for the presence of antibody-reactive plaques using 1:50 dilutions of antisera to HFA, HFB, and native (non-deglycosylated) mucin. The control was serum from a nonimmunized rabbit. strongly with the fusion proteins produced by SMUC 40-42.
These experiments indicate that these clones produce recombinant fusion proteins that are recognized by antisera against deglycosylated mucin but not by antisera against native mucin. Thus, the fusion proteins apparently contain epitopes that do not function as immunogens when mucin is injected into rabbits unless the mucin is first deglycosylated, providing evidence that these cDNAs encode the normally covered mucin protein backbone.
Sequence Analysis of Mucin cDNAs-The recombinant phage DNA was digested with EcoRI, and each clone was found to contain a single, unique insert. Sequence analysis of the terminal regions of these clones indicated that they all contained repetitive sequences. Exonuclease I11 was then used to generate partially deleted clones for sequence analysis of the interior regions of these cDNAs (19). This made it possible to correlate the region of the cDNA sequenced from each template with the length of the deletion, information necessary to avoid confusion caused by not knowing which repeat unit was being sequenced (20). Details of the sequencing strategy used are given in Fig. 3.
Each of these clones was found to contain tandem repeats of 69 nucleotides (Fig. 4). In fact, only the 5'-terminal 71 nucleotides of SMUC 42 and the 3'-terminal 471 nucleotides of SMUC 41 can be clearly identified as not consisting of these repeat units. Thus, anti-HFB serum appears to be strongly immunoreactive with the protein encoded by the 69bp repetitive element. The amino acid sequences deduced for each of these tandem repeats are shown in Fig. 5. The 23amino acid consensus sequence of these repeat units contains 14 threonine and 5 proline residues, including a group of five consecutive threonines and a stretch containing three threonine-proline direct repeats. The 14 repetitive units contained in the three partial cDNA clones isolated in this study have 90% overall sequence identity with the consensus sequence shown in Fig. 5. Even more conserved is the 12-amino acid stretch enclosed in the box in Fig. 5, which exhibits 98% overall sequence identity with the consensus sequence. Only 11 serine residues are found dispersed among these 14 tandem repeats and nine of them occur as substitutions for threonine in the consensus sequence. On the other hand, the carboxylterminal 157-amino acid region deduced from the 3'-terminal 471 nucleotides of clone SMUC 41 (which does not consist of the tandem repeats) contains 25 serine residues. Hence, it appears that the majority of serine residues in intestinal mucin are clustered in regions other than the tandem repeats. The 3"terminal region of SMUC 41 also contains the only cysteine, present as a cyscys dipeptide, and most of the aromatic amino acids. Two potential N-glycosylation recognition sites are encoded in the sequences presented here, one in the last repeat unit of SMUC 40 and one near the 3'terminal of SMUC 41.
Reactivity of Antibodies against the MRP with HFB-As shown in panel A of Fig. 6, antibodies against HFB reacted with both HFB and BSA conjugated with the MRP but not with partially deglycosylated mucin (HFA) or unconjugated BSA. The broad smear of antibody reactive protein in the HFB sample is indicative of the cleavage of the mucin backbone that occurs during deglycosylation (10). MRP-conjugated BSA exhibits polydispersity, on the other hand, due to Most of the sequencing of the interior regions of SMUC 40 and 41 was performed using exonuclease 111-deleted clones. In a few cases, restriction fragments obtained from TqI and MspI digests of SMUC 40 and 42 were force cloned into AccIand EcoRI-digested M13mp18 and used to generate templates. Sequencing done using this latter method is indicated using dashed arrows. irregular conjugation of BSA with itself and the peptide. Antibodies against the MRP had a specificity similar to anti-HFB. Again, reactivity was apparent with HFB and MRPconjugated BSA but not with HFA or unconjugated BSA (Fig.   6, panel B). Thus, antibodies prepared against a synthetic peptide made using the deduced sequence of a mucin repeat unit were reactive with HFB, providing additional evidence that these cDNAs are actually derived from mucin messages.
RNA Blot Analysis and in Vitro Translation of Mucin mRNAs-Poly(A)+ RNA from a number of human cell lines and tissues was subjected to RNA blot analysis using SMUC 41 cDNA as a probe (Fig. 7). The messages that hybridized to SMUC 41 were large and polydisperse, averaging 7600 bases in length. In addition, a distinct but faint band at 1850 bases was sometimes detectable (Fig. 10). The strongest hybridization signals observed in these experiments were expressed by colon, colon tumor, and small intestine RNA. HM-7 and H498, two high mucin-producing human colon cancer cell lines (22,23), also contained high levels of message. LM-12, a low mucin-producing variant of LS174T cells (22), exhibits only a faint hybridization signal as does RNA from LS-G and SW1116 cells. No detectable signal was obtained with either placenta or the thyroid tumor poly(A)+ RNA used here.
In vitro translation and immunoprecipitation with anti-HFB serum was used in an attempt to identify the intestinal mucin primary translation product (Fig. 8). When the in vitro translation reactions were programmed with poly(A)+ RNA from human small intestine, colon, H498 cells, or HM-7 cells a single discernible protein of 162,000 daltons was specifically immunoprecipitated. This band was fainter but detectable when LM-12 poly(A)+ RNA was used to program the reactions and was absent when LS-G poly(A)+ RNA was used. A protein of 162,000 daltons would require an mRNA of approximately 5,000 bases or larger, depending on the length of the 5'-and 3"untranslated regions. Hence, the molecular weight of the immunoprecipitated protein is in good agreement with the message size determined in Fig. 7.
Genomic DNA Blot Analysis-Genomic DNA was isolated from the lymphocytes of two human donors and five colon cancer cell lines, restriction endonuclease-digested, and subjected to electrophoresis and hybridization blot analysis using the SMUC 41 probe (Fig. 9). Six of these DNA samples were cleaved with EcoRI and all exhibited a single hybridization band that was larger than the 23.1-kb standard (Fig. 9A). As a control for restriction endonuclease cleavage, these same blots were examined using a probe for carcinoembryonic antigen, and the resulting band pattern was similar to previously published results (25). Hence, the large size of the EcoRI-cleaved hybridization band does not appear to be due to incomplete digestion of the DNA. HinfI digestion produced bands at 7.9,1.2, and 0.62 kb in four of the five DNA samples tested (Fig. 9B). The other sample displayed these three bands plus an additional band at 4.5 kb. This demonstrates that a polymorphism exists in or around the gene that encodes SMUC 41. Further evidence for polymorphism in this gene is shown in Fig. 9C. Sau3A digestion of these DNA samples revealed a different set of hybridization bands for three of the four samples tested. Thus, both HinfI and Sau3A identify  restriction fragment length polymorphisms in the intestinal mucin gene.
RNA and DNA Blot Analysis Using the 5' and 3' Portions of SMUC 41 as a Probe-Analysis of the 3"terminal 471 nucleotides in SMUC 41 indicates that this region has no sequence similarity with the 69-nucleotide tandem repeats. Thus, it is possible that this stretch of nucleotides is not an integral part of the intestinal mucin gene, i.e. its presence as part of SMUC 41 may be an artifact of cloning. To test this possibility, RNA and genomic DNA blots were probed with segments of SMUC 41 representing the tandem repeats and the 3"terminal region. SMUC 41 was digested with ApaI which cleaves after base 370, five nucleotides downstream from the end of the last repeat. The 5"terminal 370-base fragment and the 3'-terminal466-base fragment were purified by gel electrophoresis (17) and used as hybridization probes (5'-370 and 3'466, respectively). As shown in Fig. lOA, both of these probes recognized the same polydisperse 7600-base band when used to probe blots of small intestine and colon poly(A)+ RNA. These two probes recognized the same large band (23.1 kb) when used to probe EcoRI-digested DNA and  both recognized the 7.9-and 4.5-kb bands in HinfI-digested DNA (Fig. 10B). Probe 3'466, however, hybridized much more strongly to the 1.2-and 0.62-kb bands in HinfI-digested DNA than did probe 5'-370. Interestingly, both probes recognized the same set of bands when hybridized to Sau3Adigested DNA although the relative intensities of the bands did vary (Fig. 10B). These results clearly indicate that the 3'terminal 466 bases of SMUC 41 are an integral part of the intestinal mucin gene. Moreover, the fact that probe 3'466 hybridizes to every tested restriction fragment that hybridizes to probe 5'-370 suggests that sequences similar to those in the 3"terminal portion of SMUC 41 are repeated elsewhere in the intestinal mucin gene, presumably upstream of the 69bp tandem repeats. This latter conclusion must be considered tentative at present, however, as exons, pseudogenes, related

DISCUSSION
In the present study we used antibodies prepared against the deglycosylated protein core of mucin to identify mucin cDNA clones in a human small intestine Xgtll library. This approach has proven successful in the past for the isolation of cDNAs encoding the mucins expressed by porcine submaxillary gland (20) and human mammary epithelium (26,27). It is reasonable to expect, therefore, that this method may be generally applicable to the cloning of cDNAs for additional mucins or other heavily glycosylated proteins.
All three of the cDNA clones isolated in this study contain 69-bp tandem repeats. The deduced consensus amino acid sequence for these repeats are high in threonine and proline and low in serine content. In this respect, the repeat units are more similar to the acidic fraction of intestinal mucin isolated by Wesley et al. (3) than to the more abundant neutral fraction. The significance of this is not yet clear, however, as cDNAs representing the entire mucin protein backbone have not yet been isolated. Since the number of GalNAc residues in intestinal mucin approaches the number of threonine residues, it is very likely that a sizable percentage of the threonine present in the tandem repeats contains 0-linked carbohydrate (1-3). Furthermore, studies of UDP-Ga1NAc:polypeptide GalNAc transferase have indicated a preference for one or more proline residues in the vicinity of an O-glycosyl-ation site (28,29), thus further supporting the conclusion that the deduced tandem repeats are 0-glycosylated. Synthetic peptide acceptors modeled after the consensus sequence of the repeating units may be useful in defining the O-glycosylation sites more precisely.
The deduced amino acid sequences described here contain two potential N-glycosylation sites. This was not expected in view of previously published carbohydrate analyses of human intestinal mucin which reveal its major glycoprotein component to be devoid of mannose (2). Other mucins, however, have been shown to contain detectable quantities of N-linked carbohydrate. These include mouse submandibular mucin which has both N-and 0-glycosidically bound carbohydrate chains present (30, 31). Also, the nascent protein precursor of human mammary mucin has recently been shown to undergo N-glycosylation (32,33), and the deduced amino acid sequence for porcine submaxillary gland mucin contains an N-glycosylation site (at amino acid position 418, Ref. 20). Thus, it is not unprecedented for mucin-type glycoproteins to contain some N-linked oligosaccharide chains in addition to their 0-linked glycans. The significance of the N-linked oligosaccharides is currently unclear. Since N-glycosylation OCcurs cotranslationally it is possible that the N-glycans stabilize the nascent apomucin prior to 0-glycosylation or that they play a role in the intracellular targeting of this macromolecule. On the other hand, the N-linked glycans may be important in the functioning of mature mucin. Further studies are needed to define the role of the N-linked carbohydrate chains in mucin. FIG. 10. RNA and DNA blots probed using 5' and 3' segments of SMUC 41. Blots prepared as described above were probed with 3'-466 as discussed in the text and autoradiography was conducted. The probe was then removed by twice incubating the blots at 65 "C in 10 mM sodium phosphate, pH 6.5, and 50% formamide for 1 h. Following a rinse in 2 X SSC, and 0.1% sodium dodecyl sulfate the blots were probed again using 5'-370. Panel A, blot was prepared using 0.5 pg of poly(A)+ RNA from human small intestine and colon. Panel B, 8 pg of DNA samples digested with the indicated restriction enzymes were analyzed.

tinul Mucin
As mentioned above, cDNAs for porcine submaxillary gland mucin and human mammary mucin have been recently isolated (20, 26, 27). Comparison of the nucleotide sequence of the clones isolated in this study to the sequences of these other types of mucin failed to reveal any significant homology. This is not surprising as these three mucins have substantially different amino acid compositions (1-3, 34, 35). A similarity exists, however, in the sense that all three of these mucins contain tandem repeats. Porcine submaxillary gland mucin contains at least eight tandem repeats of 243 bp each, and human mammary mucin contains an unknown number of repeats of 60 bp each (20,26,27). This suggests that selective evolutionary pressure exerted on several types of mucin genes has favored events such as gene duplications or unequal crossovers which have yielded tandem repeats.
Another similarity that exists, at least between mammary mucin and intestinal mucin, is that both are encoded by genes that are genetically polymorphic (Fig. 9, see Refs. 26,36, 37). In the case of mammary mucin, a length polymorphism is observed; i.e. different alleles are thought to have a variable number of repeat units which leads to differences in the size of the mucin protein expressed (37). At the present time, we do not have sufficient data to determine whether the intestinal mucin gene also exhibits length polymorphism or if the different restriction fragments observed are due to differences in the sequence (in either introns or exons). In any event, this polymorphism may contribute to the polydispersity observed with regard to intestinal mucin message size (Fig. 7). Furthermore, these polymorphisms could reflect significant differences in the mucin gene-coding sequence and/or in regions of the gene that effect the level of its expression. The isolation and sequencing of genomic DNA clones containing different restriction fragment length polymorphism-defined mucin alleles may reveal variations in intestinal mucin structure and expression that occur within the human population.