Novel Amino-terminal Propeptide Configuration in a Fibrillar Procollagen Undergoing Alternative Splicing*

We isolated overlapping cDNAs from embryonic li- braries of the sea urchin Strongylocentrotus purpuratus coding for a fibrillar procollagen (2a chain) with a predicted molecular mass of about 320 kDa. The deduced primary structure of the echinoid chain consists of a 265-amino acid carboxyl-propeptide, a triple helical domain made of 337 uninterrupted Gly-X-Y repeats, and an unusually long amino-propeptide. Aside from a 10-cysteine globular region, a collage- nous sequence, and a nonhelical segment, this protein domain includes a novel 4-cysteine motif repeated sev- eral times. Interestingly, preliminary evidence indi-cates that different combinations of the 4-cysteine repeats are encoded by alternatively spliced transcripts. Irrespective of this, the sea urchin 2a procollagen chain represents the longest fibrillar molecule identi- fied to date by cDNA cloning experiments in both vertebrate and invertebrate organisms. using the Sequenase enzyme (U. S. Biochemical Corp.). Sequence analysis was aided by the computer program MULTALIN. Oligonu-cleotides were synthesized by an Applied BioSystems model 380


Novel Amino-terminal Propeptide Configuration in a Fibrillar
We isolated overlapping cDNAs from embryonic libraries of the sea urchin Strongylocentrotus purpuratus coding for a fibrillar procollagen (2a chain) with a predicted molecular mass of about 320 kDa. The deduced primary structure of the echinoid chain consists of a 265-amino acid carboxyl-propeptide, a triple helical domain made of 337 uninterrupted Gly-X-Y repeats, and an unusually long amino-propeptide. Aside from a 10-cysteine globular region, a collagenous sequence, and a nonhelical segment, this protein domain includes a novel 4-cysteine motif repeated several times. Interestingly, preliminary evidence indicates that different combinations of the 4-cysteine repeats are encoded by alternatively spliced transcripts. Irrespective of this, the sea urchin 2a procollagen chain represents the longest fibrillar molecule identified to date by cDNA cloning experiments in both vertebrate and invertebrate organisms.
In higher vertebrates, five distinct collagen trimers (types I, 11, 111, V, and XI) participate in the formation of morphologically similar supermolecular aggregates, the quarter-staggered fibrils (for a recent review, see Ref. 1). The precursor procollagen subunits of fibrillar collagens exhibit the same overall structure consisting of a central triple helical domain flanked by carboxyl-terminal and amino-terminal propeptides. Unlike the first two domains, amino-terminal propeptides of various procollagen chains differ greatly in length and composition. Accordingly, three distinct amino-terminal propeptide architectures are recognized (2). The first consists of a globular region that harbors 10 similarly spaced cysteinyl residues, a collagenous sequence that may or may not be discontinuous, and a nonhelical segment that contains the Nproteinase' cleavage site (Structure I). All but the first subdomain are present in the second amino-terminal propeptide configuration (Structure 11). In the third type of configuration * This work was supported by Grant GM-41849 from the National Insitutes of Health. This is article 106 from the Brookdale Center for Molecular Biology. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) M92041.
$ On leave of absence from the Institut de Biologie et Chimie des Proteines, Centre National de la Recherche Scientifique, Lyons, France.
On leave of absence from the International Institute of Genetics and Biophysics, Consiglio Nazionale delle Ricerche, Naples, Italy.
The abbreviations used are: N-proteinase, the enzyme that cleaves the amino-terminal propeptide; C-proteinase, the enzyme that cleaves the carboxyl-terminal propeptide; kb, kilobase(s); bp, base pair(s); COLPZa, the gene coding for the S. purpuratus 201 procollagen.
(Structure 111), the 10-cysteine subdomain is replaced by a long globular region divided by a 3-cysteine cluster into an upstream slightly basic segment and a downstream highly acidic sequence. Furthermore, one of the fibrillar procollagen chains, pro-al(II), exists in both the first and second configuration because of alternative splicing of the sequence coding for the 10-cysteine globular region (3).
In contrast to vertebrates and despite substantial morphological data, very little is known about the primary structure of fibrillar procollagen molecules in invertebrates. The sole exceptions are some partial sequences from cDNAs of the fresh water sponge Ephydatia miilleri and the Mediterranean sea urchin Paracentrotus lividus (4)(5)(6). Albeit incomplete, these data revealed a close evolutionary kinship between the vertebrate and invertebrate proteins. They also documented the functional contribution of several phylogenetically retained structures of the proteins to metazoan fibrillogenesis.
In addition to serving as supportive elements, collagens are intimately involved in a variety of physiological processes and cellular activities (7). Such a dual role is reflected in the complexity and diversification of the molecular circuitries that modulate collagen production during development and in the adult organism (8). The sea urchin represents an instructive and simple organism for studying collagen function and regulation during early animal embryogenesis. Sea urchin collagens have been implicated in different morphogenetic programs of the developing embryo, such as gastrulation and spiculogenesis (9-14). For example, expression of a nonfibrillar collagen gene (Spcoll) by the differentiating mesenchyme cells of the sea urchin Strongylocentrotuspurpuratus has recently been shown to govern the point of differentiation at which these cells initiate biomineralization (14).
Because of our long-standing interest in collagen evolution, we have recently completed the determination of the sequences of two sea urchin fibrillar collagens, termed l a and 2a chain (5, 6). Here, we present the data pertaining to COLP2o1, the gene coding for the 2a procollagen chain of S. purpuratus.

MATERIALS AND METHODS
Embryo Cultures and Nucleic Acids Purification and Analysis-Collection of S. purpuratus gametes, in uitro fertilization, and embryo cultures were performed according to the standard protocol (15). Genomic DNA was purified from sperm as previously described (16). Total RNA, prepared according to a published protocol (17), was eluted twice through an oligo(dT)-cellulose column (Boehringer Mannheim). For Northern blot analysis, 1 pg of poly(A)+ RNA was fractionated through a 1% agarose gel containing 2.2 M formaldehyde, transferred onto a nitrocellulose filter (Millipore), and hybridized to a 500-bp EcoRIIKpnI probe corresponding to the 5'-foremost segment of COLP2a (Fig. 1).
cDNA Cloning and Sequencing-Approximately 5 pg of late gastrula stage poly(A)+ RNA was utilized as a template to generate two embryonic cDNA libraries in the X g t l O and hgtll vector using oligo(dT) and random primers and following the recommendations

P2a R D Q P I R S C K D L F K C Y P E A E D G N Y W I D S N E G 3024
............... .....

L2a R D Q P I R S C K D L F K C F Q R P K M A T T G S D S N E G
0 0     (20), and by employing synthetic oligonucleotides. Nucleotide sequence was determined with a modified protocol (4) of Sanger et al. (21) using the Sequenase enzyme (U. S. Biochemical Corp.). Sequence analysis was aided by the computer program MULTALIN. Oligonucleotides were synthesized by an Applied BioSystems model 380 synthesizer.

RESULTS
Cloning of 2a Procollagen cDNAs-A 2.5-kb cDNA (12) was initially isolated from an S. purpuratus embryonic library by cross-hybridization with a probe (Uni 13) previously shown to code for the carboxyl-terminal propeptide domain of the P.
Ziuiclus 2a procollagen (6) (Fig. 1). Comparison of the deduced amino acid sequences of I2 and Uni 13 showed a substantial degree of homology extending to the highly divergent carboxyl-telopeptide (Fig. 2). In addition, the two carboxylterminal propeptides display in identical positions a potential Asn-linked glycosylation site and the 7 cysteines which, in vertebrate chains, are involved in intracellular assembly of the procollagen trimer (Fig. 2). Based on these data, we concluded that the S. purpuratus clone represents the counterpart of the P. liuidus 2a collagen gene product. Additional library screenings yielded eight overlapping clones covering a 9594-bp-long open reading frame (Fig. 1). As discussed more extensively below, some of the cDNAs provided evidence that COLP2a transcripts undergo alternative splicing.
Structure of 2a Procollagen-The conceptual amino acid translation of the sea urchin cDNAs revealed that this polypeptide is unique among vertebrate and invertebrate fibrillar procollagens, for its predicted molecular mass is -320 kDa. Moreover, nearly 60% of the chain is accounted for by an unusually long amino-terminal propeptide domain characterized by a novel subdomain positioned between the 10-cysteine globular region and the collagenous sequence (Fig. 1).
The first subdomain of the sea urchin amino-terminal propeptide, the 10-cysteine globular region, begins 38 residues after the start site of translation immediately following a characteristic signal peptide sequence (Fig. 3). A remarkable similarity can be observed when the cysteine cluster of the sea urchin chain is aligned to the same region of the human procollagen chains (Fig. 4) (3). First, the spacing between the cysteinyl residues is maintained across collagens, with the exception of the interval between cysteines 6 and 7. This particular spacing is also different in the pro-al(II1) collagen chain (Fig. 4) (22). Second, and like the human chains, the invertebrate subdomain retains several invariant residues that are also conserved in the analogous cysteine-rich motif of thrombospondin (3,23). Third, the echinoid and human sequences are nearly identical around the two consecutive cysteines likely to be engaged in interchain bonding.
Following the 10th cysteine of the globular domain is the new subdomain, which is made of a 4-cysteine motif repeated 12 times (Fig. 3). With the exception of the first repeat, whose estimated PI is 8.3, the other 11 repeats are substantially  FIG. 3. Amino-terminal prepropeptide sequence. Amino acids are numbered on the right from the start site of translation; they extend to the region immediately before the beginning of the triple-helical domain (see Fig. 6 ) . The boundaries of the four amino-terminal propeptide subdomains are designated by the open triangles. In the cysteinerich globular region, cysteinyl residues are boxed (see also Fig. 4). In the repeated subdomain, the boundaries and identity of each repeat are indicated by the horizontal arrows and numbers, respectively (see also Fig. 5). In the collagenous sequence, Gly-X-Y triplets are continuously underlined and the 4amino acid interruption is highlighted by the dotted lines. Arrows indicate putative cleavage sites; structural elements discussed in the text are boxed.  Only invariant residues were identified in this comparative analysis. Note that the ordering of the chain in the cross-alignment is arbitrary. acidic with theoretical PI values ranging from 3.8 to 4.2. A putative cell attachment sequence (Arg-Gly-Asp) (24) and a potential glycosylation site are noted in repeats 7 and 5, respectively. The homology between the 12 repeats can be readily appreciated when portions of their sequences are aligned (Fig. 5). From this alignment, the following consensus sequence can be derived X(39)GX2LWXllGXGX39CX6CXzL/ FX(z3)CX(4)CX3 (where numbers in parentheses signify an average number of residues). A computer-aided search failed to identify appreciable homology between this consensus sequence and known peptide motifs.

P G L I G S V G Y H G I R G P N G L S G P A G O R G~D G R D G N S G N R~T P G P P G P P G P P G
Twenty-four Gly-X-Y repeated triplets constitute the collagenous sequence of the amino-terminal propeptide (Fig. 3).
They contain a small interruption likely to render the upstream set of four triplets too short to participate in triple helical assembly. This subdomain is separated from the main triple helical domain by 21 residues containing a potential proteolytical signal (Ala-Gln) (25) (Fig. 3). Should the aminoterminal propeptide be cleaved, then a Lys-cross-linking site (26) would be located in the amino-telopeptide of the mature a-chain (Fig. 3).
The triple helical domain of the sea urchin 2a procollagen is made of 337 uninterrupted Gly-X-Y repeats. As previously .. .

G R P G S A G Y S G H R G A R G P Q G L T G P K G P Q G S A G P K G K S G P R G A R G E D G E D G N 1973 D G O N G R O G E I G L V G I S G R P G L G G K H G K S G N P G H K G W~R H G A P G A A G E R G 2023
shown for other invertebrate and lower vertebrate chains and unlike triple helical domains of higher vertebrates (4-6, 18), numerous Gly-X-Y triplets display a glycine residue either at position X or Y (Fig. 6). In addition, the 2a chain contains several putative cell attachment sequences but lacks the crosslinking sites normally found at both ends of the triple-helical domain (Fig. 6) (26).
The two major structural features of the carboxyl-terminal propeptide domain were already discussed in the previous section. In completing this description, a potential C-proteinase cleavage site (Arg-Asp) is predicted to reside 25 residues from the end of the triple helical domain (Fig. 2). If this assumption were correct, the mature 2a chain would contain a second Lys-mediated cross-linking site in the carboxyltelopeptide (Fig. 2).
Evidence for Alternative Splicing of COLP2a"Sequence analysis of several amino-terminal propeptide-encoding cDNAs suggested that COLP2a transcripts undergo alternative splicing. To be precise, clone F6 spans the 5"untranslated region to the end of repeat 11 and precisely lacks the sequence of repeats 6-8. Clone F9 begins at about the same position and ends in the second third of repeat 8. The sequence of repeats 2-5 is absent in this cDNA. Clone f4 harbors only sequence coding for the repeated subdomain, from the end of repeat 3 to the very beginning of repeat 10. Finally, clone f3 begins within repeat 9, includes repeats 10-12, and ends in the first third of the triple helical domain (Fig. 1). Collectively, the four overlapping clones cover 5769 bp of contiguous sequence coding for the 2 a amino-terminal prepropeptide (Fig.  3).
Independent evidence for COLP2a alternative splicing was Amino acids are numbered on the right from the initiation site of translation; they extend from immediately after the amino-terminal prepropeptide (see Fig. 3) to immediately before the carboxyl-terminal propeptide (see Fig.  1). Gly-Gly and Arg-Gly-Asp sequences are underlined by continuous and dotted lines, respectively. obtained by Northern blot hybridization to a probe specific for the region immediately upstream of the repeated subdomain of the amino-terminal propeptide. This identified at least three major hybridizing bands that range in size from approximately 8 to 10 kb (Fig. 1). Furthermore, preliminary analysis of cDNA amplified by the polymerase chain reaction technique strongly suggests the existence of additional alternatively spliced transcripts coding for distinct combinations of the amino-terminal propeptide repeats (data not shown). We believe that some of the background seen in the Northern blot of Fig. 1 might be caused by the hybridization of these alternative and plausibly minor transcripts. Experiments in progress are elucidating the exact number, relative representation, and composition of all COLP2a transcripts, along with the maximum length of the repeated subdomain.

DISCUSSION
The complete primary structure of two invertebrate fibrillar procollagens, la and 2 a chains, have been deduced from sequences of overlapping cDNA clones. The sea urchin la procollagen comprises 1414 amino acids and is evolutionarily related to the vertebrate pro-aB(1) collagen (27). In contrast, the hypothetical structure of the 2a chain is unique among invertebrate and vertebrate fibrillar procollagen molecules. The primary differences are the length and composition of the amino-terminal propeptide domain, which exhibits a novel configuration, Structure IV. In addition to the three characteristic subdomains of Structure I, Structure IV contains a very large subdomain consisting of a novel peptide module repeated several times. Moreover, the mesenchyme cells of late gastrula sea urchin embryos appear to produce alternatively spliced COLP2a transcripts theoretically encoding isoforms with amino-terminal propeptides of different length. This is the second case of a fibrillar collagen gene whose amino-terminal propeptide-coding domain undergoes alternative splicing. In the vertebrate pro-al(I1) collagen, the 10cysteine globular subdomain coding exon is in fact alternatively spliced in a developmentally regulated manner. Such a phenomenon was first recognized in human chondrocytes by Ryan and Sandell (3) and later confirmed in Xenopus and chicken embryos by other investigators (18, 28). The phylogenetic retention of the developmental pattern of alternative splicing has been interpreted as suggesting functionally distinct roles of the resulting pro-al(I1) products during vertebrate embryogenesis (18). It will be of interest to determine whether alternative splicing of COLPZa is also developmentally regulated and, more importantly, to understand what is the significance, if any, of producing different size amino-terminal propeptides.
The contribution of collagen molecules to sea urchin development is well established. In uiuo studies have documented the importance of properly formed collagen aggregates for the progression of the gastrulation process (9, 10). In vitro experiments have emphasized the role of the collagen substrate in promoting the biomineralization program of cultured cells (11)(12)(13)(14). Our structural data are consistent with the proteolytical removal of the long amino-terminal propeptide of the 2a chain. It is, therefore, tempting to speculate that the 2a procollagen may yield two polypeptides nearly identical in size and serving distinct morphogenetic functions. Such a situation would be analogous to the postulated morphogenetic functions assigned to cleaved propeptides of vertebrate procollagens in cartilaginous matrices (29,30). The availability of COLP2a clones provides a means to test this hypothesis, in that specific antibodies could be employed in interference experiments on micromers differentiating in culture (14).
The three S. purpuratus collagen genes hitherto identified by cDNA cloning experiments are all expressed in the same cell lineages, but at different ontological times (14).2 It could be argued that a common regulatory program governs critical changes in matrix composition during the differentiation of primary and secondary mesenchyme cells. Conversely, each H. R. Suzuki, M. D'Alessio, J. Y. Exposito, R. Gambino, F. Rarnirez, and M. Solursh, manuscript in preparation.
of these genes is likely to contain distinct cis-acting elements responsible for timely onset of transcription. Experiments currently in progress are characterizing the regulatory sequences of the three sea urchin collagen genes. These studies promise to elucidate some of the mechanisms and factors controlling collagen gene expression in the developing animal embryo.