Complete Primary Structure of a Sea Urchin Type IV Collagen a Chain and Analysis of the 5’ End of Its Gene*

We isolated several overlapping cDNAs from Stron-gylocentrotuepurpuratus coding for a nonfibrillar col- lagen chain structurally homologous to the vertebrate type IV collagen chains and arbitrarily termed 30 chain. The deduced amino acid sequence of the sea urchin polypeptide includes a 28-residue signal peptide, a 14-residue amino-terminal non-collagenous seg- ment, a triple-helical domain of 1390 residues containing 23 imperfections, and a 226-residue carboxyl-ter- minal non-collagenous region. Comparison of the sea urchin amino- and carboxyl-terminal non-collagenous domains with those of the vertebrate type IV collagen chains indicated a high level of sequence identity to the al(1V) and (u6(IV) chains. This evolutionark relationship was further strengthened by the analysis of the genomic organization of the 6’ portion of the sea urchin gene, which also provided the composition of some of the upstream sequences. In addition, this work demonstrated that our gene product is identical to that encoded by the partial cDNA clone recently isolated by others (Wessel, G. M., Etkin, M., and Benson, S. (1991) Dev. Biol. 148, 261-272) who demonstrated its in- volvement in the biomineralization process of cultured mesenchyme cells.


Complete Primary Structure of a Sea Urchin Type IV Collagen a Chain and Analysis of the 5' End of Its Gene*
(Received for publication, July 31, 1992)

From the Brookdale Center for Molecular Biology, Mt. Sinai School of Medicine, New York, New York 10029
We isolated several overlapping cDNAs from Strongylocentrotuepurpuratus coding for a nonfibrillar collagen chain structurally homologous to the vertebrate type IV collagen chains and arbitrarily termed 30 chain. The deduced amino acid sequence of the sea urchin polypeptide includes a 28-residue signal peptide, a 14-residue amino-terminal non-collagenous segment, a triple-helical domain of 1390 residues containing 23 imperfections, and a 226-residue carboxyl-terminal non-collagenous region. Comparison of the sea urchin amino-and carboxyl-terminal non-collagenous domains with those of the vertebrate type IV collagen chains indicated a high level of sequence identity to the al(1V) and (u6(IV) chains. This evolutionark relationship was further strengthened by the analysis of the genomic organization of the 6' portion of the sea urchin gene, which also provided the composition of some of the upstream sequences. In addition, this work demonstrated that our gene product is identical to that encoded by the partial cDNA clone recently isolated by others (Wessel, G. M., Etkin, M., and Benson, S. (1991)

Dev. Biol. 148, 261-272) who demonstrated its involvement in the biomineralization process of cultured mesenchyme cells.
Collagen molecules participate in the formation of a wide array of supramolecular aggregates that confer critical biomechanical and physiological properties to the tissues and organs of all metazoan organisms (1). Based on morphological criteria, these extracellular structures are broadly divided into fibrillar (type 1-111, V, and XI collagens) and nonfibrillar (type IV, VI-X, and XII-XIV collagens) networks (1). Contrasting the homogeneity of the former group is the substantial diversity of architectural forms of the nonfibrillar aggregates, which include sheet-like structures, hexagonal lattices, beaded filaments, and anchoring fibrils (1). In addition to merely serving as supportive elements, collagen molecules are also recognized to play a dynamic role in several cellular activities, morphogenetic processes, and developmental pro-* This work was supported by National Institutes of Health Grant GM-41849. This is article 117 from the Brookdale Center for Molecular Biology at the Mt. Sinai School of Medicine in New York. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence ( grams (2). Such a dual function is reflected in the complexity and diversification of the regulatory mechanisms that modulate the tissue-and stage-specific patterns of collagen gene expression (3).
Our current understanding of collagen structure, function, and regulation relies for the most part on studies performed in higher vertebrate systems. This despite the large body of descriptive information that exists for the collagens of simpler and experimentally more amenable organisms. A case in point is the sea urchin embryo in which collagen biosynthesis has been implicated in the progression of gastrulation and in the differentiation of skeletogenic mesenchyme cell cultures (4-6).
Biochemical and immunological data have indicated that the sea urchin contains several genetically distinct collagens believed to represent the counterparts of vertebrate fibrillar and nonfibrillar molecules (6-11). This postulate was recently confirmed by the cloning and characterization of three collagen genes from two distinct sea urchin species, Paracentrotus lividus and Strongylocentrotus purpuratus (6,(12)(13)(14)(15)(16)(17).
Using a nematode collagen probe, Saitta et al. (13) isolated from the former organism a genomic fragment subsequently shown to encode a collagen evolutionarily related to the fibrillar group of molecules (14). The primary structure of this gene product, arbitrarily termed l a chain, was recently elucidated in its entirety in S. purpuratus and shown to be homologous to that of the vertebrate pro-a2(1) collagen gene (16). Likewise, the protein sequence of an unusually long fibrillar procollagen, named 2a chain, was deduced from over lapping cDNAs isolated from embryonic libraries of P. lividus and S. purpuratus (15,17). In both sea urchin species, these fibrillar collagen genes are co-expressed in the mesenchyme cell lineage of late-gastrula stage embryos (14, 15).' The third echinoid collagen was originally identified in the S. purpurutus genome by Venkatesan et al. (12) who isolated a 212-bp2 exon coding for an interrupted collagenous sequence using a mouse type IV collagen probe. These investigators showed that the collagen-coding probe hybridizes to a 9-kb transcript first detected during blastula formation (12), whereas others demonstrated a more complex pattern of hybridization by the same exon probe (18). To be precise, they reported that the genomic fragment cross-hybridizes with a 7-kb RNA which is first detected in the gastrulae, as well as with a second 9-kb species apparently co-expressed at blastula stage together with the homologous transcript of similar size (18). The nonfibrillar nature of this collagen gene, termed Spcoll, was recently confirmed by partial cDNA cloning experiments which, however, left unresolved the issue Clones are depicted below a schematic representation of the 3a collagen chain. Signal peptide ( S P ) is signified by the gray rectangle, whereas amino and carboxyl non-collagenous regions (NC), as well as non-collagenous interruptions of the triple-helix (black), are in white. Relative position of cysteinyl residues (C) are indicated along with those of EcoRI ( E ) sites. On the right is a Northern blot hybridization to clone I28 of RNA from gastrulae (C) and plutei ( P ) . Size of hybridizing bands, estimated using DNA size markers, is indicated-on the right side of the autoradiogram: pertaining to the identity of this collagen chain (6). Here we demonstrate that this nonfibrillar molecule of S. purpuratus is a type IV collagen a chain.

MATERIALS AND METHODS
Embryo Cultures, RNA Isolation, and Analysis-Gametes collection, in vitro fertilization, and embryo cultures were carried out according to the standard protocol (19). Poly(A)+ RNA was isolated by oligo(dT)-cellulose chromatography of total RNA purified from gastrulae and plutei using the procedure described by Cathala et al. (20). For Northern blot hybridizations approximately 1 pg of poly(A)+ RNA was analyzed as described previously (14, 15).
Isolation and Sequencing of cDNA and Genomic Clones-For cDNA isolation, 5 X IO4 recombinant phages from a late-gastrula stage S. purpuratus library (16, 17) were initially screened under cross-hybrizing conditions using P. liuidw cDNA fragments encoding triplehelical sequences of the In and 2n procollagens (14, 15). Additional cDNA and genomic screenings were performed under stringent conditions of hybridization using appropriate s. purpuratus cDNA probes. Clones were sequenced as described previously (16, 171, and resulting data were analyzed with the aid of the computer program MULTALIN (21).
Determination of the Start Site of Transcription-For primer extension, 100 pg of a 30-nucleotide oligomer complementary to the 5'coding portion of the 3n mRNA was 32P-labeled by T4 kinase, incubated with 1 pg of late-gastrula stage poly(A)' RNA, and extended by reverse transcriptase as described (22). For nuclease protection, a 578-bp genomic fragment, whose 3' end lies 168 bp upstream of the ATG codon, was first subcloned into the transcription vector pT7/T3-19 (Life Technologies Inc., Gaithersburg MD). Uniformly '*P-labeled antisense RNA was then synthesized by transcribing 1 pg of linearized template using the T7 RNA polymerase. The product of the reaction was annealed to 1 pg of late-gastrula stage poly(A)' RNA and subjected to RNase protection according to the protocol of Krieg and Melton (23). Products of the primer extension and nuclease protection reactions were analyzed by autoradiography after electrophoresis in a 5% polyacrylamide, 50% urea (w/v) gel (23).

Cloning of 3a
Collagen cDNAs-While this work was already in progress, Wessel et ul. (6) reported partial sequence information of the transcript encoded by the genomic clone initially isolated by Venkatesan et al. (12) using a murine type IV collagen probe. To be precise, this 2721-bp-long cDNA (Spcoll) codes for a series of repeated Gly-X-Y triplets interrupted 13 times by short non-collagenous sequences (6). Consistent with previous analyses (12, 18), Spcoll was found to hybridize t o a 9-kb transcript which is first detected at the swimming blastula stage where it accumulates specifically in the mesenchyme cells (6). Concomitantly to and independently from this study, we also identified a nonfibrillar collagen gene product during our cross-species cloning of the S. purpuratus la a n d 2 a collagen genes (16, 17). As discussed more extensively below, this 1-kb cDNA, I28 (Fig. l), codes for a triple-helical sequence containing four interruptions, a short amino-terminal non-collagenous region, and a potential signal peptide (Fig. 1). In addition, the 168 bp which lie upstream of the start site of translation of I28 contain several stop codons. Thus, the data indicated that the sequences of I28 correspond t o t h e 5' end of a nonfibrillar collagen mRNA. Based on the available structural evidence, we predicted that this mRNA encodes a type IV-like a chain. For consistency with our nomenclature of the other sea urchin collagens (14-17), this product was named 3a chain and the corresponding gene COLP3a.
Upon Northern blot analysis, we found that the I28 cDNA hybridizes t o a transcript of about 8-9 kb in gastrulae and to an additional 6-7-kb transcript in plutei (Fig. 1). Moreover, parallel in situ hybridizations' revealed t h a t t h e developmental pattern of COLP3a expression closely resembles that independently reported for Spcoll (6). Hence, we hypothe- sized that I28 either corresponds to the 5' portion of the Spcoll transcript or to the other 9-kb cross-hybridizing species identified by Nemer and Harlow (18) using the Spcoll genomic probe. To clarify this point, we decided to isolate the full-length sequence of COLP3a. Accordingly, an embryonic cDNA library was subjected to several rounds of screening utilizing 3' fragments of appropriate cDNA clones. This eventually led to the isolation of eight overlapping clones covering the entire COLP3a message in addition to 400 and 260 bp of 3'-and 5"noncoding sequences, respectively (Fig.  1). The 5256-bp coding sequences of the COLP3a cDNAs were also found to include the 2.7 kb of Spcoll, thus ruling out the possibility that our clones represent the cross-hybridizing species previously detected by Nemer and Harlow (18).
Structure of the 3a Collagen-The sea urchin 3a collagen chain is 1752-amino acids long and displays the characteristic domains of a type IV collagen chain. These are as follows: a putative 28-residue signal peptide, a 14-residue amino-termi- nal non-collagenous segment, an interrupted triple-helical domain of 1484 residues, and a 226-residue carboxyl-terminal non-collagenous (NC1) domain (Fig. 2).
The type IV collagen network represents the central core around which numerous macromolecules self-assemble to originate the sheet-like structure of basement membranes (1). At least five genetically distinct type IV collagen chains have been identified in mammals (24)(25)(26); type IV-like collagen genes have also been cloned in Drosophila melanogaster, Ascaris suum, and Caenorhabditis elegans (27)(28)(29)(30). We utilized the available amino acid sequences of these vertebrate and invertebrate chains to more firmly establish the identity of the S. purpuratw 301 collagen. To this end, we concentrated on the analysis of the type IV collagen domains which are known to be highly conserved across different chains and different species, notably the carboxyl-terminal NC1 domain and the amino-terminal 7 S region (24)(25)(26)(27)(28)(29)(30). In the initial steps leading to the assembly of the basement membrane network, the former domain of the type IV trimer participates in dimers association, whereas the latter is the site whereupon a tetrameric complex is formed and stabilized by disulfide bonds and lysine-derived cross-links (1).
Alignment of the NC1 sequences of the S. purpuratus chain with those of various vertebrate and invertebrate chains (24)(25)(26)(27)(28)(29)(30)(31) confirmed the remarkable phylogenetic preservation of this domain configuration in the sea urchin type IV collagen as well (Fig. 3). This is particularly evident for the number of and relative spacing between the 12 cysteines implicated in the internal disulfide bonding of the folded NC1 domain (Fig.  3) (32). The comparison revealed the highest levels of sequence identity between the NC1 domain of the S. purpuratus chain and those of the human al(1V) (71.2%) and a5(IV) (70.8%). The close identity to both mammalian chains is not surprising in view of the fact that these two collagens share 83% sequence identity in the NC1 domain (33).
Similar considerations apply to the comparison of the 7 S regions (Fig. 4A), which also indicated that 3a has a slightly closer level of identity to the a5(IV) (62%) than the al(1V) chain (58%). Another noticeable feature of the 3a collagen is the presence of a glycosylation site (Asn-Gly-Thr) at the boundary between the collagenous and non-collagenous sequences, in addition to the highly conserved Asn-Gly-Thr triplet positioned immediately after the last cysteine of the 7 S region (Fig. 4A). Interestingly, in the sea urchin these putative glycosylation sites are within two identical peptide sequences (Cys-Asn-Gly-Thr-Lys-Gly-Glu-Arg-Gly) (Figs. 2 and 4A). Furthermore, the carboxyl-terminal end of the most upstream peptide overlaps with the triplet Arg-Gly-Asp believed to mediated cell-matrix interactions by binding to integrin receptors on the plasma membranes (34).
The triple-helical domain of all type IV collagens contains non-collageneous interruptions or imperfections which vary in length, composition, and location (1). In the S. purpuratus chain there are 23 of such interruptions ranging in size from 1 to 20 amino acids (Fig. 2). Like all type IV collagens, one of the imperfections positioned in the amino-terminal third of ._._.._...., the triple helical domain (19, Fig. 2) contains 2 cysteinyl residue. Two additional cysteines are located in imperfections 14 and 112 (Fig. 2). The former sequence contains also the third putative N-linked glycosylation site of the 3a collagen chain (Fig. 2). Finally, a second Arg-Gly-Asp triplet is noted at the carboxyl-terminal end of the triple-helix between imperfections 122 and 123 (Fig. 2). Analysis of the 5' Portion of the COLP3a Gene-In the last set of experiments we analyzed the 5' portion of the COLP3a gene and thus determined the boundaries of the first exon. To this end, we isolated a phage clone by screening a genomic library with the cDNA clone L4 (Fig. 1). The resultant positive recombinant, GC18, was then analyzed by separate Southern blot hybridizations to the two EcoRI segments of L4 in order to identify the contiguous EcoRI genomic fragments. This placed the sequences of the 5' and 3' EcoRI probe of L4 in a 3.5-and 5-kb EcoRI genomic fragments of CG18, respectively (data not shown).
To determine the 5' end of the 3a mRNA, a primer exten-

Invertebrate Basement Membrane
Collagen 5253 TCTATTTTGATCGCTTTTATTGCAACCCTTGTGG~TTTCAGTTTGA AACGGAAACAACGCACAATGTCGGRCTAACTTCCTATTR CATCACCGCGCTGGGAACGACGAAAATGATGATCC&TCTTTCA&TTCTCC CAACAAACCCTCCAAAACTTTTCTCAAATCGTCGTGTCAGATTGCAAAAT GTTTGAATATTCATTAGTGTGATTCGACATAGTTAGTCCGCAGATAT~T  5 and 6). This identified a major extension product, 156 nucleotides in length (Fig. 5A), which placed the start site of transcription 260 bp from the ATG codon (Fig. 6), nearly coinciding with the last nucleotide of the cDNA clone 445 (Fig. 1). Sequencing of the 3' portion of the 3.5-kb subclone of GC18 revealed complete identity between cDNA and genomic sequences. The location of the start site of transcription was independently assessed by a nuclease protection experiment. To this end, we utilized a 578-nucleotide riboprobe that spans from the 3' EcoRI site of the 3.5-kb genomic subclone and thus includes the 5' end of clone 445. The size of the major resistant product, 96 nucleotides ( Fig.  5B), placed the 5' end of COLP3a at the same location of the primer extension reaction, 260 bp from the ATG codon (Fig.  6). Analysis of the upstream sequences revealed the presence of a putative TATA box (TATTAT), beginning at position -45, and a CCAAT motif on the opposite strand between nucleotides -103 and -99, relative to the start site of transcription (Fig. 6). Finally, it should be noted that numerous minor products were seen in the primer extension and the nuclease protection experiments (Fig. 5). However, only one of them was independently confirmed by both assays (Fig. 5), thus suggesting the presence of a second start site of transcription 22 nucleotides downstream from the major one (Fig.  6). Determination of the nucleotide sequences from the 5' EcoRI site of the 5-kb genomic subclone established the 3' end of exon 1. This and the previous experiments documented that the sequences of the first coding unit of COLP3a include the entire 5"noncoding segment, the signal peptide, and the first codon of the 7 S region (Fig. 6). Such an arrangement is identical to that of the mammalian al(IV) gene and distinct from the multiple-exon configuration of the a2(IV) gene ( Fig.  4B) (35)(36)(37). Although no information is presently available for the 5' end of the a5(IV) gene, these results reiterated the al(1V)-like nature of the echinoid gene. The mammalian al(1V) and a2(IV) collagen genes are characteristically arranged in a head-to-head configuration separated by a short TATA-less common promoter (35)(36)(37). In contrast, the C. elegans homologs of these genes are located on separate chromosomes (30). A similar situation may exist in the sea urchin, since preliminary evidence excludes the presence of a closely linked a2(IV)-like gene in the genomic clone we isolated.

DISCUSSION
We completed the primary structure of an embryonic collagen, 3a chain, from the sea urchin S. purpuratus and determined the promoter sequence of the corresponding gene, COLP3a. This gene product is identical to that encoded by the partial clone that Wessel et al. (6) recently isolated and showed to play a critical role in the i n vitro differentiation of mesenchyme cells. We demonstrated that COLP3a encodes a type IV collagen a chain which we believe represents the al(IV) chain of S. purpuratus. Although the structural data could not establish a firm relationship with either this chain or the closely related a5(IV) collagen, we rest this hypothesis on three lines of indirect evidence.
First, COLP3a is expressed at the time in which the most primitive basement membrane, the basal lamina lining the blastocoel of the developing embryo, is formed (6).' Such a pattern of expression parallels that seen for the major type 1V chains during vertebrate embryogenesis (2). Second, the 3a transcript is co-expressed with a gene product of similar size that displays an estimated 30% sequence divergence (18), a value that closely resembles the difference between the mammalian al(IV) and a2(IV) collagens (27). Third, antibodies raised against the triple helical domain of the 301 chain have been shown to recognize a collagenase-sensitive heterotrimer in the culture media of mesenchyme cells (6).
Hence, it is reasonable to argue that the two COLP3a hybridizing bands which appear at distinct stages of sea urchin development correspond to three distinct type IV collagen gene products. To be precise, the 9-kb band might correspond to the two transcripts from the major type IV genes (al(1V) and a2(IV)), whereas the 7-kb band might represent a minor al(IV)-like gene product. Such an hypothesis is consistent with early immunological indications of three type IV-like a chains in the sea urchin (4), as well as with the recently recognized heterogeneity of this group of nonfibrillar collagens in mammals (1). A corollary to our hypothesis is that the differential expression of the sea urchin type IV collagen genes may be responsible for distinct morphogenetic properties of basement membranes in the developing organism. Cloning of additional type IV-like transcripts and characterization of the function of these gene products will eventually test the validity of this hypothesis.
Another interesting point emerging from our studies is that the structural identity of individual collagens and the overall diversity of this family of molecules in the sea urchin replicate to some extent the vertebrate situation. It is also evident that, regardless of their identity and function, the sea urchin collagen genes display a pattern of expression specifically restricted to the cellular derivatives of the vegetal plate and the micromeres (6,14,15).' In this respect, it will be of interest to characterize the molecular mechanisms that orchestrate collagen gene expression in this experimental model widely used for studying cell lineage specification during animal embryogenesis (38).