Characterization of a Proline-rich Cell Wall Protein Gene Family of Soybean

Further characterization of a proline-rich cell wall protein gene family from soybean (Glycine max (L.) Merr) has been accomplished by the isolation and sequence analysis of two additional genes, SbPRP2 and SbPRP3, which encode mRNAs of 1050 and 650 nucleotides in length, respectively. Like the proline-rich protein gene, SbPRPl, which was previously reported (Hong, J. C., Nagao, R. T., and Key, J. L. (1987) J. Biol. Chem. 262,8367-8376), these two SbPRP genes encode proteins having a signal peptide sequence and repeats of Pro-Pro-Val-Tyr-Lys. The SbPRP2 gene encodes a protein of 26 kDa which contains a perfect alternating repeat of Pro-Pro-Val-Tyr-Lys and ProPro-Val-Glu-Lys. The SbPRP3 encodes a lo-kDa protein which also contains Pro-Pro-Val-Tyr-Lys as a major amino acid repeat, but the overall amino acid sequence of this protein is more variable than that of SbPRPl and SbPRP2. RNA blot analyses have demonstrated that there are marked differences in the pattern of expression of each SbPRP in various soybean tissues. In contrast, sequence analysis reveals that the SbPRP genes contain a high degree of sequence conservation. Nucleotide sequence homology extends 90 to 100 base pairs upstream of the transcription initiation site and includes typical CAT and TATA sequences. Approximately 80 base pairs of the 3’-noncoding sequence around the polyadenylation signal is also highly conserved. Therefore, the DNA sequence upstream of the 5’-conserved region is presumed to contain c&elements accounting for the developmental and tissue specificity of gene expression. While the pentameric repeat structures occur in all SbPRP genes, the encoded proteins are predicted to be different in several features including basicity, substitutions of tyrosine and glutamic acid in the repeat, and the size of the mature protein.

Approximately 80 base pairs of the 3'-noncoding sequence around the polyadenylation signal is also highly conserved. Therefore, the DNA sequence upstream of the 5'-conserved region is presumed to contain c&elements accounting for the developmental and tissue specificity of gene expression.
While the pentameric repeat structures occur in all SbPRP genes, the encoded proteins are predicted to be different in several features including basicity, substitutions of tyrosine and glutamic acid in the repeat, and the size of the mature protein.
Several plant cDNAs or genes that encode structural cell wall proteins have been characterized (see review by Cassab  andLin, 1989). Extensins, the hydroxyproline-rich glycoproteins (HRGPs)' of the dicotyledonous cell wall, are the best characterized cell wall proteins in higher plants (Chen and Varner, 1985b). Extensins have been found in many plant species and are proposed to be one of the major protein components of the primary wall (Lamport and Catt, 1981). The various levels of HRGP in different organs and tissues (Cassab et al., 1985;Cassab, 1986), however, suggest that the distribution of extensin in the plant cell is rather tissue specific or that HRGPs may not be the only major protein in plant cell walls. There are several known instances of total cell wall proteins in which glycine is a major amino acid component, suggesting the presence of glycinerich protein (Varner and Cassab, 1986). Recently, glycinerich protein genes were isolated from petunia and bean (Condit and Meagher, 1986;Keller et al., 1988). Additionally, cDNAs and genomic clones encoding proline/hydroxyprolinerich protein (PRP) distinct from the HRGP have been isolated from both soybean and carrot (Chen and Varner, 1985a;Hong et al., 1987). The isolation and characterization of genes representing these different classes of cell wall proteins have revealed that there are major differences in the basic repeat motifs of these proteins: Ser-(Hyp), for extensin, (Gly-X), for glycine-rich protein, and Pro-Pro-Val-X-Y for PRP. We have recently reported the isolation and characterization of a proline-rich protein gene, SbPRPl of soybean, which probably plays an important role in plant development (Hong et al., 1987). The SbPRPl gene sequence predicted a novel proline-rich protein containing a highly repeated amino acid structure consisting essentially of Pro-Pro-Val-Tyr-Lys. The Northern blot analysis of RNA isolated from soybean hypocotyl tissue using a cDNA for this gene, pTU04, revealed that there are three mRNA bands homologous to the cDNA probe: a 1050 nt mRNA from the apical and elongating region, and two mRNAs of 1220 nt and 650 nt from the mature hypocotyl (Hong et al., 1987). Further studies on the expression of this PRP gene family using gene-specific probes revealed a highly regulated pattern of gene expression during plant development; the major form of PRP mRNA changed at different stages of development and/or in different tissues or organs (Hong et aZ., 1989).
To study the molecular basis of this developmental regulation of the SbPRP gene family, we isolated and characterized two other PRP genes, SbPRP2 and SbPRP3, which encode the 1050 and 650 nt mRNAs, respectively. The results demonstrate that each member of this PRP gene family encodes a related proline-rich protein with differences occurring in nt mRNA encoded from the SbPRP3 gene using the trichloroacetatebuffer method of Murray (1984). The 5'-end of the transcript was determined using the 450-bp BglI*/NdeI fragment, which was 5'-end labeled using TI polynucleotide kinase (Bethesda Research Laboratories) after the BglI cut and further digested with NdeI. The 3'-end of the transcript was determined using the 656-bp 3'-end-labeled BglI*/HindIII fragment which was 3'-end-labeled using Ta DNA polymerase (Maniatis et al., 1982). Mungbean nuclease-protected DNA fragments were analyzed to determine the 5'-and 3'-end mRNA termini as previously described (Hong et al., 1987). The 5'-end of 1050 nt mRNA was predicted from sequence comparisons to two other pTUO4-related genes; the 3'-end of the RNA was deduced from the DNA sequence of several cDNAs which encode 1050 nt mRNA.

MATERIALS AND METHODS
cDNA Cloning ofSbPRP2 mRNA-Ten pg of poly(A) RNA isolated from the apical region of the soybean hypocotyl (Hong et al., 1987) were used to construct a cDNA library in XgtlO by the procedure of Gubler and Hoffman (1983). The cDNA was treated with Tq DNA polymerase and the Klenow fragment of Escherichia coli DNA polymerase I (Maniatis et al., 1982) and methylated with EcoRI methylase prior to the addition of EcoRI linkers. Following digestion with excess EcoRI restriction enzyme and removal of excess linkers, the cDNA was size fractionated through a Sepharose CL-4B column to obtain cDNA molecules larger than 400 base pairs (Maniatis et al., 1982). The cDNA was then ligated into the uniaue EcoRI site in XgtlO (Huynh et al., 1984) Nagao et al. (1981) with purified, random primer-labeled cDNA inserts (Feinberg and Vogelstein, 1983). The X clone, XSAx41, which contained the SbPRP3 gene in a 14-kb insert, was isolated from the Charon 35 library using"'Plabeled oTU04 cDNA insert as a nrobe (Hone et al.. 1987 (Maniatis et al., 1982), and transferred to nitrocellulose as previously described (Hong et al., 1987). After transfer, the blots were hybridized as described (Hong et al., 1987), in the presence of 2 x lo6 cpm/ml of 32P-random primer-labeled (Feinberg and Vogelstein, 1983)

RESULTS
Phage libraries constructed from soybean DNA were screened with pTU04 cDNA or pSAp4-5 cDNA as a probe (see "Materials and Methods"). Two genomic clones XSAx41 and XJH403, which carry the 650 nt RNA gene (SbPRP3) and the 1050 nt RNA gene (SbPRP2), respectively, were isolated and further characterized. Both genomic clones contained an entire copy of the respective gene flanked by several kilobases of DNA.

Sequence Analysis of the SbPRP2 and SbPRP3
Gene-The restriction maps and the nucleotide sequences of the SbPRP2 and SbPRP3 genes are shown in Figs. 1, 2 and 3, respectively. Restriction enzyme analyses and DNA sequence analyses of the different clones indicate that these two genes are not clustered. SbPRP2 contains an open reading frame of 690 bp (230 amino acids) whose deduced amino acid sequence encodes a 25,978 Da protein. This coding sequence was followed by a 3'-untranslated region that contained a consensus sequence polyadenylation signal at 201 bp 3' to the translation termination codon. The SbPRP3 gene contains an open reading frame of 270 bp (90 amino acids) encoding a protein of 10,293 Da. A putative polyadenylation signal, AA-TAAA, is observed 236 bp 3' to the termination codon.
A summary of the transcription unit and encoded proteins, deduced from the nucleotide sequence of three SbPRP genes, is shown in Table I. From the deduced amino acid sequences, putative signal peptide sequences are predicted which are 22 and 23 amino acids in length for the SbPRP2 and the SbPRP3 genes, respectively. The patterns of N-terminal amino acid sequences are in good agreement with sequences which are typical of membrane-spanning signal sequences of secretory proteins (Perlman and Halvorsen, 1983) (Fig. 7). Following the signal sequence, the remainder of the coding region consists of multiple 15-bp repeats.
Previously the SbPRPl gene was reported to contain 43 repeats of a sequence consisting primarily of Pro-Pro-Val-Tyr-Lys (CCX-CCX-GTX-TAX-AAX).
There are five amino acid deviations from this repeat pattern, two Ile (ATT) substitutions for Val (GTT) and one substitution each of Thr (ACT) for Val (GTT), Asn (AAC) for Lys (AAG/A), and Gly (GGA) for Glu (GAG). The major difference between the SbPRPl and SbPRP2 proteins is the Glu substitution for Tyr in the alternating pentapeptide repeats. The amino acid composition of the predicted mature SbPRPZ protein is Pro 39.1%, Lys 18.3%, Val 16.3%, Tyr ll%, and Glu 9.1% (Table I). The SbPRP2 protein is therefore predicted to be less basic than the SbPRPl protein. The SbPRP3 gene also encodes a proline-rich protein having a pentapeptide repetitive structure which is shorter (90 amino acid) and more variable than that of the other two PRPs. Six repeats of Pro-Pro-Val-Tyr-Lys and two repeats of Pro-Pro-Tyr-Lys-Lys occur in SbPRP3. The analysis of codon usage reveals a biased pattern with the preferred codon being CCA for Pro, GTT for Val, TAC for Tyr, and AAG for Lys.
Both SbPRP genes have putative CAT and TATA-motifs about -80 bp and -25 bp upstream of the cap sites, respectively. For SbPRP3 the 5'-and 3'-ends of the transcription unit were mapped using mungbean nuclease (data not shown) (Fig. 3). Since the SbPRP genes retain a high level of nucleotide sequence conservation and contain GTGTGTT prior to the transcription start site, the cap site of the SbPRP2 gene was deduced by analogy (Fig. 2). In SbPRP3, the upstream region of DNA contains two repeats of a 13-bp sequence (CATGCTTGATTt/aC) and two inverted repeats with a sequence of ATTGxxxCACTAxACATGCxA, but the significance of these conserved sequences has not been determined.
Genomic Southern Analysk- Fig.  4 shows the Southern analysis of soybean genomic DNA digested with BgflI and EcoRI probed with "P-labeled SbPRP gene-specific probes. The SbPRP2 probe hybridized to a single major band of 5.4 kb with BgfII and 7.7 kb with EcoRI digestion, respectively. The SbPRP3 probe did not yield discrete bands on the same ElgflI-digested sample, indicating the lack of a Z3gflI site in a 10-20 kb piece of genomic DNA. The EcoRI digestion yielded 5.6-and 3.6-kb bands which hybridized to the SbPRP3 probe with different intensities. Overall, the Southern blot analyses indicate that, like the SbPRPl gene, the SbPRP2 and Sb-PRP3 genes are not members of large gene families.
Sequence Homology of the SbPRP Gene Family-A diagram of the organization of the SbPRP gene family of soybean is illustrated in Fig. 5. The SbPRPl gene was included for comparison. These genes show striking similarity with several regions of sequence identity observed at similar locations in the 5'upstream regions around the CAT and TATA motifs: the 5'untranslated sequences, the coding regions, including a signal peptide sequence and the mature protein, and the 3'untranslated region around the polyadenylation signal.
In the 5'-upstream region of the three SbPRP genes, nucleotide sequence homology extends into the putative CAT boxes, GTCAX,-ZT, which is positioned approximately the ~ same distance from the cap site (-80 to -85). In this region, tbe TATA motif (TATAAAAA) occurs at approximately -25 bp upstream of the cap sites. The nucleotide conservation ranges from 69% (between SbPRP2 and SbPRP3 or SbPRPl and SbPRP3) and 84% (SbPRPl and SbPRP2) in this 5'conserved region. In addition, immediately 5' to the transcription initiation site, a GTGTGTT sequence is present in all three SbPRP genes. When the regions of DNA encoding the putative signal sequences are compared, it is apparent that these PRPs are closely related. The sequence homologies Ten pg of soybean DNA were digested with the indicated restriction enzymes, electrophoresed on a 0.6% agarose gel, and transferred to nitrocellulose. Indicated molecular sizes in kilobases were determined by HindHI-digested DNA molecular mass markers. Regions of high sequence homology among the SbPRP genes are boxed; regions of no similarity are indicated with bold lines. The amino acid repeat structure for each gene is indicated. The region of DNA from which genespecific probes were obtained is indicated by GSP*. A signal peptide sequence and the putative cleavage site is indicated. Boxed regions indicate the amino acid sequences which form the hydrophobic core of signal peptide sequences. The putative cleavage site was predicted according to Von Heijne (1984). between SbPRP2 and SbPRPl or SbPRP3 in this region are 98 and 72%, respectively. In addition to the 5'upstream conserved region, the 3'-end of SbPRP genes spanning approximately 80 bp have about 70% sequence identity centered around the polyadenylation signal. A region of DNA approximately 200 bp in length, located between the translation termination codon and the region of 3'-homology around the polyadenylation signal, did not possess noticeable sequence identity. This region was used for the preparation of genespecific probes.
The highly conserved pattern of amino acid sequences in the N-terminal region of the encoded proteins is shown in Fig. 6. The signal peptide sequences of three SbPRPs are in good agreement with typical membrane spanning signal sequences of various eukaryotic proteins that are secreted (Perlman and Halvorson, 1983;Von Heijne, 1983. Comparison of the amino acid sequences of the coding region revealed the highly repeated nature which is reflected in the hydropathy plot of the encoded proteins (Fig. 7). Plots were constructed by the method of Kyte and Doolittle (1982) by progressively moving along the amino acid sequence and averaging the hydropathy index for nine amino acids. Points uboue t/ze /zorizorztu~ line correspond to hydrophobic regions; points below t/ze horizontd lirx represent hydrophilic regions.

DISCUSSION
The characterization of two additional proline-rich cell wall protein genes, SbPRP2 and SbPRP3, which encode 1050 and 650 nt mRNAs, respectively, reveal a striking similarity among members of this gene family, detectable both at the nucleotide sequence level and in the highly repetitive structure in the coding region. Highly repetitive structural motifs are observed for the deduced amino acid sequences of polypeptides encoded by all three SbPRP genes (Fig. 5). The presence of multiple 15-bp repeats (CCX-CCX-GTX-TAX-AAX) in all three PRP genes explains the cross-hybridization of the pTUO4 cDNA to 1050 and 650 nucleotide mRNAs on Northern analysis in addition to the 1220 nt mRNA (Hong et ul., 1987). The SbPRP genes encode proline-rich proteins which are highly related, with major differences predicted in their biochemical nature, such as Tyr and Glu content, or basicity, and physical size of the mature protein. Compared with the SbPRPl protein, which contains the Pro-Pro-Val-Tyr-Lys pentapeptide as the major amino acid repeat, SbPRP2 contains a near perfect alternating repeat structure composed primarily of Pro-Pro-Val-3-Lys and Pro-Pro-Val-Glu-Lys. Based on the pattern of repeat structure and am= acid composition, the SbPRP3 protein is more similar to SbPRPl than to SbPRP2. The major difference is the size of the mature SbPRP3 protein, which is approximately 8 kDa uersus 27 and 24 kDa for SbPRPl and SbPRP2, respectively.
To address the functional importance of the proline-rich protein gene family, it is desirable to find a relationship to other known cell wall protein genes. The most highly studied cell wall proteins are the HRGP extensins of dicots (reviewed by Cassab and Varner, 1988). Cell wall HRGP extensins contain a characteristic pentapeptide repeat sequence, the Ser-(Pro)+ in the primary translation product of the apoprotein, which is subsequently hydroxylated to give Ser-(Hyp)d found in the mature protein (Chrispeels, 1970). A comparison of amino acid sequences deduced from all known HRGP sequences revealed substantial variability in the length or in amino acids flanking the highly conserved Ser-(Pro)d repeats, e.g. Val-His or Val-Ala (Showalter et ul., 1985), Thr-Pro-Val-Tyr-Lys (Smith et al., 1986), and Val-Tyr-Tyr-Tyr-Lys or Tyr-Tyr-Tyr-His (Corbin et ul., 1987) (see review by Showalter and Varner, 1989). These flanking sequences are presumed to be important in functionally and structurally distinguishing each HRGP isomer. Cell wall HRGPs contain carbohydrates, approximately two thirds of the glycoprotein mass, which are composed largely of oligoarabinosyl residues attached through 0-glycosidic linkages to most of the hydroxyproline residues and, to a much lesser extent, by galactose, which occurs in O-glycosidic linkage to some of the serine residues (Lamport, 1973). Glycosylation of extensin has been suggested to be important in maintaining its proper structure and function (Stafstrom and Staehelin, 1986;Sadava and Chrispeels, 1973;Cooper et aZ., 1983). The polysaccharides of mature HRGP appear to stabilize HRGP in forming extended helical rods of polyproline II conformation (left-handed helix, 3 residues/turn, a pitch of 3.12 A) (Van Holst and Varner, 1984). The HRGP becomes insoluble in the cell wall, perhaps through the formation of isodityrosine bonds (Fry, 1986), which are formed by two adjacent tyrosine residues. Although Ser-(Pro)d does not occur in the SbPRP gene family, the relationship of SbPRPs to other cell wall HRGPs is noted at the nucleotide sequence level (Hong et ul., 1987). The Pro-Pro-Val-Tyr-Lys repeat of SbPRPs has been observed as part of an extensin repeat in tomato (Smith et al., 1986). Recently, additional cDNAs encoding cell wall proteins have been reported which are unrelated to the HRGP and lack the Ser-(Pro)d motif. These include cDNAs for nodulin-75 (Frassen et uZ., 1987) of soybean nodules and a 33-kDa PRP of carrot (Chen and Varner, 1985a). In a recent study, the steady-state levels of SbPRP mRNA A cell wall protein which has the same amino acid compoaccumulation showed that each transcript is differentially sition as the SbPRPl protein was identified from cultured expressed, showing dramatic patterns of developmental and soybean cells (Averyhart-Fullard et al., 1988). The protein organ specificity (Hong et ul., 1989). The major SbPRP tran-lacks histidine and serine and contains 20% hydroxyproline script accumulating in the mature hypocotyl, root, and young and 20% proline. The molecular mass of the protein appears seed coat was SbPRPl, SbPRP2 in the apical hypocotyl and to be 33 kDa on sodium dodecyl sulfate-polyacrylamide gel, young tissue-cultured cells, and SbPRP3 in most of the aerial parts of the soybean plant.
In contrast to the marked differences in mRNA accumulation, sequence analysis of the SbPRP genes revealed strikingly similar gene structure and a high degree of sequence conservation. The nucleotide sequence conservation extends into the -90 to -100 bp region upstream of the cap site and 3' into the noncoding region around the polyadenylation signal. The DNA sequences around the CAT and TATA motifs, cap sites, and polyadenylation signal are positioned at approximately the same distances relative to the transcription unit. This high conservation suggests functional significance of these sequences for the regulation of expression of the SbPRP gene family. However, sequences involved in the dramatic developmental-and organ-specific expression of this gene family are likely located upstream of this highly conserved region.