Structure of a Gene Encoding the 1.7 S Storage Protein, Napin, from Brassica napus”

A rapeseed chromosomal region containing a gene (napA), which encodes the 1.7 S seed storage protein (napin), was isolated in several overlapping recombi- nant clones from a phage X genomic library. Following restriction enzyme mapping of the genomic region, a subclone containing the napA coding region as well as some 1.1 and 1.4 kilobases of DNA from the 5‘ and 3’ regions, respectively, was mapped and sequenced. The gene turned out to lack introns. Southern blotting anal- yses utilizing a napin cDNA clone as a probe revealed the presence of on the order of 10 napin genes in the rapeseed genome. The major polyadenylated transcript encoded by these genes was shown to be an 850-nu- cleotide species, the initiation site of which was mapped onto the napA gene. The major initiation site for transcription is located some 33 nucleotides downstream from a sequence perfectly conforming to the consensus sequence of a TATA box. Further analyses of the sequence revealed several features that may be of relevance for the expression of the napin genes.

* This work was supported by The Swedish Research Council for Natural Sciences, The Swedish Research Council for Forestry and Agriculture, and the Stiftelsen Brinkgirden. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) 502798. 1986). As an initial step toward an increased understanding of the regulation of napin genes, we have isolated and sequenced a member of what turns out to be a small gene family.

DISCUSSION
We have isolated and sequenced a gene encoding napin. The gene is a member of a small family with some 10 genes. Transcription of an as yet unknown number of these genes yields an 850-nucleotide-long mRNA, the cap site of which was mapped onto the napA sequence. We have compared our sequence with that of another napin gene, pGNA, as well as with previously sequenced cDNA clones (Crouch et al., 1983;Ericson et al., 1986). The nupA sequence is completely identical to the pNAPl cDNA clone that we have previously sequenced (Ericson et ul., 1986). This makes us rather confident that we have sequenced an expressed copy of the napin gene family, although we have no formal proof that this is the case.
Comparison with the pGNA gene sequence revealed that, apart from single nucleotide changes, a quite frequently occurring divergence in the coding region is insertions of one or two triplets in pGNA relative to napA. These occur in four and two instances, respectively (data not shown). Apart from one previously reported triplet deletion in the pN1 cDNA clone (Crouch et al., 1983). These are the first examples of differences that affect the length of the primary sequence of the translated napin product. The number of nucleotide changes in the coding region is also higher when comparing napA with pGNA than with any of the previously sequenced cDNA clones (data not shown). It is interesting to speculate whether these observations may be related to the fact that B. napw is an amphidiploid of Brassica campestris and Brassica oleracae. It might be expected that the genes derived from one of the respective parental species would be more homologous to each other than when comparing across the parental border. We are presently attempting to assign parentalship of isolated napin genes by comparison with Southern blots of genomic DNA from the three species. Preliminary data' indicate that the napA gene most likely is derived from B. oleracae.
1. Genomic restriction fragments hybridizing with napin cDNA sequences. Genomic DNA was cut with restriction enzymes. The generated fragments were separated and blotted onto nitrocellulose filters as described under "Materials and Methods." Nick-translated pNAPl cDNA was used as a probe in hybridization to these filters. The enzymes used were B, BamHI; E, EcoRI; H, HindIII; and P, PuuII. The size marker ( M ) used was an end-labeled BstEII digest of phage X DNA. Sizes of the marker bands were (from to^ to bottom): 8454,7242,6369,5687,4822,4324,3675,2323,1929,1371,1264, and 702 base pairs. M ' 0 R QM FIG. 2. Northern blotting and hybridization of rapeseed mRNA to pNAPl cDNA. mRNA was purified and separated on denaturing agarose gels as described under "Materials and Methods." After transfer to nitrocellulose filters the immobilized mRNA was hybridized to a nick-translated cDNA probe. R denotes the RNA lane; M, the marker lane. The marker used was a denatured HinfI digest of pBR322. The autoradiogram reveals the marker bands hybridizing to nick-translated pUC19. The sizes of the bands are 1631 and 517/506 nucleotides, respectively. An 18mer oligonucleotide, complementary to a napin sequence just downstream from the initiation condon, was synthesized. This synthetic oligonucleotide, "P end-labeled and unlabeled in the respective cases, was annealed to either mRNA or M13 DNA covering this region on the minus strand. In separate reactions the primer was allowed to be elongated to the 5' end of the napin transcripts or to prime a standard set of sequencing reactions. The products were separated on a gradient sequencing gel. Lane R shows the terminated forms that were elongated on the mRNA, lanes A, C, G, and T, the respective sequencing reactions.
With regard to the primary translation product, comparisons of all the known sequences have made us aware of an interesting repeated structure in the removed parts of the napin polypeptide. All of the previously sequenced cDNA clones and the two genomic clones discussed here conform to this structure. It consists of a stretch of 7 or 8 amino acids, X-X---(-)X, where X denotes hydrophobic andnegatively charged amino acids, respectively. These sequences in napA are shown boxed in Fig. 6. The negatively charged amino acid in brackets is only present in the first copy of the repeat which occurs in the amino-terminal part of the precursor sequence, before the small subunit. The second copy of the repeat occurs within the removed sequence which is present between the small and large subunits. These two repeats in fact carry almost all of the negative charges that are contained in the processed parts of the precursor (Ericson et al., 1986). It is possible that these repeats are involved in processes relevant for the translocation, intracellular transport, and/or deposition of napin into protein bodies. Alternatively, they could serve as signals in the proteolytic processing steps necessary for the generation of mature napin. However, confirmation of a possible role of these repeats in the above processes will have to await experiments directly aimed at these points.
We have noted several interesting features in the sequence of nupA (and pGNA) that ma37 be of relevance to different aspects of gene regulation. It is tempting to speculate that the 5' hairpin region and the TACACAT repeat region may be directly involved in the transcriptional activation of the gene and that the 3' hairpin region may be involved in the termination of transcription. There is ample precedence in the literature for the former point, i.e. degenerate (or non-degenerate) repeats as well as alterations in DNA topology (possibly manifesting itself in cruciform structures) have been implied in gene regulation in several systems (Gidoni et al., 1985;Hall et ul., 1982;Harland et al., 1983;Serfling et al., 1985). It appears more doubtful what role hairpin loops may play in gene. The figure shows the sequence , , f

~A C C -~~C C~A~A C C A G A C C G~A C~A~A C =~G~M C A~C C C C M~A~G~C~C A G M~C C A~~
discussed in the text.    229D  2300  2310  2320  2330  2340  2350  2360  2370  2380 2390 2400  2650  2660  2670  2600  2690  2100  2710  2720  2730  2740  2750 2760 2770  2700  2790  2800  2010  2020  2830  2B40  2050  2860  2870  2880 I . , , . , . . . . . . . . . . .   termination of RNA polymerase I1 transcripts (Birnstiel et al., 1985), although they may be involved in the termination of specific sets of genes (Hentschel and Birnstiel, 1981). In this context it is worth noting that the m p A gene has several A/T-rich clusters downstream of the poly(A) addition site. As an alternative, these could fulfill a function as terminator signals.

M~A~~A C~C T C C A T A C C~A C~C A T C C A~C T C A~~-G A G A~
The determination and analysis of the nucleotide sequence of the napA gene have revealed features which we suggest may be related to gene regulation. Still, an increased under-standing of gene regulation in the case of napin will undoubtedly have to await data regarding (a) co-regulated genes (e.g. cruciferin (Simon et al., 1985)), ( b ) a functional definition of the cis sequences by in vitro mutagenesis and transformation studies, ( c ) a definition of transacting factors either by the study of regulatory mutants or by studying DNA binding proteins, and ( d ) studies on how the abscissic acid response is mediated. The isolation and characterization of the napin gene described in this paper facilitate studies aimed at solving some of these questions. FIG. 7. Alignment of the napA promoter region and the promoter region of the pGNA napin gene. The nucleotide sequences of the promoter regions of napA and the pGNA napin gene were aligned by use of the ALIGN program (Dayhoff et al., 1979) run with the UN matrix, a break penalty of 2 and 100 random runs. CAC trinucleotides are boxed and perfect or degenerate versions of the TACACAT repeats are indicated by arrows. The TATA box and initiation ATG are boxed for reference. The major transcription cap site is indicated by an arrow. Brackets at the 5' end encompass sequences with a tendency to form hairpin loops.

C -n C C T T A *
nap*

Acknowledgment-Dr.
Steve R. Scofield is gratefully acknowledged for making his sequence of the pGNA napin gene available to us prior to publication. added. The sample was then incubated at 42% for 20 mi" and Subsequently treated as a regular sequencing gel sample.
ApprOYimafely 1 ul of the mixture I 5000 cpm) was loaded Onto the gel and run alongside a reference Set Of sequencing reactions.

Databases
The three major data bases INBRF, EMBL and GENBWKI were used in the sequence comparisons.

Southern and Northern blotting analveee
As an lnltlal step towards defming the Complexity of the rapeseed genome with regard to napln genes we declded to use pNAP1, d CDNA clone whlch encodes napin IEricson -a1.19861, as a radioactive probe in Southern blotting analyses. 10 Y 9 portions Of total rapeseed DNA were In separate reactlons digested to completion with four different DNA fragments on agarose gels, the fragments were denatured restriction enzymes. Following separation Of the generated and transferred to nitrocellulose filters. Hybridiratlon to the filters Of nick-translated pN-I =DNA ylelded the pattern Shown in figure 1. The different enzymes yielded between 8 and 1 3 . hybridizing bands. Since it 1s not known to what extent the enzyme. may Cut withln individual napln genes, Nevertheless, consldering the data as a whole it appears there 15 no way, of deducing an exact gene number.
reasonable to assume that there are I" the order Of 10 genes for napin. HOW many of these hybrldizlnq bands that represent expressed napin genes is at present not clear.
Irrespective of the fact that several genes may be expressing n a~l n . one well defined. maior n a m n mRNA SceCies was evident when rapeseed embryonal & N A ;as subjected to Northern blotting wlth the CDNA probe IFlgure 21. In addition to the major 850 nucleotides transcript, a diffuse population of RNA species 15 also evldent. This ranges in ElZe from approximately 900 to 1500 nucleotides, and as a whole constitutes m i t e a simifrcant fraction of the total hvbrldizina mate-Fial. W e cannot at present determine whether these-larger RNAs represent a Vast PopulatlOn Of differently polyadenylated species Of napin transcripts or Simply are contaminatlng hnRNA which ha6 not yet been polyadenylated. In  &,1982). Figure 4 shows the map IYanish-Perron et G,1985),and further mapped by conventional that was Obtained and a cornpar~son with the PNAPI cDNA restriction map. fhaiall the I m k e T sTquences involved I" transcriptional regulation w e r e contained in th15 subclone and consequently declded to sequence the whole insert of the subclone.

Sequenclnq of the naPA gene
The entire sequence Of the 3 . 3 kb fragment was determlned In overlapping sequence reactlons on both Strands by a combmation of "Shotgun" sequencing and sequencing Of individual, TeStrICtiOn enzyme-derived 1113 subclones. Both the universal merst complementary to sequences within the rubclonea were 17-mer sequencing prmer and synthetic OligOnUCleOtideS 118used to obealn the complete sequence. The requenclng strategy is represented in a Schemaflc fashion below the restrlctlon map ~n Flgure 4 . This represents a minimal estlmate Of sequence data chat were collected. Sequences that were well represented in the "shotgun" clones. the transcribed region I" partlc~ldr, were determined wlth a lot hlgher frequency than 1s apparent from the flgure.
In addition, many individual reactions were performed more than once. et al.1986;Morelli et al.198Sl.Thus. we considered It 1Lkely

Mappinq of the initiation site for transcription
The tranrcrlptlon Cap-Elte of napin mRNA was determined by mRNA dlrected pilmer extension. A Synthetic oligonucleotide, complementary to mRNA sequences close to the inltiation ATG, was (32Pl end-labelled, annealed to mRNA and Subsequ'LntlY elongated to the 5' end of napin mRNAr by the lncorporatlon of vnlabelled nucleotides aedrated by Aw reverse transcriptase. Figure 5 Shows the elongated and termmared primer alongside the Sequence reactions obtained by letting the same oligonucleotide, unlabelled in this Case. prlme sequencing reactions on an MI3 shotgun clone that covered this region on the minus strand. When mapped Onto the sequence of the gene the ma)or initiation Site is at the A in posltion 1102. The minor bands correspond to positions 1098, 1112 and 1113. Thus, the major Site of tranEcriptiOna1 lnltlation appears to be located 33 nucleotides downstream from a sequence which ConfOrmS to the consensus Of a TATA box (see below].
~e n e r a l features of the sequence Flgure 6 shows the sequence of the 3295 nucleotides of the Hind111 -BglII subclone inserr. The rranalated sequence of the coding reglon is also Shown above the nucleotide sequence pNAP1 cDNA clone (Ericson g.19861   . denotes reactions primed b y synthetic 18-mer primers within different S"kl0"eS.