Isolation and DNA Sequence of a Gene Encoding a-Trichosanthin, a Type I Ribosome-inactivating Protein*

cY-Trichosanthin (a-TCS) is a ribosome-inactivating protein that has recently been shown to inhibit the replication of human immunodeficiency virus. We have isolated a gene encoding a-TCS and have determined its DNA sequence. The data indicate that (r-TCS is synthesized as a preproprotein consisting of 289 amino acids, the first 23 residues of which comprise a putative secretory signal peptide. The last 19 residues comprise a carboxyl extension that has not been reported to be associated with the mature protein and that may be processed in the endoplasmic reticulum or Golgi appa- ratus of cells producing (r-TCS. The mature protein consists of 247 amino acids. The sequence predicted by translation of the DNA sequence agrees with and con- firms the primary sequence determined recently on the protein. The molecular clone for a-TCS will facilitate directed mutational analyses that may provide infor- mation on how this peptide, and other ribosome-inac-tivating proteins, function. These studies may also lead to the development of therapeutic agents with altered activities and/or improved properties for in viva use.

The data indicate that (r-TCS is synthesized as a preproprotein consisting of 289 amino acids, the first 23 residues of which comprise a putative secretory signal peptide. The last 19 residues comprise a carboxyl extension that has not been reported to be associated with the mature protein and that may be processed in the endoplasmic reticulum or Golgi apparatus of cells producing (r-TCS. The mature protein consists of 247 amino acids. The sequence predicted by translation of the DNA sequence agrees with and confirms the primary sequence determined recently on the protein.
The molecular clone for a-TCS will facilitate directed mutational analyses that may provide information on how this peptide, and other ribosome-inactivating proteins, function.
These studies may also lead to the development of therapeutic agents with altered activities and/or improved properties for in viva use.
cY-Trichosanthin (wTCS)~ is a Type I ribosome-inactivating protein (RIP) that is purified from root tubers of Trichosanthes kirilowii Maxim. (l-3). It has been identified as the active component in the Chinese medicine, Tian Hua Fen (4). This medicine was described as early as the 3rd century and is still used in China to induce abortions, particularly those in the second trimester, and to treat choriocarcinoma, hydatidiform moles, and ectopic pregnancy (4)(5)(6). The findings that cu-TCS is similar in sequence to ricin A-chain (7) and is itself a potent inhibitor of eukaryotic protein synthesis suggested as a possible explanation for its activity that it is preferentially cytotoxic to certain types of cells with a selective permeability to the protein.
In the examples noted above, trophoblasts and trophoblast-derived cells are the common link and the likely target of a-TCS.
wTCS has also been shown to inhibit HIV-l replication in both acutely infected T-lymphoblastoid cells and in chroni-* This research was generously supported by Sandoz Ltd., Basel, Switzerland. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. tally infected macrophages (8). In this instance, however, it is not clear that cu-TCS functions solely to abrogate protein synthesis in infected cells. In acutely infected cells, cellular protein synthesis was unaffected at concentrations of a-TCS which elicited significant reductions in viral RNA and protein synthesis. Other potential mechanisms by which a-TCS functions in blocking HIV-l replication, then, should be considered, including a direct action on viral RNA. In the accompanying manuscript (9), a complete primary sequence and a molecular model for cu-TCS are reported. This work provides the basis for a rational approach to the systemic study of the functional significance of specific amino acid residues and structural features. Essential to this research is the ability to make specific mutations and modifications to the protein and, for this, it is desirable to be able to manipulate a gene encoding cu-TCS. The potential value of this protein as a therapeutic agent also raises the issue of being able to produce large quantities of a sequence-characterized and homogeneous protein. As a first step in these directions, we report here the isolation and sequence of a gene encoding (Y-TCS. The sequence data obtained also confirm the primary sequence recently determined for (Y-TCS even though the source materials for the protein and the cloning efforts were obtained from geographically distinct locations.

RESULTS
Generation of an cu-TCS-specific Probe-The complete primary sequence for LU-TCS had been determined (9), and oligonucleotide probe/primers could be designed from regions showing minimum degeneracy in coding potential. The probe/ primers designed and the protein sequences from which they were derived are shown in Fig. 1. To allow for the generation of sequences significantly longer than 20 nucleotides of manageable complexity, deoxyinosine residues were incorporated at positions where all four nucleotides were possible (13, 14). One oligonucleotide pool (MPQP-1) was derived from amino acid residues 90-101 and consisted of 128 isomers of a 35-mer containing 4 deoxyinosine residues. Two other pools (MPQP-2 and -3) were derived from amino acid residues 164-174 and consisted of 128 isomers each of a 32-mer containing 3 deoxyinosine residues. The latter two pools of synthetic sequences differed in a G or a T at the second position from the 5' end. The first set of oligonucleotides and the other two sets taken together were also designed to face one another as primers on a genomic DNA template and to be used for amplification of a specific DNA fragment in a polymerase chain reaction. This was the approach taken to isolate a sequence-specific probe for a-TCS.
It was assumed that the gene sequence encoding LU-TCS would not contain any introns as was found for ricin, a Type II RIP (15). On this assumption and from the positions of the oligonucleotide primers relative to the determined protein sequence, it was predicted that a DNA fragment of approximately 255 bp would be amplified in a polymerase chain reaction using T. kirilowii genomic DNA as template. The results of such a reaction are shown in Fig. 2. One amplified DNA fragment of the expected size was detected after agarose gel electrophoresis by ethidium bromide staining and by autoradiography when "'P-labeled primer was used. No significant "background" products were noted even after 40 cycles of polymerase chain reaction.
To confirm the identity of the amplified DNA fragment, approximately 100 ng of gel-purified fragment were subjected to a DNA sequencing reaction using the MPQP-1 oligonucleotide pool as primer. A sequence for 116 bases was determined and translated in all six potential reading frames (Fig. 3). One open reading frame showed a translation product which matched exactly the sequence of o(-TCS from amino acid residue 128 through 163. The translated residues at the beginning and the end of the DNA sequence did not match the determined protein sequence, but these regions were at the limits of the interpretable DNA sequence data, and errors may have been made. Nevertheless, it was clear that the amplified fragment did correspond to an cu-TCS-specific or -like sequence.  Allowing for the coding of a secretory signal peptide, a minimum total coding sequence of about 850 bp might be expected. On Southern blot analysis using "'P-labeled amplified fragment as probe positive-hybridizing bands of about 7, 4, and 1.4 kbp in size were detected in an EcoRI digestion of T. kirilowii DNA (data not shown). In addition, the relative ease with which these bands were detected also suggested that the complexity of the T. kirilowii genome was not unduly large and that positive-hybridizing clones might be detected in a library of reasonable size. With this information, it seemed reasonable to assume that a complete gene might be contained on a relatively small restriction fragment and, so, a X phage vector capable of taking 0-lo-kbp inserts, i.e. lambda ZAP" II, was chosen for constructing a gene library. A library of EcoRI fragments in lambda ZAP@ II was generated as described under "Experimental Procedures." Approximately lo6 recombinants were screened with 500 ng of the specifically amplified DNA fragment labeled with '*P to a specific activity of about 1.3 x 10' cpm/pg. One clone, pQPlD, was strongly positive; a second clone, pQ30E, yielded a fainter signal. Both clones were plaque-purified and rescued as plasmids as described under "Experimental Procedures." Restriction analyses (data not shown) indicated that the cloned inserts were approximately 4 and 0.6 kbp, respectively, in size. The clone pQ21D, but not pQ30E, contained a Sal1 site predicted from the DNA sequence of the specifically amplified fragment. This information, together with the observation that pQ2lD hybridized much more strongly than pQ30E, suggested a higher probability that pQ21D contained a DNA sequence specific to a-TCS.
Preliminary sequence analysis of pQ21D confirmed that it contained a sequence potentially coding for a-TCS beginning 409 bp from one end. The sequence information was then extended to cover the entire coding region for a-TCS as shown in Fig. 4. No attempt was made to sequence the entire noncoding flanking regions in the gene, although 339 bp preceding the first in-frame ATG codon were confirmed.
The DNA sequence determined for pQSlD, a translation showing the coding of a-TCS, and a comparison of the determined protein sequence to that translation are shown in Fig.  5. First of all, it is clear that pQ2lD contains an authentic gene sequence for a-TCS. Aligned with the determined protein sequence, which also indicates the limits of coding for the mature sequence, there are only two amino acid differences, a Thr for a Ser at position 211 and a Met for a Thr at position 224, both relatively conservative changes. The translation information indicates that the gene sequence encodes a precursor protein. The open reading frame begins with an ATG at nucleotide position 340 and continues through nucleotide 1206. Nucleotides 340 through 408 likely encode a putative secretory signal peptide; nucleotides 1150 through 1206 encode a carboxyl-terminal extension of the mature protein. There are no potential iv-linked glycosylation sites (Asn-X-Ser/Thr) and, as expected, there are no indications of introns contained within the coding sequence. The sequence upstream of the open reading frame is 72% A + T, like that in the gene for ricin (15). There are several sequences that resemble a TATA box found in eukaryotic genes 30-35 nucleotides upstream of transcription start sites (16,17) (a), the coding strand sequence determined for pQ21D. The sequence is numbered above for convenience, and relevant restriction sites are shown. Potential control sequences are underlined.
(b), the translation of the encoded precursor protein for (u-TCS. (c), the reported primary sequence of (u-TCS taken from Collins et al. (9). The protein sequences are numbered below. The mature protein sequence is numbered from I through 247; the putative secretory signal peptide is numbered in a negutiuefashion as it precedes the mature sequence; and the additional carboxyl-terminal sequence noted in the precursor is numbered in parentheses to represent a continuation of the mature sequence.
it will not be possible to identify any of these as actual control sequences until a transcript for cu-TCS can be isolated and mapped on the gene. DISCUSSION Although gene sequences for several RIPS have been cloned or synthesized (B-21), this report describes the first isolation and DNA sequence of a complete gene for a Type I RIP. RIPS constitute a rather large group of proteins. In cloning (u-TCS, we chose an approach which should be generally applicable to the cloning of other RIPS. The basis of this approach is to rapidly evaluate the prevalence of a RIP sequence in a DNA (or RNA) preparation and to generate a highly specific probe by using degenerate primers in a polymerase chain reaction. For a-TCS, we had the benefit of having a complete protein Available RIP primary sequences are aligned to show maximum homologies. All positionally matched and identical residues are shaded. The sequence of LvTCS is numbered for reference. TCS, n-TCS sequence taken from pQ21D; ricA, ricin A-chain sequence (19); abrA, abrin A-chain sequence (32); BPSZ, barley protein synthesis inhibitor (33); MIRA, mirabilis sequence (18).
sequence, allowing us to select the optimum regions for primer design. However, even with limited protein sequence information, specific probe sequences might be amplified. As noted previously and shown in Fig. 6, RIPS share extensive sequence homology.
If, for example, only amino terminal sequence information is available, degenerate primers for PCR might be designed from the known sequence and coupled with primers designed from a region showing extensive sequence homology among RIPS. In Fig. 6, one such region would correspond to cu-TCS residues 160 through 167. A consensus sequence, EAARF(K/Q)YI, might be taken, and a degenerate primer might be designed. After polymerase chain reaction, amplified products of a predicted size, based on alignment of the new sequence with the other known sequences, would be isolated and characterized by DNA sequencing. As additional protein sequences are determined, it may even be possible to generate "universal" RIP primers to be used to screen for RIPS and facilitate their cloning when no protein sequence is available.
With limited or no sequence information, however, the degeneracy of the primers to be used could be extreme. In the example above, the consensus primer could contain as many as 4096 isomers of a 23-mer. Incorporation of deoxyinosine at selected positions might reduce the complexity considerably. For example, the potential number of isomers for the sequence from which we derived MPQP-2 plus -3 was 9216. The final primer pools used contained only 256 isomers. Codon usage tables for plants or preferably for RIPS might also be employed to further restrict primer degeneracy. At this time, we cannot make estimates or recommendations as to the limit of the complexity of primer pools that may be used successfully in polymerase chain reaction. Primer pools containing greater than 4000 isomers have been used successfully to amplify mammalian sequences (22), but colleagues were unsuccessful using similarly complex primers applied to another plantderived DNA.* A complication in the latter instance, however, was that the genome complexity was estimated to be at least lo-fold greater than that of T. kirilowii or mammalian DNA which is about 10'. This would increase the chances of spu-rious or nonspecific priming requiring the testing of several different primer combinations. Problems not withstanding, this approach and method quickly led to our isolation of a RIP gene and may be useful when applied to the cloning of others. It is typical to find multiple RIPS in a plant. In Ricinus communis, several related genes exist, coding for at least two toxins and one agglutinin (l&19,20, 23). Preliminary Southern blot analysis of EcoRI-digested T. kirilowii DNA showed three hybridizing bands of about 7, 4, and 1.4 kbp in size, respectively. Follow-up analyses on gels run for shorter periods showed an additional band of about 0.6 kbp. Of these, the 4-and 0.6-kbp bands correspond to the clones, pQ2lD and pQ30E, respectively. The clone, pQSOE, has been sequenced in part, and the data show that it contains an RIPlike sequence. The insert is too small to encode a complete protein, however, and it is possible that it does not derive from a functional gene. We plan to isolate a full length clone from an alternatively restricted library and determine the identity of this sequence. A second RIP, trichokirin, has been isolated from seeds of T. kirilowii indicating that at least one additional, active gene exists (24). Thus, it is probable that RIPS in T. kirilowii, like those in R. communis and likely in other plants, comprise a multigene family.
The translation of pQ2lD indicates that a-TCS is produced as a preproprotein. The mature protein is preceded by 23 codons which resemble a consensus secretory signal peptide (25). This is consistent with the premise that all RIPS are secreted proteins as they would otherwise diminish or stop protein synthesis in the cells in which they are produced. Ribosomes taken from a plant producing a RIP are particularly resistant to that RIP, but the resistance is not absolute (26)(27)(28). It is reasonable to expect that these proteins, which are not only potent inhibitors of protein synthesis but are produced in relatively large amounts, would be secreted and compartmentalized. Another feature of (u-TCS revealed by the gene translation may also relate to this. There are 19 codons following the mature protein sequence which, by analogy to preproricin, may be processed after translocation to the endoplasmic reticulum (29). Proricin contains 12 amino acids which link the carboxyl end of the A-chain sequence to the amino end of the B-chain sequence. This sequence may function in one aspect to facilitate the formation of the disulfide bond between the A-and the B-chain, but it may also function to maintain the protein in an inactive state until it is safely placed across the membrane of the endoplasmic reticulum. Nonreduced ricin is much reduced in its ability to block protein synthesis in an in uitro translation assay (30), and proricin, produced by injection of RNA transcripts into oocytes, is essentially inactive (31). It will be of interest to test our hypothesis by expressing both the mature and the proprotein forms of ol-TCS in E. coli or in uitro and compare their activities on eukaryotic ribosomes.