Deep Venomics Reveals the Mechanism for Expanded Peptide Diversity in Cone Snail Venom*

Cone snails produce highly complex venom comprising mostly small biologically active peptides known as conotoxins or conopeptides. Early estimates that suggested 50–200 venom peptides are produced per species have been recently increased at least 10-fold using advanced mass spectrometry. To uncover the mechanism(s) responsible for generating this impressive diversity, we used an integrated approach combining second-generation transcriptome sequencing with high sensitivity proteomics. From the venom gland transcriptome of Conus marmoreus, a total of 105 conopeptide precursor sequences from 13 gene superfamilies were identified. Over 60% of these precursors belonged to the three gene superfamilies O1, T, and M, consistent with their high levels of expression, which suggests these conotoxins play an important role in prey capture and/or defense. Seven gene superfamilies not previously identified in C. marmoreus, including five novel superfamilies, were also discovered. To confirm the expression of toxins identified at the transcript level, the injected venom of C. marmoreus was comprehensively analyzed by mass spectrometry, revealing 2710 and 3172 peptides using MALDI and ESI-MS, respectively, and 6254 peptides using an ESI-MS TripleTOF 5600 instrument. All conopeptides derived from transcriptomic sequences could be matched to masses obtained on the TripleTOF within 100 ppm accuracy, with 66 (63%) providing MS/MS coverage that unambiguously confirmed these matches. Comprehensive integration of transcriptomic and proteomic data revealed for the first time that the vast majority of the conopeptide diversity arises from a more limited set of genes through a process of variable peptide processing, which generates conopeptides with alternative cleavage sites, heterogeneous post-translational modifications, and highly variable N- and C-terminal truncations. Variable peptide processing is expected to contribute to the evolution of venoms, and explains how a limited set of ∼ 100 gene transcripts can generate thousands of conopeptides in a single species of cone snail.

Cone snails are slow-moving predatory marine gastropods that hunt a variety of preys including fish (1) using venom optimized through more than 33 million years of evolution (2). The success of this strategy relies on the deployment of potent toxins targeted to the nervous system and musculature of the prey using a specialized radula tooth (3). This hollow harpoon-like structure delivers venom deep into the prey's flesh, where it can enter the circulatory system and interact with nerves to induce rapid paralysis (4,5). It is not surprising that human envenomations resulting from certain cone snail stings are potentially lethal (e.g. the fish hunting Conus geographus), given the conservation of neurological and neuromuscular receptors in vertebrates (6). What first appeared as an unfortunate coincidence is now emerging as a promising source of novel drugs to treat a wide range of human diseases (7). Indeed, cone snail venoms are now regarded as pharmacological treasures, and significant research efforts are being made to uncover the therapeutic potential of these molecules (8). One such molecule, the N-type channel selective blocker -conotoxin MVIIA, is now an FDA-approved drug to treat unmanageable chronic pain (9), and an optimized version of the norepinephrine transporter inhibitor -conotoxin MrIA (Xen2174) is in Phase IIa trials for cancer and post-surgical pain (10). In addition, several other cone snail compounds are being investigated for the treatment of neuropathic pain, epilepsy, cardiac infarction, and neurological diseases (11).
The majority of molecules found in cone snail venoms are small, bioactive, and heavily post-translationally modified peptides collectively known as conopeptides (12). The disulfide-rich peptides (Ն 2 disulfide bonds) are called conotoxins and represent the majority of conopeptides. Traditional biochemical methods to isolate and sequence these potential bioactives are time consuming and often sample limited. Presently, it is estimated that Ͻ 2% of the total conopeptide diversity has been sequenced (13). Conopeptides are synthesized in the venom gland as precursor proteins from a single gene comprising highly conserved signal peptide, propeptide region, and hypervariable toxin sequence (14), and classified into gene superfamilies according to the sequence similarities of their signal peptide in the precursor. The use of signal peptide-specific primers to amplify isoforms from known gene superfamilies accelerated discovery. However, this relatively straightforward strategy can only be used to increase our knowledge of already identified gene superfamilies and is unable to discover new ones. Additionally, the characterization of conopeptide gene products require other techniques, such as mass spectrometry, because of the numerous and highly diverse post-translational modifications (PTMs) 1 observed in mature conopeptides, which cannot easily be predicted from precursor sequences. Over the past three-decades, ϳ 1400 conopeptide sequences have been isolated from 92 different cone snail species, with as few as 210 peptides being validated at the protein level. Therefore, while we appreciate the enormous diversity present in the venom of this genera and have extensive knowledge on conopeptides in general (11), there is no comprehensive study on the set of toxins produced in the venom gland even of a single species.
Cone snail venoms are highly complex mixtures, with early estimates ranging from 50 to 200 conopeptides per species. However, recent reports showed the presence of Ͼ 1000 different peptides in a single venom using optimized liquid chromatography LC-MS approaches (15,16). Surprisingly, venom gland transcriptomes of several species have revealed a much more limited number of conopeptide genes (Ͻ 100) (17)(18)(19). This large discrepancy between the number of genes and the number of masses detected in the venom is currently not well understood. Differential PTM processing can only partially explain the observed venom complexity, since most conopeptides have on average only two modified positions (excluding disulfide bond formation) that would generate up to 400 peptides from 100 genes. To better understand the mechanisms responsible for cone snail venom peptide diversity, we have integrated transcriptomic and proteomic approaches using bioinformatics in a strategy coined "deep venomics" (20), to fully explore the origin(s) of the thousands of conopeptides found in the venom of Conus marmoreus. This well-studied mollusc-hunting cone snail produces potent analgesic compounds, including -conotoxin MrIA (10,21) and O-conotoxin MrVIB along with 40 other identified conotoxins.
From the different second-generation sequencing platforms, the 454 pyrosequencing technology was selected as it generates relatively long reads (on average Ͼ 300 bp) that can cover the full length of conopeptide precursors (70 -100 amino acid). This approach allows direct identification of conopeptide precursors, avoiding the errors inherent to the assembly of reads into contigs typically required for other second-generation technologies that generate shorter read lengths (22). To complement this approach, we performed a detailed proteomic investigation using three high sensitivity mass spectrometers and developed dedicated bioinformatic tools for data integration. Besides the identification of 72 novel conopeptide precursors and five novel gene superfamilies, this study revealed for the first time extensive and highly variable processing of the N-and C termini and PTMs that dramatically increased venom peptide diversity. This variable peptide processing, together with intra-species variation, explains how a limited set of ϳ 100 gene transcripts can generate thousands of conopeptides in the venom of a single species of cone snail.

EXPERIMENTAL PROCEDURES
RNA Extraction, cDNA Library, 454 Sequencing and Assembly-One single adult specimen of C. marmoreus collected from the Great Barrier Reef (Queensland, Australia) and measuring 6 cm was dissected on ice. The venom duct was removed and directly placed in a 1.5 ml tube with 1 ml of TRIZOL reagent (Invitrogen, Carlsbad, CA). The extraction of total RNA was carried out following the manufacturer's instructions. We obtained 44.8 g of total RNA, which was further purified using Oligotex mRNA Mini Kit (Qiagen, Valencia, CA), yielding ϳ 400 ng of mRNA. From this sample, 200 ng was submitted to the AGRF (Australian Genomic Research Facility) for cDNA library construction and sequencing. Preparation of the cDNA library consisted of several major steps, including fragmentation of RNA, synthesis of double-stranded cDNA, fragment end repair, preparation of AMPure beads, ligation of adaptors, removal of small fragments, quantitation, and quality assessment of the cDNA library. Sequencing was carried out on a Roche GS FLX Titanium sequencer. In addition to our sample, three other samples from a related project were run together on a full plate, using a unique barcode for each sample. After sorting, cleaning and trimming of the reads, sequence assembly (contigs) was carried out using Newbler 2.3 (Life Science, Frederick, CO).
Conopeptide Sequence Analysis-Raw reads and contigs were up-loaded in a proprietary web-based searchable database. The identification of conopeptide sequences was carried out from the raw data using tBlastn and either signal sequences or mature sequences retrieved from the ConoServer (23). As mentioned previously, such long sequence reads are likely to contain the full nucleic sequences of conopeptide precursors. The identified conopeptide sequences were then aligned using Multalin program (24). At this stage, redundant sequences, incomplete precursor sequences and aberrant sequences (i.e. extended N-terminal due to frameshifts or degenerate positions) were removed. Alignments were then edited with Jalview and the sequence clustering tree was constructed from "average distance using % identity" algorithm implemented in the Jalview program (25). Gene superfamilies, signal peptides, and cleavage sites were predicted using the ConoPrec tool implemented in ConoServer (26). The cutoff value for assigning a signal peptide to a gene superfamily was set at Ͼ 75% sequence identity, as extrapolated from a recent analysis of all precursors deposited in ConoServer (13).
Venom Sample Preparation-Six adult (Ն 6 cm) specimens of C. marmoreus were collected from the Great Barrier Reef (Queensland, Australia) and held in aquaria for several months. Temperature was maintained between 24 -28°C and a light cycle of 12:12 was applied. Milking of all snails was carried out once a fortnight. The procedure involved enticing the cone snails with live prey (gastropod mollusks) to initiate extension of the proboscis. Then, a 0.5 ml collecting tube comprising a fine slice of the prey's foot tissue stretched over the opening sealed with parafilm was presented to the snail. On repeated contact of the proboscis with the piece of foot tissue, at times with agitation, a radula was eventually fired and venom ejected into the tube. After each collection, the pooled injected venom was stored immediately at Ϫ20°C until further use (total from 25 milkings was ϳ 200 l). This batch of venom has been used for all subsequent MS experiments.
HPLC Fractionation for MALDI-100 l supernatant of the pooled injected venom was fractionated using a Thermo C 18 4.6 ϫ 150 mm column fitted to a Shimadzu Prominence HPLC system with 0.043% trifluoroacetic acid/90% acetonitrile (aq) as elution buffer B and 0.05% trifluoroacetic acid (aq) as buffer A. A linear 1% B min Ϫ1 gradient was delivered to the column at a flow rate of 1 ml min Ϫ1 over 80 min. The eluent was monitored using a dual wavelength UV detector set to 214 and 280 nm and fractions collected from the 214 nm trace.
Reduction-Alkylation-The buffer used for reduction and alkylation was 30% acetonitrile/100 mM NH 4 HCO 3 at pH 8. Tris(2-carboxyethyl)phosphine (TCEP) was used as the reducing reagent and maleimide was used as the alkylating reagent. All samples including the raw injected venom (10 l supernatant) and the fractionated venom (2/3 of the fractions) were lyophilized and reconstituted in 50 l of the above buffer prior to the reduction and alkylation procedure. The sample solution was incubated with 10 l of 100 mM TCEP at 60°C for 1 h under nitrogen. Alkylation was carried out on the reduced raw injected venom by addition of 10 l of 100 mM maleimide and the reaction mixture was incubated for 1 h before LC purification.
Matrix-assisted Laser Desorption Ionization-MS-Matrix-assisted laser desorption ionization (MALDI)-MS analyses were conducted using an AB SCIEX (Framingham, MA, USA) 4700 TOF-TOF Proteomics Analyzer. The fractionated venom samples (1/3 of each fraction) were reconstituted in 5 l 50% acetonitrile/0.1% formic acid (aq) and 0.5 l of the samples were deposited on a 192-well stainless steel plate through 1:1 dilution with matrix consisting 10 mg ml Ϫ1 ␣-cyano-4-hydroxycinnamic acid (CHCA) in 50% acetonitrile/0.1% formic acid (aq). For LC-MALDI analysis, ϳ10 g of the injected venoms (native) were diluted in 22 l 0.1% formic acid (aq). Of this solution, 20 l was analyzed using a Vydac Everest® C 18 (300 m ϫ 150 mm) capillary LC column on the Agilent nano 1100 series HPLC system. During fractionation, a CHCA solution (10 mg ml Ϫ1 in 50% acetonitrile/50% ethanol) was added 1:1 to the effluent and samples were deposited on a 192-well stainless steel plate using a plate spotter. MALDI-TOF spectra were acquired in reflector positive operating mode with source voltage set to 20 kV and Grid1 voltage at 12 kV, mass range 1000 -8000 Da, focus mass 3500 Da. The plate was calibrated using Calmix (4700 Proteomics analyzer calibration mixture) from Applied Biosystems (Foster City, CA).

LC-electrospray Ionization (ESI)-MS and LC-ESI-MS/MS-Liquid
chromatography and electrospray mass spectrometry were performed on two advanced AB SCIEX instruments (Framingham, MA, USA). The AB Sciex QSTAR Pulsar is an electrospray quadruple time-of-flight (QqTOF) MS equipped with a Turbo-Spray ionization source and coupled to an upstream Agilent 1100 series HPLC system. In contrast, the AB Sciex TripleTOF 5600 System is a hybrid quadruple TOF MS equipped with a DuoSpray ionization source coupled to a Shimadzu 30 series HPLC system. For comparison, the same amount of raw injected venom (ϳ 8 l supernatant) was directly subjected to LC-ESI-MS to obtain a complete mass list of underivatized peptides. Full scan mass spectrometric analysis and product ion MS/MS analysis using Information Dependent Acquisition (IDA) experiments were performed using the 5600 TF on the reduced and reduced/alkylated injected venom samples. The LC separation was achieved using a Thermo C 18 4.6 ϫ 150 mm column at a linear 1.3% B (90% acetonitrile/0.1% formic acid (aq)) min Ϫ1 gradient with a flow rate of 0.3 ml min Ϫ1 over 60 min. A cycle of one full scan of the mass range (MS) (300 -2000 m/z) followed by multiple tandem mass spectra (MS/MS) was applied using a rolling collision energy relative to the m/z and charge state of the precursor ion up to a maximum of 80 eV. The full scan mass spectrometry had duration of 84 min with a cycle time of 2.55 s (total of 1975 cycles). The maximum number of candidate ions monitored per cycle was 20 and the ion tolerance was 0.1 Da. The switch criteria were set to exclude former target ions for 8 s and to exclude isotopes within 4 Da.
Bioinformatic Tools-Raw data extracted from mass spectrometry instruments often contain replicates and deconvolution artifacts (e.g. assignment of two monoisotopic masses for the same molecule during the automatic reconstruction step) that need to be cleaned before use for further analysis. To this end, two useful tools have been implemented to help our analyses, and these tools ("Remove duplicate masses" and "Compare mass lists") have been made publicly available on the ConoServer website. The first tool removes duplicates in a list of masses using a user-defined mass precision parameter, whereas the second tool identifies common masses between two mass lists. Correctly assigning a mass to a conotoxin predicted from a precursor protein is challenging because conopeptides are heavily post-translationally modified. To date, 14 different types of post-translational modifications (PTMs) have been identified in mature cone snail toxins (13). The problem of identifying a conopeptide from a gene sequence is increased by the presence of differential post-translational processing. ConoMass was implemented in Cono-Server to help in the identification of conotoxins by mass spectrometry (26). In this two-step process, monoisotopic and average masses resulting from variable PTM processing are computed for each peptide and then matched to masses observed experimentally without relative mass accuracy correction. These bioinformatic tools are implemented in PHP, Python, and Mysql and are available online at the ConoServer website (http://www.conoserver.org) (26).
Proteomic Data Analysis-LC-ESI-MS reconstruction was carried out using Analyst LCMS reconstruct BioTools (Framingham, MA, USA). The mass range was set between 1000 -8000 Da. Molecules Ͼ 8000 Da were observed but excluded from further analysis. The mass tolerance was set to 0.2 Da and S/N threshold was set to 10. The MS data matching was carried out using the ConoMass tools (see below) followed by critical manual inspection. The precision level was set to 0.1 Da for automatic matching search. Manual search accuracy was set to 100 ppm. Deconvoluted mass lists from different instruments were cross-calibrated, compared, cleaned and binned using two bioinformatic tools, namely "Compare mass lists" and "Remove duplicated masses," which are available on the ConoServer website. The precision level used for binning and comparing masses was set to 0.2 Da. The ProteinPilot™ 4.0 software (AB SCIEX, Framingham, MA, USA) was used for sequence identification by searching the LC-ESI-MS/MS mass lists obtained at a mass tolerance of 0.05 Da for precursor ions using the reduced and reduced/alkylated samples. These masses, and related fragmentation masses (0.1 Da tolerance), were matched against a protein database comprising all ConoServer conopeptides, NCBI cone snail related proteins and all read sequences obtained from this transcriptomic project (2,157,997 entries). Modifications used in the search include the following: amidation, deamidation, hydroxylation of proline and valine (27), oxidation of methionine, carboxylation of glutamic acid, cyclization of N-terminal glutamine (pyroglutamate), bromination of tryptophan (28), and sulfation of tyrosine (29). The O-glycosylation PTMs were not included in our search as this modification has not been reported for C. marmoreus conopeptides (glycosylation occurs infrequently and mostly in fish-hunting species) and the typical fragment loss associated with glycosylation was not seen by MS in this venom. The threshold "Conf" value for accepting identified spectra was set to 99. Identified peptide sequences were inspected manually to confirm assignment.

RESULTS
Transcriptomic Data Analysis-A single run ( 1 ⁄4-plate equivalent) on the Roche GS FLX Titanium sequencer generated 179,843 reads averaging 317 bp (min 18 bp) in length after removal of low-quality sequences. 114,159 reads were assembled into 839 contigs, and the rest remained as singletons. Although this study focused mainly on conopeptides, many protein and enzyme sequences were also identified among the contigs and will be described elsewhere. As outlined in the experimental procedures section, we searched for conopeptide sequences directly from the sequencing reads, as the average read length of Ͼ 300 bp allowed full conopeptide precursors to be found. Conopeptides were also searched in the contigs, and no additional conopeptide sequences were found. Overall, 105 unique conopeptide sequences were retrieved from the venom duct transcriptome of C. marmoreus. The conopeptide precursors were named Mr001 to Mr105 and are shown in Fig. 1. From the 42 previously known conopeptide sequences from C. marmoreus, 30 were identified in our data (28.5% of total precursors recovered; Table I) Fig. S1). In addition to these superfamilies, we also found sequences belonging to superfamilies I1 and S that had not previously been reported for C. marmoreus. Finally, from the remaining 13 unclassified conopeptide precursors, five groups could clearly be identified, based on their signal peptide sequence similarity and named gene superfamilies B, H, N, E, and F. As detailed below, conopeptides belonging to gene superfamily N and H show typical mature conotoxins, while gene superfamily B, E, and F are represented by only one sequence and appear to be also quite divergent.
Some conopeptide precursors were markedly more abundant than others. Indeed, the three most expressed conopeptide precursors contribute 28% of the total conopeptide reads, the next 20 contribute to 46.5% of the reads, whereas the remaining precursors contribute only 25.5% of the reads (Fig. 3A). This finding parallels that of Conticello et al., where order-of-magnitude differences were observed in the expression levels of individual conopeptides in five Conus species, with a few transcripts typically dominating the sequenced clones in a given species (32). Not surprisingly, nearly all peptides with a corresponding number of reads above 300 were already either characterized from the venom or discovered from cDNA clone libraries, with the exception of two conopeptide precursor, Mr047 and Mr096 (Fig. 3B). This observation suggests that the toxins most expressed at the mRNA level tend also to be the more abundant in the venom and thus are usually biochemically characterized first. A linear regression (r 2 ϭ 0.88) indicated that gene superfamilies with the largest number of precursors also had the highest number of total reads (Fig. 3C). Only gene superfamily I2 was an outlier to this regression, with a relatively high number of precursors (10) but low expression levels. Overall, gene superfamily M has the highest number of reads and the largest number of precursors. A large proportion of the reads assigned to gene superfamily M match to precursor Mr044, which encodes conopeptide Mr3.8 (two sequences, Mr3.8 and MrIA, have Ͼ 1000 reads). It is interesting to note that this conopeptide is the most highly expressed in the venom gland, yet its pharmacology remains unknown.
The Injected Venom of C. marmoreus-To study the venom most relevant to prey capture and defense (containing fully mature peptides), we adapted the milking method described by Hopkins et al. to collect the injected venom of a molluskhunting cone snails for the first time ( Fig. 4A) (33). This method allowed several C. marmoreus specimens to be milked for a comprehensive proteomic study. C. marmoreus has relatively short radula (ϳ 2.5 mm) making this species challenging to "milk." The injected venom of C. marmoreus has a milky appearance (Fig. 4B), in contrast to the translucent venom obtained from "hook-and-line" piscivorous species. The milky appeareance is mainly due to the presence of secretory granules ( Fig. 4C) that appear similar to those found in the venom duct of another molluscivorous cone snail, C. victoriae (34). The volume of the injected venom seems to vary according to the size of the animal, and generally 10 -20 l were collected per milking, with six different individuals pooled for our proteomic analysis.
Mass Spectrometry-We used ESI or MALDI sources in LC-ESI-MS (QSTAR Pulsar), MALDI-MS (4700 TOF-TOF Proteomics Analyzer) and LC-ESI-MS/MS (TripleTOF 5600 System) configurations to uncover the complexity of C. marmoreus injected venom. Using a precision of Ϯ 0.2 Da for binning the mass list, single 115 min LC-ESI-MS run on the QSTAR instrument revealed 3172 unique masses (from the 6867 raw data mass list) in the milked venom of C. marmoreus (Fig. 5B). An exhaustive MALDI analysis, including both 33 min LC-MALDI run (192 spots) and manually spotted UVabsorbing fractions from a HPLC run, identified a comparable number of masses (2710). However, only 1219 (45%) masses were common between the QSTAR and the 4700 MALDI instruments indicating significant detection bias. In comparison, 6254 unique masses (from the 15757 total masses detected) were identified using the TripleTOF 5600 from a single LC-ESI-MS run (TIC trace shown in Fig. 6), of which 2448 overlapped with the QSTAR (77%) and 1776 overlapped with the MALDI (65%). Overall, 1105 common masses could be identified from all three instruments with a precision of 0.2 Da FIG. 1. Alignment of C. marmoreus conotoxin precursors retrieved from next generation sequencing data. Sequences have been clustered by gene superfamily, according to their signal peptide. Gaps have been introduced to optimize the alignment sequence identity. Color coding has been applied using the following scheme: cysteine residues are in yellow, negatively charged residues are in red, positively charged residues are in blue, polar uncharged residues are in green, methionine residues are in orange and hydrophobic residues are in white.
Cysteine residues are indicated in bold type, hydroxyproline are represented as O, C-terminal amidation as *, carboxyglutamates as y and D-amino acids are underlined. (Fig. 5B) from a total of 7798 unique masses detected across the three instruments. Although this number is the largest reported for any venom, our stringent conditions for sorting the mass list from the raw data likely under-estimate the total number of peptides present, since peptides with similar masses but distinct retention times would not be counted. In addition, with a threshold S/N conservatively set to 10, some minor components were also missed (Supplemental Fig. S2). Furthermore, only 32 possible Na-adducts, 37 possible Kadducts, and 26 possible Fe-adducts were identified in the MALDI mass list of (3.5% of 2710 masses). In the 5600 TF mass list, 338 possible adduct products were found from 6254 masses (5.4%), however, Ͼ 50% of these masses had distinct retention times, indicating most were in fact different peptides and not salt adducts. Deconvolution artifacts were also considered, and isotopic masses envelopes (ϩ1 to ϩ8) with the same retention time were removed, along with possible loosely associated masses within 0.5 Da that had the same retention time. Finally, in-source fragments were also been considered, however, the mild conditions used for TOF FIG. 2. Sequence identity within and between gene superfamilies signal peptide sequences. The minimum percentage of sequence identities computed between signal peptide sequences of precursor belonging to the same gene superfamily are on a black background. The maximum percentage identities measured between signal peptide sequences of precursors belonging to different gene superfamilies are on a white or gray background. Comparisons between the new gene superfamilies and the previously known gene superfamilies are highlighted on a gray background. The percentage of sequence identities were computed for all pairs of sequences using a Smith and Waterman algorithm, and the percentage of identity was computed using the length of the smallest sequence. The gene superfamily M was detailed into three branches: m-1, m-2 and m-c (conomarphins).

FIG. 3. Levels of mRNA expression of individual conotoxins and conotoxin superfamilies in the venom duct of C. marmoreus. A,
The total number of reads is plotted per precursor, demonstrating the efficacy of the sequencing effort. B, Dramatic variations in the level of expression were noted for individual conotoxin. Interestingly, most conotoxin with a number of reads above 300 were already discovered either from the venom or using PCR amplification strategies. C, The number of isoforms and the total number of reads per gene superfamily show an apparent correlation. The goodness of fit was R 2 ϭ 0.88, revealing a significant correlation between the two parameters. scan (ESI) were expected to produce few in-source fragments. For the MALDI experiments, only mild MS-RP acquisition on CHCA matrix were performed, preventing in-source fragmentation.
It is surprising that only 77% of the Qstar masses overlapped with those of the 5600 TF within 0.2 Da precision range, while both instruments use the same ionisation method. It is likely that the accuracy of the measurement between the two instruments accounts for this discrepancy. For example, the reconstructed mass of MrVIB (Mr051, MW 3403.58 Da) from the two instruments showed that the 5600 TF produces highly accurate data (within 0.01 Da of the theoretical mass), while the Qstar was less reliable (mass difference of 0.26 Da). Increasing our precision to 0.5 Da significantly improved overlap to 87%, confirming that instrument accuracy was a major contributor to the incomplete overlap observed between the Qstar and 5600 TF detected masses. The mass distribution of the injected venom of C. marmoreus inferred from each instrument is shown in Fig. 5A. As expected, small peptides dominated the venom, especially those in the range 1000 -2000 Da, while similar numbers of peptides were detected for the ranges 2000 -3000 Da and 3000 -4000 Da. Proteins larger than 8 kDa were also detected, however, they represent relatively minor components of C. marmoreus injected venom and were not analyzed further in this study.
Matching Transcriptomic and Proteomic Data Using Dedicated Bioinformatic Tools-Calculated masses from all 102 predicted mature sequences were compared with masses identified using the three instruments (Supplemental Table S1; precursors Mr069, Mr070, and Mr071 that only contained a proregion were excluded from this analysis). The TripleTOF 5600 System detected all 102 mature sequences within 100 ppm. In contrast, the QSTAR data could be matched to 79 (77%) of the mature conopeptides, including 69 within 100 ppm and 26 were not detected, while MALDI data could be matched to 71 (67%) of the mature peptides, including 69 within the 100 ppm and 34 not detected. As expected, the precision match (smaller delta mass) was higher for short sequences (Ͻ 20 -25 amino acids), in part because longer sequences have proportionally more possible PTMs. A single mass may correspond to several possible peptides, but detailed MS/MS data and knowledge of each gene superfamily PTM profile allowed discrimination of the different possible solutions. Below we describe the conopeptides identified, Gene Superfamily A-Only two precursors from gene superfamily A were identified in our transcriptomic data. From the three previously known ␣-conotoxins, only Mr1.1 could be found in our transcriptome data (Mr001), and Mr1.2 and Mr1.3 were absent. The molecular targets of these small peptides are the various subtypes of nicotinic acetylcholine receptors, although recent findings indicate that GABA B is also a potential pharmacological target (35). Mr1.1 was recently found to be analgesic in an animal model of inflammatory pain (36). In our data set we found a novel ␣-conotoxin isoform, Mr002, which has high similarity to Bn1.2, a peptide isolated from the closely related C. bandanus. The proregions of Mr001 and Mr002 are different and contain the presequence cleavage sites LTVK and LNAR, respectively, which were confirmed by MS/MS sequencing. Both Mr001 and Mr002 have similar levels of expression, with 35 and 20 reads, respectively. MS/MS data of Mr1.1 (Mr001) indicated that the mature form has an amidated C terminus. This is the first time that Mr1.1 has been identified at the peptide level. In contrast, mature Mr002 peptide had two hydroxyprolines and a serine instead of the C-terminal glycine found in its precursor.
Gene Superfamily I1-Three gene superfamily I1 precursors were detected in our transcriptome data, and all three showed relatively low levels of expression. Fourteen reads were found coding for Mr004, but only three for Mr005 and one for Mr003. The presequence cleavage site in these precursors is LR, producing 40 -45 amino acid long mature peptides with four disulfide bonds that were confirmed by MS/MS. Most conopeptides from the gene superfamily I1 isolated to-date produce general excitatory symptoms in mice, possibly through effects on sodium channels (37).
Gene Superfamily O 2 -Ten precursors belonging to the gene superfamily O 2 were sequenced and further classified into three subgroups based on signal peptide sequence similarities ( Figs. 1 and Supplemental Fig. S1). Five precursors in the first subgroup coded for mature peptides of 24 -27 amino acids and three disulfide bonds (Mr006-Mr010). Only one peptide in this gene superfamily, produced by precursor Mr007, was already known from C. marmoreus (Mal51), and this precursor was represented by 10-times more reads than the other members of this subgroup (38). Each of these ten precursors contained the presequence cleavage site KR, generating mature peptides for Mr006, Mr007, and Mr008 with a predicted N-terminal pyroglutamate and amidated C terminus (except Mr010). Mal51 and the mature sequences of Mr009 and Mr010 were confirmed by MS/MS. Although MS/MS evidence for the predicted pyroglutamate and C-terminal amidation was found for the abundant Mal51, the unmodified mature peptide unexpectedly dominated in the venom.
The second subgroup contained three precursors (Mr011-Mr013), which are expressed at a low level (Ͻ 25 reads). The signal peptide of these precursors shared 90% sequence homology with known gene superfamily O 2 precursors, but the propeptide and predicted mature peptide regions were different. The pre-cleavage sites (LIGR or LTGR) precede mature peptides of 34 -35 amino acids, which display an eight residue N-terminal tail and three disulfide bonds. A conserved lysine residue at position 48 (see Fig. 1 alignment) constitutes a second cleavage site, resulting in mature peptides of 26 -27 amino acids in length and three disulfide bonds. Indeed, these shorter peptides were confirmed by MS/MS sequencing as being the dominant mature products. Interestingly, MS/MS data could be confidently matched to several isolated propeptide regions excised from precursors from this subgroup. The identified propeptide region sequences are DEENLLKP-MIYFILIGR for Mr011 and DGENPLKALIDILTGR for Mr012.
Finally, two precursors coding for contryphans were found to cluster with the gene superfamily O 2 : Mr014 (contryphan-M) and Mr015. Contryphan-M was highly expressed with 82 reads, whereas Mr015 was expressed at ϳ 20-fold lower frequency. The cleavage site KVLR for Mr015 produced a ten residue mature peptide corresponding to a truncated contryphan-M, and this peptide was confirmed by MS/MS sequencing. In addition, the C-terminal amidation of both contryphan-M and Mr015 mature peptides was validated by MS/MS.
Gene Superfamily S-Only eight conopeptides from gene superfamily S are known in the entire ConoServer database. Two new precursors belonging to this gene superfamily were found in our C. marmoreus transcriptome and both were expressed at a low level (Ͻ 10 reads). Full length Mr016 has only three cysteines, whereas other members of the superfamily S belong to cysteine framework VIII and have ten cysteines. Conopeptides with an odd number of cysteines are rare, but some were recently shown to form disulfide bonded homodimers (39). However, the expected dimer (7041.55 Da) was not detected in the venom. The second precursor had a partially truncated signal peptide, but the predicted mature peptide possessed the canonical cysteine framework VIII. Both of the predicted mature peptides without PTMs were matched to peptide masses within 100 ppm using MS, however, MS/MS data could not confirm these sequences.
Gene superfamily I2-Ten precursors were identified for the I2 gene superfamily, yet none had level of expression higher than 50 reads. Previously identified Gla-MrII (Mr019) was found in our transcriptomic data, but Mr12.8 was absent (40). In contrast to other conopeptide precursors, this gene superfamily has its propeptide region located after the mature peptide region. In addition, several peptides in this gene superfamily were shown to contain ␥-carboxylation and a recognition site for the carboxylase enzyme (41). The identification by MS of peptides from this gene superfamily is challenging because the predicted mature sequences are long and potentially heavily post-translationally modified. For example, Gla-MrII has five ␥-carboxylations. From the ten precursors belonging to gene superfamily I2, three subgroups could be identified (Fig. 1). Three precursors, Mr018, Mr019 and Mr020, had Gla-MrII-like sequences and a ␥-carboxylation motif. MS data could be associated with all the mature peptides of all three precursors including 4 -5 ␥-carboxylations (Supplemental Table S1). Gla-MrII and the mature Mr020 sequences were confirmed by MS/MS but their ␥-carboxylation was not detected.
A second I2 subgroup included precursors Mr021, Mr022, Mr023, Mr024, and Mr025 that were predicted to be slightly shorter than Gla-MrII but with a similar ␥-carboxylation pattern. Despite having different propeptide regions, Mr022 and Mr025 share the same predicted mature sequence. Masses corresponding to four to five ␥-carboxylations were identified in the MS data but mature peptides could not be confirmed by MS/MS data. Finally, two precursors, Mr026 and Mr027, encoded short mature peptides containing three and four cysteines, respectively. A peptide fragment LCEHPEETCLLPQ corresponding to Mr026 and/or Mr027 was identified without PTMs by MS/MS.
Gene Superfamily M-Twenty-three precursors belonging to the gene superfamily M were further classified into the m-1 and m-2 subgroups, which have distinct signal peptide sequences (42). From the eight full-length precursors belonging to the m-2 branch, four have mature peptides that were reported previously: Mr3.3, MrIIIB, MrIIIG, and MrIIID (43)(44)(45). Among this group, MrIIIG precursor has the highest expression level with 230 matching reads. The predicted mature regions are cleaved after a DSGR or DAVR motif to generate peptides ranging from 14 to 17 amino acids and stabilized by three disulfide bonds. Processing of both Mr028 and Mr029 precursors generates the same mature peptide Mr3.3. Good MS/MS coverage was obtained this subgroup. The mature peptides of Mr030 (MrIIIB), Mr031 (MrIIIG), and Mr033 (MrIIID) each displayed a hydroxyproline in a conserved C(XO/P)CC motif. Additionally, MrIIID has a second hydroxyproline in the first loop, and both MrIIID and MrIIIG have an amidated C terminus. In contrast, Mr034 and Mr035 precursors generated mature peptides without PTMs, as identified by MS within 100 ppm accuracy. These peptides without PTM could not be confirmed by MS/MS (Supplemental Table S1).
Twelve precursors that belong to the m-1 branch were identified (Mr039-Mr050), including the previously characterized MrIIIE, MrIIIF, Mr3.8 and Mr1e precursors (43)(44)(45). All precursors in this branch had a pre-sequence cleavage site LGQR or KR, yielding predicted mature peptides with 11 to 16 amino acids and three disulfide bonds, except Mr1e, which has only four cysteines. Mr044 (Mr3.8) was the most highly expressed precursor in the transcriptome of C. marmoreus with 1372 reads and was readily confirmed by MS/MS. The new precursor Mr047 is also highly expressed (415 reads) but the other new precursors identified generated only 1-73 reads. Interestingly, MS/MS data suggest that the mature sequences of Mr041 and Mr049 contain an odd number of cysteines. The predicted C-terminal amidation of Mr039 was confirmed by MS/MS, whereas the mature peptide corresponding to the excitatory Mr1e (45) was confirmed to contain no PTMs by MS/MS. Two precursors Mr036 (conomarphin) and Mr037 cluster with the gene superfamily M (m-c branch) (30). Both precursors contain the cleavage site LKKR, producing a mature linear peptide of 17 amino acids, and both were expressed at relatively high levels (161 and 92 reads for Mr036 and Mr037, respectively). Interestingly, a precursor encoding the same conomarphin was also cloned from the worm-hunter Conus imperialis (46). MS/MS data confirmed proline hydroxylation and identified truncated forms as previously described (47).
The precursor Mr038 has a gene superfamily M signal peptide although the propeptide and the mature peptide regions display little homology with other gene superfamily M precursors. The predicted cleavage site (RK) and removal of the C-terminal glycine (amidation) is expected to yield a 18 amino acid mature peptide with two cysteines and a long N-terminal tail more similar to the contryphans than other known gene superfamily M conopeptides (Fig. 1). Only four reads were found to match this sequence and reliable MS/MS coverage could not be obtained.
Gene Superfamily O1-Twenty-three precursors belonging to the gene superfamily O1 signal peptide sequence were identified that clustered into three distinct subgroups (Fig. 1), each containing the cleavage site LEKR or LNKR. The first subgroup contained six precursors (Mr051-Mr056), including the highly expressed Mr053 (MrVIA) and Mr051 (MrVIB) (48,49). The new precursor Mr052 had a similar sequence to the MrVIB precursor, and the three precursors Mr054, Mr055 and Mr056 had an odd number of cysteines and extended C-terminal sequences. The mature peptide sequence of Mr052 was confirmed using MS/MS, but those of Mr054, Mr055, or The second subgroup comprised four similar precursors, including the previously characterized MrIA (Mr090), CMrVIA (Mr091), and CMrX (Mr092) (51)(52)(53). Except for the new pre-cursor Mr089, these precursors were highly expressed (Table  II) and the presence of a hydroxyproline was confirmed by MS/MS.
New Gene Superfamily N-Three precursors, Mr093, Mr094, and Mr095, displaying the typical signal peptide/propeptide region/mature peptide region architecture of conopeptides, were identified as belonging to a new gene superfamily. Each of the three precursor had a LEKR cleavage site that delineates a mature peptides with eight cysteines. These cysteines are arranged along the sequence in a C-C-CC-C-C-C-C pattern corresponding to cysteine framework XV (Supplemental Table S2). Interestingly, the mature peptide of Mr093 (45 reads) was discovered as two main fragments in the MS/MS data (CSSGKTCGSVEOVLCCARSDCYCRLIQT and SYWVOICVCP), indicating the presence of an alternative cleavage site generating a major framework VI/VII peptide and a smaller disulfide-poor conopeptide.
New Gene Superfamily B-Only one precursor, Mr096, was identified in this new gene superfamily. Despite Ͻ 55% sequence identity to signal peptides from other gene superfamilies its level of expression was high (323 reads). Interestingly, one sequence from C. litteratus (Q2HZ30) deposited in the UniProt-KB database and described as a "high frequency protein" also contains the same signal peptide. The predicted mature sequence of Mr096 displays a cysteine framework VIIII (Supplemental Table S2) but includes an unusual repeat motif (CRECK/R). Surprisingly, the predicted mature sequence of Q2HZ30 had no cysteine residues and no sequence homology to the mature Mr096. Although we could match the predicted mature sequence from Mr096 by MS within 100 ppm with no PTMs, MS/MS data was inconclusive.
New Gene Superfamily H-Superfamily H has a signal peptide that is divergent from previously known conopeptide gene superfamilies (Ͻ 50% sequence identity). As a consequence, the corresponding precursors were initially not recovered in the homology search of the raw reads. Instead, peptides belonging to this gene superfamily were first identified through MS/MS data matching, illustrating the complementarity of transcriptomic and proteomic data in conopeptide discovery. From the seven precursors belonging to this gene superfamily, six had six cysteines arranged in a classical VI/VII cysteine framework (Supplemental Table S2), but Mr103 was predicted to generate a different mature peptide. Mr097 and Mr098 were the most highly expressed genes in this gene superfamily (130 and 127 reads, respectively), whereas Mr099, Mr100, and Mr103 were expressed at two-to fourfold lower levels, and Mr101 and Mr102 only generated one and four reads, respectively. Only the four short mature sequences from the precursors Mr097, Mr098, Mr099, and Mr100 yielded good MS/MS data coverage. These precursors contained an unconventional pre-cleavage site (RNWSR) and their mature toxins have a hydroxyproline located in the first inter-cysteine loop in all except Mr100.

New Gene Superfamily E and F-The two precursors Mr104
and Mr105 had no significant homology to any known conopeptide sequences deposited in ConoServer. Mr104 had relatively high expression (86 reads) whereas Mr105 gave only two reads. No obvious cleavage site could be identified for the Mr105 precursor, but a KRNGR pre-cleavage site was predicted for Mr104. MS/MS identified a propeptide of Mr105 (ELYDVNDPDVR) in the venom, however, Mr105 mature peptide was not identified. The predicted mature sequence of Mr104 was supported by MS/MS data, revealing a 26 amino acid peptide with two disulfide bonds and a bromo-tryptophan.
A New Mechanism Expanding Conopeptide Diversity-The high sensitivity of the TripleTOF 5600 System allowed us to characterize on average 20 different peptide variants (i.e. different precursor masses detected by mass spectrometry) for each gene precursor (Fig. 7A). Unexpectedly, most of this peptide diversity corresponded to truncated forms of either the mature peptide, the propeptide, or sequences comprising both the mature peptide and the propeptide. In addition to these truncations, additional diversity was created by variable PTM processing. The largest number of MS/MS sequences identified was associated to the gene precursor of MrIA (Mr090), with 72 unique peptide masses detected in the venom of this highly expressed peptide. Based on the intensity of the mass precursor ion, MrIA and its deamidated form (21) dominated, with the next most intense mass precursor ions (ϳ 4% of deamidated MrIA) corresponding to the full MrIA gene precursor propeptide ( Fig. 7B and Table III). Other mature MrIA-related peptides included N-terminal truncations and PTMs including C-terminal amidation and sulfation of tyrosine, not previously reported for gene superfamily T peptides.

DISCUSSION
Using a combination of second-generation sequencing and high-sensitivity mass spectrometry, we have unraveled the venom molecular diversity of Conus marmoreus and identified a new mechanism of variable peptide processing (VPP) that contributes to the remarkable diversity of conopeptides. Sequences for 105 unique conopeptide precursors were retrieved from the transcriptome and classified into 13 gene superfamilies. Conopeptides in gene superfamilies O1, T, M dominated both in terms of their expression level and number of isoforms, suggesting an important role in prey capture and/or defense. Seven gene superfamilies not previously known from C. marmoreus, including five novel gene superfamilies, were also discovered. Our approach of integrating transcriptomic and MS/MS sequence data allowed identification of highly divergent gene superfamilies (e.g. superfamily H) that were missed in simple homology searches. VPP, in combination with intra-species variation within gene superfamilies, can explain how ϳ 100 gene precursors generate thousands of unique venom peptides in a single species of cone snail.

Venomics of Conus marmoreus
Table IV displays statistics on the gene superfamilies identified from 12 species of Conidae, including data from the recently reported venom duct transcriptomes of C. consors and C. pulicarius (17)(18)(19). Extensively studied mollusk-hunting species including C. marmoreus (this study), have a comparable distribution of transcripts across the different gene superfamilies, with gene superfamilies M, O1, and T dominating. In our study on C. marmoreus, this expression level translated to a corresponding distribution of mature peptides in the venom. Gene superfamilies M, O1, and T are also common in vermivorous species (see Table IV). However, for the more recently evolved piscivorous species C. consors and C. striatus (54), gene superfamilies M and O1 are highly expressed, along with gene superfamily A. Therefore, the requirement for gene superfamily T in molluscivorous and vermivorous species appears to have been lost in piscivorous species. C. californicus is thought to be phylogenetically distinct from other Conus species. Because only the gene superfamily O1 is shared as a large gene superfamily between C. californicus and other Conidae, gene superfamily O1 may have evolved early in the speciation of Conidae. The cysteine framework VI/VII, the most common gene superfamily O1 conopeptides, fold into a highly stable cysteine knot motif (55) found in a wide range of bioactive peptides expressed across both the animal and plant kingdoms. These cysteine knot peptides have evolved in cone snails to selectively target voltage-gated Other peptides (including MrIA and propeptide truncations and PTMs variants) were present at less than 5% of MrIA (or less than 1% for ϳ 90% of the peptides, as represented by the dotted line).  (48,49), as well as selectively inhibiting the mammalian neuronal voltage-gated sodium channel Na v 1.8 to produce intrathecal analgaesia (56,57). In contrast, the biological activity for a number of other C. marmoreus conopeptides has only been demonstrated at mammalian targets. For instance, intracranial injections in mice identified that Mr1e was excitatory, CMrX was paralytic, and CMrVIA produced seizures (45,51). Further, contryphan blocks L-type calcium channels in mouse pancreatic B-cells (58), Mr1.1 inhibits rodent nicotinic acetylcholine receptors (36,58) and MrIA non-competitively inhibits the human noradrenaline transporter (53). This remarkable diversity of biological actions indicates that C. marmoreus uses multiple target strategy to broadly disrupt neuronal function of prey and/or predators.
This study has shown that the level of precursor transcription, as estimated by the number of reads for each transcript, reflects the levels of the corresponding conopeptides found in the crude venom. For example, transcript Mr044 was the most highly expressed transcript in C. marmoreus venom duct and its corresponding conopeptide Mr3.8 was also one of the most prominent ions detected in the injected venom (Fig. 6). In contrast, precursors expressed at low levels could rarely be confirmed by MS/MS analysis. While evolutionary pressures are expected to influence the level of expressed conopeptides (59,60), it remains to be determined whether conopeptides expressed at low levels are recently evolved or in the process of being deselected.
We observed a significant disparity between the number of conopeptide genes and the number of masses detected by mass spectrometry, confirming previous studies (15,16). Compared with the 105 conopeptide precursors identified in the venom gland transcriptome, 7798 unique masses were identified using the combined results from three MS platforms  with stringent de-replication. To understand the mechanisms responsible for this ϳ 75-fold disparity, reduced and alkylated venom was analyzed in detail by MS/MS. Using this approach, 1385 peptide fragments sequenced by MS/MS could be matched to Ͼ 60% of the 105 precursors, providing the most comprehensive study to-date on animal venom complexity. Surprisingly, the majority of identified conopeptides were differentially processed N-and C-terminal variants. For each gene precursor, one or two conopeptides typically dominated quantitatively (ϳ 95%) and these invariably corresponded to conopeptides cleaved at a predicted R/K cleavage site. The remaining variants arise from enzyme processing at alternative R/K cleavage sites in the sequence, or they appear to arise from enzymes with low substrate specificity or an alternative substrate preference. Because these alternatively cleaved forms are always less abundant than the full length mature peptide, their biological relevance is unclear. However, because conopeptides differing by only a few residues at their N-or C termini can have altered biological activity (61)(62)(63), this VPP is expected to have evolutionary significance. Together with the hypermutations seen at the mRNA level, VPP is a new mechanism that contributes to "biological messiness" in venoms, a concept recently developed in the field of enzymology to explain the origins of evolutionary innovation (64). This study has also demonstrated that propeptide sequences can survive intact in cone snail venom. In C. marmoreus these were identified from gene superfamily I1, M, O1, O 2 , and T precursors, and again were subject to variable cleavage that expanded their diversity. While still attached to the mature peptide, the proregion is known to facilitate the ER export of hydrophobic mature conotoxins (65), however, no role has yet been assigned to the propeptide itself. It will be interesting to identify if these mostly linear peptides have biological activity and to what extent they contribute to the envenomation process and conopeptide evolution.

CONCLUSIONS
Our analysis of the more than 7500 conopeptides used by C. marmoreus for prey capture and defense represents the most exhaustive transcriptomic/proteomic study of cone snail venom to date. In addition to accelerating the rate of discovery of novel venom peptides (75 novel conopeptide precursors), the combined strategy using second generation sequencing technologies and high sensitivity mass spectrometry has allowed the identification of a novel mechanism of variable peptide processing (VPP). VPP produces diverse Nand C-terminal truncations that exponentially increase the number of peptides generated from a limited number of genes. On average 20 conopeptides (1-72) were generated from each precursor sequence. When applied to each of the 105 conopeptide precursors, an estimated 2000 conopeptides are predicted to be generated by a single C. marmoreus specimen. Significant intraspecific venom variability (16) likely explains the additional conopeptides observed in the pooled milked venom obtained from six C. marmoreus (7798 peptides detected using the three MS platforms). Thus, VPP in combination with intraspecific variability explains for the first time how cone snail can produce exquisitely complex venoms from relatively limited gene sets. VPP may represent a more general phenomena accounting for highly diverse venoms (Ͼ 1000 peptides) observed in other animals, including spider venoms (66), contributing to the "biological messiness" in venoms and associated rapid and adaptive evolution of toxins for prey capture and defense. The next challenge in venomics will involve coupling this accelerated discovery strategy to high throughput synthesis and bioassays (67,68) to accelerate molecular target identification and selectivity profiling of new conotoxins.
□ S This article contains supplemental Fig. S1 and S2 and Tables S1 and S2.