Dataset for the proteomic and transcriptomic analyses of perivitelline fluid proteins in Pomacea snail eggs

This article describes how the proteomic and transcriptomic data were produced during a study of the reproductive proteins of Pomacea maculata, an aquatic apple snail laying colorful aerial eggs, and provides public access to the data. The data are related to a research article titled ‘An integrated proteomic and transcriptomic analysis of perivitelline fluid proteins in a freshwater gastropod laying aerial eggs’ (Mu et al., 2017) [1]. RNA was extracted from the albumen gland and other tissues and sequenced on an Illumina Hiseq. 2000. The assembled transcriptome was translated into protein sequences and then used for protein identification. Proteins from the perivitelline fluid of P. maculata were separated in SDS-PAGE and analyzed by LTQ-Orbitrap Elite coupled to an Easy-nLC. The translated transcriptome data are provided in this article. Proteomic data (.raw file format) are available via ProteomeXchange with the identifier PXD006718.


Specifications
An integrated proteomic and transcriptomic analysis of perivitelline fluid proteins in a freshwater gastropod laying aerial eggs [1] Value of the data This dataset provides a comprehensive proteomic profile of perivitelline fluid of the apple snail Pomacea maculata. The proteomic data which were obtained from state-of-the-art mass spectrometry analysis can be used for protein identification, especially for reproductive proteins in gastropods.
This dataset also provides translated transcriptomic profiles of the albumen gland and other tissues of Pomacea maculata. The translated transcriptome can be used as the database to support protein identification in gastropods.
The data presented here can be used for studies of protein function and evolution in gastropods.

Data
Pomacea maculata is a freshwater snail native to South America that has invaded many regions of the world [2]. There is considerable interest in the reproductive biology of this species [3,4], but a lack of genomic resources has hindered such studies at the molecular level. We extracted the RNA from the albumen gland and other tissues, and sequenced them on Illumina Hiseq. 2000 to generate a database to support protein identification. Table 1 shows the number of contigs and unigenes in the assembled transcriptome, as well as the quality of the data. Table S2 contains 44,350 protein sequences which were translated from the transcriptome. These sequences were used for protein identification as described below. Proteins were extracted from the perivitelline fluid of newly laid eggs, fractionated using SDS-PAGE and analyzed with LTQ-Orbitrap Elite coupled to an Easy-nLC. The data files (.raw) generated by mass spectrometry was converted into.mgf files using Proteome Discovery 1.3.0.339 and searched against the protein database in Mascot 2.3.2 and they were deposited in ProteomeXchange.

Animal culture
The Pomacea maculata adults originally collected from a river in San Pedro, Argentina (33°39′ 35.97″ S, 59°41′52.86″ W) were transported to Hong Kong Baptist University and cultured at 25 71°C with dechlorinated tap water. Fish food, lettuce and carrot were fed to the snails. Egg clutches deposited by the snails on the walls of aquaria were used for protein extraction.

RNA extraction and transcriptome sequencing
In order to establish a database for protein identification and detect tissue specific genes, transcriptomes of albumen gland (AG) and other tissue (OT; including foot, mantle and visceral mass) were sequenced. Total RNA of AG and OT was extracted using TRIzol reagent (Invitrogen, Carlsbad, USA) following the manufacturer's protocol except two minor modifications: A mixed solution of 0.8 M Na 3 C 6 H 5 O 7 and 1.2 M NaCl was added before the isopropanol step; A LiCl solution (final concentration 2 M) was added after resuspension of RNA pellets with RNase-free water. The messenger RNA was collected and reverse-transcribed into cDNA, and sequenced on an Illumina Hiseq. 2000 to produce 100 base pair of pair-ended reads. Clean reads were assembled using Trinity (release 20130225) [5]. The assembly statistics of the AG and OT transcriptomes are showed in Table 1. Assembled sequences were annotated using BLASTx by searching against public databases (NCBI nr, Swissprot, COG and KEGG) with an E-value threshold of 1×e −5 [6,7]. Amino acid sequences were translated from the assembled sequences and used as the database for protein identification ( Table 2).

Egg mass collection, protein extraction and mass spectrometry
Egg masses were washed with MilliQ water and then air-dried. A sterile needle was used to crack the egg shells gently and a pipette with a fine tip was used to collect the perivitelline fluid (PVF). PVF was stored in 8 M urea, homogenized, and centrifuged. Supernatant solution was collected, purified, and protein concentration was determined using RC-DC kit (Bio-Rad). There were three biological replicates which were collected from different egg masses.
The protein solutions were mixed with a SDS-PAGE buffer (0.05% bromophenol blue, 50% glycerol, 10 mM dithiothreitol, 0.2 M Tris-HCl pH ¼ 6.8, and 10% SDS) with a ratio of 3:1 (v/v), heated at 105°C for 5 min, and separated by SDS-PAGE. Sample gels were stained with Coomassie Brilliant Blue and destained with 1% acetic acid and MilliQ. Each biological replicate was divided into 10 fractions. For each fraction, gels were cut into small pieces and further destained with a mixed solution of 50% methanol and 50 mM NH 4 HCO 3 , and then washed with MilliQ, 100% ACN, and 100 mM NH 4 HCO 3 sequentially. Then 10 mM of dithiothreitol was applied to reduce the disulfide bonds, and 55 mM of iodoacetamide was used to alkylate the sulfhydryl groups. Each gel fraction was then digested using Each fraction from the three biological samples was reconstituted using 0.1% formic acid and analyzed twice with a LTQ-Orbitrap Elite coupled to an Easy-nLC as described previously [8]. In short, peptides from each fraction were separated in a C18 capillary column. Mass spectrometry scans over a range of 350-1600 m/z were conducted with a resolution of 60,000 under the positive charge mode. The top five abundant multiple-charged ions which had a minimum signal threshold of 500.0 were selected for fragmentation using collision-induced dissociation (CID) and high-energy collisioninduced dissociation (HCD). Both CID and HCD scanning strategies used an isolation width of 2.0 m/z. The CID fragmentation adopted an activation time of 10 ms and a normalized collision energy of 35%; The HCD fragmentation also used an activation time of 10 ms but the normalized collision energy was 45%.

Protein identification
The raw MS/MS files were converted into.mgf files (Raw data are available via ProteomeXchange with identifier PXD006718) using Proteome Discovery 1.3.0.339, and searched against the P. maculata database with 77584 protein sequences containing both 'decoy' and 'target' sequences using Mascot version 2.3.2. The parameters were similar to those described in Mu et al. [9] except that the fixed modification was set as cysteine carbamidomethylation and the maximum number of missed cleavage of trypsin was set as one. Peptides having an ion score Z 22 (corresponding to 95% confidence) were kept. Peptides which had more than nine amino acids were retained and 1% of false discovery rate threshold was adopted in the protein identification. Proteins which had at least three matched peptides and were detected in at least two replicates were kept.