In Silico Identification of New Secretory Peptide Genes in Drosophila melanogaster*S

Bioactive peptides play critical roles in regulating most biological processes in animals. The elucidation of the amino acid sequence of these regulatory peptides is crucial for our understanding of animal physiology. Most of the (neuro)peptides currently known were identified by purification and subsequent amino acid sequencing. With the entire genome sequence of some animals now available, it has become possible to predict novel putative peptides. In this way, BLAST (Basic Local Alignment Searching Tool) analysis of the Drosophila melanogaster genome has allowed annotation of 36 secretory peptide genes so far. Peptide precursor genes are, however, poorly predicted by this algorithm, thus prompting an alternative approach described here. With the described searching program we scanned the Drosophila genome for predicted proteins with the structural hallmarks of neuropeptide precursors. As a result, 76 additional putative secretory peptide genes were predicted in addition to the 43 annotated ones. These putative (neuro)peptide genes contain conserved motifs reminiscent of known neuropeptides from other animal species. Peptides that display sequence similarities to the mammalian vasopressin, atrial natriuretic peptide, and prolactin precursors and the invertebrate peptides orcokinin, prothoracicotropic hormones, trypsin modulating oostatic factor, and Drosophila immune induced peptides (DIMs) among others were discovered. Our data hence provide further evidence that many neuropeptide genes were already present in the ancestor of Protostomia and Deuterostomia prior to their divergence. This bioinformatic study opens perspectives for the genome-wide analysis of peptide genes in other eukaryotic model organisms.

ally based on their biological activity. Although bioactive peptides are of considerable biological, medical, and industrial importance, so far a method is lacking for the systematic identification of all candidate bioactive peptides in an organism. Computational methods have become especially important since the advent of genome projects. By means of Basic Local Alignment Searching Tool (BLAST) 1 analysis, the genome of an organism can be screened for peptide-coding genes based on sequence similarity to known peptide genes from other organisms. Using BLAST, 36 peptide genes have already been found in Drosophila melanogaster (1)(2)(3)(4). Likewise in Anopheles gambiae, 35 peptide-encoding genes were discovered using the same sequence similarity-based mining approach (5). Although certain of these peptides have been studied in detail (6 -11), more data on the entire peptidome of insects are needed for integrated functional analyses.
However, for in silico prediction of peptide precursor genes in large sequence datasets, the performance of the BLAST tool is limited because putative peptide sequences for which no orthologous biologically active peptide has been identified as yet (for instance, because of lack of suitable detection methods) will not be revealed.
BLAST programs are very suitable for scanning databases for conserved proteins. They are appropriate to find two kinds of sequence similarities in the following scenarios (12): 1) similarity is expected along the whole or most of the sequence and 2) local alignments. However, BLAST programs are far less efficient at finding similarity to short peptides when they are scanned against the whole genome sequence. Indeed in most cases, only a short conserved motif is responsible for the function of a particular peptide, and often only this short sequence motif, which can be 5 amino acids or less in length, is conserved. For instance, members of the invertebrate FM-RFamide peptide family can share the carboxyl-terminal tetramer FMRFamide or the MRFamide tripeptide motif or RFamide motif. The detection of a conserved motif may yield valuable structural and functional insights, whereas the remainder of the peptide precursor sequence, i.e. outside the sequence of the actual bioactive peptide, may be essentially irrelevant.
If one or two of the sequences are long, for example in the case of two proteins, it is quite possible that they display very weak sequence similarities overall, but there are some different local alignments having a high score. However, it is difficult for BLAST to find a short alignment containing a motif with a high score within a long amino acid sequence. The local alignment with a high score may not include a specific motif and may only be a noise generated as the result of random mutation. At the same time, the real significant alignment that includes a motif may be masked by long irrelevant sequences because many peptide genes are between 50 and 500 amino acids in length, and in general only a small part of the precursor consists of the actual active peptide(s). In this study we describe a new in silico searching program that uses additional hallmarks of biological peptides and their precursors that are not used by current predictive algorithms.

Rationale
First we examined the structural hallmarks of peptide precursor proteins. Regulatory peptides are synthesized as part of larger precursor proteins that are subsequently processed into smaller active substances. All peptide precursors contain a signal peptide that directs them into the secretory pathway of the cell. After cleavage of the signal peptide, further processing by endoproteases occurs predominantly at basic processing sites, typically mono-and dibasic amino acid residues (13). Many peptide precursors encode multiple bioactive peptides that are often highly related, for example the tachykinin precursor, the allatostatin precursors, and the neuropeptide F precursor in Drosophila and other insects (14 -18). However, in Drosophila as in other insects, peptide genes encoding multiple, unrelated bioactive peptides or genes encoding just a single bioactive peptide also occur (19). Based on the common structural characteristics of known invertebrate peptide precursors we built a sensitive searching procedure to identify peptide genes in the Drosophila database.
Two types of programs were constructed. The first one was built to find those peptide precursors that encode multiple highly related peptides. The second one searches for precursors containing a single peptide or multiple unrelated peptides that share conserved motifs with known peptide precursor proteins from other animal species. In both cases, the putative peptide sequence was defined by the presence of characteristic proteolytic cleavage sites flanking the peptide sequences.
To avoid the shortcomings of BLAST programs in searching long sequences for short similarity (see the Introduction), we split the protein sequences from D. melanogaster into short subsequences and then applied BLAST to compare the subsequences within each protein sequence to fish for those precursor sequences that have at least two similar subsequences. Additionally the subsequences were also compared with subsequences derived from known peptide precursors in the Swiss-Prot database. These subsequences were obtained by splitting the known peptide precursors from other metazoan organisms obtained from the Swiss-Prot database. Because each Drosophila protein sequence was split into a number of subsequences and because all of these subsequences were subsequently compared with all known peptide precursor subsequences, a very large number of alignments with a high score were obtained. Because similarity does not imply homology, only the alignments with sequence motifs from actual bioactive peptides were considered significant, and the obtained subsequences were considered as possible peptides. A FASTA protein database containing all identified putative peptide precursors was constructed. This database was loaded on an in-house Mascot server and used for the identification of peptides in a peptidomic analysis of Drosophila hemolymph. Our results showed that this technique is very efficient to find novel peptide genes.

The Program
The aim of the program is to mine for putative peptide precursors according to the rules and the techniques described above. The program is implemented in SAS, a powerful integrated software to access, manage, analyze, and present data. External tools such as SignalP and BLAST need to be run independently. They communicate with the program by text files. The program includes a few subprograms listed below.
Protein.SAS-The first part of the program, named Protein.SAS, serves to pick up all proteins from a specific species, in this case D. melanogaster. The input of the subprogram consists of the Swiss-Prot protein database files and additional Drosophila genes at Gen-Bank TM identified by Hild et al. (20). The relevant information for each of the Drosophila proteins, such as accession number, protein name, gene name, protein sequence, signal peptide information, length, and mass, is written into an SAS dataset. The first 70 amino acids of every protein sequence serve as output to a text file in FASTA format, which is used as the input of SignalP. SignalP (www.cbs.dtu.dk/services/ SignalP) for eukaryotes is then run to predict the presence and location of a signal peptide in each protein sequence. Next the subprogram reads the output file by SignalP, and another SAS dataset is created that includes the predicted signal information of every Drosophila protein. The dataset is compared with the dataset of all Drosophila proteins, and the proteins are retained if they are either annotated to have signal peptides in the Swiss-Prot protein database files or predicted to have signal peptides by SignalP. The comparison result is a dataset of Drosophila proteins having amino-terminal signal peptides. From this dataset, only the proteins that are less than 500 amino acids in length are retained because previous analysis has shown that all known secretory peptide precursors are shorter than 500 amino acids (11). In total, 5096 proteins made up the final Drosophila protein dataset, which was analyzed further. The logic of Protein.SAS is illustrated in Fig. 1.
Peptide.SAS-The subprogram filters out from Swiss-Prot protein databases all the peptides or their precursors that are known in Metazoa today. The peptide precursors are identified by the keywords in each protein data file. If a protein is picked up, its relevant information, like the information collected in Drosophila proteins, is written into an SAS dataset. Fig. 2 describes the process.
Cleavage.SAS-The objective of the subprogram is to split protein sequences into subsequences after removal of the signal peptide sequence. The protein sequences are split in silico at cleavage sites typical for peptide precursors. Conventional amino acid motifs that are required for cleavage of neuropeptides from their protein precursors in insects have been described as GKR, GRR, GR, GRK, GKK, KR, RR, GK, RK, KK, and Arg (21). From our statistical analyses on all known peptide precursors in all organisms (data not shown), it is clear that the processing of peptide precursors does not occur at every conventional cleavage site in the precursor. Cleavage also depends on the amino acids that are at the proximity of the cleavage site. For example, proteolytic processing at GKK followed by Arg always occurs. However, if GKK is followed by Ala, Asn, Ser, or Lys, the processing may or may not occur. For other amino acids at this position, it has not been demonstrated whether processing occurs. As a second example, proteolytic processing at a single Arg residue only occurs when there is a basic amino acid residue in position Ϫ4, Ϫ6, or Ϫ8 with respect to the single Arg. The basic amino acid is usually an Arg, but Lys or His residues work as well (19).
BLAST Analysis-The output of Cleavage.SAS consists of two database files "Drosophila subsequence" and "peptide." BLAST anal-  ysis is then conducted on these two databases. The score matrix "PAM30" is used, and the expectation value (e-value) as well as the parameter "word size" are set to 6 and 2, respectively, to find short but strong similarities. Fig. 3 explains the process.
Extract.SAS, Shift.SAS, and Motif.SAS-These programs are used to screen the result output by BLAST and determine the biologically significant matches. The subprogram Extract.SAS extracts the Drosophila proteins that have at least two similar subsequences within the protein. The subprogram Shift.SAS reads the comparison result from the BLAST analysis and computes the shift value. The shift value is the minimal distance between the amino or carboxyl terminus of a subsequence and the matching amino acids in the subsequence. From the statistical analysis of the known peptide precursors, these shift values should be low. This means that the motif should be close to a cleavage site. The shift value is set to be no larger than 3 in the program. The subprogram Motif.SAS reads the comparison results between Drosophila subsequences and peptide subsequences as well as the comparison results among peptide subsequences and identifies the Drosophila subsequences that contain motifs.
TMpred and SOSUI-Finally on-line software TMpred at www. ch.embnet.org/software/TMPRED_form.html (22) and SOSUI sosui.proteome.bio.tuat.ac.jp/sosuimenu.html are used to determine whether a protein has a single transmembrane region at its amino terminus (Fig. 4). The minimum and maximum length of the hydro-phobic part of the transmembrane region was set at 17 and 33 residues, respectively. For the TMpred program a score above 500 for both inside to outside as well as outside to inside helixes was considered to be significant for the presence of the amino-terminal transmembrane region. A score of 250 was considered to be significant for the presence of an inside to outside helix of any second or third transmembrane region. A putative peptide precursor was retained if one program predicts a single transmembrane region at the amino terminus. When both programs predict the absence of an aminoterminal transmembrane region, the protein was deleted from the list.

Mass Spectrometry
D. melanogaster were reared in 250-ml bottles containing 70 ml of water, 17 g of sucrose, 0.45 g of yeast, 0.9 g of agar, 0.5 ml of 8% Nipagin, and 0.36 ml of propionic acid. The bacteria species Micrococcus luteus and Escherichia coli were precultured in TSB medium. They were used as typical organisms for infection with Gram-positive and Gram-negative bacteria, respectively. Pellets taken when the cultures were in the log phase of growth were resuspended in a small amount of culture medium. Septic injury was performed by pricking third instar larvae of D. melanogaster with a fine insect needle dipped in a mixed solution of M. luteus and E. coli. Then larvae were incubated on tissue saturated with a 5% sucrose solution. After 12 h, the hemolymph was collected using microcapillaries under a binocular microscope. In the control experiment, third instar larvae were collected and incubated for 12 h on the 5% sucrose solution before hemolymph was taken.

Sample Preparation
For each experiment, 25 l of hemolymph were collected in 100 l of extraction solution of methanol/water/formic acid (90:9:1, v/v/v). The sample was then sonicated for 15 min. All cell debris and large proteins were removed by centrifugation. The supernatant was dried and resuspended in 100 l of MilliQ water, 0.1% TFA. The lipids in the sample were removed by adding 100 l of ethyl acetate to the sample and then vortexing it. After centrifugation, the ethyl acetate with the dissolved lipids was separated from the aqueous solution with the peptides. This solution was dried and resuspended in 15 l of MilliQ water/acetonitrile/formic acid (94.9:5:0.1) and next filtered through a Millipore spin down filter. As control experiments, hemolymph of non-infected larvae was collected and treated the same way as the test samples.
Capillary LC-tandem MS experiments were conducted using an Ultimate HPLC pump, a column-switching device (Switchos), and a Famos autosampler (all LC Packings) coupled to a Q-TOF mass spectrometer (Micromass). Chromatography was performed using a guard column (-guard column MGU-30 C18, LC Packings) acting as a reverse phase support to trap the peptides. Ten microliters of the sample were loaded on the precolumn with an isocratic flow of 2% acetonitrile in MilliQ water with 0.1% formic acid at a flow rate of 10 l/min. After 2 min, the column-switching valve was switched, placing the precolumn on line with the analytical capillary column, a Pepmap C 18 , 3-m, 75-m ϫ 150-mm nanocolumn (LC Packings). Separation was conducted using a linear gradient from 95% solvent A, 5% solvent B to 80% solvent A, 20% solvent B in 90 min followed by a linear gradient from 80% solvent A, 20% solvent B to 50% solvent A, 50% solvent B in 60 min (solvent A: water, formic acid (99.9:0.1, v/v); solvent B: acetonitrile, formic acid (99.9:0.1, v/v)). The flow rate was set at 150 nl/min.
The LC system was connected in series to the electrospray interface of the Q-TOF device. The column eluent was directed through a stainless steel emitter (Proteon). Needle voltage was set at 1650 V, and cone voltage was set at 35 V. Nitrogen was used as nebulizing gas. Parent ions with two, three, or four charges of sufficient ion intensity (threshold was set at 15 counts/s) were automatically recognized by the charge state recognition software (MassLynx 3.5, Micromass) and selected for fragmentation as they eluted from the column. Argon was used as a collision gas; collision energy was set at 25-40 eV depending on the mass and charge state of the selected ion. The detection window in the survey scan was set from 400 to 1400 mass to charge (m/z). Fragmentation spectra were acquired from m/z 50 to 2000.

Construction of Two
Databases-First we constructed two databases. The first database was generated as follows. Drosophila proteins that are less than 500 amino acids in length and that start with a signal peptide were assembled from Swiss-Prot and TrEMBL databases as well as from a collection of additional Drosophila genes identified by Hild et al. (20). The program SignalP for eukaryotes was used to predict the occurrence of a signal peptide for a protein sequence (23). As a result, 5096 Drosophila protein sequences were retained. Then all these protein sequences were split into short subse-quences at the conventional cleavage sites, taking into account the nature of the amino acids in the proximity of each cleavage site. These subsequences formed the first database, which we named Drosophila subsequence.
The second database is a peptide database that comprises all known peptide precursor subsequences from all metazoan organisms. These annotated peptides or peptide precursors were filtered from the Swiss-Prot (release 42.11) and TrEMBL (release 25.11) databases as follows. A protein was retained when it is annotated as (neuro)peptide precursor or when its name contains the word "neuropeptide." Proteins of which the corresponding protein file contains keywords such as peptide, neuropeptide, hormone, or neurotransmitter were also retained. But if these proteins have a subcellular location as membrane protein (as indicated in the protein file) or if they are characterized by key words such as receptor, signalanchor, transmembrane, binding protein, DNA binding, nuclear protein, nuclear transport, or enzyme or words ending in "ase," they were excluded. In total, 2858 proteins met these criteria. These peptide precursors were subsequently split into short subsequences at the conventional cleavage sites, also taking into account the character of the amino acids in the proximity of each cleavage site. This collection of peptide precursor subsequences constituted the peptide database.
Setup of Data Mining Analysis-Stand-alone BLAST was used to compare the two above mentioned databases. Interpretation of the results generated by BLAST involved evaluation of the matches to determine whether they are significant. Therefore, genuine and biologically meaningful similarities need to be distinguished from the irrelevant and essentially random ones. If the alignment was similar to a motif, it was considered significant, and the subsequence was considered a putative peptide. To find the conserved motifs, all known peptide precursor subsequences were compared by BLAST.
Four types of analysis were performed. 1) The Drosophila subsequence database was compared with itself, and those protein sequences that have at least two similar subsequences within the same protein sequence were retained (first screening method). 2) The peptide precursor subsequences in the peptide database were compared with each other, and the obtained similar amino acid sequence tags were considered as possible motifs.
3) The Drosophila subsequence database was compared with the peptide database, and those Drosophila subsequences that displayed sequence similarities to such a conserved motif were retained (second screening method). 4) The retained proteins from 1 and 3 were then analyzed by a transmembrane region prediction method. Our analysis of the annotated Drosophila neuropeptide precursors indicates that almost all have a single transmembrane region, which is located at the amino terminus and which corresponds to the signal peptide. Therefore, in a final step we fine tuned the generated list of putative peptide precursors based on this hallmark. The list was curated by the deletion of (i) all soluble proteins (lacking membrane-spanning regions), (ii) proteins having more than one transmembrane region, and (iii) proteins having one transmembrane region that is not located in the amino-terminal region (the cut of the start of the transmembrane region was set at the 20th residue; transmembrane regions that started at or after this point were not considered to be at the amino-terminal side).
Screening Method 1-The first screening method was based on the principle that multiple peptides encoded by a single invertebrate peptide precursor gene often are highly related. Therefore, proteins were only selected if they had at least two similar subsequences and if the matching amino acid sequence was at or close to the amino or carboxyl terminus of at least one subsequence.
Therefore, the structural pattern of a putative peptide precursor is: . . . . Using this screening method we found 58 peptide precursors in Drosophila, 10 of which are well known peptide precursor genes that encode at least two related bioactive peptides, drosulfakinin (dsk), FMRFamide, short neuropeptide F, tachykinin, capa or mt-cap2b, and diuretic hormone.
For example, the protein identified by Swiss-Prot/TrEMBL accession number Q9V808 is a putative peptide precursor (Fig. 5). By comparing the Drosophila subsequence database with itself, we obtained three similar subsequences in Q9V808. The number following the accession number represents a different subsequence within a protein sequence by its position. The results of this screening method is depicted in Supplemental Table 1.
Screening Method 2-The fact that only nine of the 44 known neuropeptide precursors as well as one immune induced peptide in Drosophila were listed by the first screening method indicated that the catalogue of putative regulatory peptide precursors obtained by the first screening method is doubtless incomplete. Therefore, we carried out a second screening method that screens for Drosophila proteins having a signal peptide and of which at least one subsequence has at least 3/5 amino acids at or close to the amino or carboxyl terminus identical to a known peptide. In addition, the identical 3/5 amino acids should be similar to a motif. The retained proteins were then further filtered by the transmembrane prediction analysis as in the first method.
By means of the second method we found 70 Drosophila peptide precursor genes in total, 42 of which are known peptide precursors and 28 of which are novel. Each of these putative peptide precursor genes encodes multiple non-re-FIG. 5. Screening method 1. The protein identified by accession number Q9V808 was retained as a putative peptide precursor. By comparing the Drosophila subsequence database with itself, we obtained three similar subsequences in Q9V808 (underlined): Q9V808_6 (IPYEVKVDVPQPYIVE), Q9V808_8 (IPYEVKVPVDKPYEVKVPVPQPYEVI), and Q9V808_9 (IPYEVKVPVPQPYEVI). The number following the accession number represents a different subsequence within a protein sequence by its position. lated peptides or only a single putative peptide. For example, protein Q8MS86 was identified as a putative peptide precursor (Fig. 6). The similar subsequence is Q8MS86_2 (WKILT-AGSHFRWL). The similar known peptides are P11885_2 (YVMSHFRWNKF) from Rana catesbeiana and P06298_8 (NGNYRMHHFRWGSPPKD) from Xenopus laevis.
The total output of this screening method is shown in Supplemental Table 2. The combined computational methods generated in total 75 novel putative peptide precursors in D. melanogaster in addition to the 43 known ones.
Peptidomic Analysis-The nano-LC-tandem MS method allowed us to select and fragment the peptide ions as they elute from the column even when co-eluting with other peptides. Peptides were identified by subjecting their fragmentation spectra to a Mascot search on an in-house server. This bioin-formatic tool (www.matrixscience.com) allows the identification of proteins and peptides by matching MS/MS data against any FASTA format protein or (translated) nucleic acid sequence database. In a typical MS/MS ion search, we combined all MS/MS data of every peptide selected for fragmentation during a LC-MS run in a comprehensive peak list. This type of file contains the centroided mass values and associated intensity values of all the parent ions selected and corresponding fragmentation peaks and can be submitted to Mascot for identification of the peptides. Settings used were as follows: variable modifications; carboxyl-terminal amidation, oxidation of methionine, and pyro-Glu (N-terminal Glu) were selected; enzyme was set to "none"; peptide mass error tolerance was set to 0.6 Da; and MS/MS mass error tolerance was set at 0.3 Da. In total more than 300 ions were automatically selected for fragmentation. Twenty peptides were identified, most of which are known to occur in the hemolymph such as the Attacins, defensins, and Drosophila immune induced peptides (DIMs) (24). The identified peptides and their Mascot scores are summarized in Table I. In an earlier study these peptides could already be identified in the hemolymph after immune challenge (24). In addition however, we identified two new peptides, LDDSENNDQVVGLLDVADQGANHANDGAREA and a truncated form of this peptide, LLDVADQGANHANDGAREA (Fig. 7). These peptides originate from protein CG7738 that was identified as a putative peptide precursor by screening method 1 (Fig. 8).
For comparison, the same data file was used in a Mascot search on a larger database. The database used was the National Center for Biotechnology non-redundant (NCBInr) database, Drosophila was selected for taxonomy, and other settings were kept identical to those used for the restricted database. Results of both searches are summarized in Table  I. Peptides identified with the NCBInr database are essentially the same as those identified with our restricted database. No additional peptides could be identified with the full NCBInr database.

DISCUSSION
Because of the availability of its complete genome sequence, Drosophila becomes a model insect for peptide research. We identified in total 118 putative secretory peptide precursor genes in D. melanogaster by applying the database searching programs presented here. 43 of them are annotated peptide precursors. All predicted peptide precursors met following criteria. (i) Each putative peptide precursor is less than 500 amino acids in length and has a signal peptide. (ii) Each precursor contains one or several putative peptides that are flanked by conventional cleavage sites. There are two possibilities: the precursor contains two or more peptides that share sequence similarities, or alternatively the precursor contains a single peptide that shares conserved motifs with known peptide precursor subsequences from other organisms. (iii) All predicted peptide precursors have one aminoterminal transmembrane region. Several of the genes mined by our method encode peptides that display significant sequence similarities to known vertebrate or invertebrate neuropeptides. These similarities could not be discovered by BLAST scanning of the whole Drosophila genome. For instance, a putative peptide encoded by CG3868, mined by the first method, displays sequence similarities with an antifreeze glycopeptide precursor identified in Antarctic fish (25). The salivary gland glue protein (CG18087) contains a putative peptide sequence that displays significant similarities to vertebrate neurophysins. Neurophysins are a group of small, soluble proteins secreted by the hypothalamus. They serve as binding proteins for oxytocin and vasopressin during their transport to the posterior pituitary. They are secreted with the hormones but have no known functions other than serving as a carrier. In vertebrates neurophysins originate from the vasopressin precursor. The salivary gland glue protein (CG18087) does not contain a vasopressin/oxytocin-like peptide.
Putative peptides from two genes, CG9358 and BK003312, display sequence similarities to conserved parts of the prolactin precursor. Prolactin and growth hormone are two distinct neuropeptide hormones that have been found in all vertebrate groups but not in cyclostomes (26), although prolactinergic neurons that were detected immunochemically occur in a protochordate (27). The growth hormone/prolactin superfamily is likely to have a prevertebrate origin, but a putative invertebrate member has not been found yet in contrast to other neuropeptide superfamilies that are highly conserved in vertebrates and invertebrates. Examples are tachykinins, gastrin, insulin, neuropeptide Y, corticotropin-releasing factor, and calcitonin gene-related peptide (28). Drosophila BK002187 encodes a peptide with sequence similarities to atrial natriuretic peptide. Natriuretic peptides are vertebrate hormones that play a pivotal role in cardiovascular and body fluid homeostasis in vertebrates (29). Although a novel natriuretic peptide has been found recently in the heart and brain of the hagfish, the most primitive vertebrate (30), no member of this family has been described yet in invertebrates. Finally a peptide encoded by the Drosophila LP04693 displays sequence similarities to ␥-MSH, a pituitary hormone derived from the pro-opiomelanocortin precursor, the function of which has remained elusive (31) (Table II).
When we consider similarities to invertebrate neuropeptides, the CG1565 protein contains a putative peptide that has an amino-terminal hexamer contained within orcokinin, a myotropic neuropeptide discovered in crustaceans (32,33) and very recently for the first time in insects (34,35) (Table III). A putative peptide sequence within the Trunk protein precursor displays striking similarities with prothoracicotropic hormone, a neuropeptide that has so far only been identified in lepidopteran species in which it stimulates ecdysone biosynthesis in the prothoracic glands (36). Interestingly the sexspecific gene male specific opa containing gene (MSOPA), as identified by Jin et al. (37), encodes a putative peptide that shows sequence similarities to a male accessory gland-specific 57-kDa peptide precursor. Next the putative Drosophila peptide encoded by CG8087 displays more than 60% sequence identities with a neuropeptide derived from a neurospecific peptide precursor in the terrestrial snail Helix lucorum (38). Finally some mined genes (CG16882, CG11131, CG7465, CG1221, Argos, and Trunk) have been predicted or shown to encode for ligands of membrane receptors, such as epidermal growth factor, Toll, or Torso receptors (39), a function in line with the peptidergic nature of their products.
Since the publication of the Drosophila genome sequence, several microarray studies have been performed, and we observed that some of the mined peptide precursor genes are up-regulated by ecdysone (CG7608, CG1807, and CG 7350) (40,41). Ecdysone is an ecdysteroid involved in insect metamorphosis and reproduction. It is the precursor of 20-OHecdysone, the functional counterpart of vertebrate estrogen (42). In this way, our data are in accordance with the reported interactions of peptide and steroid hormone signaling cascades in vertebrates (43,44). Other mined genes are upregulated after infection (45) and encode for peptides that are secreted into the hemolymph such as Attacin, diptericin, drosocin, and various DIMs (24). With the currently established program, several additional putative peptide precursor genes display sequence similarities to known DIMs but have yet not been annotated as such. Three of them (CG32851, CG5791, and CG15065) form part of the Toll pathway (46), and one (CG18107) is rhythmically expressed in the head (47) ( Table IV).
Our program also picked up the drosocrystallin gene (51) as well as other annotated cuticular proteins. In Tenebrio molitor, biologically active peptides display strong sequence similarities to parts of cuticle proteins, and therefore they might be processed from them (52). Given the fact that proteolytic processing does not always occur at every conventional cleavage site (53), our established catalogue of predicted peptide precursors is doubtless incomplete, and it will be a difficult challenge to consider the existence of these unconventional cleavage sites in the further refinement of our method.
Only two of the characterized peptide precursors were not mined by our method, i.e. the diuretic hormone precursor or CG8348 because it has four transmembrane regions and the proctolin precursor "Q8MMJ7" because its sequence is too short (5 amino acids) to be filtered by the program. Only a few cases could be false positives: CG5559 has been annotated to encode a conserved protein involved in synaptic vesicle fusion, CG6409 has been predicted to be a component of the endoplasmic reticulum, CG11577 has been predicted to permanently reside in the lumen based on its carboxyl-terminal sequence (54), and CG6357 encodes a putative cysteine protease.
The database of predicted and known peptide precursors in Drosophila as established in this study will serve several applications in experimental research. Mass spectrometric data will become much easier to read and interpret if the database against which they are scanned is much narrower than the Swiss-Prot database. In insects, as in all larger organisms the hemolymph is an important part of the defense mechanism against infections (24,55). Many of the peptides identified so far in the hemolymph are involved in the response of the fly to an infection (24), and most of these peptides are released into the hemolymph after infection. In addition all peptides (and proteins) presented in the hemolymph are secreted by other tissues or by blood cells. Therefore we chose to perform a peptidomic analysis on the hemolymph of fruit fly larvae after infection with bacterial material.
The peptidomic analysis of the Drosophila hemolymph resulted in the identification of 22 peptides; all of these peptides originate from peptide precursor genes known to be involved in the immune response of the fruit fly. In addition we were able to identify two novel peptides originating from the CG7738. The first peptide, LDDSENNDQVVGLLDVADQGAN-HANDGAREA, is 31 amino acids in length and is flanked at the amino-terminal side by the cleavage site of the signal peptide (Fig. 8). At the carboxyl terminus the peptide is flanked by an arginine residue that could act as a monobasic cleavage site.
The second peptide, LLDVADQGANHANDGAREA, is a truncated homologue of the first one.
For comparison, the same data file was used in a Mascot search both on the restricted peptide precursor database and on the NCBInr database (taxonomy, Drosophila). The peptides identified with the NCBInr database are essentially the same as those identified with our restricted database. No additional peptides could be identified with the full NCBInr database.
However, the threshold scores for a significant identification increases with the size of the database used (larger number of possible peptides). The minimum score for a significance threshold of 0.05 was 27 for the restricted peptide precursor database and 56 for the full NCBInr (Drosophila) database. Therefore, a large number of peptides (13 of 22) that had a significant score with the restricted database fell below the significance threshold with the full NCBInr database. This includes the larger form of the identified novel peptide; the truncated peptides fell within the set threshold limits of both searches.
Similar to most proteomic identification tools, Mascot is designed to identify a protein from the MS/MS spectra of several individual peptides cleaved from the same protein.
The protein score in a peptide summary is derived from the ion scores of the individual peptides. For a search that contains a small number of queries, the protein score is the sum of the unique ion scores. Many (neuro)peptide precursors give rise to only one or a very limited number of bioactive peptides, and because a peptidomic experiment focuses on the peptides themselves rather than on the complete precursor, only the individual scores of the peptides can be taken into ac-  count for an identification. In addition, because the processing of the peptide from the precursor is unknown, no cleavage enzyme can be selected for identification. All these features of naturally occurring peptides are detrimental for easy identification in an automated fashion. Our study demonstrates the improved success rate of identification of a secretory peptide if one is able to use a restricted database. The lack of identification of more predicted peptides may be due to several reasons. First, their concentration in the hemolymph may be below the sensitivity of the instrumental setup we used. Second, not all peptides are extracted or ionize with the same efficiency. Consequently instrument sensitivity varies between peptides depending on their amino acid composition. Third, in contrast to peptides obtained after tryptic digestion of a protein, naturally occurring peptides do not necessarily have a basic amino acid at their carboxyl terminus and will not necessarily yield nice series of y or b ions upon fragmentation. As a consequence many natural peptides yield fragmentation spectra that are hard to interpret or are of insufficient quality for Mascot to allow identification with a sufficiently high score. Fourth, these peptides were predicted from the genomic database, which is no guarantee that they are present in the hemolymph. Fifth, if they are released into the hemolymph, their secretion and/or synthesis might be dependent on the physiological condition of the animal. Further analysis of other tissues in different physiological conditions will, without doubt, result in the identification of more of the predicted peptides.
Also in mammalian models, genome-wide analysis of peptides by mass spectrometry has increased recently (56). Construction of a peptide database, like the one presented here for Drosophila, will be of high value to support these studies. As the structural hallmarks of peptide precursor sequences are highly conserved across phyla, we foresee that the established search program can be adapted for the genome-wide analysis for peptide precursor genes in other animal model systems that have a sequenced genome (24,32,34,35,(57)(58)(59)(60)(61)(62)(63)(64)(65)(66). * This work was supported by Fonds voor Wetenschappelijk Onderzoek Vlaanderen (FWO) Grant G0146.03. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material.
§ Both authors contributed equally to this work. ʈ Postdoctoral fellow of the FWO. To whom correspondence should be addressed. E-mail: geert.baggerman@bio.kuleuven.ac.be. ** Postdoctoral fellow of the FWO.