Proteomic analysis of Artemisia annua – towards elucidating the biosynthetic pathways of the antimalarial pro-drug artemisinin

MS-based proteomics was applied to the analysis of the medicinal plant Artemisia annua, exploiting a recently published contig sequence database (Graham et al. (2010) Science 327, 328–331) and other genomic and proteomic sequence databases for comparison. A. annua is the predominant natural source of artemisinin, the precursor for artemisinin-based combination therapies (ACTs), which are the WHO-recommended treatment for P. falciparum malaria. The comparison of various databases containing A. annua sequences (NCBInr/viridiplantae, UniProt/viridiplantae, UniProt/A. annua, an A. annua trichome Trinity contig database, the above contig database and another A. annua EST database) revealed significant differences in respect of their suitability for proteomic analysis, showing that an organism-specific database that has undergone extensive curation, leading to longer contig sequences, can greatly increase the number of true positive protein identifications, while reducing the number of false positives. Compared to previously published data an order-of-magnitude more proteins have been identified from trichome-enriched A. annua samples, including proteins which are known to be involved in the biosynthesis of artemisinin, as well as other highly abundant proteins, which suggest additional enzymatic processes occurring within the trichomes that are important for the biosynthesis of artemisinin. The newly gained information allows for the possibility of an enzymatic pathway, utilizing peroxidases, for the less well understood final stages of artemisinin’s biosynthesis, as an alternative to the known non-enzymatic in vitro conversion of dihydroartemisinic acid to artemisinin. Data are available via ProteomeXchange with identifier PXD000703.


Background
There is growing interest in applying proteomics to organisms other than just those which are biomedically relevant and important species such as human, mouse or rat. However, one of the main hurdles for successful application of proteomics to an organism of interest is still the availability of a well annotated and curated (genomic) database that can be used to search the (mainly MS-based) proteomic data for protein identification in that organism. Thus, the field of proteogenomics is becoming increasingly important because of its ability to support the annotation of genomic sequence data by exploiting the information that is obtained through proteomics for the identification and characterization of the actual products of gene expression [1,2].
Plant genomes can be highly complex and, in general, have been less well characterized than those from the animal kingdom, let alone those in the mammalian class, as mentioned above. For many plants, even those of high economic importance, the variability in the quality of available sequence databases can have a great effect on the power and depth of MS-based proteomic analysis. Consequently, it is desirable to understand and overcome the above limitations, leading to a more informative data set which can be constructed from the vast amount of data that is commonly obtained from largescale MS-based proteomic analyses. Here, we have studied the organism Artemisia annua, which is a Chinese medicinal plant endemic to the Northern parts of China. A. annua is crucial to world health programs as it is currently the sole source for biosynthetically produced artemisinin, the antimalarial pro-drug that has now been the last line of defence against malaria for several decades.
The sesquiterpene lactone, artemisinin, is the precursor for artemisinin-based combination therapies (ACTs), which are the WHO-recommended treatment for P. falciparum malaria [3]. Due to its unique mode of action, artemisinin has been found to be effective against the asexual (blood) stage of the malarial parasite's life cycle [4], which has acquired resistance to the older generation of antimalarial drugs. Between 2005 and 2013, the number of ACT treatments procured by the public and private sectors in endemic countries rose 36-fold, reaching a total of 392 million in 2013 [3].
Thus, the reliable supply of artemisinin as the precursor compound for the active ingredient of ACTs is of crucial importance in the fight against malaria. However, the current production of artemisinin is compromised by the fact that it is reliant on cultivation of A. annua. Thus, in areas where A. annua is competing against food for use of the land, the rise in food prices will cause farmers/extractors to have less of an incentive to grow A. annua. Furthermore, farmers/extractors need to decide whether or not to plant A. annua some 14 months before the drug can be produced. Finally, floods such as those frequently experienced in China and Vietnam make the supply of ACTs unpredictable [5]. Taking these and other factors into account it is highly desirable to develop a method of production of artemisinin that is not dependent on A. annua cultivation, can be scaled up at short notice and is inexpensive.
In order to achieve these goals, two different approaches (as well as a hybrid of both) are now described. One of these approaches is based on chemical synthesis of artemisinin from commercially available starting materials such as the monocyclic monoterpene, isopulegol. However, artemisinin production solely by such chemical means is complex and expensive, and therefore, has so far not supplanted the agricultural production of artemisinin as the favoured production method. An alternative to chemical synthesis is the employment of genetically engineered fast-growing organisms that produce artemisinin in high quantities. In this approach, the biosynthetic pathway to artemisinin needs to be expressed in an organism such as yeast by genetic modification of the host with the relevant genes from this pathway. There are several aspects of this bio-engineering approach that are important for its success, including the viability and fast growth of the engineered organism as well as the compartmentalization/secretion of the biosynthetic product in such a way that harvesting becomes an easy enterprise, to name but a few. For all this, a comprehensive knowledge of the biosynthetic pathway to artemisinin in A. annua is paramount, but unfortunately this goal has not yet been achieved (in particular, the final steps of the biosynthesis are still not completely defined). However, very significant advances have been achieved in spite of this, as demonstrated by two recent publications describing the bio-engineering of artemisinin-producing tobacco [6] and that of artemisinic acid-producing yeast [7]. While the former concludes that current yields in tobacco are still significantly below the percentage dry weight levels obtainable from A. annua (by a factor of more than 1000), the latter reports an incomplete biosynthesis to artemisinic acid, which then needs to be chemically converted to artemisinin. Although great progress has been made recently using the latter approach [8], it remains the case that expensive and complex chemical synthetic steps are still needed in the final stages of producing artemisinin.
Crop-breeding programs, on the other hand, have produced new varieties of A. annua, such as Artemis, with a consistently high yield of artemisinin (1 %); and ongoing projects are aiming to push this yield even higher. However, it is quite clear that both the crop-breeding and fermentation/chemical synthesis strategies would benefit from full knowledge of the biosynthetic route to artemisinin.
The biosynthesis of artemisinin is thought to be localized within the glandular trichomes of the plant which are found on the leaf surface [9]. Trichomes are leaf hairs which originate from the protrusion of specialized epidermal cells on various parts of the plant, including leaves and stems. Typically, trichomes are divided into 2 categories: non-glandular and glandular. Non-glandular trichomes are involved in processes such as water absorption and seed dispersal, whereas glandular trichomes are involved in processes connected with secondary metabolites including biosynthesis, storage and secretion [10,11]. There is currently much interest in taking advantage of the glandular trichomes' biosynthetic function in order to produce compounds, which have pharmaceutical use, such as artemisinin.
The glandular secretory trichome of A. annua is a 10celled biseriate which consists of two basal cells, two stalk cells and three pairs of secretory cells. It is within this multi-cellular glandular trichome that biosynthesis of artemisinin occurs [12]. By comparing the amount of artemisinin extracted from glanded and glandless biotypes of A. annua, Duke et al. revealed that all extractable artemisinin was localized in the subcuticular space of the captitate glands [9]. After extraction, no artemisinin was found in the glandless biotype, thereby implying that the biosynthesis of artemisinin is localized in the glandular secretory trichomes [9].
The current study focuses on the analysis of the trichome proteome of A. annua, exploiting the recently published EST and unigene sequence databases published by Graham et al. [13]. This dataset is evaluated in comparison to four other databases (three protein sequence databases and one trichome-specific Trinity contig database [14]) with regard to their suitability for proteomic analysis. In addition, a comparison is made with the only other MS-based proteomic study of A. annua trichomes, which used another different EST data set assembly [10].
More importantly, the current study provides a significantly extended proteomic data set for A. annua, potentially leading to a better understanding of the trichome machinery and its role in the production of artemisinin.

Isolation of glandular trichomes
The protocol that was applied for the isolation of glandular trichomes is based on the process of glass bead abrasion which has been described previously [10,15,16]. In these earlier studies, it was reported that the majority of the cell material enriched by this method represented glandular trichomes albeit with a significant remainder of non-glandular trichomes [15,16]. Our simplified method, omitting the sucrose gradient fractionation, confirmed these findings as shown by environmental scanning electron microscopy (ESEM) analysis (see figure in Additional file 1), indicating that predominantly glandular trichome material had been obtained from the treated leaves. However, ESEM also revealed a significant number of glandular trichomes left on the leaf material after glass bead abrasion (Additional file 1). It is assumed therefore that there are only small differences between the 'trichomedepleted' and 'whole leaf' samples. Consequently, the data presented and discussed here are mainly based on the analysis of the 'trichome-enriched' material and its comparison solely with the 'trichome-depleted' sample material, as the technical replicate analysis of these samples showed a similarly low relative standard deviation of around 2.5 % for the number of identified proteins. By contrast, this number was around 8 % for the 'whole leaf' replicate samples (see table in Additional file 2).

LC-MS/MS analysis
The triplicate LC-MS/MS runs showed that for all triplicates, the majority of protein identifications (using the York Artemis contig database) were also obtained in the other two respective replicates, indicating an acceptable level of technical reproducibility. For the trichomeenriched sample analysis of the merged triplicate LC-MS/ MS data a total of 671 contig 'protein families' entries (see Additional file 3) were significantly matched while for the trichome-depleted and whole leaf sample analysis this number was slightly higher at 774 and 749, respectively.
The greatest number of protein identifications was obtained by searching the York Artemis contig database and the A. annua trichome Trinity contig database. However, as there is no functional annotation provided in these databases, the data was also searched against the UniProtKB database, restricted to viridiplantae. These searches led to a total of 319 protein (family) identifications for the trichome-enriched samples (see Additional file 4) while for the trichome-depleted and whole leaf sample analysis this number was again slightly higher at 417 and 408, respectively.
Decoy database searches using the default option in Mascot (reversed protein sequences) showed that the false discovery rate (FDR) for peptide matches above identity threshold was between 1.5 % and 1.9 % for all Artemis contig decoy database searches while the FDR for the Uni-ProtKB (taxonomy: viridiplantae) decoy database search of the trichome-enriched sample data was~9 % (see Table 1). Interestingly, the FDR for the corresponding Trinity contig database search was 3.2 % (see Table 1).

The trichome-enriched proteome of Artemisia annua
In general, the vast majority of proteins identified previously from A. annua by Wu et al. were also found in our datasets [10]. Notably, in the trichome-enriched sample data, we found amongst others a similarly large number of ATPases/ATP synthases and oxidoreductases (e.g. four ferredoxins), as well as many proteins involved in translation and transcription and also in proteolysis and the proteosome. Furthermore, several kinases and phosphatases were also identified. Figure 1 displays a rough molecular functional classification (GO terms) of the UniProtKB (taxonomoy: viridiplantae)-identified proteins from the trichome-enriched sample material after submission to Percolator and setting the 'expect cut-off' threshold to 0.05. Importantly, several known and putative enzymes on the biosynthetic pathway to artemisinin have been found when searching the UniProtKB database entries restricted to the taxonomy A. annua. These include: peroxidase 1 (UniProt # Q84UA9), artemisinic aldehyde delta-11 (13) reductase (DBR2; UniProt # C5H429), amorpha-4,11diene synthase (ADS; UniProt # Q9AR04), 2-alkenal reductase (UniProt # C0LNV1), HMG-CoA reductase (HMGR; UniProt # Q9SWQ3), and putative hemebinding cytochrome P450 (UniProt # Q2EPZ0), as listed in Table 2. Interestingly, most of these enzymes were not identified by searching the entire clade viridiplantae of the UniProtKB database. However, most of them and several other proteins related to the biosynthetic pathway to artemisinin were found searching the data against the contig database and by using the BLASTx search utility for functional annotation.
Although this study was first and foremost designed to investigate the usefulness of the York Artemisia contig database and other databases for proteomic analyses, a preliminary comparison between the protein abundances of the trichome-enriched and trichome-depleted sample material was also thought to be useful for both (i) restricting the number of BLAST searches; and (ii) providing some means of focusing on trichome-and thus potentially pathway-specific enzymes for further studies. In order to compare the protein abundances between   Tables 3 and 4 present the 20  proteins that gave the highest and lowest values, respectively, while Tables 5 and 6 present the 10 proteins with the highest emPAI values from the trichome-enriched sample material that were not detected in the trichomedepleted sample material and vice versa. Figure 2 displays the functional classification of the proteins in Tables 3, 4

Discussion
The number of protein families identified in this study from trichome-enriched samples represents an increase by a factor of 7-8 when compared to the protein identification data achieved by the only other previously published larger proteomics study with A. annua by Wu et al. [10]. The increase is even higher (12-14-fold) if these results are compared to the EST search results of "nonredundant" protein hits in the earlier study [10]. This order-of-magnitude increase in proteome coverage is arguably due to the different technical approaches adopted (nanoUHPLC-ESI MS/MS vs. gel-based MALDI MS/MS) and also to the databases which have been employed. For instance, the study by Wu et al. was restricted to the separation of proteins by 2DE using a pH gradient of 4-7 [10]. For the databases used in the present study, the translated York contig database searched by Mascot comprises 85,508,608 residues, which equates to an average of 123 residues per translated contig sequence, while Wu et al. used an in-house EST database, resulting in 49,389,486 residues and 2,060,880 sequences, i.e. an average of~24 (more than 5-fold less) residues per translated EST sequence. The greater fragmentation (and larger number of sequences) of the latter database negatively affects protein identification, which is partially reflected in the different individual ion scores that were necessary for identity or extensive homology (p < 0.05) in the two studies. For the present study, these had to be only >30, while in the study by Wu et al. this threshold was reported to be >41 [10].
The results obtained from the MS/MS data searches using different databases (cf. Table 1) demonstrate the importance of the availability and quality of organismspecific (genomic) sequence data for proteomic analysis. When the NCBInr database was searched with the taxonomy restriction of viridiplantae, i.e. similarly to the work of Wu et al. [10], 419 protein family matches were obtained with an FDR of 4.7 %. Searching a custom database restricted to the UniProtKB entries for A. annua (created on 19. January 2012; 118 sequences, 41,707 residues) resulted in 17 protein family hits for the trichomeenriched sample data with 11 peptide matches above identity threshold for the decoy database search, i.e. an FDR of 6.2 %. The highest FDR (~9 %) was obtained from the UniProtKB (taxonomy: viridiplantae) decoy database search.
Interestingly, the Trinity trichome contig database search of the MS/MS data of the trichome-enriched sample material (see Additional file 5) yielded a slightly higher number of protein family hits compared to the York Artemis contig database search (Table 1) but had far less peptides matched above the identity threshold and a substantially higher FDR (3.2 %) as well as 27 more peptides matched above the identity threshold in the decoy database search, potentially negating the slightly higher number of protein family hits. Thus, the York Artemis contig database appears to be the best sequence database for proteomic analyses.
Overall, the above analysis shows that using large common protein sequence databases, even if well curated and/ or for proteomic analysis restricted to a specific organism or clade, can easily result in high false discovery rates for organisms that have been less well sequenced and characterized. They also show that high quality (genomic) sequence information for these organisms provides a significant advantage if one wants to achieve greater proteome coverage and lower numbers of spurious protein identifications.
The protein abundance analysis between the trichomeenriched and -depleted samples using their emPAI values  shows that peroxidases have far greater abundance within the trichome-enriched sample material. This data could be relevant for the elucidation of the final (oxidative) step in the biosynthesis of artemisinin (see Phase 3 in Fig. 3), which is thought to proceed most likely via the precursor of dihydroartemisinic acid and its derived tertiary allylic hydroperoxide. It has been found that all the reactions depicted in the final phase of the biosynthesis of artemisinin in Fig. 3 can proceed non-enzymatically in vitro, and it has been suggested that this might also be the case in vivo. However, the over-expression of peroxidases arguably indicates the involvement of enzymes in this final step in the biosynthesis of artemisinin. In addition, cyclophilins which usually catalyse the isomerisation of peptidic bonds from the trans to the cis form at proline residues were found to be in greater abundance in the trichome-enriched sample material.
In general, there seemed to be a relatively higher level of ribosomal proteins, ATP/glutamate synthases, and proteins with transporter and electron carrier activity in the trichome-depleted sample material (see Fig. 2). This is probably partly due to the dominance of the abovedescribed proteins in the trichome-enriched sample material, which catalyse trichome-specific biosynthetic and metabolic processes. As can be seen in Fig. 2 for the top 10/20 most abundant proteins, there is a far greater number in the categories 'oxidoreductase/antioxidant activity' and 'other catalytic activity' for the trichome-enriched sample material than for the trichome-depleted sample material.
Finally, in both trichome-enriched and trichomedepleted samples there was a significant background of chloroplastic proteins associated with photosynthetic processes, which conforms with a previous A. annua transcriptomics study, where a large number of transcripts matching photosynthetic homologues were indentified [17]. There were comparatively more photosynthesisrelated proteins in the trichome-depleted samples, which can be simply explained by the relatively lower number of chloroplast-containing cells in trichome-enriched samples.
A large number of the known enzymes on the biosynthetic pathway to artemisinin has been detected by combining the information from the two UniProtKB database searches (see Table 2 and Additional file 4)using the taxonomy viridiplantae and A. annua, respectively (see Fig. 3). The biosynthesis of artemisinin, as it is currently best understood, is summarized in three phases as shown in Fig. 3, with the enzymes that catalyze each step indicated above each arrow in black. Enzymes appearing in red below each arrow were identified in the A. annua taxonomy search (see also Table 2) and enzymes appearing in blue were identified in the viridiplantae search, and for the latter, if needed, by homology searching using BLASTp with ' Artemisia' as organism (E value < 10 −47 ; e.g. searching UniProt # Q84UU4 (α-humulene/(−)-(E)-ß-caryophyllene synthase), UniProt # P93665 ((+)-δ-cadinene synthase), UniProt # Q2EPZ0 (cytochrome P450), UniProt # Q42799 (cytochrome P450) and UniProt # Q9ZPB7 (aldehyde dehydrogenase)).
In phase 1 of the biosynthesis of artemisinin, HMGR [18] catalyzes the first transformation in the mevalonate pathway, which is irreversible under physiological conditions, and thereby constitutes a key committed step in the biosynthesis of sesquiterpenes (as well as triterpenes) in the cytosol of higher plants. ADS [19] at the end of phase 1/beginning of phase 2 again catalyzes a committed step which channels the metabolic flux towards the amorphane sesquiterpenes (artemisinin is one such secoamorphane) and away from triterpenes and alternative cyclic sesquiterpene skeletons that are common in A. annua (e.g. humulanes/caryophyllanes and cadinanes).
It has been known for several years now that the series of three sequential oxidations which converts amorpha-4,11-diene to artemisinic acid (via the intermediates artemisinic alcohol and artemisinic aldehyde) in phase 2 of the biosynthesis, is catalyzed by a single cytochrome P450, designated CYP71AV1 [20]. The cytochrome P450s identified from this study, as accession numbers Q42799 and Q2EPZ0, may represent this same enzyme. More recently, it has become clear that dihydroartemisinic acid [21], not artemisinic acid [22], is the true precursor to artemisinin at the start of phase 3 of the biosynthesis. It has been proposed that DBR2 converts artemisinic aldehyde to the alternative product, dihydroartemisinic aldehyde [23,24]. The 2-alkenal reductase (COLNV1) identified in this study should catalyse this same reaction, and may be involved at this step or was simply identified due to its close homology to DBR2. An aldehyde dehydrogenase such as Q9ZPB7 is then required to convert dihydroartemisinic aldehyde to dihydroartemisinic acid [24,25].
There is still considerable uncertainty as to the identities of the intermediates involved in the third phase of the biosynthesis of artemisinin, as well as the enzymatic or non-enzymatic nature of these transformations. What is known is that dihydroartemisinic acid can be converted non-enzymatically to artemisinin via an initial oxygenation to the corresponding tertiary allylic hydroperoxide, followed by Hock cleavage, and a second oxidative reaction on the resultant enolic intermediate [26]. However, it is still not clear whether a similar series of spontaneous oxidations occurs in planta, or whether enzymes are present to catalyze some steps in this pathway (or whether an alternative series of oxidations occurs in vivo). In this regard, it is very interesting indeed to note the high trichome-specific expression of peroxidase 1, which must be a strong candidate as catalyst for the first (and possibly the second) oxidation reaction which is proposed in Fig. 3, if an enzyme were to be involved.

Conclusions
Using the example of A. annua, we have provided further evidence that the choice of sequence database is crucial for successful proteomic analysis. Compared to previously published proteomic data for A. annua, we have now shown for the example of a medicinal plant that the employment of an organism-specific database that has undergone extensive curation, leading to longer EST sequences, can greatly increase the number of true positive protein identifications, while reducing the number of false positives.
Most importantly, the presented results substantially increase the (trichome-specific) proteome data available for A. annua. An order-of-magnitude more proteins have been identified for trichome-enriched samples, including proteins which are known to be involved in the biosynthesis of artemisinin, as well as other highly abundant proteins, which suggest additional enzymatic processes within the trichomes that are important for the biosynthesis of artemisinin. In particular, the high trichome-specific expression of peroxidases suggests strong enzymatic oxidation activity in trichomes, potentially allowing for effective oxidative reactions in the final phase of the biosynthesis of artemisinin, which have so far been thought to be nonenzymatic in nature.

Solvents and solutions
All solvents used were of HPLC-grade and purchased from Sigma-Aldrich, Poole, UK, except water, which was acquired through Fisher Scientific, Loughborough, UK.

Plant material
Branches of the A. annua field cultivar Artemis (seed source: Mediplant, Switzerland) were harvested and pooled, and leaves were taken and frozen at −80°C within 30 min of harvest.

Isolation of glandular trichomes
A volume of 200 mL of isolation buffer was placed in a 500-mL Schott bottle with 200 μL of protease inhibitor (Calbiochem, Nottingham, UK) and left to stand on ice for 1 hour. After this time, 20 g of frozen A. annua leaves were placed into the buffer together with 20 g of glass beads (0.5 mm diameter) (Thistle Scientific, Glasgow, Scotland). The bottle was shaken for 5 min before passing the contents consecutively through 1-mm, 250-μm, 106μm and 45-μm molecular sieves (Endecotts, London, UK).
The liquid was forced through the 106-μm and 45-μm sieves under pressure provided by nitrogen gas. All plant material was returned to the Schott bottle with fresh 200-mL portions of isolation buffer and fresh beads and the process repeated twice. The resulting filtrate was separated into 50-mL tubes and centrifuged for 20 min (2500 g, 4°C). The supernatant was disposed of and the pellet transferred to four 1.5-mL microcentrifuge tubes, which were centrifuged for a further 20 min (2500 g, 4°C). The supernatant was again discarded, leaving the pellets which constituted the enriched glandular trichome sample, each weighing approximately 0.2 g. Leaf material retained by the initial 1-mm sieve was also kept and used as the glandular trichome-depleted sample.

Environmental scanning electron microscopy
Plant material collected from the 1-mm steel sieve was collected, dried and analyzed by ESEM on a Quanta 600 F instrument (FEI, Hillsboro, OR, USA). For comparison, material obtained from the enriched glandular trichome isolate was also dried and analyzed by ESEM.

Protein extraction
Each glandular trichome-enriched pellet was crushed using a micro pestle. Aliquots of 2 g of frozen whole leaves (i.e. leaves which had not been subjected to the trichome isolation procedure) and 2 g of glandular trichome-depleted sample (prepared as above) were separately flash frozen in liquid nitrogen and ground with a pestle and mortar to obtain a fine powder. A volume of 1.5 mL of precipitation solution was then added to the trichome-enriched samples while 18 mL of precipitation solution was added to the whole leaf and the trichomedepleted samples, respectively. All samples were thoroughly vortexed and then stored at −20°C for one hour. The trichome-enriched samples were centrifuged for 10 min at 10000 g (4°C) while the whole leaf and trichome-depleted samples were centrifuged for 10 min at 4000 g (4°C). The supernatants from all samples were discarded and the pellets from the glandular trichomeenriched sample were dissolved in 1.5 mL of rinsing solution while the pellets from the whole leaf and glandular trichome-depleted samples were dissolved in 18 mL of rinsing solution. All samples were stored at −20°C for one hour and then centrifuged for 10 minutes (trichomeenriched samples at 10000 g, 4°C, and whole leaf and trichome-depleted samples at 4000 g, 4°C) before the supernatant was discarded. The previous steps including addition of rinsing solution, centrifugation and discarding of the supernatant were repeated twice for all samples and the resulting pellets were retained from each procedure. The trichome-enriched pellets were then dried under vacuum for 30 min. A volume of 200 μL of solubilization solution was added to each trichome-enriched pellet. Then the samples were vortexed and centrifuged for 10 minutes (10000 g at 25°C) and the supernatants kept. The trichome-depleted and whole leaf pellets were left on the bench for 1 hour for the pellets to dry and 3 mL of solubilization solution was added before the samples were vortexed and centrifuged for 10 minutes (4000 g at 25°C) and the supernatants retained.
The amount of protein in each sample was determined using a Bradford assay.

Protein digestion
A volume of 210 μL of trichome-enriched material, 227 μL of trichome-depleted material and 139 μL of whole leaf protein extracts with approximately 150 μg of protein each were digested separately. An appropriate volume of a 100-mM dithiothreitol (Sigma-Aldrich) solution was added to each extract in order to obtain a final concentration of 10 mM of dithiothreitol. Each extract was then vortexed and stored at 45°C for 45 min. The extracts were subsequently left for 5 minutes at room temperature to cool before adding appropriate volumes of a 90-mM solution of iodoacetamide (Sigma-Aldrich) in order to obtain a final concentration of 30 mM of iodoacetamide. A 50-mM ammonium bicarbonate (Sigma-Aldrich) solution was also added to each extract in order to dilute the concentration of urea to approximately 2 M. All extracts were vortexed and left in the dark for a further 45 min. The pH of each extract was checked using pH strips to confirm that the pH remained within the range of 7.5-8. A sequence-grade trypsin (Promega, Southampton, UK) stock solution (200 ng/μL) was added to each extract to obtain a protein-to-trypsin ratio of 100:1 and all extracts were vortexed and left overnight at 37°C. The digestion of each extract was stopped by adding 10 μL of 0.1 % trifluoroacetic acid (TFA; Sigma-Aldrich).

LC-MS/MS
Each digested sample was diluted by a factor of 3 using 0.1 % TFA. An aliquot of 1 μL of each sample was injected onto an UHPLC-MS/MS system, consisting of a Dionex 3000RSLC UHPLC system (Thermo Scientific, Hemel Hempstead, UK) and an LTQ Orbitrap XL mass spectrometer (Thermo Scientific) as described previously [27]. Samples were injected in triplicate.
MS/MS analysis was performed on the LTQ Orbitrap XL using an AGC (automatic gain control) target value of 500,000 over 500 ms for the orbitrap and a value of 10,000 over 200 ms for the ion trap. MS spectra (m/z 400-2000 scans) were acquired on the orbitrap mass analyzer set at a resolution of 60,000. The five most intense ions per MS scan with a charge state of ≥2 were sequentially isolated in order of their signal intensity (highest intensity first with a signal intensity threshold set to 5,000 and an isolation window of m/z 3) and fragmented in the linear ion trap by collision-induced dissociation (CID) with a normalized collision energy of 35 %, an activation q value of 0.25 and an activation time of 30 ms. The fragment ions were recorded over the m/z range of 100-2,000. Dynamic exclusion was enabled to minimize redundant sequencing. MS peaks that occurred more than once within 30 s were excluded from selection for fragmentation for 60 s (with an exclusion list restriction to 500 entries).

Data analysis
All MS/MS spectra were processed using Mascot Distiller software (Version 2.3.2; Matrix Science, London, UK) to convert the raw LC-MS/MS data of each technical replicate for each sample type (trichome-enriched, trichome-depleted and whole leaf sample) into Mascot Generic Files (.mgf files). Searches were then performed against sequence databases using Mascot Daemon (Matrix Science), which combined the database search results from all three technical replicates of each sample type. Mascot (server version 4.2) searches were performed against the UniprotKB database (downloaded on 24. April 2012; 535,698 sequences, 190,107,059 residues), the NCBInr database (downloaded on 07. June 2012), the York A. annua (Artemis) contig, the recently published trichome Trinity contig database and an in-house contaminants databases. The York A. annua (Artemis) contig database used for this study was established as part of the transcriptome shotgun assembly project from the University of York [13] and was downloaded from the NCBI website http://www.ncbi.nlm.nih.gov/bioproject/ 39657 (January 2012) and consists of 116,303 RNA sequences from the cultivar Artemis. The recently published trichome Trinity contig database was obtained from Soetaert et al. [14]. Contaminants database searches were performed in order to assess the sample contamination levels due to proteins such as keratins and common protein standards frequently used in the laboratory. Searches were performed using the following parameters: peptide mass tolerance, 10 ppm; MS/MS tolerance, 0.8 Da; peptide charge, +2, +3, +4; missed cleavages, 2; fixed modification, carbamidomethyl (C); variable modification, Oxidation (M); and enzyme, trypsin. Taxonomy of viridiplantae was specified when searching against the A. annua contig and UniprotKB databases.
The merged database search results from the trichomeenriched sample were also compared against the merged database results from the trichome-depleted sample and the whole leaf sample by using their Mascot-derived emPAI values and calculating the proportional fold differences for each contig/protein by dividing the emPAI values of the trichome-enriched sample with those values of the trichome-depleted and whole leaf sample, respectively. For this, the Mascot search results were first submitted to the built-in Percolator software, filtered by applying an 'expect cut-off' of 0.05 and exported as .csv files using Report Builder within Mascot with a filter of at least 2 'significant sequences'.
For each comparison two different lists were created: proteins/contigs that were present in both samples in all triplicates and proteins/contigs that were only present in all triplicates of one sample but in none of the other. In the first list proteins/contigs were ranked according to their proportional difference in emPAI value while in the second list they were ranked according to their absolute emPAI value.
Due to the missing annotation in the A. annua contig database a BLASTx search with an E-value cut-off of 0.01 was performed to translate the contigs into known protein homologues. For this search non-redundant protein sequences was specified as the database and Artemisia as the organism. This was performed against the highest ranking contigs of the Mascot database search results comparisons detailed above. The functions of proteins resulting from the BLASTx search were verified using Uniprot Protein KnowledgeBase. Identified proteins with associated GO (gene ontology) molecular function terms were classified into the following categories based on the UniProt database (http://www.uniprot.org): ion binding, small molecule binding, oxidoreductase/antioxidant activity, other catalytic activity, hydrolase activity, structural molecule activity, electron carrier activity, isomerase activity, transporter activity, and protein/peptide binding. The above categories are umbrella terms for more specific GO molecular functions and a number of proteins contributed to more than one GO molecular function class, depending on their (multiple) GO annotation.