Proteomic fingerprinting of Neotropical hard tick species (Acari: Ixodidae) using a self-curated mass spectra reference library

Matrix-assisted laser desorption/ionization (MALDI) time-of-flight mass spectrometry is an analytical method that detects macromolecules that can be used for proteomic fingerprinting and taxonomic identification in arthropods. The conventional MALDI approach uses fresh laboratory-reared arthropod specimens to build a reference mass spectra library with high-quality standards required to achieve reliable identification. However, this may not be possible to accomplish in some arthropod groups that are difficult to rear under laboratory conditions, or for which only alcohol preserved samples are available. Here, we generated MALDI mass spectra of highly abundant proteins from the legs of 18 Neotropical species of adult field-collected hard ticks, several of which had not been analyzed by mass spectrometry before. We then used their mass spectra as fingerprints to identify each tick species by applying machine learning and pattern recognition algorithms that combined unsupervised and supervised clustering approaches. Both Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) classification algorithms were able to identify spectra from different tick species, with LDA achieving the best performance when applied to field-collected specimens that did have an existing entry in a reference library of arthropod protein spectra. These findings contribute to the growing literature that ascertains mass spectrometry as a rapid and effective method to complement other well-established techniques for taxonomic identification of disease vectors, which is the first step to predict and manage arthropod-borne pathogens.


Introduction
Hard ticks (Ixodidae) are hematophagous ectoparasites that feed on almost every species of terrestrial vertebrate on earth, including Homo sapiens sapiens [1,2]. Due to a complete dependency on blood as a food source, both sexes of adults and immature ticks are capable of transmitting disease pathogens to their hosts, causing significant morbidity and sometimes even death [3,4]. Research on hard ticks has increased recently in the Neotropics, where a growing number of outbreaks of tick-borne related illnesses have been documented [5][6][7][8]. Despite these efforts, comprehensive studies about the ecology, behavior and control of hard ticks relevant to public health remain elusive in Central America due to the shortcomings of traditional taxonomic methods for species identification [9]. Taxonomic identification of Neotropical Ixodidae has traditionally relied on adult morphological characters [10]; however, morphological keys for immature stages (i.e., larvae and nymphs) are lacking and experts are often unable to reliably identify immature ticks to species [10,11]. Moreover, morphological identification of ticks is unrealistic in epidemiological settings because assessing the role of ticks as disease vectors usually involves identifying hundreds of individuals for pathogen screening, an extremely time-consuming effort, which may be further impeded by the lack of qualified taxonomic specialists [12].
Matrix-assisted laser desorption/ionization (MALDI) time-of-flight mass spectrometry is an analytical technique that allows for sensitive and accurate detection of complex molecules such as proteins, peptides, lipids and nucleic acids [13][14][15]. The conventional MALDI approach has been used successfully for proteomic fingerprinting through pattern recognition for the identification of microorganisms such as pathogenic bacteria and fungi, which can be cultured in the laboratory and form discrete colonies with very consistent mass spectra that facilitates the development of reference libraries for identification of unknown samples [16,17]. In fact, a commercial program offered by the manufacturers of the MALDI technology is

PLOS NEGLECTED TROPICAL DISEASES
Fingerprinting of Neotropical hard ticks using a self-curated mass spectra library PLOS Neglected Tropical Diseases | https://doi.org/10.1371/journal.pntd.0008849 October 27, 2020 2 / 18 capable of determining statistical similarities between the spectra of unknown samples and a well-curated, proprietary reference library of bacteria and fungi to identify the species of the unknown specimen. This is analogous to the process of matching fingerprints, and offers a simplified comparison score that ranges from 0.0 to 3.0. Scores above or equal to 2.3 represent a confident match at the genus rank, and high probability at the species level, while values below 1.7 are considered as non-reliable identifications [16][17][18]. Although more challenging than identifying bacteria and fungi due to the size and heterogeneity of the specimen, MALDI has also been used to discriminate among species of invertebrates, including mosquitoes (Culicidae-Anopheles), fleas (Pulicidae-Ctenocephalide), biting midges (Ceratopogonidae-Culicoides), sandflies (Psychodidae-Phlebotomus, Lutzomyia) and ticks (Ixodidae-Rhipicephalus) [19][20][21][22][23][24][25][26][27]. A key finding from these studies is that protein spectra obtained from body sections or whole specimens were similar among individuals of the same morphological species but differed noticeably across different species. Therefore, MALDI protein spectra can be used as a tool to delimit species boundaries in arthropods that are vectors of pathogens. Nevertheless, fresh laboratory-reared specimens are routinely needed to build a reference library that meets the high-quality standards required for classification. This represents an important limitation for some arthropod groups, or assemblages, that are difficult to rear under laboratory conditions. In addition, epidemiological studies often rely on field-collected specimens preserved in ethanol for long-term storage in reference collections. To overcome these limitations, previous studies have opted for adjusting the comparison scores minimum-threshold limit for identification, lowering the manufacturer´s recommended scores from 2.3 to 1.8 [22,28] or even 1.3 [23,29]. Hence, mass fingerprinting for the identification of field-collected specimens that do not exist in a reference spectra library (or for those from which reference spectra cannot be generated under ideal conditions) requires an alternative, objective approach [12]. Moreover, most existing applications of MALDI to identify arthropod disease vectors have focused on relatively species-poor vector assemblages from Europe. This technique has been tested less frequently in the new world tropics [20,21,23,25,[28][29][30][31][32][33][34][35][36][37], where vector species richness is the greatest on Earth.
Here, we used MALDI as a scheme to identify Neotropical specimens of adult hard ticks derived from ethanol-preserved field collections. Specifically, we used machine learning and pattern recognition algorithms to classify protein spectra from the legs of field-collected specimens in order to identify a group of unknown samples with a self-curated reference library. MALDI is a promising tool for cataloging and quickly identifying large arthropod groups such as ticks [12]. Our results should contribute to the growing body of literature trying to address questions about feasibility, reliability and universality of the methodology for different environments and species that have not been evaluated before. Properly identifying disease vectors such as Ixodidae in highly diverse Neotropical countries, such as Panama, is a critical first step to predict and manage tick-borne zoonotic pathogens such as Rickettsia and arboviruses (i.e., arthropod-borne viruses).

Sample preparation
Ticks stored in ethanol for up to 5 years, and previously identified based on morphological characters, were taken from long-term storage in a -20˚C freezer (S1 Table). A total of 103 specimens from the following species were included in this study: Amblyomma mixtum ( Samples were prepared following previously published protocols with minor modifications [22,23]. Briefly, we removed either the left or the right anterior leg from each tick specimen using a scalpel. The leg was then put in a tube with 300 μL ultrapure water followed by the addition of 900 μL 100% ethanol. The tube was vortexed for 15 seconds and centrifuged using a Heraeus Biofuge Pico microcentrifuge (Thermo Fisher Scientific, Waltham, MA, USA) at 17,000 g for 2 minutes. After centrifugation, the supernatant was poured off from the sample tube, which was left to dry for 15 minutes. Subsequently, the leg was resuspended in 60 μL 70% formic acid and 60 μL 100% acetonitrile and homogenized in the microtube using a manual pestle. The sample was placed in a Branson 1510 ultra-sonicator (Bransonic, Danbury, CT, USA) for 60 minutes in ice water, and then vortexed for 15 seconds and centrifuged again at 17,000 g for 2 minutes.
For peptide detection with mass spectrometry, a saturated solution (10 mg/mL) of αcyano-4-hydroxycinnamic acid (HCCA) matrix was prepared in 30:70 [v/v] acetonitrile: 0.1% trifluoroacetic acid (TFA) in water. An aliquot of 1 μL from the sample supernatant was premixed with an equal volume of HCCA matrix, and 1 μL of the mix was quickly pipetted onto a polished steel MALDI plate in its respective target spot. All samples were placed and measured on three individual target spots with spectra from three technical replicates collected per spot. After letting the plate dry, it was inserted into the MALDI mass spectrometer to record the protein spectra from the tick´s leg.

MALDI mass spectrometry parameters
We used an UltrafleXtreme spectrometer (Bruker Daltonics, Bremen, Germany) to generate the protein mass spectra of each specimen. The equipment has a MALDI source, a time-offlight (TOF) mass analyzer, and a 2 kHz Smartbeam-II neodymium-doped yttrium aluminum garnet (Nd:YAG) solid-state laser (λ = 355 nm) that we used in positive polarization mode. All spectra were automatically acquired in the range of 2,000 to 20,000 m/z in linear mode for the detection of the most abundant protein ions. Each spectrum represented the accumulation of 5,100 shots with 300 shots taken at a time, and the acquisition was done in random-walk mode with a laser power in the range of 50% to 100% (global laser attenuation at 30%).
The software FlexAnalysis (Bruker) was used to pre-process and evaluate the mass spectra quality, based on the number of ion peaks and their intensity. Initially, all sample spectra were normalized by applying a general algorithm for baseline subtraction and smoothing provided by the software. Visual comparisons of the mass spectra from different tick species gave initial indications of dominant ion peaks that would suggest possible classification into discrete groups. Mass spectra that did not include at least one ion peak with an intensity of 1000 a.u. or more, were considered low quality and filtered out. All samples were placed and measured on three individual target spots, with three technical replicates of the mass spectra collected per spot.

Data analysis, clustering algorithms and statistics
The methodology has been described in detail previously by our group for the identification of adult mosquito legs [27], based on similar data analysis for face recognition [38,39] and spectral classification using mass spectrometry [40,41]. In brief, 239 mass spectra generated across 103 samples for all 18 species of morphologically identified Neotropical hard ticks were classified with a custom-made algorithm developed by our group using MATLAB (MathWorks, Natick, MA, USA). The algorithm is based on Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which are linear transformation techniques from the field of Machine Learning that are commonly used for dimensionality reduction and classification. Dimensionality reduction can help decrease computational costs for classification, as well as avoid overfitting by minimizing the error in parameter estimation. Overfitting was also addressed by maximizing the number of specimens analyzed per species, while minimizing the number of technical replicas (i.e., only three spectra per specimen with good signal intensity were used for data analysis).
PCA is an "unsupervised" algorithm that generates vectors that correspond to the direction of maximal variance in the sample space. On the other hand, LDA is a "supervised" algorithm that considers class information to provide a basis that best discriminates the classes (i.e., tick species) [38]. For both PCA and LDA analyses, we calculated the Euclidean distance between the vector describing the test sample and the average vector describing each class to identify a test sample. The class with the minimum distance with respect to the test sample was assigned as the identified species for that test sample. The LDA was applied over the data set expressed in terms of the coefficients (i.e., principal components) obtained by the PCA. Thus, PCA reduced the dimensionality of the data, and the LDA provided the supervised classification.
The performance of the clustering algorithms was tested using Monte Carlo simulations over 1000 iterations per species to optimize training and cross-validation prediction success rates. For each iteration, the data elements in each class were split randomly in approximately, but not less than, 20% of the elements for testing and the rest of the elements for training, for each species. We used all the peaks in the spectra for the PCA analysis, and the first 150 principal components from the PCA stage that explained 99.9% of the total variance were then projected for the LDA algorithm, which also generated a 150-components data set. The number of components was chosen after a performance analysis, again using a Monte Carlo approach, that provided the best identification rates. Global and class positive identification rates were calculated to establish the classification capacity of the algorithm. The positive identification rate corresponds to the percent ratio between positive identifications performed by the algorithm and the real positive cases in the data.
For visualization purposes in the plots generated with our algorithm in MATLAB, species that were morphologically identified within the Rhipicephalus and Ixodes genera were separately compared against Dermacentor and Haemaphysalis for which there was only one species in each. All species that were morphologically identified within the Amblyomma genus were separately compared between themselves or against the Ixodes genera.

Results
Optical micrographs from 18 species of Neotropical hard ticks showed evident differences among species in terms of adult morphological features (Fig 1), which was well aligned with the expected unique mass spectra generated from each sample and taxon (Fig 2, S1 Fig, S2 Fig  and S3 Fig). The global automatic acquisition rate was 77% for all species (Table 1), confirming that, overall, the mass spectra of field-collected and ethanol-preserved specimens allowed automatic acquisition of spectra. In fact, automatic acquisition of spectra results in faster and more objective data acquisition than performing spectra collection manually. However, the automatic spectra collection, coupled to the fact that species had different starting number of specimens, meant that the number of spectra per species for data analysis was not the same and, in some cases, did not meet the expected number of spectra per specimen (Table 1). Still, this was not an obstacle for our data analysis clustering algorithm. The percentage of automatic spectra acquisition with the MALDI ranged from 50% for A. mixtum (cajennense), I. boliviensis and R. sanguineus to 100% for several of the species, including A. calcaratum, A. geayi, A. sabanerae, I. affinis, and R. microplus, covering a range from 6 to 56 spectra per species (Table 1). The time stored in ethanol or the location of sample origin did not seem to explain the variable percentages of automatic spectra collection (S1 Table). Spectra from freshly collected specimens stored dry at -20˚C, used to establish the methodology, exhibited the best signals, with betterdefined spectral peaks and higher signal-to-noise ratio.
In addition, the specimens within each species showed consistently similar protein profiles, regardless of their taxonomic genera, sex, collection date and/or sampling location (S1 Fig, S2  Fig, S3 Fig). Mean protein spectra for tick species differed visually among taxa and the differences appeared to be related to their degree of phylogenetic relatedness (Fig 2). For example, species within the genera Ixodes, Rhipicephalus, and Amblyomma were more similar among themselves in terms of the ions peak number and mass over charge (m/z) position in their mass spectra than species from different genera. Nonetheless, some closely related species within the Amblyomma genus such as A. mixtum (cajennense), A. varium, and A. tapirellum also showed fairly distinct protein spectra (Fig 2), which motivated the application of clustering algorithms for their classification. Distinct mass spectra profiles between morphologically identified ixodid species could be classified by an unsupervised PCA algorithm to identify specimens. The quantitative performance of the PCA algorithm was assessed per species (Table 2), and visually confirmed with the graphic clustering presented in 3D plots (Fig 3). The PCA global positive identification rate was 91.2%, with 14 out of 18 species having higher than 90% positive identification rate. The PCA graphs showed that most species separated in well-defined clusters, and the distance among clusters seemed to be related to the degree of phylogenetic relatedness as evidenced by the clear separation from the specimens of Dermacentor and Rhipicephalus with those from Haemaphysalis and Ixodes (Fig 3A and 3B), or just between the specimens of Amblyomma ( Fig  3C). When comparing species within the genus Amblyomma against those from Ixodes, again the spectra from specimens of each species clustered together with limited overlap between groups and those from different genera were clearly separated (Fig 3D).   Indicates some specific specimens that upon collection were stored fresh in Silica Gel (For more metadata information about these samples see also S1 Table). https://doi.org/10.1371/journal.pntd.0008849.t001

PLOS NEGLECTED TROPICAL DISEASES
Fingerprinting of Neotropical hard ticks using a self-curated mass spectra library In addition, the LDA clustering analysis showed a global positive identification rate of 94.2% (Fig 4; Table 2), with 14 out of 18 species having higher than 97.8% positive identification rate. The range of positive identification rates went from 100% (best score possible) for A. mixtum (cajennense), A. nodosum, A. oblongoguttatum, A. ovale, A. varium, A. naponense and R. sanguineus to 45.6% for D. nitens. The 3D representation plots of the LDA clustering displayed that the separation between species was more pronounced than with PCA when comparing species from different genera, confirming the improved quantitative results of the performance of the LDA algorithm (Table 2).

Discussion
Our results show that MALDI mass spectra of highly abundant proteins in arthropod legs served as fingerprints to identify samples of 18 species of Neotropical hard ticks using machine learning and pattern recognition algorithms to create a self-curated reference library. We compared smoothed and baseline-corrected spectra generated from unknown field-collected tick samples against the mean spectra from a subset of the same field samples that had already been identified through traditional means. To systematize this process, we used PCA and LDA algorithms to classify mass spectra without prior establishment of a high-quality reference library, which typically requires laboratory-reared specimens that may not be possible to obtain for all species. Global positive identification rates of up to 94.2% were achieved with this methodology, offering a rapid, reliable and objective approach to identify hard tick species, which will likely improve as more specimens are evaluated and included in our database.
These outcomes agree with our previous work [27] in which we used a similar approach to classify field-collected samples of 11 morphologically-identified species of Anopheles mosquitoes. In that study, Neotropical Anopheles samples were stored dry in silica gel at-20˚C, which seemed to avoid sample degradation and maintain spectral quality. This contrasts with the present study, where most of our specimens were stored in ethanol at -20˚C for several years. Thus, our findings confirm that our novel analytical approach using MALDI and PCA/ LDA clustering algorithms is robust for species classification regardless of the arthropod assemblage, sample storing conditions, and the lack of a high-quality reference library. In fact, the percentage of automatic spectra acquisition from the processed tick species was much higher ( Our results herein also show that both classification algorithms, PCA and LDA, were capable of clustering and recognizing spectra from up to 18 different tick species, including roughly 50% of ixodid taxa (e.g., both ecologically dominant and rare species) reported for Panama [27,42]. LDA outcomes were more discriminant and robust than PCA overall, but PCA also classified species from different genera with over 91% accuracy and consistency. LDA was able to cluster each of the 18 species of ticks with validation and cross-validation scores above 94%, both between and within genera. As expected, the clustering algorithm was most accurate for distinctly related phylogenetic species (i.e., Ixodes, Rhipicephalus and Haemaphysalis genera),

PLOS NEGLECTED TROPICAL DISEASES
Fingerprinting of Neotropical hard ticks using a self-curated mass spectra library with higher than 97% success rate in most of these cases, than for closely related species (i.e., Amblyomma genus). However, A. dissimile and D. nitens depicted only moderate to low positive identification rates. Although this could be due to assemblage specific signals (i.e., high protein variability of conspecifics within these taxa), sample degradation and contamination, or technical errors such as spotting errors cannot be ruled out entirely. Future studies will have to corroborate the findings regarding these two species.
Although the number of samples analyzed for some ixodid species was relatively low, several of these taxa are considered cryptic species complexes [43] and have been implicated as vectors of human pathogens in Panama as well as more broadly, including A. mixtum (cajennense) and D. nitens, the likely vectors of Rickettsia rickettsii, known to cause Rocky Mountain spotted fever [44]. We also included samples of A. tapirellum, A. oblongoguttatum and H. juxtakochi, three species from which human pathogens have been previously isolated [45], such as: Coxiella-related bacteria, whose member C. burnetii can cause Q fever; Ehrlichia, which causes ehrlichiosis infection; and Rickettsia, which causes a variety of bacterial infections in humans and other animals. These results are important because our species identification platform can serve along with recently implemented metagenomic approaches as additional tools for health ministries in Panama and other countries, to monitor, predict and manage tickborne zoonotic pathogens [46].
Morphological taxonomic identification of ixodid ticks can be enhanced by molecular techniques such as DNA barcoding [8,47], but this procedure is laborious, expensive and needs a highly trained lab technician. Studies show that typical DNA barcoding costs can range from $2 to $5 per sample, with difficult-to-extract samples increasing the cost two-fold or more [48,49]; while costs associated to MALDI species identification have been calculated to be less than $0.50 per sample, without considering the high equipment cost [50][51][52]. Furthermore, a comprehensive repository of DNA sequences (e.g., DNA barcodes) is needed in order to test species limits, yet only a handful of Neotropical tick species are represented in Genbank [53] or BOLD [54] repositories, which could limit identification to the most common taxa only. In addition, DNA barcoding occasionally fails to delimit species boundaries due to ambiguous evolutionary relationships among closely related tick species [47].
Modern methodologies of whole genome analysis of arthropod vectors using Illumina or Nanopore next generation sequencing platforms can be applied not only to delimit taxonomic boundaries among tick species, but also to examine vector evolution (i.e., positive selection and ecological diversification), demographic phenomena (i.e., expansion and bottlenecks) and molecular epidemiology (i.e., pathogen infection and genetic diversity). The cost of these modern technologies is decreasing rapidly, and they could quickly become a valid alternative for taxonomic studies in developing and middle-income countries of Central America, including Panama. Indeed, portable Nanopore MinION methodology can be performed at the site of interest, with a laptop computer by someone with very basic entomological knowledge, and at a very affordable price on a per-sample basis [55]. Nevertheless, the bioinformatic skills and cluster capacity to process whole genome sequences of tick samples might represent an impractical burden for some institutions in developing nations, which may not have the machinery or competency to analyze this kind of data. Moreover, using the Nanopore Min-ION next generation sequencing approach for the exclusive goal of achieving reliable taxonomic identification of tick species may represent an underutilized expenditure that might ultimately end up overkilling the budget of resource-limited institutions.
While MALDI mass spectrometry suffers from many of the shortcomings listed for other technologies, our approach can be used to identify both field-collected vectors and the pathogens they harbor in a short period of time, with a minimal amount of tissue and without the need of expert taxonomists. Our strategy to analyze protein spectra also overcomes the drawbacks of working without a reference library to classify unknown samples. We posit that MALDI mass spectra of highly abundant proteins from arthropod tissues is a powerful tool for species identification that can be easily adapted to other biological systems. However, we also believe that this technology will be best used as a complement to the traditional barcoding technique or modern next generation sequencing methodologies, to accurately confirm species boundaries across entire arthropod communities, while considering problematic vector taxonomy and the availability of local financial resources. Developing an additional tool for rapid and accurate arthropod species identification offers further flexibility to the fluctuating budgets of the research community in Central/South America.
The long-term goal of our analytical approach with MALDI is to develop a tool that can enhance currently available open-source, web-based platforms, such as MALDI UP [56], MicrobeMS [57], or Mass-Up [58]; or become a new all-in-one platform where users can upload mass spectra datasets of known specimens to increase the number of species covered (e.g., bacteria, fungi, insects) and directly test spectra from unknown specimens for identification with our clustering algorithms. This crowd-sourced approach could be more cost effective, given that it is not necessary to generate a reference library of well-curated samples.
Instead, field samples can be taxonomically assigned as they arrive to the laboratory using a correctly matched protein fingerprint, while unidentified samples can be identified with traditional methods and added as new entries into the growing self-curated reference database.

Conclusions
The present study used MALDI mass spectrometry as a tool to rapidly identify Neotropical specimens of adult hard ticks that had been preserved in ethanol for several years. Our algorithms were capable of identifying specimens from the 18 tick species evaluated, based on their protein spectra "fingerprint" with up to 94% cross-validation capability. This is the first report of the protein mass spectra from the leg for most of these Neotropical tick species. Large arthropod groups such as ticks are difficult to identify with currently available strategies from commercial vendors, forcing the user to lower the "quality" bar of a positive match to enhance the percentage of correct identification. Our MALDI/self-curated library approach, although still under development and serving as an auxiliary technique to traditional identification methods (and not necessarily replacing them), would reduce considerably the number of samples that would require morphological identification or DNA barcoding. This will reduce the time and cost needed to integrate these techniques in routine surveillance programs in Neotropical regions where tick diversity remains relatively uncharacterized.