Next-Generation Sequencing for Rodent Barcoding: Species Identification from Fresh, Degraded and Environmental Samples

Rodentia is the most diverse order among mammals, with more than 2,000 species currently described. Most of the time, species assignation is so difficult based on morphological data solely that identifying rodents at the specific level corresponds to a real challenge. In this study, we compared the applicability of 100 bp mini-barcodes from cytochrome b and cytochrome c oxidase 1 genes to enable rodent species identification. Based on GenBank sequence datasets of 115 rodent species, a 136 bp fragment of cytochrome b was selected as the most discriminatory mini-barcode, and rodent universal primers surrounding this fragment were designed. The efficacy of this new molecular tool was assessed on 946 samples including rodent tissues, feces, museum samples and feces/pellets from predators known to ingest rodents. Utilizing next-generation sequencing technologies able to sequence mixes of DNA, 1,140 amplicons were tagged, multiplexed and sequenced together in one single 454 GS-FLX run. Our method was initially validated on a reference sample set including 265 clearly identified rodent tissues, corresponding to 103 different species. Following validation, 85.6% of 555 rodent samples from Europe, Asia and Africa whose species identity was unknown were able to be identified using the BLASTN program and GenBank reference sequences. In addition, our method proved effective even on degraded rodent DNA samples: 91.8% and 75.9% of samples from feces and museum specimens respectively were correctly identified. Finally, we succeeded in determining the diet of 66.7% of the investigated carnivores from their feces and 81.8% of owls from their pellets. Non-rodent species were also identified, suggesting that our method is sensitive enough to investigate complete predator diets. This study demonstrates how this molecular identification method combined with high-throughput sequencing can open new realms of possibilities in achieving fast, accurate and inexpensive species identification.


Introduction
Because species are the basic unit of many fields in biology [1,2], accurate species identification is an absolute prerequisite for studies focusing on agronomy [3,4] human health [5], conservation biology [2,6], ecology and evolution [7]. Whether pests, disease reservoirs or endangered species are considered, fast and accurate species identification is required for an ever increasing number of animal and plant samples [8].
Classical approaches to identification have traditionally been based on morphological criteria and/or morphometric analyses, often requiring the input of taxonomic experts. Unfortunately, there are too few taxonomic specialists available for the many research disciplines [9]. In addition, the enormity of biodiversity is often underestimated, and is continually threatened due to ongoing global change, therefore a comprehensive inventory appears to be an ever more urgent requirement [10]. Complicating the issue further, precise species identification based solely on morphological criteria can be extremely complex. Larval and/or immature stages can be morphologically very different from the imago or adult phase [3,11], sexual dimorphism can be extreme [12], and cryptic and/or sister-species nearly identical [13]. Finally, identification based on morphology alone is often impossible either due to poorly preserved specimens, or to the difficulties associated with identifying non-invasive samples such as feces, bones in bird's pellets, shed skin, etc., but also to incomplete or degraded museum specimens. Similarly, illegally traded products from endangered species are often processed to such an extent that they are useless for forensic investigations based on morphological criteria [8] [14].
For these reasons, species identification via molecular methods, such as molecular barcoding using a short genetic marker [15], is proposed to overcome some of the weaknesses of the traditional morphology-based taxonomic system [16]. These newer methods will aid non-taxonomists by fulfilling the urgent requirement for rapid and accurate species identification tools [16]. In addition, providing that DNA can be adequately extracted and amplified, these methods have the advantage of using only a portion of the specimen or non-invasive sample for accurate species identification.
In theory, in order to accurately discriminate between closely related species, suitable molecular identification markers should exhibit low intra-species genetic variability, but high inter-species variability. Ideally, a single ''universal'' genetic marker should be used to facilitate the rapid identification of any living organism. To this end, the international project Barcoding of Life (www.barcoding. si.edu) aims to generate a complete species identification catalogue for all animal kingdom organisms based on the mitochondrial (mt) gene of the cytochrome c oxidase I (COI). Unfortunately, most of these COI sequences are until now inaccessible. However, sequences of another mt gene, cytochrome b (cytb), are more abundant and freely available in public databases even if their quality is not always optimal [17]. This perhaps explains why this marker is most often used for species identification in vertebrates [18] and particularly for mammals [19].
The current protocol for molecular barcoding is based on PCR amplification of an mt marker, followed by ''classical'' Sanger sequencing. This robust approach is effective when applied to a few samples, but appears inefficient and expensive when scaled up to thousands of samples. Additional difficulties such as heteroplasmy (several mt genomes co-existing within the same cell [20]) or Numts (copies of mt DNA that are integrated into the nuclear genome, [21][22][23]) further frustrate the task of species identification. Similarly, DNA mixtures extracted from non-invasive samples (e.g. predator and prey DNA mixes from feces or bird's pellets; reviewed in [24]) also create problems for species identification without labor-intensive cloning.
Significant advances in high throughput Next Generation Sequencing (NGS) technology have allowed us to develop a novel barcoding method for the fast and accurate identification of wild rodent species. Such an innovative approach was recently proven effective in the correct species assignation of 255 insect specimens, corresponding to 17 different species of Ephemeroptera and Trichoptera [25]. We opted to utilize the 454 GS-FLX (Roche TM ) high-throughput sequencing system due to the following benefits. Firstly, read lengths are considerably longer at approximately 400 bp, compared to other NGS technologies (e.g. Illumina/HiSeq 2000, ,100 bp; Life Technologies/SOLiD 3, ,50 bp [26]) allowing complete sequencing reads for PCR products between 100 and 300 bp. To correctly identify an animal species, more than 100 bp are usually required, regardless of whether COI or cytb is used [27,28], while 200-250 bp corresponds to the maximum upper limit of markers able to target fragmented and/or degraded DNA [29,30]. In addition, the high number of sequences produced (i.e. 1,200,000 sequences per run) combined with a suitable tagging method allows the concomitant identification of hundreds or thousands of samples in a single run [31]. As such a tagging method consists of appending an additional 50 to 60 bp to both ends of the targeted fragment (30 bp for Titanium adaptors, ,10 bp for the tag sequence, ,20 bp for the PCR primer), the total read length could reach 200 bp. Unlike the classical Sanger method, the 454 technology includes an emulsion PCR (emPCR) prior to the pyrosequencing step [32]. This step isolates each DNA strand before sequencing, mimicking sequencing via cloning. This method is thus extremely well-suited for the analysis of DNA mixtures, in which ambiguous heteroplasmy cases and misleading Numt amplifications could be resolved, and where prey and predator sequences could be easily unraveled when investigating feces or bird's pellets.
We chose to focus our study on rodent identification, as rodents represent 40% of all mammalian species [33]. Significant difficulties are currently associated with the correct identification of rodent species, due to the many cryptic species [13,34,35], and their ever increasing numbers, as new genera and species are continually described (e.g. Laonastes aenigmamus, [36]; Saxatilomys paulinae, [37]; Mayermys germani, [38]; Tonkinomys daovantieni, [39]). Furthermore, rodents are one of the preferred subject for epidemiology, agronomy and ecology investigations, not only due to their existence as major hosts and vectors of human parasites and pathogens (reviewed in [40]), but also as major agricultural pests. Rodent species identification is often difficult using morphological criteria alone [13,34,35,41], while accurate identification is absolutely essential in such studies. Despite these difficulties, a clear picture of rodent taxonomy is nonetheless emerging, resulting in a reliable baseline reference against which a relevant molecular barcoding method can be developed.
In this study, we analyzed 946 rodent samples representing the breadth of rodent diversity, which included 820 tissue samples preserved in ethanol, 49 rodent feces, 54 rodent skins from museums, 12 feces from carnivores likely to have ingested rodents, and 11 bird's pellets containing bones of micro-mammals. Firstly, we designed a small DNA-barcode able to discriminate the largest number of rodent species. We then tested its efficacy on 265 reference samples corresponding to 103 rodent species. Finally, its applicability was successfully evaluated using delicate samples such as non-invasive and museum samples, demonstrating that this new method could open new realms of identification possibilities in many fields of biology.

Mini-barcode selection
We compared the efficiency of the two most common mt gene used to discriminate and identify rodent species: firstly, the Consortium for the Barcode of Life (CBOL) standard animal barcode, cytochrome c oxidase I (COI); and secondly, cytochrome b (cytb), the marker most commonly used to investigate mammal biosystematics [19]. GenBank searches using the keyword 'cytb' yielded 15,121 sequences corresponding to 1,476 rodent species, whereas 'COI' retrieved only 2,857 sequences corresponding to 503 rodent species. Furthermore, only 42% of these COI sequences were identified at the genus or species level, therefore the remaining sequences, labeled as 'Rodentia sp.', were not included for assessment. Only sequences of species available for both markers were selected and only one sequence per species was conserved. Sequences with ambiguous or incorrect species assignation were discarded, as well as incomplete or poor quality sequences (i.e. sequences with undetermined nucleic acids at numerous positions). As a result, a total of 115 rodent species sequences were available for both mt markers (see Fasta S1 for COI and Fasta S2 for cytb). All Fasta alignments (Fasta S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, and S11) were deposited in the Dryad data repository (http://datadryad.org/; doi: 10.5061/ dryad.1j6v6).
Following [27], the predicted efficacy of successive 100 bp-long fragments for both markers were then assessed via three parameters: i) resolution percentage from neighbor-joining (NJ) analyses (%Res), ii) the mean pairwise genetic distance using the Kimura two-parameter substitution model (%K2P) and iii) mean variable sites percentage (%Var).
Both markers encompass several 100 bp mini-barcodes, which give 100% of resolution for the GenBank rodent dataset (Table 1). It means that using these mini-barcodes, all the 115 species of the GenBank dataset could be discriminated based on the NJ tree. Interestingly, some of these mini-barcodes were more variable than the entire gene (Table 1). For example, the cytb fragment located between positions 901 and 1000 exhibited 78.0% variable sites, while the whole cytb gene, 60.1%. This also held true for the COI marker, with 48.0%Var for mini-barcode 1-100, versus 43.1%Var for the entire gene. In general, cytb barcodes were more variable than COI barcodes (19.9-41.4%K2P versus 20.2-25.5%K2P, and 46.0-78.0%Var versus 42.0-48.0%Var). Of all analyzed mini-barcodes, the cytb barcode located between positions 701 to 800 appeared the most promising. In addition, it was surrounded by two regions of low variability (601-700 and 801-900), facilitating the design of rodent universal primers. Furthermore, owing to the fact that there are more cytb rodent sequences available in GenBank than those for COI, permits the use of GenBank sequences as a reference database. Accordingly, the cytb mini-barcode located between positions 701 to 800 was selected for all subsequent rodent species identification.

Primer design
To design universal rodent primers we used the initial cytb dataset extracted from GenBank (15,121 sequences), but removed poor quality or Numt sequences, resulting in 9,071 sequences, which corresponded to approximately 1,063 rodent species (Fasta S3).
Primers were designed to target the most conserved parts of the gene surrounding the selected mini-barcode. Several nucleotide positions (often at the third nucleotide) were degenerated in order to allow hybridization to multiple rodent species DNA templates. Due to primer design constraints, the final mini-barcode corresponded to a fragment slightly longer than 100 bp (136 bp), but still small enough to target fragmented/degraded DNA, located between positions 666 (L15411) to 801 (H15546) of the cytb gene (Fasta S3).
The forward and reverse primers used are as follows: L15411F 59-GAY AAA RTY CCV TTY CAY CC-39 and H15546R 59-AAR TAY CAY TCD GGY TTR AT-39 respectively. To allow sample assignment, primers were modified into fusion tagged primers following [31]: a seven bp tag was added to each primer at its 59 extremity, as well as a 30 bp adaptor for the 454 Titanium sequencing reagent series, resulting in a final amplicon size of 250 bp.

PCR, 454 pyrosequencing and SESAME software analysis
The applicability of our molecular mini-barcode was investigated in a wide range of samples (Table S1). PCRs were performed on 820 ethanol-preserved tissues (265 reference samples and 555 non-identified samples), 49 rodent feces, 54 museum skins, 12 carnivore feces and 11 owl's pellets, all corresponding to at least 180 different rodent species. Forty-seven ethanol-preserved samples were randomly selected for duplicate analysis, while amplifications performed on non-invasive and museum samples were systematically duplicated. In order to estimate the 454 pyrosequencing error rate, two clonal mt fragments (plasmids containing PCR products) were amplified in 24 independent PCRs. Overall, 1,140 PCR attempts were realized: 1,093 yielded positive amplifications, 13 gave weak amplifications and 34 failed. Most of the PCR failures concerned samples with degraded DNA: museum skins (19 failures), owl's pellets (4) and feces from rodents (6) or carnivores (1).
Following 454 GS-FLX Titanium pyrosequencing, 197,650 reads were obtained. Sequences differing by at least one base pair substitution or by indels were called ''variants'' [31]. Among the resulting sequences, we distinguished ''artefactual variants'' (variants arising from PCR, emPCR or pyrosequencing errors) and ''true variants'' (variants that were retained following our validation procedure, see Methods), which will hereafter be called ''haplotypes''. A total of 114,293 reads corresponding to 16,439 distinct variants were subsequently assigned to 1,103 samples via the forward and reverse tag combination using the software SESAME [42]. There were 104 mean reads per sample, although more than 20 reads were obtained for 98.4% of samples, and more than 50 for 89.1%. Artefactual variants were sorted and discarded manually based on the alignments generated for each sample in the SESAME software.

Accuracy and quality assessment of the 454 reads
The 454 pyrosequencing technology is known to produce a significant proportion of sequencing errors, therefore it is necessary to be able to identify and discard these artefactual variants [31].
In order to accurately estimate the error rates, internal controls were included. They consist in 24 PCR products performed on clonal cytb fragment of two different rodent species, and whose genuine sequences were obtained by classical Sanger sequencing method. The 6,109 pyrosequencing reads were compared to the genuine sequences to calculate the percentage of erroneous reads, and to also validate our selection procedure and our ability to discriminate between artefactual variants and haplotypes.
The percentage of reads with at least one substitution (Sub), insertion (Ins) or deletion (Del) were estimated. Based on the 16 PCR amplicons obtained from a single clone (eight PCRs per clone), we determined that 6561% and 6467% (mean 6 S.D.) of the 454 sequences were perfectly identical to the Sanger sequences. Errors rates were similar for both clones, (and also between the different PCRs from the same clone): Sub = 462% and 663%; Ins = 2465% and 2766%; Del = 662% and 361% for the two clones respectively.
To estimate the percentage of recombinant chimeric reads (Chim) the two clones were pooled together and eight independent PCRs were performed, resulting in 5867% of 454 reads that were strictly identical to the Sanger sequences, with the remaining error rates: Sub = 462%; Ins = 2365%; Del = 763%. Chimeras were easily identified as spurious sequences derived from the mixed clonal template, with an estimated 864% chimeric reads.
As expected [43], the most common sequencing errors were insertions, likely due to homopolymer stretches, and which were detected in more than 20% of the reads generated for our internal controls. Nevertheless, even if large numbers of artefactual variants were generated, they were easily identified and removed from the analysis due to their frame-shifting effect on the coding sequence.
Clonal variants detected at the highest frequency during 454 pyrosequencing were selected as the true variants, which always corresponded to the genuine sequences obtained via Sanger sequencing. Consequently, we applied the same rationale to determine the haplotype for subsequent rodent samples. Confirmation was obtained when 46 tissue samples were processed in duplicate, with subsequent independent haplotype selection yielding exactly the same haplotype. Altogether these findings demonstrate that this protocol ensures accurate haplotype selection. The validation procedure was slightly different when analyzing non-invasive samples (see Methods), as extracted DNA likely corresponded to several different organisms.

DNA barcoding of reference samples
The ability of our mini-barcode to identify rodents at the species level was first tested on a reference sample comprising 265 rodent individuals corresponding to 103 species from Asia, Africa and Europe (Table S1 and Fasta S4). The exact species of each sample had previously been clearly established using molecular and/or morphological techniques. Representatives of several rodent genera known to be difficult to discriminate at a species level using morphological characteristics alone, such as Mastomys [34,44], Microtus [45] or Gerbillus [13], were included in the reference dataset, as well as recently diverged species such as members of the Rattus rattus complex [46].
In the maximum likelihood (ML) phylogenetic tree obtained from the analysis of the 136 bp barcode (Figure 1), representatives of the same species constituted monophyletic groups, which were supported by high bootstrap values (Bp.80%). Even closely related species were clearly distinguished (see for example the distinction between species of the genus Rattus, Myodes, Microtus and Gerbillus). A few species were however poorly identified as monophyletic (Bp,80) and were thus not distinguishable from their sister species. In all cases, these results either corresponded to groups whose taxonomy has not been officially confirmed, for example, the species status of Microtus obscurus in China, Acomys johannis in Mali, or Gerbillus representatives in Morocco remain controversial [47,48]; or to species which have recently diverged, such as Microtus arvalis versus M. obscurus.
The comparison of pairwise K2P genetic distances within and between species shows a gap centered on 10% (Figure 2). The mean intra-specific distance reached 2.8% (S.D. 4.3%) while the mean inter-specific distance was 32.7% (S.D. 8.1%). In accordance with the ML tree, some closely related species or sibling species displayed very low K2P distances. This was the case for Microtus arvalis and M. obscurus, which diverged at only one or two nucleotides within our mini-barcode (K2P = 0.9%). However M. obscurus is sometimes considered as an isolated lineage of M. arvalis rather than as a valid independent species [49]. Similarly, Rattus sakeratensis, R. lineage R3 and R. tanezumi exhibit small distances (1.8-5.4%) corresponding to between two to seven substitutions. These findings are in accordance with recent phylogenetic studies which estimate that they likely diverged less than one million years ago [46], questioning the distinct species status of Rattus R3 [41]. Indeed, if speciation events had occurred recently, not enough time has elapsed for mutations to become fixed within the species. Consequently, whichever small DNA marker is considered, it would be extremely difficult to discriminate between very closely related species.
Accidental amplification of Numts occurred in most species from the genera Arvicola, Apodemus, Gerbillus, Microtus, Myodes and Praomys similar to that reported in the literature (Microtus, [50], Arvicolinae, [51], Myodes gapperi, [52], Apodemus sylvaticus, [23]). In these cases, Numts did not hinder species identification because: i) they were amplified at a lower frequency than the multicopy mt marker and ii) they were easily identified as Numts owing to the large number of Apodemus and Microtus species sequences deposited in GenBank.
However, if Numts were only recently incorporated into the nuclear genome, frameshift mutations are unlikely to have become fixed. Recent Numts could be problematic if not documented in the literature, and could become difficult to distinguish from genuine cytb sequences [53]. Nonetheless, in such cases, Numts are generally species specific, and could therefore also potentially be used as species markers to confirm identification obtained with the mt mini-barcode [54].

DNA barcoding of unknown samples
To experimentally validate our mini-barcode, we aimed to identify the species of 555 samples, for which there was no previous identification (Table 1 and Fasta S5). PCR amplification failed for only three samples. Depositing these three DNAs on an agarose gel suggested that the DNA concentration for these samples was very high and probably inhibited the PCR reaction. Of these 555 samples, 85.6% were identified using the BLASTN program and GenBank rodent reference sequences (Table S1). The mean number of substitutions within species was estimated to 3.5% from our reference rodent dataset. Consequently, to assign a species to an input sample sequence, we fixed a threshold of 96% BLAST identity (with 99-100% of coverage).
The success of species identification was highly variable. It depended on the amount of knowledge available on local rodent fauna, which in turn differed according to the various geographic areas that were investigated. For example, the identification success rate was 100% for European samples (N = 104, Maximal Identity: Max Ident $99%). The taxonomy of the European rodent fauna is today well known and largely documented (e.g. [55]), numerous molecular studies have been performed on European species (e.g. [45,56,57]) and sequences have been deposited in GenBank. Similarly, rodents from the Indochinese region (N = 269) have recently been subjected to intensive phylogenetic studies [41,46,[58][59][60][61][62]  Consequently rodent samples from these areas displayed high identification success rates (Indochinese samples = 94.7%, with Max Ident $98%, rates for West Africa = 100% with a Max Ident $99%). In contrast, samples from the Sundaic region (N = 21) and from East Africa (N = 95), where rodent faunas have remained relatively unexplored, resulted in lower identification success rates (61.9% and 40.0% respectively, with Max Ident $96%). These observations underline the absolute necessity of a robust reference dataset including accurate taxonomic differentiation, for this barcoding approach to be applicable.

DNA barcoding of degraded rodent samples
DNA extracted from non-invasive samples is often low in quantity and of poor quality [68]. It is thus subject to contamination by exogenous DNA and can often be co-extracted with Taq polymerase inhibitors [68,69]. Problems encountered when working on museum specimens are similar to those faced when dealing with non-invasive samples, but are clearly exacerbated [70]. Post-mortem DNA degradation (e.g. depurination, deamination) is known to lead to artefactual substitutions during PCR [71,72]. Researchers working on degraded DNA often circumvented this problem by cloning and sequencing independent PCR amplicons [70], a procedure which is time-consuming and expensive. Consequently, our rodent species identification method appears to be well suited for degraded substrates since it targets a small DNA barcode (136 bp), and avoids a long cloning process. To prove this hypothesis, our rodent species identification method was thus evaluated on rodent feces or museum samples.
Rodent feces. Trapping rodents requires complicated logistics (trap maintenance and transport), and staff specially trained in rodent handling. These requirements are often difficult to meet during field trips. In addition, some rodent species are elusive or are resistant to trapping, despite local abundance (e.g. Arvicanthis ansorgei, Gerbillus henleyi, Rattus norvegicus [73]) or are endangered protected species (e.g. Laonastes aenigmamus [36]). In such situations, collecting non-invasive samples such as feces appears as an attractive alternative. Our approach was tested on 49 rodent feces collected in traps from West Africa (N = 11) and Asia (N = 38) (Table S1). In this fashion, it was possible to compare sequences obtained from both the feces and tissue from the same animal. In addition, amplifications were performed twice per faecal samples. Indeed, under an extreme scenario, the amplification could start from a single template molecule because of the low DNA quantity of the faecal extract. If this template was chemically modified by DNA decays, artefactual mutations due to nucleotidic misincorporations during the PCR reaction could lead to an inaccurate sequence.
Of the 49 fecal samples, duplicate PCRs failed on three samples, and for another provided a spurious avian haplotype, however 45 fecal samples had identical replicate results, and sequences were indistinguishable from those obtained from the animal's corresponding tissue. Anecdotally, some feces samples from black rats in Mali gave additional divergent haplotypes corresponding to birds (Table S1, and Fasta S6). Black rats are known to be an opportunistic species, reported to feed on small animals and carrion, neatly corresponding with our findings.
Rodent museum samples. Taxonomic investigations require repeated comparisons between modern and museum specimens (often the species holotype). Such comparisons could lead to inconclusive identification when the holotype is poorly preserved, or when intra-species variation is so large that the holotype appears morphologically distinct from the novel specimen requiring identification. For example, the wide range of intraspecies morphological variation makes such criteria unsuitable for accurate rat identification, resulting in over-described species with confusing taxonomy, hampered by an overabundance of synonyms. In this case, sequences obtained from museum holotype specimens have been used to refine the taxonomy of the tribe Rattini [41]. However, such holotype specimens are extremely precious. Therefore, for our preliminary tests, our identification method was applied using less valuable museum specimens.
We tested our approach on 54 DNA extractions kindly provided by the Museum National d'Histoire Naturelle (MNHN) of Paris. These DNAs had been extracted in 2006 from museum samples prepared between 1958 and 1991. Based on morphological characteristics, they were assigned to five genera (i.e. Hybomys, Hylomyscus, Lemniscomys, Lophuromys and Praomys), however, their actual species remained uncertain (Table S2). Previous attempts of  Table 2. Predator and prey haplotype occurrence in both carnivore feces and bird's pellets.   Table S2 and Fasta S7 for details). These three species corresponded to taxa currently housed in the museum, therefore the discrepancies could be due to museum labeling error. However, these DNA samples had never previously yielded positive PCR amplifications. As potential higher quality contaminants can be preferentially amplified when working with degraded templates, our results are likely due to contamination by exogenous rodent DNA during the extraction procedure. Sequences obtained for the remaining two positive samples were identified as human (Table S2 and Fasta S8) and were definitely due to contamination. In addition to the most abundant haplotype, several other haplotypes were also validated for a high proportion of the museum samples (83.3% of the sampling harbour between 2 to 11 haplotypes, with a mean average of three haplotypes per sample, Table S2 and Fasta S9). These 'additional' haplotypes represented 5.6% of the validated reads, of which 1.8% corresponded to African rodent genera not included in our study, and which are so completely morphologically different, it is unlikely they could have been confused with the museum specimens in question. In addition, 1.8% of reads corresponded to other non-African rodents, and also to shrews, bats, ungulates, primates, carnivores, birds and reptiles. The remaining 2.0% corresponded to human haplotypes/Numts. These haplotypes were found at very low frequencies in most of the museum samples investigated. Consequently, plentiful evidence exists to suggest contamination during the DNA extraction process.

Environmental barcoding: molecular diet analysis of micro-mammal predators
We assessed the suitability of our approach in determining the diet of rodent predators, as rodents are an important link in the food chain for many wild species [74]. Species identification of rodent remains in carnivore feces and bird's pellets is thus a central issue to ecological studies on food chains and webs, prey-predator relationships or competition between predators. A potentially innovative application could be the use of owl pellets in enabling a faunal census of those elusive or difficult to trap rodents [75,76].
We analyzed 12 carnivore feces (mustelids, felids and canids) and 11 owl's pellets, all collected in France. Based on external characteristics, feces were a priori identified as those of wild cats (Felis silvestris, n = 4), pine martens (Martes martes, n = 4) and red foxes (Vulpes vulpes, n = 4). Four samples were identified as barn owl pellets (Tyto alba) and seven remained undetermined. All of these species are known to prey on rodents. From 23 PCRs, 20 resulted in replicated positive amplifications. One positive result could not be replicated while two owl's pellets never yielded positive amplification ( Table 2). With respect to feces samples, 88.7% of the 1,205 validated reads were assigned to the defecator, 7.5% corresponded to ingested rodents (Fasta S6 for prey and Fasta S10 for predators). On the contrary, for bird's pellets, 95.4% of the 1,481 validated reads could be identified as rodents while only 0.3% as owls (the barn owl, Tyto alba and the long-eared owl, Asio otus). For both feces and pellets, we observed excellent result repeatability in the two replicates for each sample: identifications were obtained in 78.6% and 77.3% of both replicates for feces and pellets respectively. Molecular identification confirmed the preliminary carnivore species identification based on external sample characteristics, except for two samples: feces supposed to have originated from pine martens were revealed to be those of red foxes.
Based on 454 sequencing, wild cats were demonstrated to ingest water voles (Arvicola scherman) and common voles (Microtus arvalis). The pine marten's diet was composed of bank voles (Myodes glareolus) but also of non-rodent species such as the crowned shrew (Sorex coronatus), the common blackbird (Turdus merula) and the earthworm (Lumbricus terrestris). As blackbirds and earthworms were both found in one single feces sample, they could represent a case of secondary predation (i.e. an earthworm ingested by a blackbird, which was in turn ingested by a marten) [77]. Red fox feces included three rodent species (A. scherman, M. arvalis et M. glareolus) but also an avian species such as the tree pipit (Anthus trivalis) and earthworms (Lumbricus terrestris). Anecdotally, a single read of wild board (Sus scrofa) was detected in feces from a red fox and could correspond to scavenging. Proteobacteria (Stenotrophomonas maltophilia with Max Ident = 94%) was also identified in a red fox feces sample (Fasta S11).

Conclusions and Perspectives
Our study reports the successful design of a 136 bp cytb minibarcode which accurately assigns individual rodent species. The applicability of our mini-barcode for species identification relied on its ability to discriminate between intra-species and interspecies levels despite its short length. As shown by the ML tree including 265 reference samples (103 species), intra-species clades are easily distinguished (Bp.80%) and appear clearly distinct from closely related species. Mean inter-species genetic distance (K2P = 32.768.1%) was higher by one order of magnitude than the mean intra-species distance (2.864.3%). Exceptions to this rule mainly concerned groups whose taxonomy is still unclear, or experiencing recent or ongoing speciation (e.g. Microtus spp., Gerbillus spp., Rattus spp.).
Altogether, hundreds of ethanol-preserved samples representing more than 180 different species from Europe, Asia and Africa were amplified, with a very low failure rate (only three samples -0.4%). Assignation of rodent species using BLASTN and GenBank reference sequences with a stringent criterion (Max Ident $96%), was achieved for 85.6% of samples. However the assignation success rate depended on the availability of both local fauna taxonomic knowledge, and publicly accessible molecular data, which is illustrated by the contrasting results for samples originating from Europe, Indochina or West Africa compared to those issuing from the Sunda or East Africa). In addition, exact assignation depended upon the accuracy of the taxonomic identification associated with GenBank sequences.
The combination of a mini-barcode, 454 pyrosequencing technology and the tagging method developed in [31] allowed reliable, fast and inexpensive species identification for a large set of samples in one quarter of a run (1,140 samples multiplexed in this study). This protocol results in a mean coverage of 1046 per sample. However, coverage for good quality DNA samples could probably be reduced to 206 without altering the taxonomic assignation rate. Consequently, up to 5,000 samples could be identified using our method, considerably decreasing the costs of molecular identification per sample for large-scale studies (less than three euros for 1,140 samples; less than one euro for 5,000 samples). In addition, a novel inexpensive NGS method, Ion Torrent technology from Life Technologies, has recently produced reads well over 200 nucleotides long with error rates similar to those observed in 454 runs. In future, our procedures could be combined with this novel sequencer, considerably reducing the cost of individual identification.
This barcoding approach relies on the clonal sequencing of a short multicopy DNA fragment and thus appears suitable for studies based on low quantity degraded or fragmented DNA. Encouraging results were obtained with feces samples, such that species identification was achieved for 94% of the samples, and comparison with sequences obtained from fresh material collected from the same rodent demonstrated 100% identity. Results obtained on museum samples were also fruitful despite the lack of stringent ancient DNA extraction procedures. Genuine sequences were obtained even if contaminating sequences were also detected. These results reveal the high sensitivity of the method when dealing with scarce DNA.
In addition, novel degenerated primers combined with nonstringent PCR conditions and clonal sequencing, facilitated the investigation of mixed DNA samples. Other molecular identification methods such as DNA arrays or those based on speciesspecific primers require a certain anticipation of the result. Microand macro-arrays rely on the hybridization of short specific nucleotide probes to target organism DNA with subsequent detection of the hybridization signal [18]. Consequently, these methods require prior knowledge of which different species could potentially be encountered, and therefore cannot detect unexpected, unknown or newly described species [27]. Moreover, due to reduced genetic similarity, undiscovered haplotypes or geographic variants would fail to properly anneal to the array probes. Species-specific primers are usually designed for a limited species set. An unknown species could mimic the pattern expected for a well-known species and once again, would not be correctly identified [34]. Our method is able to overcome the vast majority of these problems.
Finally, identification performed on carnivore feces and owl's pellets highlighted the enormous potential of our approach for use in ecological studies. Identified rodent sequences were consistent with typical prey species ingested by the carnivores and owls in the sampled area. Based on these preliminary results, it is still difficult to confirm if all prey species present in the feces or pellets were actually detected. However, results obtained from museum specimens tend to indicate that even DNA present in tiny amounts (such as contaminations) are likely to be detected with our method, but has yet to be confirmed in animal diet investigations. In future assays, it is likely that increasing sequence coverage and numbers of PCR replicates will enhance the probability of detecting less frequent prey. Our experiments also suggest previously unsuspected applications of our method in the field of ecology. A significant number of sequences were assigned to animals other than rodents, such as mammals, birds, reptiles but also invertebrates. Consequently our method could be suited to establishing the diet of animals which feed on prey other than rodents, as well as determining which predator species had produced the feces or pellet sample. Our approach was proven to be successful in determining unique carnivore species via their scat (cats, foxes or martens). In two cases, species assignments previously attributed by assessing the external aspect of the feces were proved erroneous and were subsequently corrected (marten vs. fox). Determining the owl species was less effective, probably due to the cleaning method performed before DNA extraction and which favors the selection of bones in the pellet. Further studies testing our approach with controlled samples, such as feces and pellets originating from captive animals with a pre-determined diet, or feces and pellets with previously identified contents via standard morphological approaches, would be of excellent use in refining this method.

Mini-barcode selection
Available rodent sequences for both COI and cytb markers were extracted from GenBank, with only one sequence per species selected. Sequences were aligned by eye using BioEdit [78] and then partitioned into 100 bp mini-barcodes from the 59 extremity of the gene.
For each marker and mini barcode, the NJ resolution percentage (%Res), mean pairwise genetic distance with the K2P model of substitution (%K2P) and the variable sites mean percentage (%Var) were all independently. To calculate the %Res, NJ trees were constructed with the K2P model of substitution and uniform rate of variation among sites. 1,000 bootstrap replicates were performed. All positions containing missing data were eliminated (complete deletion option). %K2P were computed with the same options. All the evolutionary analyses were conducted in MEGA5 [79].

Sample selection
Four types of samples were selected in this study. Firstly, 265 high quality tissue samples preserved in ethanol were used as references. They had been collected in Asia, Africa and Europe from rodents that were unambiguously identified at the species level by specialists, based on either morphological characters or molecular data (www.ceropath.org; www.bdrss.ird.fr/ bdrsspub_form.php; [55]). Specimens were selected in order to maximize the number of species and various geographic locations. The total reference sample comprised 103 species, 38 genera and 8 families. In addition, this reference set included closely related species and cryptic species that were only recently described (e.g. species of the Rattus rattus complex, Microtus complex, Gerbillus complex, etc.).
Secondly, 555 samples preserved in ethanol but with uncertain taxonomic status were selected. Feces found in the traps were collected at the same time in order to compare results obtained using high quality DNA (from tissue) or poor quality DNA (from non-invasive samples). Tissue and fecal samples were obtained for 11 and 38 rodents from Mali and Thailand respectively.
Thirdly, 54 DNAs were extracted from museum specimens (skins) and kindly provided by the MNHN of Paris.
And finally, feces from predators that were likely to have ingested rodents were collected. Feces thought to have originated from four foxes (Vulpes vulpes), four martens (Martes martes) and four wild cats (Felis silvestris), as well as 11 owl pellets were analyzed to determine their rodent diets. Sample information is detailed in Table S1.

DNA extraction
DNA from the 820 tissue samples preserved in ethanol was extracted using the DNeasy Tissue Kit (Qiagen) following the manufacturer's recommendations. Non-invasive samples (feces and bird's pellets) were handled in a different area of the laboratory to prevent contamination from high quality DNA samples. DNA was extracted using the QIAamp DNA Stool Kit (Qiagen), following the protocol designed for the isolation of DNA from human stools.
DNA extraction from museum specimens was attempted between 2006 and 2009 in the MNHN laboratory using the CTAB protocol [80]. However, unfortunately, it was not achieved following the ancient DNA standards [81].

Primer design and PCR optimization
Alignments of 9,071 cytb GenBank sequences corresponding to 1,063 rodent species were performed with BioEdit 7.0.9 [78]. Rodent universal primers were designed in order to amplify the cytb fragment identified as the best mini-barcode for rodent species identification.
Following our recent tagging and multiplexing method for 454 pyrosequencing [31], primers were modified in-fusion tagged primers by adding a short 7 bp sequence (the tag) and 30 bp Titanium adaptors to the 59 ends of L15411F (59-CCATCTCATCCCTGCGTGTCTCCGACTCAGNNNNNN-NGAYAAARTYCCVTTYCAYCC-39) and H15546R (59-CCTATCCCCTGTGTGCCTTGGCAGTCTCAGNNNNNN-NAARTAYCAYTCDGGYTTRAT-39). These adaptors were required for the emPCR and 454 GS-FLX pyrosequencing using Lib-L Titanium Series reagents. Each tag differed from the others by at least three substitutions to avoid misassignment of samples (see Table S1). We designed 36 and 32 different tags for the forward and the reverse primers, respectively. This allowed the generation of 1,152 putative unique combinations of forward and reverse tags and thus the ability to tag 1,152 different amplicons. Samples were processed in 96-well plates and cytb amplicons were individually tagged according to these primer combinations, as described in [31].
PCR amplifications using DNA extracted from tissue or noninvasive/museum samples were performed in independent facilities and at different times. PCRs were carried out in a 10 mL reaction volume using 5 mL of 26 QIAGEN Multiplex Kit Buffer (Qiagen) and 0,5 mM of each primer. One mL of tissue sample DNA (i.e. approximately 30 ng) or 2 mL of feces, pellets or museum sample DNA, was added to each well. The PCR started by an initial denaturation step of 95uC for 15 min, followed by 40 cycles of denaturation at 94uC for 30 s, annealing at 45uC for 45 s and extension at 72uC for 30 s followed by a final extension step at 72uC for 10 min. PCR amplifications from feces, pellets or museum DNA were performed in duplicate. To ensure method reproducibility, 47 tissue samples were also analyzed in duplicate.
PCR, emPCR and pyrosequencing-induced substitution and indel errors were assessed by comparing to two internal controls (i.e. clones of known sequences). These controls corresponded to classical Sanger sequencing of purified cytb clonal sequences from Rattus argentiventer and Mus cervicolor samples (accession number HM217362 and JQ685755 respectively). Using a clonal sequence as an internal control ensured that any differences observed between pyrosequencing reads and the reference sequence were likely to have been generated during the 454 process (Numt coamplification, artefactual mutations due to DNA chemical degradations, or cytb amplification in a heteroplasmic individual could be discarded). To estimate precisely the error rate, both clones were independently amplified eight times. In addition, to assess chimera production rates, the two clones were pooled together and eight independent PCRs were performed on the mix. Consequently, chimeras were easily identified as spurious sequences derived from the double clone templates.
Amplicon pooling and 454 GS-FLX Titanium pyrosequencing PCR products (3 mL) were verified on 1.5% agarose gels and the positive reactions were pooled in equal proportions. An initial mix was generated for each PCR plate: 4 mL of efficiently amplified PCR products or 7 mL of less efficiently amplified products were pooled together. These mixes were once again verified on 1.5% agarose gels prior to generating a final mix to obtain a single ''super-pool''. To achieve this, 10 mL per pooled PCR from tissues samples, and 20 mL per pooled PCR from non-invasive or museum samples were mixed together.
The ''super-pool'' was then processed by Beckman Coulter Genomics (Danvers, Massachusetts). To eliminate putative nonspecific PCR products, the pool was run on a microfluidic electrophoresis Pippin Prep (Sage Science) and fragments of the expected 250 bp size were selected. Following emPCR, amplicons were sequenced on a 454 Genome Sequencer FLX (Roche) in one quarter of a Titanium picotiter plate.

SESAME software analysis
The SESAME software (SEquence Sorter & AMplicon Explorer) [42] ver. 1.1B was used to sort the sequences (i.e. individual assignment and removal of artefactual variants due to sequencing errors during PCR, emPCR and 454 sequencing). Utilizing the tag combinations, sequences were assigned to the sample from which the PCR amplicon was obtained. When PCR was performed on rodent tissue or fecal samples, generally one single high frequency variant was detected and was consequently considered to be the valid sequence. Other variants were also found at very low frequencies and were considered as artefactual reads generated during PCR, emPCR and pyrosequecing steps (see [43,82] for a details). This rationale was corroborated by results obtained with the internal controls. Occasionally, when PCR was performed on tissue samples, additional medium-high frequency variants were detected. In these cases, Numt amplification was suspected, and was periodically validated by comparing to GenBank Numt sequences.
When PCR was performed on fecal samples from predators or bird's pellets, divergent variants were detected, therefore the variant found at the highest frequency among each of these clusters was selected as the true variant. Similarly, several rodent species were often ingested by predators, and several haplotypes were expected to be generated via our molecular identification method. Furthermore, when PCR was performed on museum samples, the same rationale was applied: several divergent variants were found and were selected for subsequent identification steps. Indeed, contaminations by exogenous DNAs were expected, since an appropriate ancient DNA extraction standard procedure was not used.
Each selected variant was then compared to sequences available in GenBank, EMBL, DDBJ and PDB (nr database) using the NCBI BLASTN program [83]. When the maximal identity (Max Ident) reached 96% (with 99-100% of coverage), the best GenBank match species was selected as the species identification (see Results for justification of the threshold).
Concomitantly to the publication of this study, an improved version of the SESAME software called |SE|S|AM|E| BAR-CODE was released [84]. This new automated procedure for species identification building a reference library (e.g. GenBank) should considerably facilitate this task.

DNA barcoding of rodent reference samples
A reference sample comprising 265 rodent individuals from 103 rodent species were used to generate 265 mini-barcode sequences. To assess the reliability of the mini-barcode to discriminate between closely related rodent species, a Maximum Likelihood analysis was performed on this reference dataset using RAxML 7.0.4 [85]. As model choice is limited in RAxML, the general time-reversible (GTR) + C model (option -m GTRGAMMA) was selected for the cytb dataset [86,87]. Tree robustness was assessed using the rapid bootstrap procedure (option -f a) with 1,000 replications (option -# numberOfRuns) [88].

Supporting Information
Table S1 Information about the study samples used. ''Specimen identification'' corresponds to the prior species identification based on morphological data and/or molecular tools for the reference samples. Other specimens were identified based on external criteria. The suffix ''bis'' in Individual code indicates replicates. BLAST statistics are italicized when Max Identity of the mini-barcode with the sequences available in GenBank ,96% (the species corresponding to the query was not documented in GenBank). (XLS) Fasta S1 Alignment of COI sequences from 115 rodent species from GenBank.