Genetic features of Sri Lankan elephant, Elephas maximus maximus Linnaeus revealed by high throughput sequencing of mitogenome and ddRAD-seq

Elephas maximus maximus Linnaeus, the Sri Lankan subspecies is the largest and the darkest among Asian elephants. Patches of depigmented areas with no skin color on the ears, face, trunk, and belly morphologically differentiate it from the others. The elephant population in Sri Lanka is now limited to smaller areas and protected under Sri Lankan law. Despite its ecological and evolutionary importance, the relationship between Sri Lankan elephants and their phylogenetic position among Asian elephants remains controversial. While identifying genetic diversity is the key to any conservation and management strategies, limited data is currently available. To address such issues, we analyzed 24 elephants with known parental lineages with high throughput ddRAD-seq. The mitogenome suggested the coalescence time of the Sri Lankan elephant at ~0.2 million years, and sister to Myanmar elephants supporting the hypothesis of the movement of elephants in Eurasia. The ddRAD-seq approach identified 50,490 genome-wide SNPs among Sri Lankan elephants. The genetic diversity within Sri Lankan elephants assessed with identified SNPs suggests a geographical differentiation resulting in three main clusters; north-eastern, mid-latitude, and southern regions. Interestingly, though it was believed that elephants from the Sinharaja rainforest are of an isolated population, the ddRAD-based genetic analysis clustered it with the north-eastern elephants. The effect of habitat fragmentation on genetic diversity could be further assessed with more samples with specific SNPs identified in the current study.


Introduction
The Asian elephant (Elephas maximus Linnaeus, 1758) is an endangered species [1], and reported throughout South-East Asia, including Sumatra, Java, Borneo, and China, as well as suitable set of markers is an urgent need for both conservation and management of E. m. maximus specifically and elephants in general. To lay the foundation, here we describe the first de novo assembly and annotation of the mitogenome of the Sri Lankan elephant. We generated the first genome-wide SNP dataset for Asian elephants with ddRAD sequencing data from 24 diverse individuals of E. m. maximus. We also assessed the evolutionary relationships of Asian elephants using available sequencing data.

Ethical clearance
The study was conducted under the research permit number: WL/3/2/2017/1 issued by the Research Committee of the Department of Wildlife Conservation, Sri Lanka (DWC). We followed relevant guidelines and regulations for animal research approved by the committee. The DNA samples for sequencing were exported under CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) permit (Security Stamp 1888968) obtained from the DWC.

Sample collection and DNA extraction
Twenty-four fresh whole blood samples were collected directly from both domestic and wild elephants with known origin and morphology (S1 Table, S1 Fig). Animals included both the sexes and tuskers (S1 Table). All the elephants are unrelated and born in widely distributed areas (Fig 1). Blood was collected from elephants after getting written informed consent from the Pinnawala Elephant Orphanage, following the ethical clearance guidelines. Elephant blood DNA was extracted using the Wizard Genomic DNA Purification kit (Promega, A1120) following the manufacturers' guidelines. DNA samples were quantified using the Nanodrop 2000 Spectrophotometer (Thermo Scientific) and assessed the quality with Nanodrop readings and running on a 1% agarose gel. The DNA samples with 230/280 and 260/280 values of about 1.80-2.00 and a clear single band on agarose gel were processed. The total DNA amount was adjusted to about 1-1.5 ug. For ddRAD sequencing, all 24 samples were sent to Admera Health Biopharma Services, New Jersey, USA.

Genomic library preparation
Isolated genomic DNA was quantified with Qubit 2.0 DNA HS Assay (ThermoFisher, Massachusetts, USA) and quality assessed by Tapestation genomic DNA Assay (Agilent Technologies, California, USA). The total DNA concentration was adjusted to about 1.00 ug and the quality values of DNA Integrity Number (DIN) above 9 were considered satisfactory. The sequencing libraries were prepared using the KAPA Hyper Prep kit (Roche, Basel, Switzerland) per the manufacturer's recommendations with Illumina1 8-nt dual-indices. The quantity of the final libraries was assessed by Qubit 2.0 (ThermoFisher, Massachusetts, USA), and quality was assessed by TapeStation D1000 ScreenTape (Agilent Technologies Inc., California, USA) where electopherograms and DIN values were taken as parameters. Libraries were sequenced on an Illumina HiSeq (Illumina, California, USA) with a read length configuration of 150 PE for 600 M PE reads per sample (300 M in each direction).

De novo assembly and annotation of Elephas maximus maximus mitochondrial genome
sufficient to completely assemble a mitochondrial genome, and we used BBMap reformat.sh [32] tool to extract~5GB amount of reads (45,499,065 read pairs) from the whole genome dataset. However, the de novo assembler wasn't successful in reconstructing the D-loop region of the circular genome. This region was still not resolved after using the multi-kmer mode recommended in Mitoz. The reason could be the inability of short reads to resolve and assemble such complex and repetitive regions [33]. Therefore, the whole-genome sequencing reads were mapped to the D-loop region of Elephas maximus mitochondrial genome (NC_005129) [34] using BWA MEM v0.7.17 [35], and the generated consensus sequence was merged into the assembly using the Geneious Prime 2020.1.2 (http://www.geneious.com/) software. After the manual curation of annotations, OrganellarGenomeDRAW v1.3.1 [36] was used to visualize the assembled genome. The circles are proportional to the ratio of elephants sampled. The number and location of samples used for the ddRAD analysis are given in S1 Table. https://doi.org/10.1371/journal.pone.0285572.g001

MtDNA divergence dating and nucleotide diversity analysis
We downloaded seven Asian elephant whole-genome datasets and two African elephant whole-genome datasets from NCBI SRA (National Center for Biotechnology Information-Sequence Read Archive). All whole-genome sequencing datasets (Table 1)  Optimizing ddRAD approach for Sri Lankan elephants ddRAD seq library preparation. Isolated genomic DNA was quantified with Qubit 2.0 DNA HS Assay (ThermoFisher, Massachusetts, USA) and quality assessed by Tapestation genomic DNA Assay (Agilent Technologies, California, USA The adapters were selected based on the restriction combinations. After the double digestion of the restriction enzymes, the  Table) in silico to digest the genome using the tool FRAGMATIC [48]. This tool estimates the number of cut sites for given fragment sizes. The restriction enzyme combinations included four base + eight base cutters, six base + four base cutters, and six base + six base cutters. The four base + eight base cutting enzyme pairs, and six base + four base enzyme pairs usually produce a higher number of fragments, which makes it difficult to control errors, and requires deep sequencing depth. Meanwhile, the six base + six base enzyme pairs generally produced 20000-40000 fragments, which makes it easy to obtain a sufficient number of fragments without sequencing to a higher sequencing depth. We estimated the number of fragments in the range of 200-400 bp for each enzyme combination using the available genomes, E. maximus [36], Loxodonta africana LoxAfr3 (Ensembl 101 release), Loxodonta africana Lox-Afr4 [39] and E. maximus preprint genome by the DNA Zoo team [49]. The SphI-EcoRI enzyme combination, which yielded relatively even and high fragment recovery across the genus was selected. We chose 300-450 bp fragment size, which has been widely used in previous studies [50,51]. ddRAD data analysis. We used Stacks [52], a well-established modular pipeline, to process and analyze ddRAD sequencing data. The sequencing service provider provided demultiplexed ddRAD data. Low quality reads were removed from these 24 RAD sequencing datasets, and RAD tags were rescued (parameters: -c -q -r-renz_1 sphI-renz_2 ecoRI) using Stacks' (v2.54) process_radtags tool. Then, we followed a reference-based approach to map pairedend ddRAD data to the chromosome-length Asian elephant draft assembly [49,53] using the bowtie2 short read aligner [54]. Stacks' gstacks program was used to assemble RAD loci from read-mapped BAM files and reads with mapping quality less than 20 were ignored when assembling loci (parameters:-min-mapq 20). The population program in Stacks was used to export results in Structure and Genpop formats while maximizing the number of polymorphic loci found in 80% of individuals (parameters:-structure-genepop -r 0.80-min-maf 0.05). This also provided summary statistics describing each RAD locus, such as haplotype diversity or haplotype richness at a RAD locus. A custom script designed (see code repository) calculated the length and the number of variant sites/SNPs of each RAD locus. If a variant is found in at least one sample, it counts towards a single variant site.
The Principal Component Analysis (PCA) is used to discern the underlying genetic diversity among the sampled individuals [55,56]. Here we used Genepop format haplotype output from Stack's population program for the PCA. ADEGENET [57] v2.1.4 R-package was used for PCA [58], and factoextra v1.0.7 [59] package was used for visualization.The population structure and admixture of Sri Lankan elephants was assessed using STRUCTURE v2.3.4 [60]. To speed up the execution, we used StrAuto v1.0 [61], a program used to parallelize the STRUCTURE execution. The STRUCTURE admixture model was run with a 50,000 burn-in period and 100,000 Markov chain Monte Carlo (MCMC) iterations for K = 1 to 10, with ten independent runs for each K. The Evanno method [62] implemented in STRUCTURE HAR-VESTER [63] was used to determine the potential K value. Across 10 independent runs, the mean and standard deviation of log likelihood of each K (LnP(K)) were calculated, and the Delta K [62] was derived based on the second order rate of change of mean log probability between successive K values. We used CLUMPP [64] to permute data from the 10 independent runs for each K to generate the final Q values for each individual. The results were graphically displayed using DISTRUCT [65,70], included in the CLUMPAK software [66], which also includes the CLUMPP. In addition to the STRUCTURE Bayesian hierarchical clustering, we used ADEGENT to perform K-means clustering and a discriminant analysis of principal components (DAPC) [67], a nonparametric approach employed without any predetermined genetic model, to investigate subdivision of genetic groups.

Mitochondrial genome assembly and divergence dating
The assembled mitochondrial genome of Sri Lankan elephant is 16,896 bp long with 38.7% of GC content. It encodes typical metazoan mtDNA genes, including 13 protein-coding genes (PCGs), 22 tRNA genes, two rRNA genes, and one non-coding control region (D-loop) (S3 Table). The gene order is identical to the E. maximus (NC_005129) [33] mitogenome.
The mitochondrial genome of the Sri Lankan elephant was analyzed in conjunction with previously published Asian elephants, African elephants, and fossil mitogenomes (Fig 2). The fossil calibrated Bayesian analysis of the basal divergence time for all elephant mitochondrial lineages was~6.25 million years (95% CI: 5-6.9 million years) and consistent with the previous estimates.
Elephas-Mammuthus clade has diverged approximately~5.9 million years (95% CI: 5-6.9 million years) ago. The estimated coalescence time of α and β clades in Asian elephants is~1.2 million years (95% CI: 0.75-1.5 million years). The coalescence time of the β clade is estimated at~0.4 million years, while the Sri Lanka elephant is estimated at~0.1 million years. Kadol the Sri Lankan elephant and Icky, the Myanmar-born elephant nested closely with elephants found in South India rather than the other India-born elephants. Moreover, Uno from North India has deeply diverged from the other Asian elephants.
DNA polymorphism analysis reveals a highly varying region within the ND1 mitochondrial gene of the Asian elephant species, and its nucleotide variability (pi) is higher than 0.015 (Fig 3). CYTB mitogenome also shows a higher nucleotide polymorphism than the rest of the genes in the mitochondrial genome.

ddRAD sequencing and SNP discovery
Choosing an enzyme combination that yields relatively even and high fragment recovery across the genus is a very import step in preparing ddRAD libraries. For the fragment size selection, a narrow range will provide more consistent library recovery and would require less sequencing effort. The selected enzyme combination for this analysis, SphI-EcoRI, within the fragment size of 200-400 bp yielded 733M reads in total for the 24 libraries. After filtering reads with low-quality bases, a total of over 724 million high-quality paired-end reads (98.67%) were retained across 24 animals. Of those, each sample had an alignment rate of >93% to the Asian elephant reference genome (Table 2) and, the coverage ranged from 42x to 231x. Initially, we identified 409,932 loci from all samples and kept 78,950 loci (19.26%) after the quality filtering steps. Among the remaining loci, 65,600 were polymorphic. Total 51,364 SNPs were identified from the 24 samples.

Population structure and cluster analysis
The PCA analysis provided the relationships within and among populations. We performed the PCA analysis on 23 ddRAD libraries (Table 2) excluding the sample B_MAD, which is of African origin (Fig 5). The first and second principal components accounted for 8.1% and 6.9% of the observed variation, respectively. Three distinctive clusters (Fig 4) were identified through this analysis. The first genetic cluster included the elephants that originated in North, Northwestern, North Central, and Eastern parts of Sri Lanka, while cluster 02 comprised the elephants born in Pinnawala. Cluster 03 included the individuals born in the Southern part of Sri Lanka (S1 Table). Table 3 consists of the STRUCTURE HARVESTER output in the STRUCTURE analysis. According to the Evanno method, most probable K is where the Delta K attains its peak. Therefore, we can infer the optimal value for K is 3 (Fig 5A). Furthermore, the manual for STRUCTURE recommends plotting the likelihood of K for each value of K, and the point at which the plot curvature plateaus, in this study K = 3 (Fig 5B), is a potential K value. In addition to the Bayesian clustering using STRUCTURE, we performed a discriminant analysis of principal components (DAPC). DAPC is a supervised multivariate statistical method that can provide an efficient description of genetic clusters with few synthetic variables. It produces synthetic variables maximizing differences between predefined sample groups while minimizing variation within groups. Assuming three clusters exist among the samples, individuals were assigned to clusters using the K-means algorithm (Table 4). These clusters were then explained with two discriminant functions and 20 remaining PCA axes. The DAPC analysis (Fig 7) confirms the population structure inferred from STRUCTURE analysis. The cluster assignments obtained using the K-means algorithm support the grouping of elephants identified from PCA (Fig 4).

Haplotype diversity and SNPs counts in loci
Stack's population program identified 67,432 assembled RAD loci with their calculated haplotype diversity and included 50,490 Genome-wide SNPs to analyze the population structure of Sri Lankan elephants. Higher haplotype diversity means its polymorphism is also higher ( Table 5). The number of SNPs also shows a similar variation with the haplotype diversity. The identified regions are between 250-450bp range and could be easily amplified with PCR.

Discussion
Modern, historical, and ancient mitogenomic data from all three elephant species Loxodonta africana, Elephas maximus and Mammuthus primigenius provided insights into the historical biogeography of the elephants [16,68]. The mitochondrial genome of the Sri Lankan tusker selected for the study was assembled de novo, and available references assisted in filling the gaps in the D-loop region. The phylogenetic analysis conducted to understand the evolution of Asian elephants included twelve mitochondrial genomes: two Loxodonta, one Mammathus, and nine Asian elephants, including the newly assembled Sri Lankan one. So far, the mt-D loop region and CYTB have been used to clarify the phylogeographic relationship and divergence of Asian elephants [14,16,69]. However, during our analysis we excluded the D-loop region because of its complexity. Our results support the deep bifurcation between African elephants, Asian elephants and Mammoths, previously been proposed based on mitochondrial genomes [39, 68,70]. The fossil calibrated Bayesian analysis also agrees with the previous estimates [14,71,72] of the basal divergence time for elephant mitochondrial lineages,~6.25 million years. Elephas-Mammuthus clade diverged approximately~5.9 million years (95% CI: 5-6.9 million years) ago. The α and β clades divergence from the Elephas is approximately 1.5 million years ago, which supports with the previous [73,74]. The β1 sub-clade is distributed primarily in Sri Lanka, and a few populations of α clade was recorded [14,16]. In the current study, the Sri Lankan elephant Kadol was nested with a Myanmar elephant named Icky. It was with an estimated coalescence time of~0.1 million years, possibly due to the historical trade of elephants as gifts between Sri Lanka and Myanmar. Moreover, other South Indian elephants namely Asha, Parvathy and Jayaprakash has placed sister to Kadol and Icky. This further suggests the existence of β1 subclade in Sri Lanka and the Southern part of India and even Myanmar.  Further, Moola and NC_005129 also clustered distantly with Icky, though all those samples are from Myanmar (β1 sub-clade). During warm periods, the Asian elephants of the β clade may have moved from Indian peninsular and Sri Lanka to the eastern and southeastern parts of Asia [13,75]. Vidya et al., in 2005 [20] recorded the predominance of α clade in the Southern part of the Sri Lanka. Kadol from mid-latitude points out the prevalence of β1 sub-clade in the central part of Sri Lanka. However, further studies are needed with a larger number of samples to explore the coexistence of α and β clades in Sri Lanka. This coexistence of both the α and β clades in Sri Lanka and the predominance of the α clade on the mainland explained by the direction of gene flow from the larger mainland population to the smaller Sri Lankan  population which occurred during more recent events of reconnection and was influenced by climate change and reflected genetic admixture of haplogroups in the Sri Lankan population [14,16]. Although the Borneo elephants belong to β1 sub-clade [5,12,76], in our analysis Chendra distantly clustered with the other β1 main sub-clade. This genetic difference of the Bornean elephant from other mainland Asian elephant subspecies makes it one of the highest priority populations for Asian elephant conservation [7,77]. The α clade elephants are primarily found in northern India, Nepal, across Myanmar, Thailand and Vietnam, and potentially dominant in Cambodia, while some found in Sri Lanka [5,13]. Accordingly, the North Indian elephant, Uno which clustered distantly from the other Asian elephants is probably from the α clade. The male elephants disperse from herds, mediating nuclear gene flow [78][79][80], and female elephants are matrilocal and remain with their natal herds [81,82]. As the mitochondrial genome is transmitted maternally, it is tied to the geographic range of the herd and sub structuring at mtDNA may be expected even in small populations [28]. Therefore, the data taking the full mitochondrial sequence (excluding D-loop) suggest the relatedness of the α and β clades. However, earlier phylogeographic and morphological analyses indicate that the Sri Lankan and Southern Indian elephants are not distinct enough to warrant classification as separate subspecies [13]. Including more genomic data from these regions would improve the phylogenetic analysis. Since previously utilized SSR regions have specificity issues, Asian elephant-specific makers are of critical need [30]. Interestingly, the CYTB and ND1 mitochondrial genes exhibit the highest nucleotide polymorphism with respect to other mitochondrial genes in our study. These two genes CYTB [13] and ND1 [6] have been also used to study the population genetic structure of Asian elephants. However, mt-D-loop region has been widely used to study the elephant population discrimination [5,13,14,16].
The above findings prompted us to investigate whether there is genetic diversity among elephants on the small island of Sri Lanka. Genetic diversity is the basis for evolutionary changes within a natural population. Maintaining or preserving high levels of genetic diversity is fundamental in conservation genetics [83]. Small isolated populations resulting from habitat fragmentation cause the loss of genetic diversity due to random genetic drifts and genetic bottlenecks [84]. It may limit the gene flow, causing significant consequences on the genetic diversity within isolated populations [84]. One of the best ways to achieve efficient conservation management of wild animals and to understand the phylogenetic relationships of the animals is to reveal the population structure and diversity of the animals. We employed ddRAD sequencing to examine the genetic diversity and population structure of Sri Lankan elephants from different geographical origins. Even though the total number of samples included in the analysis is low compared to the elephant population in Sri Lanka, it is a good representation. Good quality DNA is the key to the success of the ddRAD technology and obtaining blood from a large number of wild elephants is practically impossible. Therefore, we collected blood from the captive elephants having a pure wild origin from different geographical locations in Sri Lanka. The ddRAD-seq approach is an efficient and cost-effective methodology for SNP discovery. A total of 50,490 high-quality SNPs specific for Sri Lankan elephants; E m. maximus identified would be utilized for future ecological and evolutionary studies. These SNPs were evenly distributed throughout the genome, showing that ddRAD is an efficient next-generation sequencing method for genetic diversity studies. The SNPs identified can be further validated with more samples and used as an efficient base for the design of markers for the identification of Sri Lankan elephants.
Our results provide insights into elephant habitat fragmentation in Sri Lanka which may have led to different sub-populations. The genetic diversity within Sri Lankan elephants is higher, suggesting a significant differentiation at a geographical level in Sri Lanka resulting in three significant clusters; Northwest-Mahaweli, Northeast, and Southern. Habitat fragmentation may have overall positive effects on the genetic diversity of the Sri Lankan elephants. Although Fernando and colleagues [19] found the Sinharaja elephant morphologically different, genetically, it clusters with the north-eastern elephant cluster, suggesting Sinharaja population is contiguous with the dry zone population in North-east and east. The elephant population recorded in Sinharaja is limited to 3 elephants [19]. These elephants could be recent migrants to the Sinharaja rainforest, and morphological changes are adaptations to the thick rainforest. Including more samples from all identified populations will provide solid evidence for such a hypothesis. While collecting blood from wild elephants is very challenging, collecting dung is doable. Our findings indicate that the single population that has been distributed in the east-northeast and north central areas is not geographically or landscape segregated.
However, our recent work showed the challenges associated with utilizing existing SSR markers in elephant dung DNA work. As a result, using available genomic data with long reads followed by validation of a polymorphic set of SNPs would reduce the cost of such studies while producing high-quality data. The current analysis suggests possible genomic regions with a reasonable number of polymorphic SNPs. The size of the fragments is within the range of easy PCR amplification and Sanger sequencing and capillary electrophoresis techniques. Therefore, validation with more samples from Sri Lanka and other regions may lead to the identification of a unique set of SNPs for Sri Lanka elephants. The availability of such sensitive analytical methods will assist in overall conservation and management strategies when the Sri Lankan elephant population has fallen almost 65% since the 19th century.

Conclusions
The first de novo assembled mitogenome presented here may further improve with long-read sequencing approaches. Our data strongly suggests a genetically closer relationship between Sri Lankan elephants to Southern Indian elephants and Myanmar elephants being in β1 clades A total of 50,490 genome-wide SNPs were identified among 24 Sri Lankan elephants, and further analysis clustered them into three clusters suggesting considerable difference. A set of polymorphic SNPs were identified for further validation. The results suggest the importance of social organization and biogeographic barriers in shaping the distribution of genetic variation among Asian elephant populations in Sri Lanka. These variables could be further assessed by including more individuals in the analysis.