Comparative population genomic analysis uncovers novel genomic footprints and genes associated with small body size in Chinese pony

Body size is considered as one of the most fundamental properties of an organism. Due to intensive breeding and artificial selection throughout the domestication history, horses exhibit striking variations for heights at withers and body sizes. Debao pony (DBP), a famous Chinese horse, is known for its small body size and lives in Guangxi mountains of southern China. In this study, we employed comparative population genomics to study the genetic basis underlying the small body size of DBP breed based on the whole genome sequencing data. To detect genomic signatures of positive selection, we applied three methods based on population comparison, fixation index (FST), cross population composite likelihood ratio (XP-CLR) and nucleotide diversity (θπ), and further analyzed the results to find genomic regions under selection for body size-related traits. A number of protein-coding genes in windows with the top 1% values of FST (367 genes), XP-CLR (681 genes), and log2 (θπ ratio) (332 genes) were identified. The most significant signal of positive selection was mapped to the NELL1 gene, probably underlies the body size and development traits, and may also have been selected for short stature in the DBP population. In addition, some other loci on different chromosomes were identified to be potentially involved in the development of body size. Results of our study identified some positively selected genes across the horse genome, which are possibly involved in body size traits. These novel candidate genes may be useful targets for clarifying our understanding of the molecular basis of body size and as such they should be of great interest for future research into the genetic architecture of relevant traits in horse breeding program.


Background
For thousands of years, enormous variety of domestic breeds with different morphological, physiological and behavioral characters have been domesticated and raised in different parts of the world. They therefore offer a powerful source of biological models for the biology studies and have played significant role in developmental, evolutionary and biomedical research [1][2][3].
Due to intensive breeding and artificial selection throughout the domestication history, horses exhibit striking variation of height at withers and body size. Today, body size is one of the most important criteria for the evaluation and classification of different breeds, and also is an essential parameter for breeding programmers to improve marketability and performance [4]. According to this criterion, most horse breeds can be divided into three main categories including high stature/heavy horses (draft breeds), light horse (riding breeds) and small stature/low weight (pony breeds) [5,6].
The pony breed is defined as a group of horses with a common height less than 14.2 hands (147 cm) at the withers [7]. There are several pony breeds in the world, though varying in body size, skin color and geographic origins, all of them share some basic traits that make them different from other horse breeds. Debao pony (DBP), a famous Chinese horse breed, is known for its small body size and lives in Guangxi mountains of southern China. This breed is strictly protected by the Chinese government, therefore it has the largest population compared with other local pony breeds. DBPs exhibit peculiar morphoanatomical adaptation to facilitate work in the mountainous regions, for example their average adult height is around 94 cm and 98 cm for male and females, respectively [8,9].
Although previous studies reported a few candidate genes such as LCORL, HMGA2, ZFAT, NCAPG [6,10], TBX3 [9] and ANKRD1 [11] with major effects on height and body size variations in several horse breeds, these studies were mostly carried out using SNP genotyping involving Illumina EquineSNP50 Genotyping Beadchip or GeneSeek Equine SNP70 Beadchip, which certainly have some limitation due to factors such as low SNP density in many genomic regions, differential probe affinity as well as ascertainment biases that prevent them from detecting novel variation in entire genome [12].
In the present study, whole genome sequencing (WGS) data were used for comparative population genomics to identify the genetic basis underlying the small stature of DBP.

Sequencing and read alignment
Individual genomes of 17 DBPs were sequenced on an Illumina Hiseq 2000 platform with a read length of 125 bp. To facilitate comparisons with other horse breeds, the sequenced horse data were jointly analyzed with publicly available WGS data of 15 different breeds (n = 69) (Additional file 1: Table S1 and Additional file 2: Figure S1). The mean sequence depth was~13.5X per sample (Additional file 2: Figure S2). We detected a total of 18,384,176 SNPs in all individuals (Additional file 1: Table S2).
Phylogenetic analysis, runs of homozygosity and linkage disequilibrium decay We firstly performed different classical analyses including phylogenetic tree, principal component analysis (PCA), Bayesian model-based analysis and haplotype population structure based on chromosome painting from ChromoPainter and fineSTRUCTURE analyses to reveal genetic relationships among different horses (Figs. 1 and 2a). Based on the phylogenetic tree results (Fig. 1a), all DBPs were separated from Middle East, European and American horse breeds. The topological pattern found in the tree was also supported by both PCA (Fig. 1b) and Bayesian model-based analysis ( Fig. 1c and Additional file 2: Figure S3). The results of painting algorithm from ChromoPainter and fineSTRUCTURE analyses simplified their relationships into a co-ancestry matrix, which presents expected number of chunks from donors (column) to recipients (row), and visualized as a heat map plot (Fig. 2a).
In addition, runs of homozygosity (ROH) along the whole genome have been applied to quantify individual autozygosity to improve the understanding of inbreeding depression in populations. The mean number of ROHs longer than 100 and 1000 Kb per each population is plotted in Fig. 2b. Also, to investigate the effects caused by selection and genetic bottleneck on population history and demography, we determined the scale of linkage disequilibrium (LD) decay for each breed. The average level of LD, measured by r 2 , between adjacent single-nucleotide polymorphism (SNPs) across the complete genome was estimated (Fig. 2c).

Scans for signatures of selection
Based on the results obtained from phylogenetic analysis and to avoid bias from unequal sample sizes, we compared the whole genome of DBP individuals, as a ponysized (genetically homogeneous) breed with Thoroughbred (THB) horses, as a large well-defined (genetically homogeneous) breed, to identify signatures of positive selection related to body size traits. Here, we examined three different parameters for a greater statistical power to localize the source of selection signals. Population differentiation (F ST ) was calculated for each SNP between DBP and THB horses as described in Weir and Cockerham [13]. A sliding window analysis was employed with 50 kb window size and 25 kb step size. A total of 367 protein-coding genes were identified in the top 1% windows with high F ST values (Additional file 1: Table S3). To get a clearer insight into the genetic mechanisms related to these candidate genes, further downstream analyses were conducted. Among the candidate genes, five (NELL1, FGFR1, SNTG1, BMP2 and TBX15) are involved in body size related traits (Table 1). Gene set enrichment analysis (GSEA) identified some significantly enriched categories related with skeletal development, such as "broad hallux" (HP:0010055), "broad phalanges of the hand" (HP:0009768) and "broad toe" (HP: 0001837) (Additional file 1: Table S4). We then, using a log 2 ratio of θπ between DBP and THB samples, identified genomic regions that may have been under selection in DBPs. A sliding window analysis yielded 332 candidate genes (Additional file 1: Table S5). For these extreme values, some enriched gene ontology (GO) terms were found to be related with skeletal development categories, including "aplasia/hypoplasia of the tibia" (HP: 0005772) and "Short femur" (HP:0003097) (Additional file 1: Table S6). In addition, we found some sizerelated genes such as, NELL1, FGFR1 and CNN3 (Table 1). Finally, we adopted a cross-population composite likelihood ratio test (XP-CLR) to evaluate historical selections based on the comparison of allele frequency spectrum. A total of 681 candidate genes were identified in the top percentiles of approach (1% cutoff) (Additional file 1: Table S7). The results of functional enrichment analysis from all these genes showed some categories related with both muscle and skeletal development such as "small hand" (HP:0200055), "increased body weight" (HP:0004324), "large forehead" (HP: 0002003), "animal organ development" (GO:0048513), "anatomical structure development" (GO:0048856) and "tissue development" (GO:0009888) ( Table 1; Additional file 1: Table S8).
In addition, to confirm these results, when we compared the genome of DBP, used as a test population, with all other horse studied breeds (mixed-breed and purebred) used as a reference population, we also observed that NELL1 signals in the high values (windows with the top 1% values of both F ST and XP-CLR methods, Additional file 2: Figure S4).

Discussion
Before to the application of methods for detecting signatures of selection, we first assessed the phylogeny of the horse breeds to evaluate the phylogenetic position of DBP within the species. In agreement with previous studies, DBP was phylogenetically distant from other horse breeds while being relatively close to the Mongolian horse (MNG) [8,9]. The distance patterns between horse breeds was also identified by PCA. However, THB and DBP had the greatest distance from each other, Hanoverian (HNV) and Holsteiner (HLS) breeds were not clearly distinguishable in either the phylogenetic tree and PCA, indicating the possibility of shared genetic components between these two German Warmblood breeds [41]. Similar to the results from PCA, admixture analysis at K = 2 separated the DBP, Yakutian (YKT), MNG and THB horses from other populations, while at K = 3 the Standardbred (STB) horses separate from the remaining horse populations. Consistent with previous studies, we found that the DBP and THB populations are genetically homogeneous for all K values (K = 2 to K = 8) [4,9].
In addition, LD decay analysis revealed a markedly lower level of LD across all genomic distances in DBPs than other breeds. The high LD in commercial breeds, especially for THB horses, could be a consequence of artificial selection for specific abilities, e.g. racing performance, in the breeding programs [42] while DBPs clearly have an ancient origin following a long-term natural selection. Also, discovering the ROH per each breed showed the lower level of ROH in the DBP population compared to other horses. Here, we found high level of ROH for THB population, that is concordant with previous study in the six different horse breeds [43].

Potential independent of positive selection in the DBP population
Because of the small body size of DBP, that is notably less than average horse breeds, we focused specifically on the loci that may play more important roles in the rapid evolution of body size during the domestication process. Here, we used comparative genome analysis between DBP and THB breeds to identifying the genetic basis underlying the size variation among DBPs. In our broad spectrum analysis, several previously reported genes were found to be probably involved in body size related traits. Highly significant candidate genes related to these traits are listed in Table 1. Within the regions showing extremely high values (top 0.01), both F ST and XP-CLR methods showed BMP2 gene shared overlapped selection signatures as positively selected genes (PSGs) ( Fig. 3a and b). BMP2, a bone formation-related gene, was found as one of the candidates on ECA5. This protein belongs to the TGF-β superfamily, which has diverse biological activities related to bone physiology and metabolism [22,23]. Previous studies have found associations of the BMP-2 variants with bone and cardiac development, bone mineral density, as well as body size Fig. 2 a Finestructure. A heat map of a co-ancestry matrix generated by chromosome painting with fineSTRUCTURE. The color of each cell represents the expected number of 'chunks' imported from a donor genome (column) to a recipient genome (row). b Runs of homozygosity (ROH). c Linkage disequilibrium (LD) decay traits [24,25]. Also in human, BMP-2 appears to be the most important BMP affecting the adult skeleton [44].
Another possible candidate gene, FGFR1, was found in one of the selection regions on ECA27 (top 1% cutoff for Fst and log2 θπ ratio values) (Fig. 3a). FGFR1 is an important candidate gene that influences bone growth and skeletal development. Previous studies found that FGFR1 protein plays a critical role in formation of muscle and bone tissues [17][18][19][20]. Considering the important function of FGFR1 in skeletal development, this gene is an important candidate for body size variation in mammalians.
Results from the detection of selection signatures revealed consistently high signal values in F ST and XP-CLR analyses as well as log2 θπ ratio for NELL1 gene, which is overlapped among candidate PSGs (Figs. 3a, b and 4). NELL1, encodes a mammalian cell-signaling protein (protein kinase C-b1, PKC-b1) that has been shown to regulate skeletal ossification [45,46]. Overexpression of this gene in both human and mice induces craniosynostosis, the premature fusion of cranial sutures [14]. Previous studies have shown that absence of Nell1 leads to decreased cell differentiation and cell proliferation in several organs such as heart, bone and cartilage tissue [14,15]. Recently, an interstitial 11p14.1-p15.3 deletion involving the Nell-1 gene was also reported in associated with short stature in children [16]. Moreover, it was demonstrated that the NELL-1 has potential roles as a bone-forming growth factor in sheep [47]. Body size is recognized as one of the most fundamental properties of an organism, affecting nearly all biological aspects. In the last decades, new insights from the genetic and physiological studies have refined our understanding of genetic basis of body size, as the target of positive selection in human and domesticated animals. Human body size is a polygenic trait affected by variants of numerous genes and their interactions with environmental factors. For example, hundreds of genetic variants, in at least 180 loci, with small effects, have impact on final human adult height [48]. In contrast, several independent studies in domesticated animals have shown that changes in body size can be controlled by a few genes with large effects. For instance, it has been demonstrated that one specific haplotype defined by 20 SNPs spanning the recent selection sweep covering IGF-1 gene has a major effect on body size within all small dogs [49]. A similar study has shown that one SNP within the strong linkage region of BMP10 gene explained around 22% of the overall body weight variance in five chicken lines [2]. Also, one study on dairy and beef cattle revealed the variation in the average height can be controlled by only 10 genes in eight genomic regions [50].
Based on the standard additive model, Makvandi-Nejad et al. [6] identified four loci on the ECA3, 6, 9 and 11 that explained 83% of size variance in 48 horses, three each of eight large and eight small horse breeds. Using the same dataset of these 48 horses, a recent GWAS study involving both dominant and recessive mixedmodel approaches as well as a genome-wide scan for signatures of selection based on the F ST genetic differentiation and XP-CLR test, ANKRD1 gene was identified and validated as a novel candidate, explaining 7.98% of the genetic variance in body size of the American Miniature horse (AMH). Compared with the fixed status of all four loci identified by Makvandi-Nejad et al. [6], ANKRD1 gene could be applied in effective genotypeassisted selection for body size in AMH [11]. In other independent studies, the differential SNPs in LCORL gene on ECA3 [10,[51][52][53][54], ZFAT gene on ECA9 [51], TBX3 gene on ECA8 [9] and LASP1 gene on ECA11 [55] have also been shown to be strongly associated with body size traits in horses.
In this study, we have investigated the genetic basis underlying the body size variation in DBP. In our broad spectrum of analyses by three methods, we did not find any significant selection signal within or near genes which were previously identified as horse body sizerelated candidates. Instead, we observe that NELL1 gene likely played an important role in the evolution of the small stature of DBP, an ancient small pony that was evolved in the mountainous areas in southwestern China. In addition, some other loci on different chromosomes were also identified to be potentially involved in body development process. No evidence of directional selection for the detected genes in this study has been reported to date in other horse populations, suggesting that these genes have been probably selected independently for the short stature of DBPs.

Conclusions
In this study, using next-generation sequencing analysis, we identified some novel candidate genes under selection for body size traits in DBP population. This results suggest that the imaging a common evolutionary mechanism that influences patterns of genetic variation in all horse breeds is misconception as many of them were adapted for different environments and/or for various goals. These novel candidate genes may be useful target for clarifying our understanding of the molecular basis of body size and they should be of great interest for future research addressing the genetic architecture of relevant traits in horse breeding program.

Sample collection and sequencing
In this study, all 17 horse blood samples were collected from private farms in Debao county, Baise city, Guangxi province in south of China. The experimental animals were not anesthetized or euthanized in order to conduct this study. No horse individuals died in this study and all individuals stayed healthy after collecting blood samples. Genomic DNA was extracted using the phenolchloroform method. Pair-end sequence data for all DBPs were generated using the Illumina Hiseq 2000. Also, previously published genome sequence data from 15 other horse breeds (n = 69) were obtained from the Sequence Read Archive (ncbi.nlm.nih.gov) (Additional file 1: Table  S1 and Additional file 2: Figure S1).
The sample size for our experiment was calculated based on Ma et al., guidelines [56]. Ma et al., (2015) showed that a reasonable power to detect selection signatures is achieved with high marker density (> 1 SNP/ kb) as obtained from sequencing, while rather small sample sizes (~15 diploid individuals) appear to be sufficient. The sample size of 86 animals used in our experiment has a power of at least 80% to detect the genomic signature of selection using different approaches.

Alignments and variant identification
High-quality reads in the present study and published data were aligned against the horse reference genome, ENSEMBL (version 94) (ftp://ftp.ensembl.org/pub/release-94/fasta/equus_caballus/dna/), using Burrows-Wheeler Aligner (BWA) (https://github.com/lh3/bwa) [57]. Binary alignment map (BAM) files were imported into SAMtools [58] for sorting/merging and into Picard tools (https://broadinstitute.github.io/picard/, latest release 2.18.26) to mark duplicated reads. We then improved alignment accuracy by minimize such mismatching bases using RealignerTargetCreator, IndelRealigner, and BaseRecalibrator functions in the Genome Analysis Toolkit (GATK) 4.0.12.0 [59]. The variant calling of sequence data were handled using UnifiedGenotyper and VariantFiltration tools in GATK. The final sequencing variants were filtered to be supported by a minimum mapping quality of 25 and a minimum genotype quality of 40. Furthermore, all loci with > 2 alleles and within clusters (> 3 SNP in a 10-bp window) were removed from later analyses.

Population variation and population genetic analyses
The phylogenetic tree for the individual genome sequences was constructed from the SNP data by using the neighbor-joining method among the horses. The PCA was performed using the software GCTA [60], after pruning the SNPs for short-range LD by PLINK software [61]. In order to visualize population structure between different horse breeds, ADMIX-TURE [62] was used with an ancestor population size ranging from 2 to 8 and 10,000 iterations for each run, based on the pruned data. In addition, we used a haplotype-based approach to explore patterns of haplotype sharing using ChromoPainter and fineS-TRUCTURE [63]. The decay of LD measured as the squared correlation coefficient by pairwise physical distance for each breed. ROHs were also identified via the "Runs of Homozygosity program", implemented in PLINK software. For each population, we identified all ROHs of length longer than 100 and 1000 kb distances [61].

Identification of candidate genes under positive selection in the DBP
Three different statistic methods were used to identify regions of the genome that have the highest signals of selective sweeps in DBPs. To characterize genetic differentiation between DBP and THB horses, pairwise F ST values for each SNP were calculated using Weir and Cockerham method [13]. The θπ values were calculated for both DBP and THB populations. Sliding window analyses were performed upon a window size of 50 kb and a step size of 25 kb across of the genome. The average F ST and log2 (θπ DBP/θπ THB) values of SNPs in each window were calculated. We additionally performed XP-CLR test to detect regions in the genome where the change in allele frequency at the locus almost-occurred recently. The scores were estimated using the code provided by Hua Chen (Department of Genetics, Harvard Medical School) [64]. We used the following options: sliding windows of 0.1 cM, grid size 10 k, maximum number of SNPs within each window as 300, and correlation value for 2 SNPs weighted with a cutoff of 0.99. Here, DBPs was taken to be the object population and THB horses chosen as the reference population. The regions with the XP-CLR values in the top 0.01 of the empirical distribution were designated as signal of positive selection.

Gene set enrichment and pathway analysis
To search the possible pathways involved in the regions with top values (F ST , log2 θπ ratio and XP-CLR), candidate selective regions were annotated using the Variant Effect Predictor available at (http://asia.ensembl.org/ info/docs/tools/vep/index.html). Functional enrichment analysis was done by using the 'g:Profiler' enrichment analysis tool, to uncover their biological functions [65]. And the P-value of the gene enrichment was corrected by Benjamini-Hochberg FDR (false discovery rate).
Additional file 1: Table S1. Sample information for each horse (86 individuals) used in this study. Additional file 2: Figure S1. Horse breeds used in this study. The figure was designed by the first author. Figure S2. Genome sequence depth for each horse in this study. Figure S3. Cross validation error (CV) plot from ADMIXTURE. Figure S4