PLN Gene Analysis in Horses: Multiway Approach for the Investigation and Validation of Molecular Variation

,

Horses (Equus caballus) are commonly known as superior athletes with several adaptations in relation to physical effort. Although among mammalians there are animals with a broader maximal oxygen uptake (V O max ) relative to their body mass, and with a higher maximum speed (LINDSTEDT et al. 1991), equines possess unique adaptations across their entire bodies to maximise the effectiveness of a workload, which allows them to consume more oxygen per kilogram of body mass than most other large mammals (YOUNG 2003). Consequently, horses have one of the highest ratios of muscle mass to total body weight when compared to other mammals (KEARNS et al. 2002). In horses, evolution has affected the growth and development of the muscles, heart and spleen as an adaptation to long-lasting endurance, and is an effect of further specialisation resulting from the different horse utility types (BRYAN et al. 2017;POOLE 2004;GUNN 1989;RIVERO & HILL 2016). The skeletal muscle is the main organ that is adapted to effort during exercise. In athletic horses, the muscle mass can account for about half of the whole body weight, and reaches up to about 60% in Thoroughbred racehorses (GUNN 1979;WOOD 1996;KEARNS et al. 2002). Despite the significant muscle mass present in horses, it is not directly related to the athletic ability, because the energetic capacity of the muscles exceeds the capacity of the cardiovascular system to deliver oxygen, and the muscle fibre composition is also one of the key factors determining an animal's performance ability. The distribution and size of three types of fibre -fast-twitch glycolytic, fast-twitch oxidative glycolytic and slow-twitch oxidative -differ across various muscles, and the characteristics of horse breeds are strongly related to the physiological adaptation to effort (RIVERO 2007;STULL 1980). Moreover, according to the usage of aerobic or anaerobic metabolism depending on the fibre type in response to exercise, the ratio of an individual's fibre composition affects metabolic homeostasis (ZIERATH & HAWLEY 2004;BAYLOR & HOLLINGWORTH 2012). Considering all the adaptations to physical effort, the horse is an excellent animal model for studying an organism's response to exercise at many levels. Furthermore, the genetics of equine performance is one of the most intensively explored subjects in recent studies.
Since the release of the first reference horse genome assembly (WADE et al. 2009) with a size of approximately 2.7 x 10 9 bp and with roughly 20,000 proteincoding genes, a great effort has been made to improve the annotation and consequently develop a more precise determination of the genetic background of equine phenotypes. This wouldn't be possible without some technological achievements in the field of DNA sequencing (SANGER et al. 1977), while the use of sequentially, fluorescent labelled terminators to detect on gel electrophoresis (PROBER et al. 1987) was further improved by a sequencer with capillary systems. However, the real milestone was the development of a high-throughput technology. It has led to revelations regarding the complexity of the genome architecture, increased the number of bases sequenced within a period of time and decreased the cost per mega base pair, which has led to the NGS (Next Generation Sequencing) technology becoming a gold standard for genomic studies (GOODWIN et al. 2016).
Despite the major accuracy of transcript variants in equine protein-coding genes annotation, with a high degree of certainty resulting from the combination of manual curation, transcript data and proteomics evidence, it is still necessary to validate the results generated via an NGS approach. The completion of the reference genome has influenced the genomic results and is particularly important for the explanation of genotype-phenotype interactions. Although there is a significant number of supporting tools that can help with variant functional impact explanations, there are no established standards in the animal industry specifying the usability of genetic information. Therefore, this problem was recently raised by equine genetic researchers, leading to the establishment of the "Consensus Statement on the Translation and Application of Genomics in the Equine Industry", which requires rigorous standards for the discovery and validation of variants (PETERSEN & COLEMAN 2020).
Therefore, the main purpose of this study was the molecular characterisation of an exemplary gene -the phospholamban (PLN) gene -with the use of information obtained by whole genome sequencing (WGS) and complete transcriptome sequencing data, supported by highly accurate methods as Sanger sequencing and PCR-RFLP genotyping. The frequency of alleles on the example of the PLN gene was established across breeds that represented various features, including racing warmblood breeds such as the Arabian and Thoroughbred, as well as a Polish coldblood breed. Another group was the Hucul horse and the Polish Konik, which are primitive, non-racing horses.

Materials and Methods
The experimental protocol was approved by the Animal Care and Use Committee of the Institute of Pharmacology, Polish Academy of Sciences, in Kraków (No. 1173.

WGS data analysis -SNP detection
The whole-genome data obtained from 6 Arabian horses was used to identify genetic variants within the equine PLN gene. The data used for the analysis was part of a previous research project in which 62 Gb per sample was obtained after whole-genome sequencing (2 x 150 bp; NovaSeq 6000; Illumina). At first, the raw data was checked for its quality using FastQC. Then, the adapter content was removed, as well as reads of a low quality (Phred quality<20) and too short reads (minimal read length was set to 50 bp) with the use of Trimmomatic software. The mapping was done with the use of BWA software, by applying a bwa-mem algorithm to the EquCab3.0 genome. Duplicates were then marked using the MarkDuplicates function of Picard Tools, and the variant calling procedure was performed using Freebayes software. Filtering of the variants was done by applying the SnpSift and VCFtools software (minimum coverage per sample was set to 10; minimum quality was set to 30; minimum combined coverage was set to 60; and minimum combined quality was set to 180).

RNA-seq -data analysis -SNP detection
The PLN locus was screened to find single nucleotide polymorphisms within the coding sequence and UTR regions, based on an mRNA sequence obtained by the RNA-seq method in previous research including muscle tissue analysis (ROPKA-MOLIK et al. 2017). The data was obtained from 15 whole transcriptomes representing purebred Arabian horses.
The quality of the raw RNA-seq data was checked using FastQC. The adapter sequences were then removed, as well as low quality (Phred quality <20) and too short reads (minimal read length was set to 50) using Trimmomatic software. The mapping was performed using TopHat software based on the EquCab2.0 and 3.0 genome releases. The duplicates were then marked with the use of the MarkDuplicates function of Picard Tools, and the variant calling procedure was maintained with the use of Freebayes software. Filtering of the variants was done with the use of SnpSift and VCFtools software (minimum coverage per sample was set to 10; minimum quality was set to 30; minimum combined coverage was set to 60; and minimum combined quality was set to 180). Moreover, the obtained mapping files in a bam format were screened in IGV 2.3 software.

Genome localisation of analysed polymorphism within PLN gene
Due to the fact that the PLN SNP was identified based on RNA-seq data, which should include only coding sequences and UTR regions based on those annotated in the current database (EquCab3.0) as a downstream gene variant, we decided to perform a PLN transcript analysis. The previous version of the PLN gene transcript (EquCab2.0) was compared with the current annotation (EquCab3.0), and significant discrepancies were observed. According to the Equ-Cab2.0 reference, the PLN transcript contained both 5' and 3' UTR regions, and was 163 bp longer than the same transcript in the current genome version (159 bp).
The primers that amplified all mRNA sequences of the PLN gene annotated in the EquCab2.0 (322 bp) were designed using Primer3.0 (v.0.4.0) and based on the ENSECAT00000003713.1 sequence (F: CGCACATTTGGCTGTCAGCTTT and R: CTGTTATATGGTATTGTTTTCC). The PCR amplicon was obtained using the AmpliTaq Gold™ 360 Master Mix (Thermo Fisher Scientific) for 8 cDNA samples synthesised based on muscle tissue collected from the 8 horses included in the previous experiment and stored at -80 o C (ROPKA-MOLIK et al. 2017). cDNA was synthesised from 300 ng of the total RNA of the muscle tissue using a High-Capacity RNA-to-cDNA Kit, according to the protocol (Thermo Fisher Scientific). Next, the PCR products were sequenced in both forward and reverse strands by the Sanger method using the BigDye™ Terminator v3.1 Cycle Sequencing Kit, BigDye XTermina-tor™ Purification Kit, POP-7™ Polymer and 3500xL Genetic Analysers (Applied Biosystems, Thermo Fisher Scientific). The obtained sequences were then analysed using the Genetic Analyser Data Collection Software (Applied Biosystems).

SNP confirmation and genotyping
A total of 868 horses were genotyped in the study as follows: Arabian horses (n=424), Polish Koniks (n=177), Hucul horses (n=110), Polish coldbloods (n=61), Thoroughbreds (n=96). The hair follicle or the whole blood from all the horses came from the same collections of the reference horse database at the Laboratory of Molecular Genetics, Department of Animal Molecular Biology, as well as the Biological Material Bank of the National Research Institute of Animal Production and the Department of Horse Breeding, Institute of Animal Science, according to the University of Agriculture in Krakow requirements (Ethical Agreement No. 1173. The genomic DNA was isolated using Sherlock AX (A&A Biotechnology, Gdynia, Poland) according to the manufacturer's recommendations and was checked by the NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific).
The SNP presence was confirmed by Sanger sequencing, as well as by the PCR-RFLP method. The studied fragment of the PLN gene was amplified using primers designed using Primer3.0 (v.0.4.0) and based on the ENSECAG00000036379 reference genome. (Table 1). The amplicons were purified by a EPPiC reagent, according to the protocol (A&A Biotechnology, Poland). The obtained products were sequenced by the Sanger method using the BigDye™ Terminator v3.1 Cycle Sequencing Kit, BigDye XTerminator™ Purification Kit, POP-7™ Polymer and 3500xL Genetic Analysers (Applied Biosystems, Thermo Fisher Scientific).
For the genotyping, the PCR-RFLP method was designed. The PCR amplicons obtained as described above were restriction digested using FatI enzymes (BioLabs). Next, the DNA fragments were separated on 4% agarose electrophoresis (Table 1). For the genotype distribution within each breed, the Hardy-Weinberg equilibrium was checked, while the distribution of genotypes between the breeds was calculated using the Chi-squared test.

Identified SNPs within the PLN locus
The analysis of the whole-genome sequencing data allowed 14 single nucleotide polymorphisms to be identified, from which 7 were assigned as upstream and the next 7 as downstream PLN gene variants (Table 2). All the detected SNPs were previously reported in the dbSNP database (NCBI). The screening of the mRNA sequence of the PLN gene using RNA-seq data based on the EquCab2.0 reference allowed for the identification of only one polymorphism localised within the 3' UTR region -rs1146603334. The same analysis based on the current reference genome (Equ-Cab3.0) did not show the presence of any variant. According to the EquCab3.0 genome, the rs1146603334 polymorphism was assigned as a downstream PLN gene variant.

Sequence analysis of the PLN gene transcript
The amplification of the mRNA sequence allowed 322 bp of PCR product to be obtained, which after the Sanger sequencing was completely overlapped by the PLN transcript annotated in the EquCab2.0 reference genome (Figure 1). The identified PLN splice variant contained a sequence assigned to the 3' and 5' UTR regions, that was missing in the current genome version. Furthermore, in 5 of the horse samples, an analysed (rs1146603334) polymorphism was detected, which additionally confirmed that the SNP must be localised in the 3' UTR region of the PLN gene.

The PLN g.67928993C>T (rs1146603334) frequency
With regard to the allele percentages, in the analysed horse groups dominated by the CT genotypethe highest rates of 72.1% and 51.6% were detected in the Polish coldblood and the Arabian breeds, respectively (Table 3). For all the other analysed populations, the frequency was around 40%. Also, the lowest CC genotype frequency was observed in the Polish coldblood and Polish Konik breeds (about 3%). Moreover, in these breeds, the highest number of TT animals (over 50%) was found. In other cases, the TT genotype frequency fluctuated at around 30%and was lowest in the Hucul horses at 17.2%. The Hucul horses and Polish Koniks had a similar percentage share of at least two genotypes. The Polish coldbloods seemed to be the most homogenous, with a CT rate over 70% (Figure 2), which was not consistent with the Hardy-Weinberg equilibrium (Table 2). In turn, for both the Thoroughbred and Arabian horses, a similar genotype distribution was observed. For the (rs1146603334) PLN polymorphism, no statistical differences were observed only between the Arabian and Thoroughbred breeds (Table 4).

Discussion
The rigorous standards applied to the validation results obtained via NGS prevented an acceleration of false-positive reports of causality. Within our study, we used data obtained by NGS for the whole genome and transcriptome of the Arabian horse, and validated it with the use of Sanger sequencing and the PCR-RFLP and RT-PCR method. The search for the genetic background of the elite performance in horses expressed in the acquisition of biomarkers led us to a further investigation of the PLN gene encoding phospholamban. That protein is a micro peptide, 52-amino acid integral membrane protein, that phosphorylated inhibits the SR Ca 2+ -ATPase (SERCA) transport of Ca 2+ into the reticulum, in the cardiac and skeletal muscles (FUJII et al. 1991;RODRIGUEZ & KRANIAS 2005). The general role of phospholamban is to decrease contractility and the rate of muscle relaxation, by decreasing the stroke volume and heart rate (BRITTSAN & KRANIAS 2000). Mutations in this gene in humans are a cause of inherited dilated cardiomyopathy, with refractory congestive heart failure (SCHMITT et al. 2003;EIJGENRAAM et al. 2020).
In animals, the PLN gene knockout of phospholamban results in hyperdynamic hearts, with little apparent negative consequence (LUO et al. 1994). The protein is a pentamer and is a major substrate for the cAMP-dependent protein kinase (PKA) in muscles. Such findings, and the critical role of calcium homeo-  The presence of phospholamban may be considered as highly important for organisms with high percentages of effort, like racing horses. The molecular mechanism of the skeletal muscle contraction, so that the cardiac muscle is strictly connected with calcium ions, is why the proper regulation of the Ca 2+ pumps may be the most important factor associated with the performance traits in horses, as well as the probability of the occurrence of some disorders. Hypocalcaemia and hypercalcaemia are among the most serious disorders in horses, causing bone diseases and thyroid disorders TORIBIO (2011). The previous report showing the significant up-regulation of the PLN gene during a training regimen, with a 3.40-fold up-regulation of the gene expression level between flat racing training stages, could be a result of increased tissue exercise activity, and its adaptation to effort indicates that this gene could be related to the adaptation to effort and should be considered in further research (ROPKA-MOLIK et al. 2017).
In most actual annotations, the PLN gene is located on chromosome 10 on the forward strand. The PLN locus overlaps with the CEP85L (ENSECAG00000012601) gene localised on the reverse strand on chromosome 10. The sequencing of the PLN gene mRNA revealed a significant difference in the horse genome annotation versions regarding the phospholamban transcript. The EquCab3.0 is a significant improvement over the previous EquCab2.0 version in terms of enhancing the contiguity, a reduction in the number of gaps and improving the characterisation of GC-rich regions (KALBFLEISCH et al. 2018). However, although the genome reference sequence has been improved, according to our results, the PLN transcript size was incorrectly assigned. Both 5' and 3' UTR sequences (not present in EguCab3.0) were identified in the PLN transcript expressed in the skeletal muscles of Arabian horses. Previously, reports also showed some inconsistencies between the EquCab2.0 and RNA-seq data received from different breeds . The discrepancies were related mainly to extending the mRNAs reading frame. Moreover, REBOLLEDO-MENDEZ et al. 2015 revealed over 720 thousand divergences between the Twilight high throughput genomic sequence and the EquCab2.0 reference assembly. According to the authors, most of the detected differences result from assembly errors but some of them (about 10 thousand) arise from inter-individual genome differences. Our results support the previous reports indicating the possible differences in the genomes between horse breeds, which are not always reflected in the reference genome. The annotated animal's genes often do not have as many transcriptional variants and accurate databases as the corresponding human genes. Moreover, a large percentage of the genes are often described as predicted on the basis of homology with other species. In addition, the genome for horses is not as wellknown as the genomes of other farm animals (cow and pigs), and currently, the 3rd version of the genome is still being improved. Such data demonstrates the need for a validation of the obtained result using highly accurate molecular and genetic methods.
The comparison of the genotyping results of the 3' UTR variant across different horse breeds resulted in various genotype distributions being detected depending on horse utility types. An interesting finding was the genotype distribution within the Polish coldblood, where the highest percentage of heterozygote horses were detected (72.1%) which was significantly different from the rest of the breeds. A large disproportion was also observed in the CC homozygotes, where the lowest percentage of 3.3% was detected in the Polish Konik and Polish coldblood populations, and the highest percentage of 36.4% was identified in the Hucul horses. The most surprising finding was the highly significant differences observed between two of the native breeds -the Polish Konik and the Hucul horse. These breeds differed in terms of both the homozygotes number (3.3 vs 36.4 for CC and 52.2 vs 17.2 for TT; Polish Konik and Hucul horse, respectively), while the number of heterozygote animals was similar. There were significant allele distribution differences between the racing breeds and the primitive ones, which seemed to be connected with various ancestry, lifestyle and nutrition factors. Moreover, no statistical difference between the genotype distribution was detected for both warmblood horse breeds (Arabian horses and Thoroughbred), where the lowest number of CC homozygotes and the highest number of heterozygote horses were identified.