Genome variation and conserved regulation identify genomic regions responsible for strain specific phenotypes in rat

The genomes of laboratory rat strains are characterised by a mosaic haplotype structure caused by their unique breeding history. These mosaic haplotypes have been recently mapped by extensive sequencing of key strains. Comparison of genomic variation between two closely related rat strains with different phenotypes has been proposed as an effective strategy for the discovery of candidate strain-specific regions involved in phenotypic differences. We developed a method to prioritise strain-specific haplotypes by integrating genomic variation and genomic regulatory data predicted to be involved in specific phenotypes. Specifically, we aimed to identify genomic regions associated with Metabolic Syndrome (MetS), a disorder of energy utilization and storage affecting several organ systems. We compared two Lyon rat strains, Lyon Hypertensive (LH) which is susceptible to MetS, and Lyon Low pressure (LL), which is susceptible to obesity as an intermediate MetS phenotype, with a third strain (Lyon Normotensive, LN) that is resistant to both MetS and obesity. Applying a novel metric, we ranked the identified strain-specific haplotypes using evolutionary conservation of the occupancy three liver-specific transcription factors (HNF4A, CEBPA, and FOXA1) in five rodents including rat. Consideration of regulatory information effectively identified regions with liver-associated genes and rat orthologues of human GWAS variants related to obesity and metabolic traits. We attempted to find possible causative variants and compared them with the candidate genes proposed by previous studies. In strain-specific regions with conserved regulation, we found a significant enrichment for published evidence to obesity—one of the metabolic symptoms shown by the Lyon strains—amongst the genes assigned to promoters with strain-specific variation. Our results show that the use of functional regulatory conservation is a potentially effective approach to select strain-specific genomic regions associated with phenotypic differences among Lyon rats and could be extended to other systems.


Background
Phenotypic diversity is ultimately driven by genetic differences. The connections between DNA sequence and observed phenotypes are often difficult to determine and may be confounded by non-genetic causes including environmental effects. Regardless, it is increasingly clear that differences in transcriptional regulation are an important factor explaining phenotypic diversity [1][2][3][4][5][6]. This is especially true between closely-related species [2,[7][8][9]. Accordingly, a number of efforts have been made to combine transcriptional regulatory data with genome variation to select candidate genomic regions involved in producing phenotypic characteristics of interest [10][11][12][13].
The rat is a key animal model for biomedical research [14][15][16]. More than 600 laboratory rat strains have been created over the last century in order to study specific traits including those which are more informative in rat than in other model species, such as behaviour and neurodegenerative diseases, cardiovascular diseases and metabolic disorders [17][18][19]. One focus over the last decade has been the identification of genes and other genomic loci associated with these strain-specific traits [13,20]. Despite the great number of quantitative trait loci (QTL) identified in rat models using a number of techniques [21], only a small number of causative genes have been determined for complex traits or diseases [13,22,23].
Most genomic variants in an individual are expected to be neutral, and therefore have no impact on reproduction or survival [24,25]. In the case of laboratory rats, the existing variation among strains (e.g. [26]) is the sum of the ancestral variation among individuals used in the process of strain development and the novel variation that originated and accumulated in the genome during the establishment and maintenance of the strains. Like humans [27] and laboratory mice [28], genetic variation among rat strains is not randomly distributed across the genome; instead it is organised in haplotype blocks [29][30][31], which are caused by meiotic crossover of the shared ancestral variation. Comparison of these haplotype blocks among rat strains with different phenotypes has proven to be a powerful strategy for genetic mapping of complex traits and diseases [29,31]. For example, Atanur and colleagues analysed the genomes of 27 rat strains, and found that haplotype blocks with variants that are unique to a single strain were positively selected in the initial phenotype-driven derivation of these strains, and thus variants associated with strain-specific phenotypes are predicted to be in these regions [30]. However, the genomic extent of such regions and the number of sequence variants found within them are nearly always too large for an effective determination of candidate loci influencing the phenotype of interest (see e.g. [32]).
Regulatory activity such as active promoters, enhancers and transcription factor binding sites (TFBS) can be effectively mapped genome-wide with current techniques such as chromatin immunoprecipitation followed by highthroughput sequencing (ChIP-seq) [33]. Previous studies have suggested that both the number and conservation level of transcription factor binding sites in a given region affect the level of gene expression [3,[34][35][36][37]. Since tissue characteristics are directed to a large extent by the activity of tissue-specific transcription factors, the location of these regulatory elements might be useful when selecting haplotype blocks associated with specific phenotypes or diseases.
In this study, we characterise the haplotype blocks holding strain-specific genome variation among three closely related strains of the Lyon rat. Although Lyon rats were initially established as a model of hypertension [38], several additional symptoms related to metabolic syndrome (MetS), such as obesity, dyslipidaemia and susceptibility to insulin resistance have been found in the Lyon Hypertensive (LH/Mav) strain [39][40][41]. Only obesity is observed in the Lyon Low pressure (LL/ Mav) and all MetS related phenotypes are absent in the Lyon Normotensive (LN/Mav) strain [39][40][41][42]. Since both liver and kidney are involved in MetS [43], we generated RNA-seq expression data from liver of LL rats and from kidney of all three strains and integrated these with relevant existing data including the level of regulatory conservation for three liver-specific transcription factors (CEBPA, FOXA1 and HNF4A [9]) between rat and five related mouse species and strains. We show that the level of functional regulatory conservation can help select strain-specific haplotype blocks putatively associated with phenotypic differences among Lyon rats.

Results
85% of strain-specific variation among Lyon rat strains is concentrated in less than 9% of the genome To define haplotype blocks, we partitioned the rat genome into 10 kb windows and calculated the number of strainspecific variants (SSVs) in each window relative to the reference rat genome assembly (see Methods). We observed a bimodal distribution in the number of SSVs and used this distribution to define the resulting haplotype blocks as having either a high density of SSVs (High Variability Region, HVR) or a low density of SSVs (Low Variability Region, LVR) (see Methods, Fig. 1 and Additional file 1: Figure S1).
The distribution of SSVs across the genome was similar for the two pairwise comparisons of Lyon rats susceptible to MetS and obesity (LH and LL) and the Lyon Normotensive (LN) that is resistant (i.e. LHvsLN (Fig. 2c). The fraction of the genome that we identify as highly strain-specific is similar to that obtained previously for these and other rat strains (see [30,31], Methods and Additional file 1: Figure S2B). HVRs characterise a substantial reduction in the portion of the genome that is most likely to be involved in MetS phenotypes and therefore form the primary focus of our subsequent analysis.

Evidence for the functionality of high variability regions in the Lyon rats associated with MetS
We then sought to determine if the HVRs preferentially contain features that could explain the phenotypic differences among Lyon rats by comparing them to other regions of the genome (see Methods and Fig. 3). Specifically, we tested whether there is a significant enrichment in HVRs of the following elements: i) annotated genes, ii) genes associated with metabolic-related traits, iii) genes differently expressed among Lyon rats, iv) occupancy in rat of three liver-specific transcription factors, and v) regions orthologous to human variants associated by GWAS to obesity and metabolic traits.
The number of Ensembl genes [44] that overlap at least one HVR was marginally greater than expected by chance and this overlap was significant for the LHvsLN comparison (p < 0.05), but not for LLvsLN (p > 0.05) (Fig. 3b). Additionally, but only in the case of LLvsLN, there was a significant enrichment of genes associated with Type I diabetes mellitus (KEGG PATHWAY database, gene count: 25, p < 10 −9 , see results for ' All HVRs' in Additional file 1: Figures S4 and S5 (DAVID web services v6.7 [45,46]). Gene enrichment in the HVRs was more significant when considering only the genes whose expression in either liver or kidney differed between LH and LN strains, and between LL and LN strains (see Methods and Additional file 1: Table S1). (Fig. 3c and Additional file 1: Table S2).
We next considered whether the HVRs were enriched for either the occupancy of three specific transcription factors (HNF4A, CEBPA, and FOXA1) or the 418 rat orthologues of Human GWAS variants associated with obesity and metabolic traits. In both cases we did not observe a significant enrichment (Fig. 3d-e, Additional file 1: Figure S6 and Table S3).
In summary, the observation that both annotated and differentially expressed genes are enriched in HVRs supports the hypothesis that HVRs harbour functional  Fig. 1 Procedure to identify genomic regions of interest based on the distribution of Strain-Specific Variants (SSV) across the genome. a SSVs were obtained for the pairwise comparisons between the two susceptible Lyon strains (LH or LL) relative to the control Lyon rat (LHvsLN and LLvsLN) and the reference rat genome (LHvsBN and LLvsBN). b Densities of SSVs were calculated in non-overlapping genome windows of 10kbp; we applied a smoothing algorithm to these densities (see Methods). c Distribution of the smoothed densities of SSVs obtained from LNvsBN and LLvsBN were used to calculate the threshold between HVRs and LVRs. Only those regions with at least three consecutive genome windows of the same type were considered. d HVRs and LVRs across chr19 for the hypertensive Lyon rat relative to the control Lyon rat (LHvsLN); regions showing poor mapping qualities (unmappable) were discarded from our analyses. Related Figure: Additional file 1: Figure S1 regions that could be responsible for phenotypic differences observed among the Lyon rats. However, given the overall genomic span of identified HVRs and the large number of SSVs both in coding and non-coding regions in the HVRs (see Table 1), these analyses on their own are inadequate to suggest either the causative genes or the causative variants influencing MetS or obesity across the whole genome.
Liver-specific regulation data can prioritise regions of strain-specific variation in Lyon rats associated with MetS We next integrated strain-specific genomic variation with available genomic regulatory data from tissue relevant to MetS in order to prioritise the HVRs using the occupancy and level of conservation of the three liver-specific transcription factors. We created subsets of HVRs characterised by occupancy of the factors and the level of conservation among mice and rats using a factor-specific Conservation Enrichment score (CE f , see Methods). Briefly, CE f is the fraction of transcription factor binding events in a 10 kb   window that are conserved between rat and mouse for each transcription factor. The score was determined independently for each of the three factors (f = CEBPA, FOXA1, or HNF4A). Thus, for each transcription factor and for both the LHvsLN and LLvsLN comparisons, we created seven subsets of HVRs: ' All HVRs' including those without any binding event; 'HVR w/TFBS' with at least one TFBS regardless of conservation; and five subsets containing HVRs with CE f greater than 0, 0.2, 0.4, 0.6 and 0.8, respectively. We then reassessed the evidence for functionally of these HVR subsets in a similar way to that done with the whole set of HVRs as above. Enrichment of Ensembl rat genes in HVR subsets corresponded with the occupancy and level of conservation of the three liver-specific transcription factors (Fig. 4a). With the exception of the subset of LLvsLN with all HVRs, all tests in the HVR subsets were statistically significant (p < 0.05). For LHvsLN and for the three factors, maximum significance possible (p < 10 −3 ) was obtained for the subset of HVRs with at least one TFBS (HVR w/TFBS), and for the subsets with CE f > 0.0 and CE f > 0.2. In the case of LLvsLN and for the three transcription factors, the maximum significance was obtained in the subset of HVRs with at least one TFBS, and HVR subset with CE f > 0.0 (i.e. HVR subsets with the darkest colour in Fig. 4a).
As above, we analysed the functional annotation enrichment in the HVR subsets using DAVID [45,46] with KEGG PATHWAY [47,48] and UP TISSUE [49] databases (see Methods). In all cases (both LHvsLN and LLvsLN for the three liver-specific transcription factors) the term 'liver' from the UP TISSUE database, had the greatest accumulated significance across HVR subsets (see Additional file 1: Figures S4A and S5A). These results indicate that HVRs selected according to information from liver-specific regulation data are enriched in genes associated with liver function. Importantly, but as expected, this association was not evident without using genomic regulation data to select HVRs (see Fig. 4b).
For the analyses using KEGG_PATHWAY database, we did not find a consistent increase in significance associated with an increase in CE f , although we did identify some functional terms that were statistically significant (see Additional file 1: Figures S4B and S5B).
Additionally, we looked for an enrichment of putative GWAS positive regions in HVR subsets by determining the orthologous location in rat of NHGRI-EBI GWAS Catalog SNPs associated with obesity and metabolic-related in humans [50] (specific terms listed in Additional file 1: Table  S3). The use of the regulatory information from liver-specific transcription factors identified significant enrichments of GWAS variants in relevant subsets of HVRs. For example, we found significant enrichments (i.e. p < 0.05) for LHvsLN in the subsets of HVRs w/ TFBS and CE f > 0.2 for the three liver-specific factors, in CE f > 0.0 for FOXA1 and HNF4A and in CE f > 0.4 for FOXA1) (Fig. 4c, Additional file 1: Table  S4). Regarding LLvsLN, we found significant enrichments for in HVRs w/TFBS, CE f > 0.0, CE f > 0.2, CE f > 0.4 for both FOXA1 and HNF4A (Fig. 4c, Additional file 1: Table S4).
Genes differentially expressed between LH and LN strains in either liver or kidney (Additional file 1: Table  S1) are significantly enriched for all subsets of HVRs (Additional file 1: Figure S7), regardless of which transcription factor is considered. The same is true for genes differentially expressed between the LL and LN strains. To compare the differences in enrichment between subsets, we computed the fraction of all Ensembl rat genes that are differently expressed for each HVR. In all cases, the fraction of differentially expressed genes was positively correlated with CE f (Fig. 4d). For example, for LHvsLN Conservation Enrichment (CE f ) is correlated with functional enrichment. a Genes overlapping HVR subsets. The colour gradient shows the significance obtained from the enrichment analyses done in each of the HVR subsets. Bar sizes indicates the total number of genes overlapping each subset of HVRs. b Liver-genes defined by UP_TISSUE annotation. The functional annotation analyses were calculated by DAVID 6.7. Colour gradient and bar size defined as above. c Rat orthologues of human GWAS regions associated with obesity and metabolic traits. We excluded HVR subsets with CE f > 0.6 and CE f > 0.8 due to no GWAS variants overlapping these HVR subsets. Colour gradient and bar size defined as above. d Ratio of genes differentially expressed in either liver or kidney, between LH and LN strains, and between LL and LN strains and data from liver, the fraction increased from 6% ('all HVRs' subset) to 15% (CE FOXA1 > 0.8), while for kidney, it increased from 15% ('all HVRs' subset,) to 30% (CE CEBPA > 0.8; CE FOXA1 > 0.8). A similar pattern was observed for LLvsLN ( Fig. 4d).
We hypothesised that there may be a correlation between selection pressures leading to regulatory conservation as measured by CE f and changes to the sequence of protein coding genes within the same sets of HVRs. The ratio of non-synonymous coding SSVs (NSC-SSVs) to synonymous coding SSVs (SC-SSVs) was therefore compared across HVR subsets (see Methods). Although we find relatively little difference in the ratio of the non-synonymous changes, especially for the case of the LHvsLN comparison, in the LLvsLN comparison, non-synonymous changes do appear to be depleted when HVRs have higher regulatory conservation (i.e. higher CE f ) (Fig. 4e). This may be the effect of simultaneous selection on both protein coding genes and regulatory networks for a subset of regions in the LL genome.
In summary, the use of CE f (i.e. the conservation level between rat and mouse in the occupancy of three liverspecific transcription factors) is effective for selecting candidate regions involved in phenotypic differences between Lyon rats. In most cases, we observed an increase in statistical significance as a function of CE f . Moreover, the consideration of regulatory information was required to identify HVRs significantly enriched for genes associated with metabolic related-trait genes and enriched for human GWAS variants related to obesity and metabolic traits.
Integrating results from the three liver-specific transcription factors to prioritise the strain-specific variation in Lyon rats associated with MetS Given the observed stability of combinatorially bound transcription factors [9] and connection of these regions to human disease [51], we assessed if the number of liverspecific transcription factors used to estimate the conservation level could more efficiently prioritise candidate HVRs. For this purpose, we used the HVR subsets with CE f > 0 (i.e. all HVRs with at least one conserved TFBS).
We observed that conservation of more than one type of factor in a given HVR was common: 41% (LHvsLN) and 43% (LLvsLN) of the HVR CE f > 0 regions had conserved peaks for all three of the liver-specific transcription factors (Fig. 5a). We then partitioned the HVRs with conserved peaks by the diversity of factors that where conserved in the given HVR. Specifically, 'HVR 1TF' includes HVRs with one or more conserved TFBS from at least one factor; while 'HVR 2TF' and 'HVR 3TF' refer to HVRs with conserved TFBS from at least two or all three factors (i.e. HRV 3TF is a strict subset of HVR 2TF, which is strict subset of HVR 1TF). We then assessed these HVRs subsets to determine if an increased diversity of conserved liver-specific transcription factors is an effective method to prioritise HVRs.
HVRs with an increased diversity of conserved peaks were generally significantly enriched (permutation tests, p < 0.05) for orthologous regions of human GWAS SNPs except for the case of the HVR 3TF subset with the LHvsLN SSVs ( Fig. 5d and Additional file 1: Table S5).
The significance of enrichments of genes differentially expressed in either liver or kidney was p < 10 −3 for the subsets with at least one conserved peak for one, two and three liver-specific transcription factors, respectively, for both LHvsLN and LLvsLN strain comparisons. Considering the ratios of differentially expressed genes, we observed that they kept relatively constant across HVRs subsets as the number of liver-specific transcription factors with conserved peaks increased (Fig. 5e).
These results suggest that knowledge of which TFBSs are conserved and whether a given region of the genome has conserved TFBSs from multiple factors may be effective in some situations at prioritising regions with strain specific variation involved with tissue specific functions. For example, we did not observe enrichments in HVRs associated with liver-expressed genes and the orthologous rat regions associated with human GWAS without using the conservation level as measured by CE f (see above and Fig. 4).

Analysing the genes obtained from the selected high variability regions
To gain insight into genes from the prioritised HVRs that may be important for MetS or obesity, we focused on the most conserved and regulatorily complex HVR subset, i.e. the set containing at least one conserved peak for all three liver-specific factors (the HVR 3TF subset, see above).
We performed two analyses based on possible functional mechanisms underpinning the phenotypic differences among Lyon rats. First, we characterised those genes with non-synonymous coding strain-specific variants (NSC-SSVs; see Methods) overlapping the selected HVRs. Such variation would result in changes to the amino acid sequence that may be responsible for functional changes in the resulting proteins. Second, we identified those genes located near putative promoters in rat obtained from Villar et al. [4] (see Methods) and with SSVs overlapping the selected HVRs. We assumed these SSVs might affect the expression of the proximal genes. For these analyses, we used only those genes expressed in liver as measured by RNA-seq data (Fragments per Kilobase Million (FPKM) > 1, see Methods).
We categorised the selected genes based on whether they were i) liver-specific genes (according to the UniProt tissue database, see Methods), ii) differentially expressed in liver and/or kidney when comparing the susceptible Lyon strains with the control Lyon strain, (see Methods); iii) overlapping human GWAS variants associated with obesity and metabolic traits overlapping the gene body in the case of genes with NSC-SSVs or overlapping the promoter in the cases of genes linked to promoters with SSVs (see Table 2 and Additional files 2 and 3).
We also analysed the genes associated by published evidence to three symptoms showed by the LH strain (insulin resistance, dyslipidaemias) and by the LH and LL strains (obesity) plus two symptoms not obviously present in these strains as control (heart disease and Alzheimers), using DisGeNET (v4.0 [52,53]) (see Methods). For this analysis, we used the corresponding human orthologues of the selected rat genes because the DisGeNET data is mainly for human (see Methods). A total of 7542 and 7520 rat genes had human orthologues and were expressed in liver in the LH and LL strains, respectively. DisGeNET identifies a small number of these genes as associated with the metabolic syndrome  Integrating results from the three liver-specific transcription factors. a Venn diagrams of HVRs with at least one conserved binding site between rats and mice (HVR subsets with CE f > 0) for the three liver-specific transcription factors. b Genes overlapping HVR subsets. The colour gradient shows the significance obtained from the enrichment analyses. Bar sizes indicates the total number of genes overlapping each subset of HVRs. c Liver-genes according to UP_TISSUE annotation. The functional annotation analyses were calculated by DAVID 6.7. Colour gradient and bar size defined as above. d Rat orthologues of human GWAS SNPs associated with obesity and metabolic traits. Colour gradient and bar size defined as above. e Ratio of genes differentially expressed in either liver or kidney between LH and LN, and between LL and LN calculated for each subset of HVRs is not influenced by the number of bound factors phenotypes and, as expected, these gene sets are highly similar for the LH and LL strains with approximately 140 (1.9%), 110 (1.5%) and 800 (10.6%) genes associated with insulin resistance, dyslipidaemia and obesity, respectively in each strain. There were 173 protein-coding genes expressed in liver of LH or LL strains with at least one NSC-SSV in the selected HVRs. Of these 173 genes, 144 had identified human orthologues and can thus be compared with the DisGeNET data. This set of genes was not significantly enriched for published associations to obesity, insulin resistance or dyslipidaemias, (see Table 2, Additional file 2).
A larger number of genes were one-to-one associated with putative active promoters including 3865 and 3864 genes that were expressed in livers of LH and LL strain rats and had human orthologues, respectively. Of the 3865 genes with active promoters from the LH strain, a total of 85 (2.2%), 66 (1.7%) and 425 (11%) were associated with insulin resistance, dyslipidaemias and obesity, respectively. The numbers for LL are similar: 86 (2.2%), 65 (1.7%), 417 (10.8%) for the associations to insulin resistance, dyslipidaemias and obesity, respectively. Only a fraction of these promoters had SSVs in the selected HVRs: 206/3865 (5.3%) for LH and 164/3864 (4.2%) for LL. Thirty-two of the genes assigned to promoters with SSVs overlapping the selected HVRs in the LH strain were associated with obesity (15.5%, Fisher's exact test: p < 0.05, see Table 2, Additional file 1: Table S6 and Additional file 3). There were no significant enrichments for insulin resistance or dyslipidaemias in the LH strain or any significant associations in the LL comparison (see Table 2 and Additional file 3).
We found no significant enrichments for the two symptoms used as control in either the comparison to genes with at least NSC-SSV in the selected HVRs (Fisher's exact tests: all p-values >0.08) or to genes one-toone assigned to promoters with SSVs overlapping the selected HVRs (Fisher's exact tests: all p-values >0.05).
Of the set of 32 genes responsible for the significant enrichment for obesity in the LHvsLN comparison (Additional file 1: Figure S8), the gene with most published evidence of association with obesity was the insulin receptor gene Insr (ENSRNOG00000029986) (Additional file 1: Table S6); Cat (ENSG00000121691) was the human gene of that list assigned to the promoter with the greatest number of SSVs (58 SSVs) overlapping the HVRs. To asses interactions among the identified set of 32 genes and gain further insight into how they might connect to the Lyon rat phenotypes, we conducted a network analysis and a functional enrichment analyses using GeneMANIA (plugging for Cytospace v3.4.1 [54]). We observed two major modules ( Fig. 6 and Additional file 1: Table S7) and significant enrichments for 20 GO terms (Additional file 1: Table S7). One module includes 10 genes interconnected with PER1 and CRY2 and which is responsible for the significant enrichment of GO terms related with circadian rhythm and regulatory binding regions in DNA ( Fig. 6 and Additional file 1: Table S7). The other module includes 11 genes interconnected with INSR, SOCS3, THRSP, TGFB1, BAMBI and SMAD7. Genes in this module are responsible Table 2 Number of rat genes and human orthologues expressed in liver of the susceptible Lyon rats (LH or LL) and associated with coding or promoter strain-specific variation overlapping HVRs with at least one conserved peak for three liver-specific transcription factors (i.e. the HVR 3TF subset)  for the significant enrichment of GO terms related to regulation of the transforming growth factor beta receptor (TGF-beta) signalling pathway and regulation of the transmembrane receptor protein serine/threonine kinase signalling pathway ( Fig. 6 and Additional file 1: Table S7).

Discussion
In this study, we have used the level of functional regulatory conservation between related species to prioritise genomic regions whose patterns of genome variation suggest that they are involved in phenotypic differences in a model of obesity and metabolic syndrome, the Lyon rat strains. As a first step, we characterised haplotype blocks by density of strain-specific variants for the two comparisons between the susceptible Lyon strains with respect to the resistant Lyon strain (i.e. LHvsLN and LLvsLN). In agreement with similar analyses [30,31], most of these variants were concentrated in a small part of the genomes, which we termed High Variability Regions (HVRs). Next, we classified the HVRs according to conserved occupancy between rat and mice for three liver-specific transcription factors. Functional enrichment of selected HVRs was evident for those genetic elements where a significant enrichment was found in the whole HVR sets. Importantly, our approach revealed associations between HVRs with liver-genes and with rat orthologues of human GWAS linked to obesity and metabolic traits.
We also searched genes associated with genomic variation linked to two selected sets of HVRs, one from each strain comparison; namely, those sets with haplotype blocks having at least one conserved peak among rat and mice for each of the three liver-specific transcription factors (i.e. 'HVR 3TF' subset). In these two subsets, we determined those genes with non-synonymous strain-specific variants and genes assigned to promoters with strain-specific variation overlapping the selected haplotype blocks. We reported a list of these selected genes where we included additional information coming from functional analyses and supporting the association of these genes with human GWAS for obesity and metabolic traits and with traits in the susceptible Lyon strains (insulin resistance, dyslipidaemias and obesity) (Additional files 2 and 3). We found a  Fig. 6 Network analysis based on protein-protein interactions for the human orthologues associated with the significant enrichment for obesity. Human orthologues of the 32 rat genes assigned to promoters with SSVs overlapping the selected HVRs in the LH strain and associated with obesity were analysed (black circles). Relationships among these genes were investigated using GeneMANIA (see Methods). The grey circles are genes predicted by GeneMANIA to be in the network. Of 20 significant GO Terms (q-value <0.05) total, two are presented in the figure due to their significance and the number of genes included in each module (green circles: 'Circadian rhythm', q-value <1e-8; red circles: 'Negative regulation of TGF-beta receptor signalling pathway', q-value <0.01) significant enrichment of liver-expressed genes associated with obesity that were assigned to promoters with strainspecific variation overlapping the selected haplotype block obtained from LHvsLN. Network analyses using these genes and based on protein-protein interactions identified modules implicated in circadian rhythms and in the TGFbeta and transmembrane receptor protein serine/threonine kinase signalling pathways.
Recent studies have described the crucial role of circadian rhythms in regulation of body weight and metabolic process in rodents and other mammals [55][56][57]. The expression of two genes with strain-specific variation, PER1 and CYR2, are part of the circadian pathway and are regulated in part by the binding of the CLOCK:ARNTL(BMAL1) heterodimer to their promoter regions. PER1 and CRY2 also feedback to inhibit the CLOCK:ARNTL heterodimer, which itself regulates the transcription of other genes involved in lipid metabolism [56,57]. Thus, the strain-specific changes in the promoter sequence of PER1 and CRY2 detected using our methodology are plausibility involved in the metabolic symptoms shown by LH strain. Further support for this interpretation is the observation that PER1 is downregulated in liver and kidney in LH and LL strains in comparison with LN strain, whereas CLOCK is upregulated in kidney in LL strain. A relationship also exists between obesity and the TGF-beta and transmembrane receptor protein serine/ threonine kinase signalling pathways [58]. Specifically, SMAD7 is an inhibitor of the TGF-beta signalling pathway [58] and is also downregulated in liver in the LH strain. BAMBI cooperates with SMAD7 in the inhibition of the TGF-beta signalling pathway [59], and it is also downregulated in kidney in LH strain. Taken as a whole, the network analyses of the 32 genes associated with obesity demonstrates that at least a fraction of them are plausibly implicated in the obesity phenotype shown by the LH strain.
Ma et al. characterised the blocks with a high density of variants that are unique in the Lyon strains in order to fine-map QTLs for MetS previously identified in these rat strains [31]. As result, the candidate QTL were narrowed by 78%. By focusing their analyses to coding variants in the QTL on rat chromosome 17, they reduced the number of candidate genes to 27. We found that three of these genes had non-synonymous strain-specific variation overlapping the most stringent HVR 3TF subset (ENSRNOG00000014834, ENSRNOG00000022378, ENSRNOG00000027453, see Additional file 1: Table S8), however none of these genes were assigned to promoters with strain-specific variation overlapping HVR 3TF regions. More recently, Wang et al., reported 17 candidate genes involved in the phenotypic differences between LH and LN Lyon rats [41]. We found that only one of Wang et al.'s genes held non-synonymous variation overlapping an HVR 3TF region (ENSRNOG00000016109, Additional file 1: Table S8). In addition, none of the genes identified by Wang et al., were assigned to promoters with strainspecific variation overlapping the HVR 3TF regions (Additional file 1: Table S8).
The gene C17h6orf52 (ENSRNOG00000039379), which encodes a protein similar to chromosome 6 open reading frame 52 (C6ORF52), is the only one reported by both studies in the previous paragraph. It is suggested to be the most likely eQTL driver gene involved in phenotypic differences between LH and LN strains [41]. C17h6orf52 is cisregulated by an eQTL hotspot on chromosome 17 and is predicted to affect 100 of 278 trans-eQTL genes [41]. While C17h6orf52 was not the single gene from Wang et al., linked to strain-specific variation overlapping the strict HVR 3TF subset, it is associated with the less restrictive HVRs 2TF subset obtained from LHvsLN comparison. Moreover, C17h6orf52 was the only gene from the Ma et al. and Wang et al. lists with both non-synonymous coding and promoter assigned SSVs (Fig. 7). This result would suggest that C17h6orf52 has been under positive selection during the phenotype-driven derivation of this strain and gives support to the predicted role of C17h6orf52 affecting susceptibility in LH rats for the Metabolic syndrome reported by Ma et al. and Wang et al. The identification of C17h6orf52 by our complementary method further supports its role in the phenotype and lends additional validation to our general approach.

Conclusions
Our results demonstrate both the potential and the limitations of using the level of functional regulatory conservation to prioritise genomic regions potentially associated with phenotypic differences among Lyon rats. This approach would be most easily extended to other systems with similar breeding histories including other rat strains and mice strains. Importantly, it is not needed to generate data from many individuals like QTL and eQTL approaches and allows the use information already available, as conservation in regulatory elements between rat and mice.

Determination of high variability regions, low variability regions and unmappable regions Genomic sequences and single nucleotide variants
We used existing whole genome alignments (ENA accession: ERP002160) and single-nucleotide variants (available from the Rat Genome Database) of the three Lyon strains (LH, LL and LN) that were generated by Atanur et al. [30] in comparison to the BN reference genome (RGSC-3.4 [60]).

Strain-Specific Variant (SSV)
We called a SSV for a given strain as a genomic position with an allele that is not present in the strain used as reference (Fig. 1a and Additional file 1: Figure S1A). Firstly, we obtained SSVs for Lyon strains compared to the BN reference genome RGSC-3.4 and the resulting sets of SSVs are referred to as LHvsBN, LLvsBN and LNvsBN for the SSVs specific to the LH, LL and LN strains, respectively. These comparisons were used to calculate the threshold for the different types of genomic regions (see below and Fig. 1c and Additional file 1: Figure S1C). Secondly, we obtained SSVs for the two pairwise comparisons of Lyon rats that are susceptible to MetS and obesity phenotypes relative to the strain that is resistant (LHvsLN and LLvsLN). In these cases, we called a SSV as a genomic position with at least one allele that is not present in both LN and the reference BN genome. By doing this, we discard from LH and LL genomes the genetic variation shared with LN strain, which we assume are not associated with MetS (Fig. 2, Additional file 1: Figures S2 and S3). Furthermore, in a similar way as done for LN strain by comparing it to LH and LL strains (LNvsLH, LNvsLL); these comparisons were used as controls of our approach, because we expected to not find any association between LN strain SSVs and MetS (see Additional file 1: Figures S2 and S3).
In a previous study with the Lyon rats, Ma et al. [31] considered a SSV as any position that differed between the two strains that were being compared regardless of whether the position was variable with respect to the reference BN genome (indicated as LH + LN, LN + LH in Additional file 1: Figure S2 and LL + LN and LN + LL in Additional file 1: Figure S3). In our case, we considered a SSV only if the allele both differed from the BN reference genome and was also in the strain that was used as query and not present in the strain used as control. For example, in Additional file 1: Figure S1, our approach considers only the G in the LNvsLN comparison, while Ma et al., would have included both the G and the C. This different criterion allowed us to remove from our LHvsLN and LLvsLN analyses those genomic regions specific to LN, and which are likely not associated with the phenotypic differences associated with MetS among Lyon rats (see Additional file 1: Figures S2  and S3). Other differences from methodology used by Ma et al., were i) we did not discard the roughly 5% of SSVs that were called heterozygous by Atanur et al. [30]; and ii) we did not use those genome regions with a low estimated accessibility (see below). smoothing-to calculate the number of SSVs in nonoverlapping 10 kb genome windows ( Fig. 1b and Additional file 1: Figure S1B). This method smoothes differences among windows that were caused by the genome compartmentalization. For a given window in the genome at position x, we calculated the smoothed density of SSVs as the following floating mean with weights:

Smooth density of SSVs
where SSV x + i is the number of Strain Specific Variants in the window with position x + i, and k is the number of neighbouring windows up and downstream used for smoothing. We use k = 3 in our analyses empirically (data not shown) because this value gives a clear distinction between two types of genomic regions (see below and Fig. 1c and Additional file 1: Figure S1C).

High Variability Regions and Low Variability Regions
The smoothed density of SSVs in genome windows between two rat strains shows a bimodal distribution ( Fig. 1c and Additional file 1: Figure S1C). The left peak in the bimodal distribution contains regions of the genome identical by descent, with low a density of SSVs (Low Variability Region, LVR). The right peak contains regions of the genome that are divergent between the two strains with a high density of SSVs. A distinct valley separates the two peaks, which we used as a threshold to differentiate HVRs and LVRs. We calculated this threshold for the three comparisons between the Lyon strain rats and the reference rat genome (RGSC-3.4). In all three cases the threshold obtained was three (Additional file 1: Figure S1C); that is, windows with a smoothed SSV density greater than three variants in 10 kb were classified as HVRs, and windows whose smoothed SSV density was less than or equal to three were classified as LVRs. Only those regions with at least three consecutive genome windows of the same type were considered for further analyses (Fig. 1d).
Unmappable regions We performed two analyses on the BAM files to estimate the parameters to characterise the non-accessible genome regions of the Lyon strains. Firstly, we obtained the distribution of mapping qualities (i.e. -10 log 10 Pr(mapping position is wrong), http://samtools.github.io/hts-specs/) by using QualityScoreDistribution.jar from Picard tools (v1.81 (1299), http://broadinstitute.github.io/ picard/) with the option VALIDATION_STRINGENCY = LENIENT (Additional file 1: Figure S9A). Secondly, we calculate genome coverage per base by using genomeCover-ageBed form Bedtools (v2.17.0 [61]) with default parameters (Additional file 1: Figure S9B). According to results obtained from the later analysis, we considered a region as unmappable when at least three consecutive windows with an average mapping quality less then or equal to 30 and/or with an average coverage greater then or equal to 100 (Figs. 1d, 2, Additional file 1: Figures S2B and S3B).

Animals
LH/MRrrcAek, LN/MRrrcAek, and LL/MRrrcAek rats were bred and maintained in an approved animal facility at the University of Iowa on a 12-h light-dark cycle and provided food and water ad libitum. Male offspring were used in this study. The rats were phenotyped and tissues collected as previously described [41]. Briefly, at 3 weeks of age the rats were weaned onto normal chow (Teklad 7913 -Harlan Teklad NIH-31 irradiated, 18% protein, 6% fat). At 15 weeks of age they were switched to a 4% NaCl diet (Teklad 7913 modified with 4% NaCl) until they were humanely euthanized with CO 2 at 18 weeks of age after an overnight fast. Tissues were collected and stored in RNAlater (Life Technologies, Grand Island, N.Y.) at −80°C for subsequent RNA extraction.
Gene expression RNA-seq data RNA was isolated from liver and kidney tissue using standard TRIzol methods [62]. We analysed the read quality using FASTQC software (v 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/). Reads were trimmed by using Trimmomatic (v0.32 [63]) if the Phred score of any base was below 25 (LEAD-ING:25 TRAILING:25). We used reads with at least 36 bases (MINLEN:36) and only those paired reads that remained after trimming.

Gene expression analyses
We estimated differential gene expression between LH and LN, and between LL and LN. TopHat (v2.0.13 [64]) was used to map reads to the rat reference genome (RGSC-3.4). Read alignments with more than two mismatches were discarded (-read-mismatches 2). We also used the option -no-novel-juncs to look for reads across junctions already annotated. The Cufflinks package (v2.2.1 [65]) was used to assemble transcripts separately for each replicate. We used cuffmerge to merge the transcript assemblies from replicates to be analysed, and finally we used cuffdiff to find differently expressed genes and with Benjamini-Hochberg corrected False Discovery Rates (FDR) of 0.05. Thus, we obtained the differently expressed genes either in liver or in kidney between LH and LN, and between LL and LN (see Additional file 1: Table S1). We also used the FPKM (Fragments per kilobase of exon per million of fragments mapped) values obtained from these analyses to get the list of genes that were expressed in the livers of the LH and LL strains. We considered a gene expressed in liver if its FPKM was greater than 1.0.

Conservation of occupancy of three liver-specific transcription factors between rat and mouse strains
We compared peaks generated from ChIP-seq datasets among the five mouse species and the rat. We used only the genomic regions present in the BLAST-Z alignment between mouse and rat available in Ensembl (v59 [44]) and using the NCBI37 mouse genome as references for comparing datasets from the different species. We considered as conserved peaks between rat and mouse, the overlapping peaks between rat and at least one mouse species/strain. Coordinates of conserved peaks were converted to the rat genome reference (RGSC-3.4). For each HVR and livertranscription factor, we calculated its Conservation Enrichment score (CE f ) as the number of conserved peaks in 10 kb for a given transcription factor.

Permutation tests
Permutation tests were used to find significant enrichments in HVRs. We clustered HVRs within an empirically defined distance of 1 Mb because HVRs have a non-uniform distribution across the genome (see Fig. 3a) and we assume nearby HVRs are regulatorily non-independent. Clusters were randomly permutated across the whole genome by using the command shuffle from BEDTools (v2.22.0 [61]) and accessed from pybedtools (v0.6.9 [67]); the relative coordinates of HVRs inside of clusters were maintained. We estimated the distribution of expected values by calculating either the total number or average of genetic elements overlapping the set of HVRs inside of the shuffled cluster for each permutation. We performed 10,000 permutations in each test. Significance of the enrichment of the genetic element in HVRs was obtained by calculating the two-tailed p-value according to this formula: where OV is the value obtained from the observed HVRs, and EV is the expected value calculated from each of the 10,000 sets of permuted HVRs (see Fig. 3a). The minimal p-value possible with 10,000 permutations is 1 × 10 −4 .

Functional analyses of HVRs Ensembl genes overlapping HVRs
We used the set of Ensembl genes from Ensembl (v69 [44]) for the BN reference genome RGSC-3.4. We used a permutation test (see above) to determine if genes overlapped HVRs more often than expected by chance. We calculated the number of genes overlapping at least one HVR in the observed permutated sets (Fig. 3b).

Differentially expressed genes overlapping HVRs
We tested if genes that were differentially expressed between LH and LN and between LL and LN overlapped HVRs more often than expected by chance (Fig. 3c). We used permutation tests for these analyses (see above). We used the list of genes differentially expressed that were obtained from RNA-seq data (Additional file 1: Table S1). We calculated the number of these genes that overlapped at least one HVR in the observed and permutated sets of HVRs ( Fig. 3c and Additional file 1: Table S2).

Gene-annotation enrichment analysis of HVRs
We analysed if there was a functional enrichment associated with metabolic or obesity phenotypes for the genes overlapping at least one HVR that were obtained from LHvsLN (' All HVRs' in Additional file 1: Figure S4) and LLvsLN (' All HVRs' in Additional file 1: Figure S5). We tested for this enrichment by using DAVID web services v6.7 (python client [45,46]) for KEGG PATHWAY [47,48] and UP TISSUE (Uniprot Consortium [49]) databases (release/download date: Sep 2009, https://david.ncifcrf.gov/ content.jsp?file=update.html). We used DAVID v6.7 (Sep 2009) for our analyses rather than DAVID v6.8 (October 2016), because most of the data used in our study (gene annotation, SNPs and occupancies of liver-transcription factors) are based on the RGSCv3.4 assembly, which is also that used by DAVID v6.7. DAVID v6.8 uses the Rnor 6.0 assembly and differences in the gene sets and/or gene nomenclature between these two rat assemblies create inconsistencies that affect the accuracy of our results (data not shown). In addition, although the KEGG PATHWAY resource was updated in DAVID 6.8, the UP TISSUE dataset, which we used to report the expected association between the term liver and the level of functional regulatory conservation (see Results section), was not updated in DAVID v6.8 (in both versions UP TISSUE is dated Sep 2009). We recognise that, in their recent paper, Wadi et al. [68] showed that the use of out-dated gene annotation prevents the identification of all significant terms in enrichment analyses. However, in our case, even when using DAVID v6.7, we found significant results and the expected correlation between gene enrichment and the level of functional regulatory conservation. Thus, the DAVID supporting database that we use are largely the same between v6.7 and v6.8, it is more important for us to be consistent on the assembly and gene set for our analysis.

Liver-specific transcription factor overlapping HVRs
We also tested if the number of peaks in rat overlapping HVRs ( Fig. 3e and Additional file 1: Figure S6) and the average CE f (Additional file 1: Figure S10) observed for each one of the three liver-specific transcription factors was significantly greater that that expected by chance. We used permutation tests (see above) for these analyses. For the observed values, we used either the total number of peaks overlapping the HVRs or the average CE f for a given transcription factor. For the expected values, we calculated the two latter values for each one of the 10,000 permuted sets of HVRs (see above).

Human GWAS variants associated with metabolic traits overlapping HVRs
We obtained from the NHGRI-EBI GWAS catalogue [50] the list of SNPs associated with obesity and metabolicrelated traits in humans (search terms used in Additional file 1: Table S3). SNPs coordinates were converted from the GRCh38 human assembly to the rat RGSC-3.4 assembly using mapping from GRCh38 to Rno6.0 and then from Rno6.0 to Rno5.0 and RGSC-3.4. All conversions used the Ensembl Perl API and the Ensembl assembly converter software (v87 [44]). As with other genetic elements analysed, we then used permutation tests to determine if there was a significant enrichment of rat orthologous positions for these GWAS variants overlapping HVRs. For the observed value, we used the total number of GWAS variants overlapping HVRs. The expected value was calculated as the number of GWAS variants overlapping each one of the 10,000 sets of permuted sets of HVRs.

Selection of HVRs according to CE f
For downstream analyses, we created seven subsets of HVRs according to the occupancy for the three liverspecific transcription factors and their CE f for each on each one of the three liver-specific transcription factors (CE CEBPA , CE FOXA1 and CE HNF4A ): all HVRs; HVRs with at least one peak (HVR w/TFBS); HVRs with CE f greater than 0 (i.e. HVRs with at least one conserved peak), and HVRs with CE f greater than 0.2, 0.4, 0.6, 0.8 respectively (Additional file 1: Tables S9 and S10 show sizes and number of SSVs of HVR subsets). We analysed each one of the subsets of HVRs in a similar way to that used for the full set of HVRs as described above. Then, we compared the results obtained in each analysis across the subsets of HVRs (Fig. 4). Specifically, we analysed the enrichment in HVRs for i) Ensembl genes, ii) gene annotation from DAVID (UP TISSUE and KEGG PATHWAY databases), iii) differentially expressed genes in liver and kidney (Additional file 1: Table S1) and iv) rat orthologues of human GWAS variants associated with obesity and metabolic-related traits (Additional file 1: Tables  S3·and S4). Additionally, we also tested if the proportion of non-synonymous coding SSVs (NSC-SSVs) and synonymous coding SSVs (SC-SSVs) in HVRs differed between the subsets of HVRs. For this, we estimated the effect of SSVs in HVRs by using the Ensembl Variant Effect Predictor (VEP) tool (standalone perl script v2.7 associated with Ensembl v69 [69]). We considered in the analyses those NSC-SSVs whose most severe effect was 'missense_variant' , 'stop_gained' or 'stop_lost'.

Selection of HVRs by the number of liver-specific transcription factors with conserved peaks
Three subsets of HVRs were created according to how many liver-specific transcription factors had conserved peaks: the 'HVR 1TF' subset included HVRs with conserved peaks for at least one liver-specific factor, the 'HVR 2TF' subset had HVRs with conserved peaks for at least two factors, and the 'HVR 3TF' subset had HVRs with conserved peaks for all three liver-specific factors ( Fig. 5a and Additional file 1: Table S11). We compared the functionality among these HVRs subsets to test the importance of the number of transcription factors used to define the conservation level (Fig. 5). We analysed each one of these three subsets of HVRs in a similar way as used for the full set of HVRs and for the HVRs subsets created with different conservation levels as described in the previous section. Specifically, we compared the enrichment in HVRs among 'HVR 1TF' , 'HVR 2TF' and 'HVR 3TF' subsets for i) Ensembl genes, ii) gene annotation from DAVID (liver term of UP_TISSUE database), iii) differentially expressed genes in liver and kidney and iv) rat orthologues of human GWAS variants associated with obesity and metabolic-related traits.

Analyses of SSVs of the selected subsets of HVRs
For these analyses, we selected the subset of HVRs that had at least one conserved peak between rat and mouse strains/species for all three of the liver-specific transcription factors (i.e. 'HVR 3TF' subset) as they show enrichment for most of the functional elements and because of the observed stability of combinatorially bound transcription factors [9]. We limited our analysis to the genes that were both expressed in liver of LH or LL (FPKM >1) and associated with coding or non-coding strain-specific variation.

Non-synonymous coding SSVs (NSC-SSVs) in the selected subsets of HVRs
We assessed the effect of the SSVs on the protein by using the VEP tool (standalone perl script v2.7 [69]). We considered in the analyses those SSVs classified as nonsynonymous variants and whose most severe effect was 'missense_variant' , 'stop_gained' or 'stop_lost'.

SSVs of the selected subsets of HVRs sited in promoters
Positions of putative promoters in Rat were obtained from Villar et al. [4]. These authors characterised promoters and enhancers by using modifications to histone 3 lysine 27 (H3K27ac) and histone 3 lysine 4 (H3K4me3). Active promoters are marked by H3K4me3 and H3K27ac, while active enhancers are regions marked by H3K27ac [4]. Coordinates were converted from the Rnor5.0 assembly to the RGSC-3.4 assembly using the Ensembl assembly converter software (v80 [44]). Genes were assigned to promoters if the gene's transcription start site (TSS) overlapped or was within 5 kb downstream of the promoter. Only one-to-one gene-promoter assignations were used for our analysis.

Association between metabolic diseases and genes
Genes associated with the three metabolic-related symptoms showed by LH and LL strains (i.e. insulin resistance, dyslipidaemias and obesity) were obtained from DisGeNET (v4.0 [52,53]). DisGeNET is a platform integrating information on associations between genes and human diseases from public data sources and literature. We analysed those genes expressed in liver and with either the selected NSC-SSV or assigned to selected promoters with SSVs. DisGeNET analysis used the human orthologous genes of the selected rat genes with homology determined by the Ensembl Perl API (v69 [44]). Only the human orthologous genes with rat homology annotated as 'one2one' or 'apparently one2one' were used. From DisGeNET, we searched for disease gene associations using relevant Unified Medical Language System Concept Unique Identifiers (UMLS® CUIs, insulin resistance: C0021655, dyslipidaemias: C0242339 and obesity: C0028754). We also included two additional diseases not shown by the susceptible Lyon strains as controls for our analyses (heart diseases: C0018799 and Alzheimers: C0002395). For each of the five diseases, we compared, using Fisher's exact test, the counts of rat genes expressed in liver with NSC-SSVs overlapping the selected subsets of HVRs and human orthologues associated with that disease with the total number of rat genes expressed in liver and human orthologues associated with the disease. A similar comparison was done for genes assigned to promoters with SSVs overlapping the selected HVRs. In this case, the total number of human orthologues of rat genes expressed in liver was limited to those that were one-to-one assigned to promoters.