Introduction

Disorders of the central nervous system (CNS) constitute a major global health burden. Accordingly, not only the role of genetic variation at candidate gene loci in CNS disorders, but also molecular function and evolution of CNS genes has been of interest in many studies. One important result from the comparative genomic analysis of CNS gene loci is their higher level of interspecies conservation.1, 2, 3 This indicates a higher potential for deleterious mutations and is consistent with gene function in an organ system as complex and indispensable as the CNS. On the other hand, coding and noncoding regions of certain types of CNS genes, in particular channel and developmental genes, had shown evidence for accelerated evolution in hominids as compared to murids.3, 4, 5, 6 This may be attributed to positively selected adaptations in genes with important role in the human CNS. Because the realized strength of positive selection on a constrained background is increased when recombination is present,7 more recombination near genes with important role in CNS function and development could thus be advantageous. If more recombination would be detectable at CNS genes, it would thus provide evidence for ongoing selection in the human lineage, based on the assumption that local recombination rates evolve under the influence of natural selection. Ongoing positive selection at CNS gene loci might be related to the high prevalence of genetically complex CNS disorders, potentially under an ancestral susceptibility allele model.8

Consistent with a functionally biased location of meiotic recombination events, recombination rates and linkage disequilibrium (LD) are known to vary greatly across the human genome.9, 10, 11, 12 The recently published HapMap dataset allowed for the first time the identification of functional categories of genes that are preferentially located within regions of weak LD (immune response and sensory perception) and strong LD (DNA and RNA metabolism and cell cycle).13, 14 In addition, several functional categories related to neurotransmission showed reduced LD.13 These LD patterns result from the joint influence of drift and different modes of selection on multiple sites under a complex population history. Nevertheless, the biased LD patterns gave rise to the hypothesis that it could be advantageous for certain types of genes to be located in regions of increased recombination.13 This makes sense theoretically, because LD can occur between multiple variable sites under selection. By allowing mutations to evolve independently from their haplotype background, the increase of recombination rates between such sites can cause selection to act more efficiently.15 On the basis of recent genome-wide estimates of fine-scale recombination rates,16, 17 it is now possible to take a closer look at the relationship between recombination rate and gene function. These recombination rate estimates are based on the polymorphism data that specifically reflect human evolutionary history over the past 200 000 years.14 One result of the analysis of fine-scale recombination rates is the widespread existence of recombination hotspots.16, 17 Recombination hotspots appear to evolve in relatively short periods, because little conservation of hotspots between human and chimpanzee was observed.18, 19

In the present study, we hypothesized that recombination hotspot predictions might be enriched at CNS gene loci. By allowing beneficial mutations to evolve independently from a highly constrained haplotype background, recombination hotspots could be particularly advantageous at CNS gene loci.

Materials and methods

Recombination hotspot predictions were retrieved from the UCSC Genome Database (version hg17),20 based on HapMap phase I genotype data.14 These hotspots are predicted by the LDhot method,17 that infers population recombination rates from patterns of LD. The method is sufficiently fast to be applicable to genome-wide data sets. In an evaluation on experimentally verified hotspots, it predicted correctly four out of eight hotspots with no false positives.21 Autosomal gene model annotations were retrieved from the Ensembl Core database (version 36).22

To test the enrichment of hotspots around genes from a certain category, we calculated the odd that a gene from this category displays a recombination hotspot in its flanking regions versus the odd that a gene not belonging to this category displays a recombination hotspot in its flanking regions. This odds ratio (OR) was separately calculated for upstream and downstream flanking regions (30 kb upstream or downstream, respectively) and then averaged to obtain a summary score for the respective category. We excluded genes that do not have at least five polymorphic HapMap Phase I SNPs in their 30 kp up- and downstream regions. To measure the significance of the association between a category and recombination hotspots, we employed a permutation-based strategy. We randomly permuted gene identifiers across gene model annotations in the human genome, while leaving the functional annotations that are assigned to gene identifiers unchanged and also leaving the position of hotspots unchanged. P-values for all categories were initially determined by 1000 random permutations. For categories where maximally 1% of random permutations showed a more extreme hotspot association than the actually observed score (ie higher average OR), the number of permutations was increased by a factor of 10. For the most significant categories, 100 000 random permutations were performed. Around genes from the most significant categories (P≤10−5), the actually observed enrichment of hotspots was stronger than seen in any of the 100 000 random permutations. This permutation strategy accounts for the location of genes and hotspots in the human genome. However, in combination with unknown patterns of chromosomal clustering of functionally related genes, it could be biased by the overproportional sampling of regions that flank two neighboring genes from a same category. Thus, categories that show a higher degree of chromosomal clustering may attain anticonservative P-values, if they have overproportionally many hotspots in chromosomal clusters, whereas such categories may attain conservative P-values, if they have a relative depletion of hotspots in chromosomal clusters.

To be robust against the influence of unknown patterns of chromosomal clustering on the reported results, we additionally employed a shifting window strategy to test the association between hotspots and gene functions. This shifting window strategy moves a nonoverlapping 50 kb window across the genome. Genomic windows were required to contain at least five polymorphic SNPs and overlap at least one gene to be included. For each category, we retrieved all windows that overlap with at least one gene from the respective category. Then we calculated for each category, the relative frequency of windows that overlap at least one hotspot prediction as test statistic. Again we used random permutation of gene identifiers across gene model annotations to determine the significance of the test statistic. Because hotspot predictions are known to be rare within genes,16 this shifting window strategy may have a bias against categories that are enriched for large-sized genes since they have more intragenic windows.

CNS genes termed ‘neuronal genes’ were defined as human genes that were returned by the EntrezGene server for the query ‘(neuronal* or glial* or neural* or neurite or axon) and not olfactory’ (date of query: 3 November 2006). EntrezGene provides manually curated free text annotation of genes. Gene expression data were retrieved from GNF2 dataset,23 as represented in the UCSC Genome Annotation database20 (version hg 17). Cross-references to Ensembl genes were obtained from the UCSC database. We defined CNS expression as the expression in the tissues olfactory bulb, temporal cortex, parietal cortex, prefrontal cortex, occipital cortex, cingulate cortex, cerebellum, cerebellar pedunculus, amygdala, hypothalamus, thalamus, subthalamic nuclei, caudate nucleus, globus pallidus, pons, medulla and spinal cord. We excluded fetal and neoplastic tissues from the analysis. Multiple measurements of a gene in a tissue were averaged and genes were taken as expressed in a given tissue, if the signal exceeded 200 arbitrary units, as suggested in the original publication.23

Ensembl genes22 were cross-referenced via the International Protein Index (version April 2006)24 to the Gene Ontology (GO) Annotation database (version April 2006).25 Obeying the ‘true path rule’, GO annotations were transitively assigned for all parent categories. GO categories26 were analyzed, if annotated to 20 or more human Ensembl genes. To account for multiple testing, Q-values were estimated based on the permutation-based P-value distribution over the 1266 GO categories27 showing that the most significant P-values of ≤10−5 correspond to a Q-value of ≤10−4, whereas a false discovery rate of 5% is attained when calling GO categories with P≤0.002 significant. To be included in the analysis, genes were required to have at least one GO annotation.26

Results

To test the association of recombination hotspots predictions14, 17 with genes from a certain category, we employed a permutation based analysis strategy (see Materials and Methods for details). We initially focused the analysis on hotspots that are predicted in the 30 kb up- and downstream flanking regions of genes, because it was shown that hotspots most often exist in these genomic windows.16 To this end, we first calculated as test statistic the ratio between the odd of a hotspot prediction in up- or downstream regions a gene annotated by the respective category and the odd of a hotspot prediction around a gene not annotated by the respective category. In addition, we employed a shifting window strategy by moving a 50 kb window across the genome. For each category, we retrieved all windows that overlap at least one gene from the respective category and calculated as test statistic the relative frequency of windows that overlap at least one hotspot prediction.

Recombination hotspots are enriched around CNS genes

Notably, we found hotspots enriched around genes with particularly important role in the CNS when retrieving such genes by three different strategies (Table 1). As one strategy, we followed the suggestion to identify CNS genes by keyword search in text-based gene annotations.5 We retrieved ‘neuronal genes’ as those that are returned from the EntrezGene server by searching for the keywords ‘neuronal’, ‘glial’, ‘neural’, ‘neurit’ or ‘axon’. On the basis of this definition, we identified 1026 neuronal genes that conform to our inclusion criteria, that is having at least one GO annotation and at least five HapMap SNPs in flanking regions. Consistent with the hypothesis of an enrichment of hotspots around CNS genes, we found flanking regions of neuronal genes enriched for recombination hotspots (upstream OR: 1.33, downstream OR: 1.22, average OR: 1.28; P<10E-05). We further observed a somewhat larger effect size for the subset of 145 neuronal genes that are additionally GO annotated by ‘cell adhesion’ (upstream OR: 1.56, downstream OR: 1.25, average OR: 1.4; P=0.0057). This may be interesting, because noncoding accelerated hominid evolution was found associated with ‘cell adhesion’ genes,5 what had mainly relied on a subset of ‘cell adhesion’ genes that belonged to a larger group of ‘neuronal’ genes as defined by keyword search.

Table 1 Enrichment of recombination hotspots around CNS genes that are retrieved by different definitions

We next used the human GNF2 dataset23 to identify CNS genes. Here we defined CNS genes as those with maximal expression in a CNS tissue. In total, we retrieved 14 600 human autosomal Ensembl genes with available expression and SNP annotation for their flanking regions and found 1493 of them having their maximal expression in a CNS tissue. On the basis of their maximal expression in a CNS tissue, one may predict that these genes play important roles in nervous system function. This definition of CNS genes also showed an enrichment of hotspot predictions (OR: 1.22, P≤10−5). A subset of 216 of the 1493 maximal CNS expressed genes was also contained in the above set of 1026 genes that were retrieved by keyword search. This intersected set of 216 genes produced an even stronger association with recombination hotspots (OR: 1.44, P≤10−5).

As a third definition of CNS genes, we applied independent expert knowledge by using a manually defined list of candidate genes for affective disorder,28 235 of them fulfilling our inclusion criteria. These CNS candidate genes displayed a significant association with hotspots with similar effect size (OR: 1.21, P=0.03). Thus, hotspots appear enriched around genes with important roles in the human CNS when such genes are defined by text-based annotations, expression data or independent expert knowledge.

Recombination hotspots are enriched around genes with CNS-related GO Annotations

To investigate the relationship between hotspot location and gene function more systematically, we tested all GO categories for the enrichment of recombination hotspots. Altogether, we compared 1266 GO categories based on 15 354 autosomal genes that have at least one GO annotation. The existence of functional categories that are enriched for hotspots and categories that are depleted of hotspots is indicated by the distribution of P-values over all GO categories, showing an excess of P-values close to 0 or close to 1 (Figure 1).

Figure 1
figure 1

Frequency histogram of OR-based P-values that are estimated for the 1266 tested GO categories. The excess of P-values that are close to 0 and P-values that are close to 1 indicates the existence of GO categories that are enriched for hotspots and the existence of GO categories that are depleted of hotspots. We observe 30 GO categories (Table 1) with P-values ≤0.0005 as compared to one GO category that would be expected by chance alone under the expectation of a uniform distribution.

Consistent with our hypothesis, GO categories most significantly enriched for recombination hotspots comprise several categories annotated to channel activity genes (Table 2 and Supplementary Table 1). Among a total of 359 genes annotated with ‘alpha-type channel activity’, we found 104 with at least one hotspot in their upstream region and 119 with at least one hotspot in their downstream region (OR: 1.49, P≤10−5). A similarly significant enrichment of hotspots (P≤10−5) was found around genes annotated with ‘(voltage-gated) ion channel activity’, ‘(metal) ion transport’ and ‘(channel or pore class) transporter activity’. By maintaining delicate ion balances at cellular membranes, the genes annotated by these categories are directly responsible for the ability of nervous system cells to generate signals. Interestingly, a significant association with hotspots (P≤10−5) also exists for the more general GO categories ‘membrane’ and (integral/intrinsic to) plasma membrane. Thus, the enrichment of hotspots around channels appears to be the more extreme reflection of a general pattern that applies to genes encoding membrane proteins.

Table 2 List of 30 Gene Ontology (GO) categories that are most significantly associated with predictions of recombination hotspots, see legend of Table 1 for further description of columns

Because more general GO categories are annotated to a higher number of genes, they have more power to attain a significant association. Nevertheless, owing to the above association of ‘ion channel activity’ with recombination hotspots, the respective subcategories deserve attention too (Supplementary Table 1). Among these, the most significant GO categories comprise ‘voltage-gated potassium channel complex/activity’. Owing to the assembly of functional potassium channels from multiple independent monomers, these categories are still annotated to a relatively high number of genes. An even stronger effect size but weaker significance can be observed for the less densely populated GO categories ‘calcium channel activity’, ‘sodium channel activity’ and ‘chloride channel activity’. Thus, a contribution to the association of channels with recombination hotspots comes from channels with selective permeability for all major ions that determine the cellular membrane potential.

The second major group of GO categories that shows association with recombination hotspots is constituted by annotations to developmental and particularly neurodevelopmental genes (Table 2 and Supplementary Table 1). Among the 1578 genes that are annotated with a role in ‘development’, 443 show at least one hotspot in their upstream region and 458 show at least one hotspot in their downstream region (OR: 1.36, P≤10−5). Moreover, significant associations (P≤10−4) were found for the GO categories ‘cell differentiation’ and ‘organ/system development’. Thus, in addition to channels which are important in neuronal signal processing, developmental genes are associated with recombination hotspots. Among more specific developmental categories, the strongest signals were found for ‘nervous system development’ and ‘muscle development’. Thus, hotspots are particularly enriched around genes that play a role in the development of excitable tissues. Further support for an enrichment of hotspots around neurodevelopmental genes is provided by the results obtained for the GO category ‘cell adhesion’ (OR: 1.30 P=1.1 × 10−4). This GO category is annotated to many genes with a role in neurodevelopment and communication between neurons such as neurologins, contactins and cadherins.

Hotspot predictions are enriched around genes with distinct coding sequence evolution

On the basis of detailed comparative genomic analysis of genes from different GO categories that has been published,3 we now looked for evidence of distinct sequence evolution of the top 30 GO categories that are associated with recombination hotspot predictions. All the 14 GO categories that are annotated to channels displayed accelerated protein sequence evolution in hominids as compared to murids.3 The same observation holds true for the three GO categories ‘development’ (GO:0007275) ‘nervous system development’ (GO:0007399) and ‘cell adhesion’ (GO:0007155). The more general and heterogeneous GO categories ‘(integral to) plasma membrane’ and ‘extracellular region/space’ were not reported to display accelerated protein sequence evolution in hominids.3 However, in the human–chimp lineage, the former two GO categories ‘(integral to) plasma membrane’ showed evidence for slower coding sequence evolution, whereas the latter two GO categories ‘extracellular region/space’ showed evidence for faster coding sequence evolution.3 Thus the comparative genomic analysis of all top hotspot-associated GO categories (Table 2) had either provided evidence for increased positive selection, increased negative selection or both.

One may also apply less stringent significance thresholds to look for a relation between coding sequence evolution and hotspot enrichment among functional gene categories. In total, 477 GO categories were analyzed both in our analysis and the published comparative analysis of human, chimp, rat and mouse protein sequence evolution.3 Among these, 226 categories displayed slower coding sequence evolution in the human–chimp lineage (defined as p_low smaller than 0.05 in Supplementary Table S26 of International_Chimp_Genome_Consortium3), 46 of them were enriched for hotspot prediction (defined as 3′–5′ P-value smaller than 0.05 in Supplementary Table 1). This falls close to the random expectation of 47 categories, based on the simplifying assumption of independence. On the other hand, 95 categories showed faster coding sequence evolution in the human–chimp lineage (defined as p_high smaller than 0.05 in Supplementary Table S26 of International_Chimp_Genome_Consortium3) and we see 26 of them enriched for hotspot predictions. This number is somewhat larger than the 17 categories that would be randomly expected under the assumption of independence. Of particular interest are now those categories with accelerated protein sequence evolution in hominids as compared to murids. Among the 98 categories that were reported to display accelerated hominid protein sequence evolution as compared to murids (defined as P_hominid_Ka>murid_Ka smaller than 0.05 in Supplementary Table S30 of International_Chimp_Genome_Consortium3), we find 34 enriched for hotspot predictions. This is nearly twice the number of 18 categories that would be expected by chance under independence.

To test the relationship between hotspots and prior evidence for accelerated noncoding sequence evolution more directly, we went back to look at the above genes that are contained in the sets of neuronal genes, maximal CNS expressed genes and affective disorder candidates. Among the 2383 genes altogether, we found 165 to be located closest to a human accelerated noncoding region.5 These 165 genes show significant enrichment of hotspots as compared to the remaining 2218 CNS genes (upstream: OR: 1.46, P=0.012; downstream: OR:1.53, P=0.0053). Finally, we directly analyzed the published list of regions with evidence for accelerated evolution.5 For each region with accelerated evolution in either human or mouse, we calculated the interval that is centered on the respective region and extends for 25 kb into both directions. We then excluded intervals around mouse-accelerated regions that overlap intervals around human-accelerated regions. In further support of our hypothesis, we found a significant enrichment of hotspots within intervals around human-accelerated regions (OR: 1.3, P=0.00015), comparing the 992 intervals centered on human-accelerated region with the 4207 intervals centered mouse-accelerated regions.

Discussion

We report the preferential location of recombination hotspot predictions around human genes with important role in the CNS, including channels, developmental and neuronal cell adhesion genes. Strikingly, such CNS genes were also found to display the combination of strong sequence conservation1, 2, 3, 29, 30 and accelerated sequence evolution in hominids.3, 4, 5, 6 Accelerated sequence evolution in hominids was interpreted as more prevalent positive selection or alternatively relaxed selective constraint in hominids.3 Our results support more prevalent positive selection as a cause of accelerated evolution, based on the assumption that recombination hotspots can evolve to increase the efficacy of positive selection on a constrained background.7 Thus, the enrichment of hotspots at CNS gene loci might indicate widespread continued positive selection in the human lineage.

Both the influence of negative and positive selection intensity on local recombination modifiers is consistent with theoretical models.31, 32 However, an exclusive influence of negative selection would not explain the strongest enrichment of hotspots at loci with accelerated hominid evolution. Acknowledging the large body of previous work on recombination promoting selection,33, 34 we therefore explain our observation by an influence of both positive and negative selection on local recombination rate. In this context it is most important that earlier results had indicated that the realized strength of selection is increased when recombination is present.7 Therefore, hotspots may often have evolved at CNS genes to ease positive selection of beneficial mutations that require particularly strong sequence conservation in their neighboring sites. In that case, our results indirectly support the claim of widespread positive selection on genes with important function in the human nervous system.4, 5 If reasoning into the opposite direction, our results support an influence of selection intensity on the location of recombination modifiers. An important part of the respective selective forces could act on noncoding, presumably regulatory, sequences.35 Accordingly, similar types of CNS genes that are associated with hotspot predictions also had shown genomic characteristics of genes with more tightly regulated expression.30

If selection intensity would exert an important influence on the distribution of meiotic recombination sites across the human genome, the maintenance of large-scale recombination rates36 could be the reflection of persistent negative and recurrent positive selection that acts on regulatory sequences located in a wider genomic range around their targets genes. An influence of selection intensity on hotspot activity appears reasonable, because recombination hotspots seem to evolve as sequence-related traits, as shown by their relationship to the CCTCCCT motif16 and the earlier observation of recombigenic alleles within hotspots.37 In any case, the enrichment of recombination hotspots around CNS genes indicates that LD mapping studies of complex CNS disorders are particularly challenging. It also could make it more difficult to find a signature of positive selection at genes with important roles in the human CNS. This may particularly apply with multiple hotspots emerging and vanishing at those types of genes,38, 39 what may contribute to the limits of the outlier approach.40 In this context, it might be worth mentioning that our hypothesis does not require that single recombination promoting alleles underlie the population genetic signatures that lead to hotspot predictions.

Recombination rates are further known to correlate with GC-content,41 but GC-content explains only a small fraction of fine scale recombination rate variation.17 Therefore, we consider it unlikely that GC-content exerts an important confounding influence on the functional bias in hotspot location. Moreover, further analysis shows that GC-content is rather below average than above average around important CNS genes29 (and own unpublished data). Instead, the hypothesis that more intense selection leads to increased recombination (at CNS genes and elsewhere) raises the question, whether the correlation between GC-content and recombination could be confounded by the parallel correlation between GC-content and expression levels (and consequently selection intensity42). It is also possible to speculate about many additional confounding factors, such as for instance, LTR repeats,16 which could interfere with the functional bias in hotspot location. However, even this speculative case does not necessarily provide a neutral explanation of the reported functional bias, because a biased location of recombination modifying LTR repeat could equally be the consequence of increased selection intensity at CNS gene loci. Moreover, any such speculative confounding factor would have to explain that the same types of CNS genes not only show an association with recombination hotspot predictions, but also display accelerated evolution in comparative genomic data.

An explanation for widespread ongoing selection at CNS gene loci might be given by the high prevalence of genetically complex CNS disorders with relatively early onset, for instance, epilepsy, migraine, schizophrenia and affective disorders. Because these CNS disorders are probably evolutionarily disadvantageous traits, selection may act against the underlying susceptibility alleles and in favor of protective alleles. Many respective susceptibility alleles could be a consequence of genetic changes that lead to human-specific CNS traits. Thus, successive genetic adaptations that produced human-specific traits might have changed the genetic background in a way that turned certain ancestral alleles into susceptibility alleles. That genetic background can determine disease allele status is well known from model organism genetics and monogenic disease. Therefore, one may hypothesize under an ancestral susceptibility allele model8 that positive selection might act in favor of derived protective alleles. Ancestral alleles might be preferentially turned into susceptibility alleles, if encoding the ‘conserved core’ of the archetypic vertebrate brain.30, 43, 44 Such CNS susceptibility genes may differ from other genes in the strength of constraint on sites that neighbor a positively selected allele, as indicated by their stronger sequence conservation.1, 2, 3, 29, 30 Therefore, more recombination could be advantageous at these loci. If this explanation is correct, widespread positive selection would be the other side of a high locus heterogeneity of complex CNS disorders that is based on susceptibility alleles that are ancestral and common. The soon availability of genome-wide association data for various complex CNS disorders will allow for a systematical testing of this hypothesis.

In conclusion, we provide strong evidence for the preferential location of recombination hotspots at genes with important molecular role in the human CNS. We propose that the combined influence of intense negative and positive selection could have promoted the enrichment of hotspots at these loci. This would imply widespread positive selection on human CNS genes, which could be related to the high prevalence of nervous system disorders.