Introduction

The CFTR (cystic fibrosis transmembrane conductance regulator) gene has been extensively characterized for its pathologic variability. However, the study of its overall random variability, wherein disease causing mutations should be framed, has started only recently.1

For many of the several hundreds CFTR variants reported, it is not known whether they are or not CF-causing and this may produce difficulties for genetic counselling. Besides the obvious criteria to identify with certainty the CF-causing mutations (eg frameshift and termination mutations), a purely statistical approach to identify not fully penetrant CF-causing mutations has been proposed by Bombieri et al.2 It is based on the consideration that every CFTR variant with a frequency certainly higher than the cumulative frequency of the not unambiguously identified CF-causing alleles cannot be a fully penetrant CF-causing allele. It was applied to a random sample of 191 Europeans (=382 genes), a population where the cumulative frequency of the not unambiguously identified CF-causing alleles is 0.004 (the difference between 0.02, the total frequency of the CF-causing alleles, and 0.016, the total frequency of the well identified CF-causing alleles: WHO Report3). In that study, 10 certainly non-CF-causing alleles were classified.2

Present paper reports eight further certainly not fully penetrant CF-causing alleles identified through this purely statistical approach. However, the most interesting finding concerned the very different patterns of variability found on the CFTR genes carrying the M470 or the V470 allele.

Materials and methods

The sample

A large part of the present sample was the same as previously described.1 It consists of 1337 healthy, unrelated individuals (selected on the basis of the birth place of the four grandparents) from six geographical areas: Northern Italy, Verona (n=300); Central Italy, Rome (n=300); Southern France, Montpellier (n=300); Northern France, Brest (n=278); Czech Republic, Prague (n=118); and Spain, Barcelona (n=41). All individuals gave their informed consent. Since not all individuals have been studied for all the 27 exons of the CFTR gene, an average sample size has been computed. It amounts to about 1700 haploid genomes. For a detailed list of the sample size studied for each exon see Modiano et al.1

Mutation analysis

Genomic DNA was extracted from blood samples, amplified in vitro by PCR and analysed by DGGE2 or DHPLC.4 Every mutant discovered by these methods was sequenced with the ABI PRISM 377 or 310 Sequence Analyser. Some variants have been studied, on a fraction of the total sample, with a restriction enzyme specific method: the following cSNSs numbered as in Table A1, nos. 1, 12, 20, 24–26, 28, 29, 37, 45, 48, 56, 59, and 60; and the following intronic variants 3041–71g/c, 1001+11c/t, and 2752–15c/g (methods available on request cristina.bombieri@medgen.univr.it).

Maximum likelihood (ML)

Estimates of haplotype frequencies, of linkage disequilibria and of their statistical significance were calculated by ARLEQUIN, ver. 2.000.5

Degree of heterozygosity (H)

The degree of heterozygosity (2pq for diallelic sites) has been calculated for each variable site both within the M and the V CFTR genes utilizing the allele frequencies for each variable site within the M (or the V) CFTR genes (see Table 1).

Table 1 Frequencies of the CFTR variants within the M or the V alleles

Web resources

Information about CFTR gene sequence and mutations are available at the Cystic Fibrosis Genetic Analysis Consortium Web Site: http://genet.sickkids.on.ca/cftr

Results

A total of 4443 coding and 2367 noncoding bp (2184 bp intronic plus 183 bp of the UTR regions) had been studied by DGGE (denaturing gradient gel electrophoresis) or DHPLC (denaturing high performance liquid chromatography). Table A1 is an update of that already published in Modiano et al1 and reports the absolute and relative frequencies of the 61 cSNSs (single-nucleotide substitutions in a coding sequence) found in a Czech sample, larger than that already published, together with the updated European frequencies.

A detailed analysis of the cSNS variability has been presented elsewhere.1 Among the 61 cSNSs (45 nonsynonymous, and 16 synonymous) observed in the entire length of the gene, three (ref. nos. 16, 32, 60) were frankly polymorphic (q>0.05) and eight only slightly polymorphic (0.005<q<0.05); all the other cSNSs showed very low frequencies (34 of them were singletons).

Table A2 reports the frequencies of the 19 non-cSNS variant sites detected in this study: 16 intronic (12 SNSs, three STRs, and one nucleotide insertion) and three exonic (one SNS in the 5′UTR and two trinucleotide deletions in the coding sequence).

The density of polymorphic SNSs in the coding and in the noncoding regions turned out to be compatible (12/4443=1/370 and 4/2367=1/592 bp, respectively; P≈0.4); on the contrary, the density of rare SNSs appeared to be three-fold higher in the coding region (49/4443=1/91 bp in the exons and 9/2367=1/263 bp in the introns; P≈0.002).

It has been possible to classify as not fully penetrant CF-causing alleles, on the basis of their frequencies, six cSNSs (ref. nos. 6, 20, 26, 29, 37, and 59, black arrows in Table A1), besides the four already classified in the previous investigation2 (ref. nos. 16, 32, 54, and 60, white arrows in Table A1), and two noncoding variants (ref. letter E and O in Table A2).

The availability of a large number of mutants collected on a random sample of individuals made it possible to perform a comparison between the indirectly estimated relative mutation rates of the 12 possible type of substitutions (Figure 1). The expected numbers of cSNSs have been computed assuming that μ is the same for all of them and that all the mutational events had the same probability to be detected; therefore, since the four nonsynonymous cSNP (ref. nos. 6, 16, 26 and 29) may have been not neutral they have been excluded, and the total number of cSNSs was 57 instead of 61. The T↔A rate was much lower than expected (obs.=1; exp.=11.1; P≈0.002; ≈0.02 with the Bonferroni correction). The combined rate of the complementary C → T and G → A SNSs is about threefold higher than expected (obs.=23; exp.=7.9; P≈10−7), confirming already known notions.6, 7, 8 It is commonly accepted that the strong excess of these two SNSs is due to a particularly high probability of the C nucleotide (both in the sense and in the nonsense DNA strand), when it is followed by G, to mutate to T (see, for example, Cooper and Krawczak9 and Cooper et al10). This is strongly supported by present data. In fact, since 873 is the number of C in the 4443 coding bp of the CFTR gene, 873 is the number of CpN dinucleotides of this gene. Among them only 57 (6.5%) are CpG, whereas six out of the eight (75%) C → T mutations of the present study were in a CpG dinucleotide (P≈0). Similarly, the total number of NpG dinucleotides is 972 and only 57 are CpG (5.9%), whereas five out of the 15 (33.3%) G → A mutations of the present study were in a CpG dinucleotide (P≈10−4).

Figure 1
figure 1

Indirect relative mutation rate estimates of the 12 types of cSNSs. The expected number for each of 12 cSNS, say X → Y, has been obtained by multiplying the proportion of X among the 4443 coding CFTR bp (T=1236; C=873; A=1362 and G=972) by 57 (the number of cSNSs) and dividing this figure by 3 (each nucleotide can mutate to the other 3). Complementary cSNSs have been combined because the two DNA strands exhibited compatible mutational behaviours (expressed by the ratio obs/exp).

An almost complete linkage disequilibrium (LD) between the M470V and the two other highly polymorphic cSNS sites of the gene (ref. nos. 32 and 60) has been observed. These LDs (D′=0.91 and 0.90, respectively) are shown in Figure 2; they are not due to blocks of absence of recombination.11, 12, 13 The two sites 32 and 60, in fact, are not in strong disequilibrium between themselves in the CFTR genes carrying the M470 allele, thus suggesting that they are very ancient. This situation is strongly reminiscent of that of Rh, the genetic system where the LD phenomenon was first discovered.14, 15 In fact, this system too consists of one polymorphic locus, D/d, and of two additional loci, C/c and E/e, which are highly polymorphic within the chromosomes carrying the D allele and barely polymorphic within the chromosomes with the d allele.

Figure 2
figure 2

Haploid assortments for the three highly polymorphic cSNSs of the CFTR gene. Lack of haplotype variability associated to the V470 allele compared with M470. V and M indicate the M470V alleles. Three letter words indicate haplotypes; V or M in the first position: M470V (site 16); t or g in the second position: 2694 t/g (site 32); a or g in the third position: 4521 a/g (site 60). The areas indicate haplotype frequencies. D and D′=absolute and relative linkage disequilibrium values, respectively. Reference numbers are as in Table A1. Accession numbers for these three highly polymorphic cSNSs in the dbSNP public database (http://www.ncbi.nlm.nih.gov/SNP/) are: rs213950 for M470V, rs1042077 for 2694t/g, and rs2800136 for 4521g/a.

The large number of CFTR genes studied allowed us to subdivide the total sample into two subsamples consisting of genes carrying the M470 or the V470 allele, respectively. Table 1 compares the degree of variability of the CFTR genes in these two subsamples. It clearly appears that the CFTR genes carrying the M allele are much more variable than those carrying the V allele for most of the markers suitable for such comparison (ie those for which a variant allele was found in at least one MM or VV homozygote, respectively) both for the ‘slow’ (mutation rate in the order of 10−8 to 10−7; n=39) and for the ‘fast’ (mutation rate 10−4 to 10−2; n=3; ie STRs) evolution markers.16 Thus, the estimate of the overall variability of the CFTR gene is the weighted mean of two very different patterns of variability: that of the CFTR genes carrying the M and that of the CFTR genes carrying the V allele, plus, obviously, that due to the M470V site itself. These findings show the existence of an ‘extended haplotype homozygosity’ region (EHH),17 namely of an almost ‘allele-restricted’ monomorphic region concerning only the CFTR genes with the V allele. Such strong preferential concentration of variability within the M CFTR genes turned out to be correlated with the distance from the 470 site being stronger in the DNA sequence around ±50 kb from it (Table 2).

Table 2 The intensity of the M-restricted variability depends on the distance from the M470V site

Discussion

Sabeti et al17 suggested that an EHH implies a recent positive selection, and verified this hypothesis in two genes (G6PD and TNSF5) known for having been recently subjected to positive selection. Thus, the present finding of an EHH region encompassing the M470V site strongly suggests that the CFTR gene recently underwent selection. This suggestion is in accordance with previous findings of extended homogeneous haplotypes associated with specific CF mutations.18, 19

Some features of the CFTR gene suggest a possible scenario for the selection process:

  1. 1)

    M470 is the ancestral allele. It is in fact the allele found in all the other species studied so far;20, 21, 22

  2. 2)

    M470 is almost fixed among the sub-Saharian Africans: the combined V frequency in the three sub-Saharian African populations we have studied (Mossì, Burkina Faso, n=146 individuals; Ewondo, Ghana, n=10; Pygmies, CAR, n=10; unpublished data) was 0.02±0.01. This high prevalence of M470 presumably applies to all sub-Saharian populations.

  3. 3)

    the V470 allele outside of Africa is very frequent, it is even more common than the M allele (eg for the Europeans1 and for the Asians23);

  4. 4)

    the bulk of CFTR gene variability is restricted to the haplotypes carrying the M470 allele (Table 1).

The time elapsed since the radiation of H. sapiens from Africa to the rest of the Old World (only two/four thousand generations24) has been far too short to account in terms of genetic drift only25 for such a tremendous increase of the V allele frequency. Therefore, a selective process seems more likely. As far as the time of onset of the selection process, the great extension of the region encompassing the M470V site with an almost complete LD suggests that it is recent (see also Slatkin and Bertorelle26). We wish to suggest the involvement of a selectively advantageous mutation X that would have caused, in relatively few generations, the increase of the V allele frequency outside of Africa.

V allele frequency could have increased by one of the following three possible mechanisms, the first one relates to the V mutation itself, the other two pertain to the hitchhiking phenomenon:27

  1. 1)

    X is V. The V allele is very common outside of Africa because only there it has been advantageous. There are indications that the V allele might produce a less functional protein. The V470 allele has, in fact, been reported to have a 1.7 times lower intrinsic chloride channel activity,28 as confirmed by different studies.23, 29, 30 It might have conferred a selective advantage in particular environments as suggested for CF heterozygotes and tubercolosis,31 or lung infections,32 or diarrhea caused by enterotoxic bacteria;33, 34

  2. 2)

    X is not V, and was already present in Africa, in CFTR genes with the V allele, before the migration of H. sapiens, but it did not confer any selective advantage. Its frequency increased dramatically in Europe following human exposure to different environmental conditions that made it advantageous;

  3. 3)

    X is not V, and was born in Europe, in one CFTR gene with the V allele, before the migration of H.sapiens towards Asia.

The first two possibilities require the additional hypothesis that the African V was carried by only one haplotype, while the third possibility is independent from the number of different African haplotypes carrying the V allele.

A choice among these three possibilities would require, at least, the ascertainment of variability, if any, of the African haplotypes carrying V allele.