The FRAXA and FRAXE allele repeat size of boys from the Avon Longitudinal Study of Parents and Children (ALSPAC)

The FRAXA and FRAXE alleles of the FMR1 and FMR2 genes located on the X chromosome contain varying numbers of trinucleotide repeats. Large numbers of repeats at FRAXA (full mutations) manifest as Fragile X syndrome, associated with mental impairment that affects males more severely. In this paper, we present the dataset of frequencies of FRAXA and FRAXE repeat size extracted from DNA samples collected from boys enrolled in the Avon Longitudinal Study of Parents and Children (ALSPAC). DNA data were extracted from samples collected in ALSPAC clinics from several types of samples: cord blood, venepuncture blood taken at 43 months, 61 months, seven years or nine years. The DNA was amplified at FRAXA and FRAXE using fluorescent PCR in the Wessex Regional Genetics Laboratory, Salisbury District Hospital. The mean repeat size for FRAXA is 28.92 (S.D. 5.44), the median 30 and the range 8 to 68. There were particularly high numbers of boys with repeat sizes of 20 (10.67%) and 23 (7.35%). The mean repeat size for FRAXE is 17.41 (S.D. 3.94), with median of 16 and range of 0 to 61. There is a relatively high degree of variation of the FRAXA repeat size particularly and we suggest the extensive data available from the ALSPAC study opens up areas of research into understanding phenotypes associated with relatively unexplored repeat sizes. This could be particularly interesting for the lower repeat sizes occurring with high frequency at FRAXA in this population. As the data can be linked to exposures and phenotypes, it will provide a resource for researchers worldwide.


Introduction
The mutation associated with the Fragile X syndrome is within a CGG trinucleotide repeat array in the FMR1 gene, identified in 1991 (Pieretti et al., 1991). The gene is located in the Xq27.3 region of the long arm of the X chromosome (Lubs, 1969;Sutherland, 1979;Sutherland, 1977). The array contains a variable number of repeats, and each repeat size is regarded in genetic terms as an FMR1 allele (often referred to as a FRAXA allele) (reviewed in Pembrey et al., 2001). Similarly, FRAXE syndrome is caused by the expansion of a GCC repeat in the FMR2 gene (Knight et al., 1993). FRAXA allele classes are categorised by the number of repeats and defined as full mutation (>200), premutation (61-200), intermediate (41-60), common (11-40) and minimal (<11) (Murray et al., 1996). FRAXE allele classes are similarly defined, but are slightly different for intermediate  and common (11-31) classes.
Full mutations of FRAXA (> 200 CGG repeats with silencing of the FMR1 gene) are more common than those of FRAXE (Knight et al., 1996); in the literature, the population prevalence of full FRAXA mutations in males was estimated to be between 1 in 4000 and 1 in 5000, though acknowledging biases due to informed consent (Chandrasekara et al., 2017;Youings et al., 2000). For females the estimation is lower, at around 1 in 8000 (Crawford et al., 1999). FRAXE full mutation prevalence in males has been estimated at 1 in 23, 400 (Youings et al., 2000).
Individuals with a full mutation inherit it from a female who carries a full mutation or premutation (Pembrey et al., 2001). Around a third of female gene carriers are intellectually impaired, usually less severely than males (Pembrey et al., 2001). Most males with Fragile X syndrome (full mutation) have a severe to moderate degree of mental impairment (Pembrey et al., 2001). The full mutation type for FRAXE syndrome has been found to result in much milder impairment than the FRAXA phenotype (Crawford et al., 1999).
In females, there is a well-established phenotypic effect of FRAXA premutation sized alleles of an increased risk of premature ovarian failure (Chandrasekara et al., 2017;Macpherson et al., 2003;Youings et al., 2000). There is also some evidence of an increased incidence of late-onset neurological disorder characterised by tremor and ataxia (FXTAS) in premutation males, with around 40% at risk, and, to a lesser extent, females, at ~8% (Chandrasekara et al., 2017;Hagerman et al., 2001;Hagerman et al., 2004). Several studies have suggested that intermediate alleles at both FRAXE and FRAXA may have a role in cognitive impairment in boys, more reliably for FRAXA (Murray et al., 1996;Youings et al., 2000). However, this finding has not been replicated in some other multinational studies (Youings et al., 2000). Specifically, a study investigating FRAXA intermediate alleles in ALSPAC data even found a significant deficiency of special educational needs (SEN) cases in boys with FRAXA intermediate expansions (Ennis et al., 2006).
An issue within the literature is that the definitions of the boundaries of classes of repeat numbers are sometimes different across studies (e.g. changes over time due to knowledge base or differing clinical definitions), thus making it harder to compare frequencies (see Cornish et al., 2008;Pembrey et al., 2001;Youings et al., 2000). The general variation of frequencies of repeat sizes is often overlooked, as are phenotypic associations with relatively low repeat sizes (particularly differences within allele classes). Additionally, some studies have found a correlation between the length of repeats and degree of phenotype, even within allele classes (Cornish et al., 2008).
Here we present a dataset of repeat size frequencies at FRAXA and FRAXE of boys from the ALSPAC study, which could provide an important resource for researchers internationally.

Materials and methods
Ethical approval and consent Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees (Birmingham, 2018). Informed consent for the use of data via questionnaires and clinics was obtained from participants following the recommendations of the ALSPAC Ethics and Law Committee at the time. Questionnaires were completed by parents in their own homes and the return of a questionnaire to the study offices was interpreted as giving tacit consent to the study. Full details of the approvals obtained are available from the ethics pages of the study website (http://www. bristol.ac.uk/alspac/researchers/research-ethics/). Study members have the right to withdraw their consent for elements of the study or from the study entirely at any time. Consent for biological samples was collected in accordance with the Human Tissue Act (2004).

The ALSPAC sample
The ALSPAC study was designed to include all pregnant mothers with an expected date of delivery between 1 st April 1991 and 31 st December 1992 living in a specific area in the South West Regional Health Authority of England (Boyd et al., 2013;Fraser et al., 2013). The number of enrolled pregnant mothers was 14541, which resulted in 13988 infants who survived to one year. When the oldest children were approximately seven years of age, an attempt was made to bolster the initial sample with eligible cases who had failed to join the study originally. Including these additional children results in 14,901 infants who survived to one year. Data collection largely consisted of self-completion questionnaires (principally based on psychosocial factors, physical environments and health) filled in by mothers, partners, children and teachers. The study methodology (Golding et al., 2001), enrolment and response rates are given in detail on the study website (http://bristol.ac.uk/ alspac/index.html). Please note that the study website contains

Amendments from Version 1
We have added information into the paper about the pattern of data across ethnicities. We have added further information about differences in allele classification criteria in the strengths and limitations section, including giving the percentages referring to the classification of 41 to 54 in the intermediate allele class. Given the differences in allele classifications, we have provided supplementary material of a table giving all the individual genotype information (frequency of each count of repeats at FRAXA and FRAXE). This will allow researchers to classify or split the data in whichever way they prefer.
Any further responses from the reviewers can be found at the end of the article REVISED details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol. ac.uk/alspac/researchers/our-data/). The data presented here on maternal age, parity, parental education levels and socioeconomic status were obtained during pregnancy from maternal self-completion questionnaires.

Sample collection
The DNA of children in the study was gathered by the ALSPAC study team; samples were collected from 5275 males in total (Ennis et al., 2006). Permission for DNA extraction was obtained as written consent from the study mothers during pregnancy (for cord blood), as well as in ALSPAC clinics for venepuncture blood taken at 43 months, 61 months, seven years and nine years. Some samples were extracted from buccal wash when the child did not consent to venepuncture. The ALSPAC study team double coded all samples for anonymity.

DNA amplification and analysis
Cord blood samples were collected in heparin at birth and stored at -70°C for five to eight years before DNA extraction (Jones et al., 2000). Blood samples from children at 43 or 61 months were stored for one month to two years before extraction (Jones et al., 2000). Samples from children at seven and nine years were stored at -20°C for up to three weeks before DNA extraction (Jones et al., 2000). The samples were provided to the laboratory as 250ng aliquots in 96 well plates with eight wells on each plate left empty for laboratory control, consisting of DNA with known CGG repeat number and water controls (Ennis et al., 2006). When there was both a cord blood and clinic sample were available for the same boy, the clinic sample was used to maximise genotyping efficiency, as heparin can inhibit PCR, and to minimise maternal contamination issues (Ennis et al., 2006). The DNA was amplified using fluorescent PCR (involving fluorescently labelled oligonucleotide primers) across the CGG FRAXA repeat and GCC FRAXE repeat. This took place in the Wessex Regional Genetics Laboratory.
Two multiplex PCR reactions were carried out, the details of which are given elsewhere (Jones et al., 2000;Murray et al., 1996;Youings et al., 2000). For both multiplex reactions, products were denatured and separated on a polyacrylamide urea gel and then run on an ABI 373A Stretch machine for eight hours at 40 W, 2500 V and 35 mA (Murray et al., 1996). The gel data were analysed on 672 GENESCAN software (ABI/Perkin Elmer) and then imported into GENOTYPER software (ABI/Perkin Elmer) to assign alleles (Murray et al., 1996).

Data processing
Genotype data were returned to the ALSPAC Study (University of Bristol). The anonymised dataset is in SPSS format and analysis was carried out using a combination of SPSS (version 11) statistical software and SAS (SAS Institute Inc., Cary, N.C.). The dataset includes: the ALSPAC pregnancy identifier (anonymised), whether the pregnancy was singleton/twins/triplets, the category of FRAXA/E repeat size, and the number of repeats at FRAXA/E.

Dataset
In all, there were 5087 children for whom the number of repeats (RPTs) at FRAXA and 5070 at FRAXE were assessed. Of these, 192 were additional cases recruited at seven years old. The proportion of the 7676 boys in the study population for whom RPT data were obtained was 66.3% for FRAXA and 66.1% for FRAXE. Those who were missed included: those for whom permission had not been given and thus DNA was not available (2401); those where the assay had not worked (164), including those with high levels of premutations and the full mutation; and additionally, where there was not enough DNA, or samples had a heterozygous genotype indicating the possibility of two X chromosomes or maternal contamination (27) (Ennis et al., 2006). As PCR is not possible for amplification of large premutations/full mutations, these individuals are not included in this dataset. Table 1 shows the percentage of boys for whom there is FRAXA repeat data within each category of sociodemographic attributes. Among the parents of boys in the ALSPAC study, FRAXA repeat information was slightly more likely to be available when the education levels achieved by the mother and the father were higher, the father's occupation was of a higher social class, and the mothers were older. However, there was no difference in availability by the number of pregnancies the mother had had where the baby reached viability (parity). The patterns in regard to FRAXE were similar.
The error rate for FRAXA and FRAXE genotypes (i.e. CGG repeat number difference >1) were 0.52% and 0.65% respectively, calculated by comparing duplicate samples. Duplicates that were not matched and samples which failed to amplify were categorised as failed (Ennis et al., 2006).

FRAXA
The mean number of repeats (RPTs) at FRAXA is 28.92 (S.D. 5.44), with a median of 30 and a range of 8 to 68. Using the arbitrary definitions in the literature (Murray et al., 1996), only three of the boys would be categorised as having a premutation (defined as 61-200 RPTs) and 168 as intermediary (41-60 RPTs) ( Table 2). In all, approximately 11% have an RPT >32. The frequency distribution also has a substantially high number of individuals with repeat numbers of 20 (10.67%) and 23 (7.35%) disrupting the normal distribution around the mean (Figure 1).

Figure 1. Distribution of FRAXA repeats.
Graph of the distribution of number of repeats at FRAXA, for all boys for which valid number of repeats (RPT) data was obtained.  (Murray et al., 1996), one of the boys would be classed as having a premutation (Table 3). There is less variation in the FRAXE RPTs compared to the FRAXA RPTs (Figure 2), but a suggestion of a bimodal distribution. The pattern of data is unchanged across ethnicities for both FRAXE and FRAXA (5% non-white).

Strengths and limitations of the dataset
One of the strengths of this data is that it includes a very large number of individuals and we present the general variation across RPT sizes, which is sometimes overlooked in the literature. Additionally, given the extensive information gathered about children in the ALSPAC study, there is potential to look at the RPT size variation to ascertain if there are any correlations with other variables. This could open up interesting avenues of research into the phenotypes associated with different RPT sizes, particularly for the allele categories rarely studied in the literature so far (such as 'minimal' and 'common') and for the relatively unexpected peak seen in the FRAXA data around 20/23 RPTs (Figure 1). Very few studies have looked at phenotypes associated with very low numbers of repeats at FRAXA or FRAXE. A few smaller studies did find association of alleles of less than 26 repeats at FRAXA with increased risk of developing fertility problems, behavioural problems and/ or having children with developmental disability or psychiatric illness; this has not been replicated by larger studies and as such remains controversial (Kraan et al., 2018).
There was some bias in that the boys without RPT data were more likely to have had mothers who were younger, less well educated and of lower social class (Table 1). This needs to be taken into account in further studies, including subsequent correlations with variables from the ALSPAC data. Another limitation is that ascertaining repeat sizes of large premutation or full mutation alleles was not possible and these data are not represented in the dataset. Additionally, as discussed in the introduction the boundaries of classes of repeat numbers vary across studies, and particularly for FRAXA there is another commonly used distinction (intermediate as 41-54 repeats and premutation as 55-200 repeats). However, if we classify the boundaries this way, then the intermediate percentage only decreases a very small amount to 3.24 and premutation increases to 0.12. Although differing boundaries limits comparison across studies, we have included a full table of genotype data in Supplementary material (DOI: 10.6084/m9.figshare.11695440) such that researchers can classify whichever way they prefer.

Data availability
Underlying data ALSPAC data access is through a system of managed open access. The steps below highlight how to apply for access to the data included in this paper and all other ALSPAC data. Please read the ALSPAC access policy (http://www.bristol.ac.uk/ media-library/sites/alspac/documents/researchers/data-access/ ALSPAC_Access_Policy.pdf), which describes the process of accessing the data and biological samples in detail and outlines the costs associated with doing so.
1. You may also find it useful to browse our fully searchable research proposals database (https://proposals.epi. bristol.ac.uk/), which lists all research projects that have been approved since April 2011.
2. Please submit your research proposal (https://proposals. epi.bristol.ac.uk/) for consideration by the ALSPAC Executive Committee using the online process. You will receive a response within 10 working days to advise you whether your proposal has been approved.
If you have any questions about accessing data, please email: alspac-data@bristol.ac.uk (data) or bbl-info@bristol.ac.uk (samples).
The ALSPAC data management plan (http://www.bristol. ac.uk/media-library/sites/alspac/documents/researchers/ data-access/alspac-data-management-plan.pdf) describes in detail the policy regarding data sharing, which is through a system of managed open access.

Extended data
The supplementary material for this paper is a It would be good for the authors to provide information about the break-down of the repeat size distribution by ethnic groups, given the UK population's ethnic heterogeneity, unless all the samples are of single origin Caucasian ethnicity. This point is pertinent given that the high frequency of 20-and 23-repeat FRAXA alleles in the ALSPAC cohort has not been previously reported elsewhere.
The authors have suggested that future studies could look at phenotypes associated with very low numbers of repeats at FRAXA or FRAXE, and certainly the 20-and 23-repeat FRAXA alleles would be interesting study candidates. Besides the repeat size, repeat structure (e.g. repeat interruptions) might also be worth a look with regard to transmission stability and cognitive phenotype variability/expressivity.
Another point that needs addressing is the fate of the 164 samples that were excluded from further analysis due to standard PCR amplification failures. The authors acknowledged that some of these samples likely include high-level pre-mutation and full mutation samples. It is a pity that there were no follow-up analyses of these samples. Judging from the slab gel instruments used, the PCR genotyping was likely performed in the 1990s and Southern blot re-analysis then may not have been an option if DNA quantity was limited. However, the authors ought to now consider re-analysing these few samples using the newer and more powerful TP-PCR methods capable of sizing all premutations and detecting all full mutations. This would render a more complete picture of the FRAXA and FRAXE repeat size spectrum and allele distribution and frequencies in the study cohort. The paper's title, and the statement "a dataset of repeat size frequencies at FRAXA and FRAXE of boys from the ALSPAC study", would then also be more justified.
Given the different repeat size classifications adopted in different studies, it would be useful if the individual genotype information be made available in the supplementary information. This will enable readers to re-plot and re-classify allele sizes according to their preferred standard.

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Partly No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Molecular genetics of repeat expansion disorders. We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
correlate these low repeat numbers with the cognitive data that seems to be available at multiple time points in the ALSPAC study. The authors should do this if possible.
It is difficult to compare their prevalence results to previous studies because the standard definition of the premutation as 55 to 200 repeats is not used. Instead the premutation is defined as 61-200 and the intermediate is 41-60. Most other studies will use 41 to 54 repeats for the intermediate or gray zone. It would be helpful for them to define these categories in the introduction and use the standard definitions.
Under ALSPAC sample first paragraph "enrollment" is misspelled.
Overall this study demonstrates a high level of low CGG repeats in the FRAXA analysis but they have the opportunity to correlate this with the clinical data including cognitive testing and they need to do this.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Partly

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Partly No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: fragile X research I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 22 Jan 2020 , University of Bristol, Oakfield House, Oakfield Grove, Bristol, UK

Rosie Clark
We thank the reviewer for their constructive comments. Regarding their first point about numbers of individuals with a full mutation, these methods are beyond our abilities and facilities currently, but data from the ALSPAC biobank is available to researchers for further use as outlined in our access policy.
In response to the reviewer's second point about correlating repeat numbers with cognitive data, we agree that these points would be worth looking at but the aim of this Data Note paper is to highlight the data that are available so that other researchers can investigate different research questions.
Enrolment is the English spelling so we have not corrected this.
Other comments have been addressed as amendments to the text and supplementary material.