Propensity for somatic expansion increases over the course of life in Huntington disease

Recent work on Huntington disease (HD) suggests that somatic instability of CAG repeat tracts, which can expand into the hundreds in neurons, explains clinical outcomes better than the length of the inherited allele. Here, we measured somatic expansion in blood samples collected from the same 50 HD mutation carriers over a twenty-year period, along with post-mortem tissue from 15 adults and 7 fetal mutation carriers, to examine somatic expansions at different stages of life. Post-mortem brains, as previously reported, had the greatest expansions, but fetal cortex had virtually none. Somatic instability in blood increased with age, despite blood cells being short-lived compared to neurons, and was driven mostly by CAG repeat length, then by age at sampling and by interaction between these two variables. Expansion rates were higher in symptomatic subjects. These data lend support to a previously proposed computational model of somatic instability-driven disease.


Introduction
Most mutations are stably transmitted from parent to offspring. This reliable genetic principle does not hold, however, for dynamic mutation disorders such as Fragile X syndrome or Huntington disease (HD). In these diseases, a sequence such as a CAG repeat tract can expand during transmission, likely through mechanisms involving replication or transcription (Khristich and Mirkin, 2020). In general, the longer the repeat, the earlier the patient develops overt symptoms and the more aggressive the disease is likely to be (Koshy and Zoghbi, 1997). Thus, in HD, modest expansions of 40 repeats in huntingtin gene (HTT) are associated with the appearance of motor, cognitive, and psychiatric disturbances in mid-or late adulthood, whereas large expansions of over 80 repeats cause childhood onset with additional features such as epilepsy and a more rapidly fatal course (Bates et al., 2015b;Sun et al., 2017). Yet on an individual subject basis, we cannot predict the disease course just from the size of the repeat tract: two individuals with the same length repeat expansion in HTT may experience disease onset decades apart (Andrew et al., 1993). The inherited pathological CAG repeat size accounts for about 42-71% of the age at onset in HD (Squitieri et al., 2006), though the confidence limits narrow for tracts longer than 50 CAGs (Andrew et al., 1993;Bates et al., 2015b;Langbehn et al., 2004;Rubinsztein et al., 1997;Wexler et al., 2004).
Part of the reason for such variability could be that HD is still thought of primarily as a movement disorder, so age at onset in HD is typically defined as the point at which motor symptoms become unequivocal. But abnormalities in the brain are present from early development (Barnat et al., 2020), and mutation carriers may experience cognitive deficits, psychiatric disturbances, or even subtle motor impairments years before diagnosis (Bates et al., 2015b). It is challenging, and somewhat misleading, to pinpoint age at onset in a disease that evolves insidiously like HD.
A more interesting explanation takes into account the fact that CAG repeats do not just expand in the germline. They are also somatically unstable, such that different CAG expansions can be identified in the same sample tissue from various organs and brain regions. Somatic mosaicism occurs both in mouse models of HD (Kennedy and Shelbourne, 2000;Larson et al., 2015;Lee et al., 2010) and in humans with HD (Kennedy et al., 2003;Swami et al., 2009;Telenius et al., 1994). The greatest increases in CAG tract length have been observed in the brain regions most affected in HD, the cerebral cortex and the striatum, whose neurons can harbor repeat tract expansions in the hundreds (Gonitel et al., 2008;Kennedy et al., 2003;Møllersen et al., 2010;Mouro Pinto et al., 2020;Shelbourne et al., 2007). Repeat expansions likely result from the formation of unusual DNA structures that predispose the tract to errors in mismatch repair (Khristich and Mirkin, 2020;Tabrizi et al., 2020). In fact, variants in several different DNA repair genes are associated with somatic instability in both animal models of HD (Dragileva et al., 2009;Pinto et al., 2013;Tomé et al., 2013) and HD patients Flower et al., 2019;Lee et al., 2019).
Mounting evidence suggests that somatic repeat lengths better explain age at onset than the germline repeat, as their propensity to expand relates to both the baseline allele length and age Lee et al., 2019). Interestingly, these data lend support to a mathematical model put forth over a decade ago (Kaplan et al., 2007). In brief, Kaplan et al. proposed that the onset and progression of triplet repeat diseases, including HD, are determined by the rate of somatic expansion in disease-relevant cells. Symptoms manifest when a critical proportion of cells (say, 20%) pass a pathogenic threshold, which would differ for different cell types. Their modeling suggests the threshold in striatal neurons for HD would be~115 repeats. They further posited that at birth, nearly all cells would carry just the inherited number of repeats, but that over time the mutant alleles would further expand at a rate that increased linearly with the number of repeats. The rate of expansion would thus determine how rapidly the pathological state is reached, and thus should influence disease onset and progression.
The Kaplan model is quite compelling, but to test its predictions requires longitudinal data to study the evolution of somatic instability over time within patients. Given that the HTT mutation was discovered less than thirty years ago, such a study is only now becoming feasible. Even so, there are limits to how much of the model can be tested in humans. We cannot, for example, sample neurons over the life span to see how many come to exceed 115 repeats, or tally the proportion of neurons that reach a pathogenic threshold before phenoconversion. Nevertheless, we have been able to measure somatic repeat expansions in blood samples from HD carriers and patients over a twentyyear period and examine cortical tissue from mutation-carrying fetuses and deceased adults. By characterizing the degree of somatic expansion at these different stages, we were able to analyze associations between changes in the somatic expansion, age, and inherited CAG repeat length.

Determination of somatic expansion index in HD carriers
We collected biological samples from 72 HD mutation carriers across the life span: 7 fetuses, 50 adults, and 15 post-mortem brains (see Tables 1, 2 and 3). For all samples, we calculated an expansion index (EI) based on a specific PCR followed by fragment sizing to identify the peaks corresponding to different numbers of CAG repeats, or (CAG)n (Lee et al., 2010;Mouro Pinto et al., 2020). The expanded allele has a characteristic PCR profile with one particularly prominent peak, which provides the CAG repeat size given for diagnosis ( Figure 1-figure supplement 1A, see 'Materials and methods'). This 'reference peak' is flanked by additional peaks that reveal the various repeat lengths in a given tissue, which we refer to as mosaicism or somatic instability. The fluorescence intensity of each peak reflects the proportion of cells bearing each somatic expansion, but it is worth noting that PCR is biased toward alleles containing smaller repeats. Because peaks to the left of the reference peak can be generated by polymerase slippage during PCR, we used only those to the right of the main peak to calculate the EI (Figure 1-figure supplement 1B). We normalized the heights of the somatic expansion peaks to the height of the reference peak, excluding any that were less than 3% of the main peak height. We multiplied each peak's height by its position to account for the increased repeat length, then summed the peak heights for each sample. An EI of 0 indicates no expansion beyond the inherited allele, and an index >0 indicates mosaicism of the CAG repeat expansion in the tissue.
The CAG expansion is usually followed by a CAACAG cassette that can be duplicated or, in some cases, deleted . There are 21 CAG repeats in the reference sequence NG_009378.1, the cassette CAACAGCCGCCA followed by seven CCG and two CCT. When the cassette is changed to CAGCAGCCGCCA by loss of the CAA interruption (Wright et al., 2019), the tract becomes less stable and more prone to expansion (Khristich and Mirkin, 2020;Rolfsmeier et al., 2000;Xu et al., 2020). We did not detect this variant in our samples. We excluded one patient from the original cohort who had an additional CAA interruption in the CAG expansion.
The somatic expansion index increases over the life span in both blood and brain samples Because of a long period of clinical prospective follow-up of HD patients at the Pitié -Salpê triè re Hospital, we were able to analyze blood samples that were collected during clinical visits at different ages for 50 HD patients (31 women, 19 men; mean reference (CAG)n 44.6 ± 3.5 [range 39-54]) ( Table 2). With up to three samples (n = 50 for t1 and t2, n = 12 for t3), taken on average 12 and 7 years apart, respectively, we could analyze the progression of somatic instability over quite a long period of time. The EI increased over time ( Figure 1A), with the aggregate EI increasing from t1 (0.620 ± 0.655) to t2 (0.881 ± 0.929) to t3 (0.967 ± 0.841) ( Table 2). Regression and Pearson's correlation showed a significant linear relationship between EI and reference (CAG)n in the blood at t1 (r = 0.816, slope = 0.155, p=5.0e-13), t2 (r = 0.880, slope = 0.237, p<2.2e-16), and t3 (r = 0.901, slope = 0.203, p=6.3e-5) ( Figure 1B, left). It is interesting to note that in our cohort, the lowest index value associated with a symptomatic subject was 0.137; this patient had a reference repeat of 39 CAGs and showed overt motor signs at the age of 49 (Table 2).
We were able to evaluate cortices from a separate group of 15 deceased patients ( Figure 1B, right). As expected from previous studies (Shelbourne et al., 2007;Telenius et al., 1994), these tissues had the highest EI in our cohort (3.361 ± 2.390, range: 1.288 to 9.094) ( Table 3), which correlated with the CAG repeat length (r = 0.615, slope = 0.492, p=0.015). We also had a post-mortem brain from a juvenile-onset case with a reference CAG repeat size of 128. The extreme mosaicism in this tissue, however, made it difficult to determine a main CAG peak or calculate an EI using the PCR profile, so we did not include it in our analyses ( Figure 1-figure supplement 1C).
Because severe neuronal loss could skew the detection of expansions (Mouro Pinto et al., 2020), we were particularly interested in examining brain tissue from early development. We analyzed fetal cortical samples from seven HD gene carriers at 13 weeks' gestation (CAG: 40-46, Table 1; Figure 1B,C; Barnat et al., 2020). Although the adult HD cortex has been consistently found to  bear the greatest somatic expansions, the fetal cortex showed almost no mosaicism: the somatic EIs were very small, ranging from 0.043 to 0.060 (0.050 ± 0.006), though they still correlated with CAG repeat length (p=0.023) ( Table 1). These indices were extremely close to those from trophoblast tissues that were analyzed for prenatal diagnosis between 11 and 12 weeks' gestation ( Figure 1C, left). Yet blood samples taken from their premanifest carrier parents at the same time (n = 6, CAG: 42.8 ± 2.5, 40-45; Table 1; these adults were not part of the longitudinal cohort) showed somatic expansions, with a mean EI of 0.256 ± 0.11 (range: 0.107 to 0.369; Figure 1C, left).
To better visualize these differences between parental blood and fetal tissue, we graphed somatic mosaicism in fetal cortices, trophoblasts, and premanifest parents for four different reference CAG lengths and estimated the percent of mutant alleles harboring each somatic expansion length ( Figure 1C, right). There is clearly more variability in the parental blood (dark orange bars) than in the fetal brain tissue (green bars). Similarly, comparison of somatic mosaicism in three of the fetal brains, the blood samples (across three timepoints) from three patients in our longitudinal cohort, and three adult post-mortem cortices ( Figure 2) clearly shows that mosaicism increased over time in blood cells but was even more marked in the adult brain, with more additional CAGs for a given reference CAG repeat size.

Determination of somatic expansion rate
We next asked whether the propensity to expand grows over time, and whether an 'expansion rate' (ER) that estimates the average annual expansion growth for each patient would correlate with the available clinical outcomes. To this end, we first ruled out the possibility of a sex effect by verifying that there was no sex difference in the AO (female: n = 31, 41.9 ± 8. We then calculated an ER for each of the 50 subjects using the slope of the regression line for the EI on ages at visits (0.024 ± 0.033 units per year [range À0.0003 to 0.1367], Table 2). Because calculating a rate entails having a baseline, we chose to extrapolate a plausible, if theoretical, EI at AO (EI-AO). To do so we used the slope and the intercept (estimated EI at birth) for each patient to estimate EI-AO (see 'Materials and methods'). A Pearson's correlation coefficient of r = 0.861 (p=2.1e-15) showed a strong association between the reference CAG repeat size and the somatic ER ( EI and ER correlations with age at onset, age at death, and disease manifest status To determine whether EI or ER could explain the variation in AO not explained by the reference repeat, we first needed to calculate how (CAG)n correlates with AO in our sample. In our longitudinal cohort of 50 subjects, (CAG)n accounted for 47.6% of variance in AO, which is at the low end of the published ranges (~42-71%) (Squitieri et al., 2006;Figure 3A, left). This is likely due to our small sample relative to many such studies, which can include hundreds to thousands of patients. CAG repeat length accounted for 68% of variance in age at death (AD) ( Figure 3A, right). Nevertheless, we proceeded to analyze the relationships between EI-AO, ER, AO, and AD. EI-AO had an inverse correlation with AO (r = À0.437, p=1.7e-03) and AD (r = À0.666, p=9.3e-03) (  from the longitudinal group (n = 14 patients who died during the study) ( Figure 3B). ER had an inverse correlation with AO (r = À0.541, p=5.9e-05) and AD (r = À0.261, p=3.7e-01) (Figure 3-figure supplement 1); it accounted for 33% of the variance in AO and did not account for the variance in AD from the longitudinal group ( Figure 3C). Notably, ER explained a larger proportion of AO variance than EI-AO. EI-AO accounted for more of the variance in AD than did ER, but it is difficult to draw conclusions based on the small sample of patients for the AD data. We then took an alternative approach to understanding variation in AO. We classified individuals into three groups indicating expected AO, earlier-or later-than-expected AO, as defined by the model errors in the linear regression of AO and reference CAG repeat size ( Figure 3A, left; see 'Materials and methods'). Neither EI-AO nor ER accounted for the differences in AO among these groups, despite a trend for lower ER in the later-than-expected group (Kruskal-Wallis test, rate: p=0.181, EI-AO: p=0.810, Figure 4A). Given the difficulties inherent in pinpointing AO, we asked whether we could see an influence of residual ER on the more general classification of premanifest vs manifest. Here we found significant differences between groups in both residual EI and residual ER (Wilcoxon test, p=3.5e-05 and p=0.023 respectively, Figure 4B).
We next asked whether we could find correlations based on the post-mortem cortices ( Table 3). A previous study on post-mortem HD brains showed that, after accounting for reference CAG repeat size, greater somatic expansions in the cortex correlated significantly with earlier AO (Swami et al., 2009). Since we did not have information on AO for the 14 subjects in the post-mortem group, we asked whether EI (from the postmortem samples) or ER (from the blood sample group) correlated with residual AD. We calculated the residual AD after accounting for the effect of the reference CAG repeat length, compared to the ER derived from the blood measures (p=0.028, R 2 adj = 28.6%; Figure 4-figure supplement 1A) or to the EI derived from the postmortem cohort (p=0.578, R 2 adj < 0, Figure 4-figure supplement 1B). With the caveat that we do not know the cause of death in all cases (which could be due to causes other than HD), EI from brain samples did not correlate with the residual AD, but ER from blood samples correlated weakly with residual AD. A larger sample would likely reveal stronger correlations.
Within-subject variation in somatic mosaicism depends on (CAG)n, age, and the interaction of these two variables We next sought to understand the relative contributions of the reference repeat size and age on the tendency toward somatic expansions. To account for the repeated measurements for each patient, a linear mixed-effects model (LMM) was fitted to the EI data on a log scale. Based on the fixed effects of the derived model, we found significant effects of age (coefficient = 0.028, SE = 0.001, p=3.4e-34) and number of CAG repeats (coefficient = 0.276, SE = 0.012, p=1.6e-30). In addition, the significant interaction between age and CAG repeat length suggests that, as CAG repeat length increases, the expansions become greater each year (coefficient = 0.002, SE = 3.8e-4, p=2.2e-7) ( Table 4). As both age and CAG were mean-centered in the model, the exponential intercept would also indicate a predicted EI of 0.547 (intercept = À0.603, SE = 0.048, p=1.0e-16) for a hypothetical patient carrying the mean characteristics of the cohort (i.e., average age in the cohorts of 44.6 years and mean CAG repeat size of 44.7). Sex was again used as a cofactor and did not show any significant effect on EI (coefficient = 0.028, SE = 0.078, p=0.719). Finally, the contribution of each fixed effects term explaining the EI, given by t values (estimate divided by SE) in descending order of importance, was as follows: t(CAG)=24.0, t(age)=22.8 and t (age ÂCAG)=5.7. Based on the fixed effects estimation extracted from the LMM, we plotted trajectories for the EI as of function of age (one trajectory for each CAG repeat length). The predicted values of each EI are shown on the original scale after back-transformation from the logarithmic scale over the same age intervals from the patient cohort for each (CAG)n ( Figure 5). This model provides a glimpse of how instability evolves with (CAG)n, age, and the interaction between these two factors.

Discussion
Our longitudinal study provides data that support the Kaplan model in several ways (Kaplan et al., 2007). First, the model predicts that at the beginning of life, all disease-relevant cells begin with the inherited repeat and negligible somatic instability. This turns out to be the case: we found almost no somatic mosaicism at the fetal stage. One might have expected that the high number of mitoses at this stage of brain development would make neural precursors sensitive to double-strand breaks and replication errors (Leija-Salazar et al., 2018;Schwer et al., 2016), but somatic expansions occur through different mechanisms than germline expansions (Khristich and Mirkin, 2020;Tabrizi et al., 2020). Although we did not have samples from embryos at later stages, there is such data for other diseases caused by repeat expansions. For instance, in Friedreich ataxia, which is caused by an expanded GAA repeat in the first intron on both alleles of the FXN gene, levels of instability found in tissues from an 18-week-old fetus were very low compared to adult-derived tissues (De Biase et al., 2007). In myotonic dystrophy, caused by a non-coding CTG repeat expansion in the DMPK gene, repeat instability was not observed at 13 weeks in fetal tissues, but a difference between tissues became detectable after 16 weeks (Martorell et al., 1997). All these studies suggest that, early in life, somatic instability is minimal.
Second, Kaplan et al. posited that somatic expansions should progress with age, even prior to disease onset. This also turns out to be correct: the presymptomatic carrier parents of fetal mutation carriers already showed somatic instability in the blood at the time of the pregnancy. It is remarkable, in fact, that increases in ER were evident despite our limited sample size and despite the fact that we had to derive this calculation from blood cells, which are not involved in HD pathogenesis and completely change over every six months or so. Unfortunately, for this very reason, the EI from the blood is not sufficient to predict AO, which is influenced by not only repeat length and somatic instability but other factors such as variants in DNA repair factors (see below). Third, the model predicts the rate of allele expansion should increase with time and be a function of the repeat length at that time. This is indeed what we found: not just greater somatic expansions with age and reference repeat length (as represented by EI), but a greater propensity to expand with age (as represented by ER).
One prediction of the Kaplan model we could not test is that there should be different thresholds of somatic expansions that must be reached for different brain regions to become pathological. It is hard to imagine how this particular prediction of the model could be tested, other than by performing extensive neuropathological studies on a great many mice at many different disease stages. In terms of correlation between somatic instability and disease progression, we did find group-level differences in EI and ER between the premanifest and manifest state. We could not establish a correlation at the individual subject level, however, likely because of the limited sample size as well as the difficulty of pinpointing phenoconversion in a disease that continues to unfold over many years. Stronger evidence on this point came from a large study of nearly 750 HD mutation carriers, which showed that larger somatic expansions are associated with worse clinical outcomes (earlier AO, higher motor and progression scores) in HD .
The most interesting questions that remain to be answered have to do with what drives somatic instability. The brain regions that have the greatest repeat expansions in HD, the striatum and  Table 5). Different colors indicate different reference (CAG)n at diagnosis (main peak on PCR profile as shown in Figure 1-figure supplement 1). For each value, the disease status is indicated with an empty triangle (premanifest) or filled triangle (manifest; score >5 on the Unified HD Rating Scale total motor score [UHDRS-TMS]). Note that the measurement Figure 1 continued on next page cortex, are hypermetabolic from early in the disease course (Tereshchenko et al., 2020), and neurons show greater somatic instability than glial cells in models and post mortem brains (Gonitel et al., 2008;Shelbourne et al., 2007). Metabolic stress may also lead to mitochondrial dysfunction and energy deficit in HD (Mochel et al., 2012;Roze et al., 2008;Tabrizi et al., 1999). An excess of excitatory glutamatergic inputs and NMDA receptor activation creates energy demands that are not sustainable in a context of diminished energy capacity, and may lead to cell death (Milnerwood et al., 2010;Mochel and Haller, 2011).
In fact, an excitotoxicity model of neurodegeneration was proposed for HD many years before the discovery of the genetic basis of the disease (Coyle and Schwarcz, 1976;Mcgeer and Mcgeer, 1976). The medium spiny neurons of the caudate and putamen, which are the most vulnerable in HD, receive their main input from cortical glutamatergic neurons; they are thus particularly  (Table 2), who went from 54 to 55 (CAG)n at the second sampling, which was taken into account for the EI. Right panel: A closer look at the values clustered at the bottom of the axis in A ((CAG)n 39-46). The EI increases with progression to manifest state even for individuals with relatively small reference repeats. (B) Scatter plot and regression lines show the linear relation between EI and (CAG)n as observed in adult blood from the longitudinal cohort (left panel, n = 50 patients with at least two samples each; of these, 12 had a third sample; yellow: first sample, orange: second sample, and red triangles: third sample) and cortical tissue (right panel, n = 7 fetal brains, green triangles, and n = 15 adult brains, blue triangles). Pearson's correlation coefficients and estimated regression slopes with p-values, indicated in the upper portion of each graph, reveal a positive linear relation between EI and reference CAG repeat length. (C) EI values from seven fetal samples according to the reference CAG repeat, ranging from 40 to 46 (at 13 weeks gestation). The instability indices of the cortical samples (green) overlap with those of the trophoblast samples (yellow); indices from carrier parents' blood are in orange (ages 25 to 34 years). Two of the fetuses had the same (CAG)n of 46 and thus overlap on the graph. Left panel: Comparison of brain tissue instability from HD carrier fetuses at 13 weeks (green), to the corresponding trophoblasts sampled for prenatal testing (yellow) and the premanifest carrier parents' blood (orange). Right panel: The percentage of mutant alleles bearing the different somatic expansions ascertained from the peak heights. Four graphs were plotted for the reference CAGs (40, 42, 45, and 46) determined on the fetal tissues. The parental blood samples show significant somatic expansions, whereas the trophoblast and the developing cortex show very little. The online version of this article includes the following figure supplement(s) for figure 1:  t2 -t1 t3 -t2 Delta age 12 ± 4.9 (n = 50, r = 0-24) 6.2 ± 5.2 (n = 12, r = 0-18) susceptible to excitation and, in fact, HD can be mimicked by administering glutamate analogues to the striatum (Coyle and Schwarcz, 1976;Estrada Sánchez et al., 2008;Mcgeer and Mcgeer, 1976). In this context it is worth noting that variants in the GluR6 kainate receptor locus were found to account for 13% of variation in AO that was not provided by CAG repeat number (Rubinsztein et al., 1997). Along similar lines, a recent study showed that absence of the aryl hydrocarbon receptor (AhR), which protects mice from excitotoxicity, greatly reduced behavioral deficits in the R6/1 transgenic model of HD (Angeles-Ló pez et al., 2021). Hypermetabolism would also contribute to oxidative stress, which can cause DNA damage (Iyer and Pluciennik, 2021;Leija-Salazar et al., 2018). Large-scale studies have linked somatic CAG expansions in patients' blood to the presence of variants in DNA repair genes, not just in HD Lee et al., 2019) but in other polyglutamine diseases as well (Bettencourt et al., 2016). In HD, somatic instability is influenced by polymorphisms in MSH3, MLH1, MlH3, and FAN1, which are all involved in DNA repair . Counterintuitively, loss of function of some Figure 2. Somatic mosaicism increases in blood and post-mortem cortex over time. Comparison of mosaicism in cortical tissue from Huntington disease (HD) carrier fetuses at 13 weeks (green), blood samples over time (gold, orange, and red for t1, t2, and t3, respectively) and adult post-mortem cortices (blue). We ascertained the '% mutant alleles' (as in Figure 1C) from the peak heights from PCR profiles obtained on GeneMapper. Three reference CAG lengths (41,43,or 45 CAG) were chosen from our cohort to illustrate the evolution of instability, and each graph represents one individual patient (from top to bottom): for the fetal samples, patients 2, 4, and 5 ( Table 1); for the blood samples, the repeated measures from patients 4, 19, and 34 ( Table 2); for the post-mortem samples, patients 2, 5, 10 ( Table 3).  (Pinto et al., 2013). The reason for this may be that transcriptionally active genes elicit mismatch repair activity to guard genomic integrity, but long repeat tracts are difficult to repair accurately (Iyer and Pluciennik, 2021). A different mechanism is at work for FAN1, which actually stabilizes the CAG repeat in HD (Goold et al., 2019); loss of FAN1 function increases repeat instability Loupe et al., 2020). Interestingly, there is evidence that doublestrand break repair is dysregulated in HD: ATM (ataxia-telengiectasia mutated) is upregulated in brain tissue from HD mice and patients, and its heterozygous loss of function is protective in both mouse and Drosophila models of HD (Lu et al., 2014). It could be that the decline in DNA repair capacity or efficiency that comes with age (Gorbunova et al., 2007) contributes to the increasing somatic instability in blood cells, which, as we noted above, seem too short-lived to accumulate expansions as they do. An extended longitudinal study of the effect of DNA repair gene variants on somatic instability would be of great interest.
Given that somatic instability influences disease progression, targeting the repeat instability is a very appealing disease-modifying strategy (Khristich and Mirkin, 2020). One possibility is to introduce DNA-stabilizing interruptions into the repeat tract via gene editing . Another is to modulate DNA repair activity in HD to retard somatic expansions (Dragileva et al., 2009;Pinto et al., 2013), but this might also run the risk of increasing overall genomic instability. A recent approach using a small molecule that specifically binds CAG slip-out structures was able to contract the expansions and reduce protein aggregates in the striatum of R6/2 mice (Nakamori et al., 2020). Further efforts to stabilize or contract somatic expansions are warranted, particularly if expansions within brain tissue can be reduced. Last but not least, there is much more work to be done to understand the mechanisms that trigger somatic expansions, whether they relate to excitotoxicity, and how they lead to neurodegeneration. Continued on next page Figure 3 continued reference CAG repeat length explains roughly half the variability in AO (R 2 adj = 47.6%, p=1.8e-8). HD individuals were classified as having onset earlier than expected (green), as expected (grey), or later than expected (red) according to their distance from the linear predictions given by the CAG repeat (see 'Materials and methods'). Right panel: The reference CAG repeat length explains 68% of the variability in AD (p=1.7e-04). (B) EI-AO explains 20.7% of the variability in AO (p=6.0e-04, left panel) and 49.7% of the variability in AD (p=2.9e-03, right panel). (C) ER explains 33% of the variability in AO (p=9.3e-06, left panel) and does not explain the variability in AD (R 2 adj < 0, p=0.348, right panel). The online version of this article includes the following figure supplement(s) for figure 3:

Sample collection Longitudinal study
We recruited HD patients through the Department of Genetics of the Pitié -Salpê triè re University Hospital (Paris, France). Inclusion criteria were a pathological CAG repeat expansion in the HTT gene above 38 repeats. Age at disease onset was defined as the presence of a clinically significant movement disorder consistent with HD. We obtained blood samples with written informed consent according to the French legislation (approval from local ethics committees on 19/12/1990, 10/11/ 1992, followed by the Ethics committee Ile de France II on 30/9/2004 and 18/2/2010). All tested subjects were offered long-term follow-up and signed an informed consent prior to clinical examination and interview. We determined AO by taking the earliest date between self-reported age and motor signs at examination by a neurologist.

Post-mortem cortical samples
Brain samples were collected as part of a program of 'Brain Donation for Research' (National Neuro-CEB Brain Bank, GIE Neuro-CEB BB-0033-00011). Brains were dissected in the neuropathological department of the Pitié -Salpê triè re University Hospital (Paris, France) to isolate samples from the frontal cortex.

Fetal samples
Approximately 20% of HD mutation carriers request prenatal diagnosis. After analysis of the fetal DNA, obtained by chorionic villus sampling, if the fetus carries the mutation the parents can request termination of the pregnancy, which is performed by manual vacuum aspiration under general anesthesia. Typically, the termination occurs at gestational week 13. We used standard obstetric protocols in accordance with the French guidelines for clinical practice. Prenatal visits and psychological support were provided for all couples participating, as standard practice, and no additional visits were planned due to participation in this study. The women signed an informed consent during a prenatal visit agreeing to the collection of fetal tissue following the eventual termination of the pregnancy. The study complied with all relevant ethical regulations, with approval from the French Agency of biomedicine (n˚PFS17-001; 24/01/2017). The brain tissue analyzed was from the developing cortex.

DNA extraction
Post-mortem brains and fetal tissues were rapidly frozen and stored at -80˚C until DNA extraction. DNA was extracted from brain tissues using the QIAamp Fast DNA Tissue Kit (Qiagen S.A., Courtaboeuf Cedex, France), according to manufacturer's instructions. For blood samples, DNA was extracted using the Maxwell RSC Blood DNA kit, according to manufacturer's instructions (Promega, France EURL). Finally, we measured DNA yields using a NanoDrop 8000 spectrophotometer (Ther-moScientific, Illkirch Cedex, France).  A., Courtaboeuf Cedex, France). After an initial denaturation for 10 min at 96˚C, samples were subjected to 35 cycles of 45 s of denaturation at 96˚C, 2 min 30 s of annealing-extension at 70˚C, followed by a final extension for 7 min at 72˚C. Each amplification product was mixed with Hi-Di Formamide and Genescan-400HD Rox size standard (Applied Biosystems, Foster City, CA). Fragments were separated on an Applied Biosystems 3730XL DNA Analyzer. We scored alleles with Gen-eMapper software v5.0 (Applied Biosystems). We used two sets of primers (see sequence below): HD-F2 with HD-WR2-hex to determine the CAG repeat length and instability, and HD-F2 with HD-WCAAM4-R-fam to determine the presence of an additional CAA interruption. We excluded any patients with a CAA interruption from this study (n = 1). To visualize the fragments, the primers used for the PCR contain a fluorescent tag, so that the fluorescence intensity is proportional to the number of amplified fragments. Figure 5. Evolution of somatic expansions in HD patient blood is a function of age, CAG repeat length, and the interaction between age and repeat length. A linear mixed model was fitted to the longitudinal data using all blood samples collected. The fitted lines show that the predicted somatic expansion increases with age for all reference repeat sizes, most notably at greater reference repeat lengths. The sex of the patients is indicated by solid and dashed lines (male and female respectively). Each curve covers the same age interval observed in our cohort for a given repeat length.

Measuring somatic CAG repeat expansions and calculating the somatic expansion index
We used the GeneMapper software v5.0 (Applied Biosystems) to analyze the somatic CAG repeat expansions. For any individual, the majority of PCR products peak around a main signal representing the reference CAG repeat size. Signals to the left of this peak include PCR 'stutter' inherent in the assay, but PCR products to the right represent somatically expanded CAG repeats only; these latter peaks were included. From the GeneMapper 'sample plot view,' we exported a data table for each sample containing the following information: sample name, called CAG allele, peak size in base pair (bp), peak height, area under the peak, and data point/scan number of the highest point of the peak. Based on the main expanded CAG peak size, we used an internal standard to assign, on a per plate basis, a main CAG length to each sample. We used peak heights to quantify mosaicism from GeneMapper traces. To calculate the proportion of expanded products for each sample, we normalized the heights of the expanded peaks to the height of the main CAG peak, multiplied by the position of the peak. We applied a relative threshold of 0.03 of the main peak, excluding peaks falling below this threshold from analysis. We selected this threshold based on the additional peaks in fetal tissues that were low in intensity but clearly distinguishable from background by the software. Finally, we summed all peak values to generate an expansion index.

Statistical analyses
We conducted all statistical analyses using R version 3.6.1 (R Development Core Team, 2019; https://www.R-project.org/), and we generated plots with the ggplot2 R package (Wickham, 2009) (ggplot2_3.3.0). We generated correlation plots using the corrplot R package (corrplot_0.84). The level of statistical significance was set at p<0.05 for all tests.

Descriptive statistics
Descriptive statistics were reported for the HD patients with demographics and disease characteristics (sex, age, somatic expansion index, and Unified HD Rating Scale total motor score [UHDRS-TMS]) determined at each visit that included blood sample collection. We defined AO as the onset of motor signs, as defined by the patient and their family, or first neurological exam at which they were considered symptomatic, whichever was earlier. Patients with a UHDRS-TMS greater than 5, which indicates motor signs of HD, were considered to have 'manifest' HD. We summarized the data as n (number of available values), mean ± SD, and range (minimum and maximum) for quantitative variables and frequency counts and percentages for categorical variables.

Relationship between somatic CAG expansions and germline CAG repeat length
For samples collected from post-mortem brains (carrier fetuses and adult brain) or blood (two to three samples per patient), we studied the relationship between the CAG somatic expansions and the CAG repeat length by linear regression. We then determined the strength of association by the Pearson's correlation coefficient (r), the slope, and p-value of the regression line.

Regression analysis of disease characteristics with CAG repeat and somatic expansion measures Expansion index (EI)
Prior to regression analysis, we transformed AO and AD values by the natural logarithm to better meet the linear model assumptions of normality and homoscedasticity (constant variance) of the residuals. Because we were able to collect blood samples at two or three time points for each patient in the longitudinal part of the study, we calculated corresponding EIs for each time point.

Expansion rate (ER)
From these EIs, we were able to derive a rate of change in expansion over time (expansion rate or ER) in addition to the single time point measures. To investigate whether somatic instability itself evolves, i.e., whether the tendency to expand increases with age, both slope and intercept coefficients were extracted using linear regressions for each individual. We used the slope to calculate the expansion rate of change (ER), while the intercept (EI-intercept) indicated a theoretical baseline value (age 0, i.e., at birth) for the expansion index.

Expansion index at age at onset (EI-AO)
Even though the EI-intercept is too distant in time from the visits to be a realistic estimate of CAG instability at birth, we used the slope and the intercept for each patient to extrapolate a plausible (albeit theoretical) expansion index at AO (EI-AO).
In a first analysis, we performed linear regressions to model the values of log-AO and log-AD, respectively, as a function of CAG repeat length, EI-AO, and ER. We used the p-value of the slope and adjusted R squared (R 2 adj ) values to determine all associations. Sex differences in AO and AD were also assessed using Wilcoxon rank-sum tests. Finally, we generated a correlation matrix plot summarizing all pairwise correlations between the variables from the longitudinal cohort.
Since the CAG repeat length is a well-established predictor of AO, we carried out the following analyses to understand whether combining information from the CAG repeat length and evolution of the somatic CAG instability could better characterize the disease onset.
Determination of earlier-than-expected, as expected, or later-thanexpected age at onset Since AO, EI, and ER are all CAG length-dependent to a great extent, we sought a way to dissociate their contributions. To this end, we divided HD patients into three groups according to whether their motor symptom onset occurred earlier or later than the AO predicted by CAG repeat number [(CAG)n]. Following (Swami et al., 2009), we calculated the residuals from the linear regression, including log-AO as the dependent variable and (CAG)n as the independent variable, to evaluate the differences between the observed and predicted AO. We standardized residuals to have mean zero and unit variance and defined onset groups as 'earlier' for residual values less than À0.5, 'later' for residual values greater than 0.5, and 'as expected' otherwise. We then performed a Kruskal-Wallis test to compare the ER and EI-AO values among these groups.

Relationship between somatic expansion and residual age at death
As a complementary analysis, similarly to the previous AO study, we used data from the 14 deceased patients in the longitudinal cohort, and data measured in the 14 postmortem brains (Tables 2 and  3). Based on the residual AD (i.e., AD after subtracting the effect of the CAGn using linear regression), we performed an association study with ER (blood samples) and EI at AD (postmortem samples) using linear regressions. Associations were reported with p-value of the slope and adjusted R squared (R 2 adj ) values.

Influence of disease status on EI and ER in blood samples
The cohort had a sufficient number of subjects in the premanifest and manifest stages at the first visit to study the influence of disease status on the residual EI and ER after using linear regression to subtract the contribution of CAGn. Since we could correlate EI with disease status only at the first visit (too many patients phenoconverted by the second visit), and because of the impossibility of clearly distinguishing the contributions of premanifest/manifest status, CAG repeat length, and age, this was an exploratory study prior to modeling using the complete expansion data with age and CAG repeat length. Comparisons of EI and ER with disease status were performed using Wilcoxon sum-rank tests.
Distinguishing the determinants of somatic instability in blood samples: linear mixed-effects model To investigate the longitudinal association of CAG repeat length and age with the somatic expansion in blood, we employed a linear mixed-effects model (LMM) including the variables age, CAG, and age Â CAG interaction term as fixed effects, the subject identifier as a random effect to account for the within-subject correlation among visits ('random intercept only model'), and sex as a cofactor for adjustment. Prior to modeling, the somatic expansion values were transformed by natural logarithm to improve the model assumptions of linearity, normality, and constant variance of the residuals. LMM was fitted using restricted maximum-likelihood estimation (REML) from the function lmer in the lme4 R package (Bates et al., 2015a) (lme4_1.1-21). For the retained model, we reported the coefficient estimates of fixed effects with standard errors and standardized regression coefficients (t values), and the standard deviation of random effects. T values were obtained by dividing each coefficient estimate by its standard error and used as a measure to represent the relative strength of association of each term with somatic expansion in blood. Significance of fixed effects (p-values adjusted for sex) was obtained with the lmerTest R package (lmerTest_3.1-1) using Satterthwaite's approximation for degrees of freedom. As age and CAG repeat length were mean-centered for modeling, the estimate for the model intercept can be interpreted as the level of somatic expansion for a virtual subject with average characteristics for all patients in the study (mean age and mean CAG repeat length). Curves for the age-trajectories of somatic expansion in blood (one trajectory per CAG value, Figure 5) were plotted from the fixed effects component of the model.