Data‐driven regions of interest for longitudinal change in three variants of frontotemporal lobar degeneration

Abstract Introduction Longitudinal imaging of neurodegenerative disorders is a potentially powerful biomarker for use in clinical trials. In Alzheimer's disease, studies have demonstrated that empirically derived regions of interest (ROIs) can provide more reliable measurement of disease progression compared with anatomically defined ROIs. Methods We set out to derive ROIs with optimal effect size for quantifying longitudinal change in a hypothetical clinical trial by comparing atrophy rates in 44 patients with behavioral variant of frontotemporal dementia (bvFTD), 30 with the semantic variant primary progressive aphasia (svPPA), and 26 with the nonfluent variant PPA (nfvPPA) to atrophy in 97 cognitively healthy controls. Results The regions identified for each variant were generally what would be expected from prior studies of frontotemporal lobar degeneration (FTLD). Sample size estimates for detecting a 40% reduction in annual rate of ROI atrophy varied substantially across groups, being 103 per arm in bvFTD, 31 in nfvPPA, and 10 in svPPA, but in all groups were less than those estimated for a priori ROIs and clinical measures. The variability in location of peak regions of atrophy across individuals was highest in bvFTD and lowest in svPPA, likely relating to the differences in effect size. Conclusions These findings suggest that, while cross‐validated maps of change can improve sensitivity to change in FTLD compared with a priori regions, the reliability of these maps differs considerably across syndromes. Future studies can utilize these maps to design clinical trials, and should try to identify factors accounting for the variability in patterns of atrophy across individuals, particularly those with bvFTD.


| INTRODUCTION
Frontotemporal lobar degeneration (FTLD) is a neurodegenerative disorder that has a profound effect on the lives of patients and their families; one that can be considered more detrimental than the effects of more typical degenerative disease such as Alzheimer's disease (AD) because it is associated with an earlier age of onset (Papageorgiou, Kontaxis, Bonakis, Kalfakis, & Vassilopoulos, 2009) and more rapid rate of decline (Roberson et al., 2005). Neuroanatomically, it manifests distinctly from AD in that it primarily involves the frontal and anterior temporal cortex rather than medial temporal and temporoparietal regions. There are no approved treatments for FTLD but efforts to develop them are underway (Boxer & Boeve, 2007;Boxer, Gold, et al., 2013;Boxer, Knopman, et al., 2013).
Brain imaging is a powerful tool in neurodegenerative disease.
MRI and PET, the most commonly used techniques, can be used to support diagnosis, and measures derived from brain images correlate with the type and severity of symptoms in each patient (Tartaglia, Rosen, & Miller, 2011). These observations have led to studies examining the utility of longitudinal brain imaging as an outcome measure for clinical drug trials, which have demonstrated that MRI can track change in neurodegenerative disorders more reliably than clinical measures such as cognitive testing (Knopman et al., 2009;Weiner et al., 2013).
One limitation of brain imaging is that each image produces hundreds or thousands of data points per patient corresponding to spatial locations in the brain, posing a significant hurdle for defining imagingbased biomarkers (Friston, Holmes, Poline, Price, & Frith, 1996). One of the most common approaches to reduce the large-scale data in imaging studies is to limit measures of change to aggregated estimates over regions of interest (ROIs), which tend to be chosen based on prior knowledge about the regions that are most severely affected in each disease. In AD, ROIs chosen often include the hippocampus, entorhinal cortex, and temporoparietal regions (Dickerson et al., 2011).
In FTLD, the frontal and/or temporal lobes have been used (Gordon et al., 2010;Krueger et al., 2010). However, the regions most severely affected in each disease tend to be those affected earliest (Jack et al., 1997;Seeley et al., 2008). When a disorder moves beyond the earliest stages, it is possible that regions affected early begin to slow their rate of change while other regions, previously only mildly affected, begin to accelerate their decline (Brambati et al., 2009;Rohrer et al., 2012;Schuff et al., 2012). Thus, ROIs chosen based on regions that are most strongly associated with the disease may not be optimal for determining treatment effects. Recent studies have shown that empirically derived ROIs representing the most reliable voxels associated with an effect of interest can be used to improve diagnosis of dementia (Avants, Cook, Ungar, Gee, & Grossman, 2010;McMillan et al., 2014) and to improve statistical power for longitudinal analysis (Chen et al., 2010;Hua et al., 2009) compared with ROIs chosen based on their general association with the disease. We recently created an empirically based ROI of annualized atrophy in a group of FTLD patients and demonstrated the potential for larger effect sizes than a priori ROIs (Pankov et al., 2016).
Frontotemporal lobar degeneration includes a spectrum of disorders with varying molecular, clinical and imaging characteristics (Bang, Spina, & Miller, 2015;Tartaglia et al., 2011). The three canonical clinical presentations include: (1) the behavioral variant of frontotemporal dementia (bvFTD), characterized by progressive impairment in socioemotional function; (2) the semantic variant of primary progressive aphasia (svPPA; also known as semantic dementia), characterized by progressive loss of knowledge about words and objects, and (3) the nonfluent variant of PPA (nfvPPA), characterized by progressive impairment of articulation and speech (Gorno-Tempini et al., 2011;Rascovsky et al., 2011). Each variant is associated with distinct distributions of cortical atrophy varying particularly in the degree of temporal and frontal lobe involvement. BvFTD alone can show highly variable patterns of atrophy (Whitwell et al., 2011). Therefore, it is likely that the most sensitive ROIs for FTLD will be derived empirically from and specific to each variant. In our previous analysis (Pankov et al., 2016), we examined annual volume loss in a mixed group of bvFTD and svPPA cases. The number of subjects in that study was too small to examine syndrome-specific patterns of change. In this study, we set out to identify the most reliable regions of change separately in bvFTD, svPPA, and nfvPPA and estimate sample sizes for theoretical clinical trials that might involve each of these groups individually.

| Subjects
Subjects in this retrospective study included any subject studied at the UCSF Memory and Aging Center (MAC) who had undergone MRI twice over a period ranging between 6 months and 2 years with a diagnosis of behavioral variant of bvFTD (n = 44), svPPA (n = 30), or nfvPPA (n = 26). All data were annualized prior to analysis. In addition, we assembled a group of healthy comparison subjects (HC) with longitudinal imaging with the same age range and sex distribution of the FTLD group (HC, n = 97, mean age 64.77 ± 6/95, mean education level 17.65 ± 6.95). Patients included in this study were recruited between 2008 and 2015 through ongoing studies (AG019724, AG032306, AG023501) at the MAC. Diagnosis for these studies was based on a multidisciplinary evaluation incorporating neurological, neuropsychological, and nursing assessment (Rosen et al., 2002). Structural brain imaging was not used to make syndromic diagnosis, but only to exclude other causes of brain damage, such as strokes or tumors. Disease duration was estimated based on the year of initial symptoms provided by the patient or their informant. HC data were obtained from a cohort of subjects recruited at the MAC via advertisements and community events. HCs underwent the same evaluation as patients and were required to have no clinically significant cognitive or behavioral complaints, performance within one standard deviation of normal on all cognitive tasks, and to have brought a knowledgeable informant to verify the absence of clinically significant cognitive or behavioral problems. HCs were excluded if they had a history of significant mood disorders, clinically significant alcohol or drug use, significant vascular disease, visual problems that would impair test performance, other neurologic conditions, and self-reported deficits in cognition.
All subjects were required to have had two T1-weighted MRI scans acquired with the same scanner and pulse sequence and with a quality suitable for processing. Images were inspected for quality, including ensuring whole-brain coverage and looking for excessive motion artifact. Assessment of CNS amyloid burden, usually with PET amyloid imaging using Pittsburgh B compound was available in 63 of the patients.
Because the goal of the analysis was to examine the change maps in groups with specific clinical diagnoses, all patients with available MRI data were included, regardless of amyloid status. A sensitivity analysis was conducted on the subset of bvFTD patients who were known to be amyloid negative to examine whether maps of change differed substantially from the maps created from the group as a whole. Amyloid status was generally not available in the controls. All research was performed in accordance with the Code of Ethics of the World Medical Association. All subjects provided informed consent, and the clinical and imaging protocols were approved by the UCSF Committee on Human Research.

| Clinical assessment
Patients were diagnosed using published criteria (McKhann et al., 1984;Neary et al., 1998) after a comprehensive evaluation at the UCSF MAC including neurological history and examination, nursing assessment, laboratory evaluation, and a previously described neuropsychological assessment (Kramer et al., 2003). The neuropsychological assessment battery includes the Mini Mental State Examination (MMSE) (Folstein, Folstein, & McHugh, 1975), and tests tapping into functions relevant to FTLD including memory, language and frontal/ executive functions. These include list-learning (California Verbal Learning Task [CVLT]; Delis, Kramer, Kaplan, & Ober, 2000), confrontational naming (15 items from the Boston Naming Test [BNT]; Kaplan, Goodglass, & Wintraub, 1983), set-shifting (modified version of the Trails B task; Kramer et al., 2003), and tests of lexical fluency (words beginning with the letter "D"; Birn et al., 2010), and semantic fluency (animals; Delis, Kaplan, & Kramer, 2001). Functional state was quantified using the Clinical Dementia Rating (CDR; Morris, 1997), which was used here to generate a continuous variable based on the sum of the individual ratings for functional domains, typically referred to as the sum-of-boxes (CDR-SB). Although an FTLD-specific version of the CDR has been developed (Knopman et al., 2008), many of these patients were assessed before our center began using it, so this analysis was done using only the traditional CDR domains.

| Image processing
Longitudinal changes in regional brain volume were estimated using the Pairwise Longitudinal Registration Toolbox implemented in SPM12 (Ashburner & Ridgway, 2012), which addresses concerns regarding asymmetric bias in pair-wise longitudinal registration (Thomas, 2010;Yushkevich et al., 2010). The process begins with intrasubject registration using iterative and interleaved rigid-body alignment, diffeomorphic warping, and correction for differential intensity inhomogeneity to generate a within-subject template representing an average of the subject's two scans with respect to position, shape, and intensity nonuniformity. Two Jacobian determinant maps are then computed; one that encodes the relative difference in volume between the first scan and the within-subject average, and another that describes the relative volume between the second scan and the average. Computing the difference between these two Jacobian determinants provides a map of relative change in volume between scan one and scan two at each spatial location. The change maps were divided by the interscan interval (in units of years) to become maps of annual rate of relative volume change. Each subject's average image was bias-corrected and the brain was partitioned into gray matter, white matter, and cerebrospinal fluid (CSF), using SPM12's unified segmentation procedure. The contraction/expansion maps were then multiplied with the gray matter probabilistic tissue segmented maps on a voxel-by-voxel basis, in within-subject average space, to restrict analyses to cortical and subcortical gray matter.
Image segmentation can be affected by several factors that may relate to disease, including histological abnormalities that could cause changes in tissue contrast, as well as subject movement, which would decrease signal-to-noise ratios. To ensure that the analysis would not be excessively influenced by differences in the quality of gray matter segmentations across groups, we reviewed the distributions of values for the whole-brain gray matter probability maps across groups. The shapes of these distributions were similar across groups.
To allow statistical analysis across subjects, all images were transformed to a standardized space. Mappings from the gray matter and white matter segments of the within-subject averages (all patients and control subjects) to an iteratively evolving study-specific population mean of these tissues were estimated using the DARTEL (diffeomorphic anatomical registration through an exponentiated lie algebra) toolbox (Ashburner, 2007). DARTEL minimizes the geodesic distance from each patient to the population mean. Thus, between-population asymmetries in registration, which could also lead to erroneous population effects, were addressed. An affine mapping between the population mean and MNI space (defined by SPM12's Prior Tissue Probability Map) was also estimated and combined with each subjectto-population mean mapping for warping average images and volume expansion/contraction rate maps to MNI space. The rate change maps were then warped to population-in-MNI space using the abovementioned mapping composition, and resampled to 1.5 mm 3 without "volume-preserving" modulation. No spatial smoothing was applied.
Subsequent analysis was done using only the gray matter maps of each patient.

| Overview
Our data-driven ROI generation procedure follows (in spirit) from prior approaches where optimal effect sizes were estimated from a training set and tested on an independent test set (Chen et al., 2010;Hua et al., 2009). However, we use a cross-validation-type scheme rather than a simple training-test approach in order to maximally use the data available in generating a "best" consensus ROI; we thereby avoid overfitting for our estimates of effect size and sample size. For each randomly partitioned cross-validation training set, we first generated a Student's t statistic (allowing unequal group variances) at each voxel in standardized space. The map of t statistics quantifies the difference in the effect size of contraction between each FTLD patient group and HCs across the brain. A 3D ROI is extracted by thresholding the map of t statistics such that the threshold used maximizes the effect size in the same training set. The effect size for tissue contraction over 1 year is then estimated on the independent test set partition of the data. After repeating the process multiple times, the effect size is estimated as the mean of the estimates across the independent test sets. A consensusweighted ROI was then generated from the cross-validation procedure by weighting each voxel based on its reliability in distinguishing contraction between patient and HC groups across the random partitions.
We specifically chose to examine only contracting voxels because expanding voxels would often represent residual CSF spaces that were not completely removed by segmentation and masking. If we included expanding voxels, we would be making the assumption that future studies would encounter similar patterns of expansion in residual/unmasked CSF voxels. Thus, the generalizability of the resultant map would be dependent on similarity between our segmentation and masking procedures and the segmentation outcomes of future studies.
Given that this segmentation accuracy would depend on many factors, we felt that limiting the ROI to only voxels that would be expected to contract would be more conservative and generalizable.

| Procedure
Data-driven ROIs were generated separately for each clinical variant of FTLD by comparing change maps in each patient group to change in the entire control group. The cross-validation algorithm proceeded as follows: 1. For each patient group, the combined set of control and patient data were randomly divided into training and test sets, with 16% of the data being assigned to the test set. Each split was stratified such that the proportion of FTLD to normal samples was required to be more than 1/3, but less than 2/3 of the total test set. For example, in the case of bvFTD where we have the N of 97 for controls and 44 for bvFTD, the size of the test set would be (97 + 44) × 0.16 = 23 images, of which 1/3 (8) to 2/3 (15) would have to be bvFTD.

2.
A series of ROIs was then generated in each training set by thresholding the t-maps over a set of levels ranging from 3.5 to the maximum observed t statistic in increments of 0.01 units.

3.
The effect size for the mean difference in rate of change between each FTLD variant and controls was then calculated for each ROI of the training set using Cohen's d. A plot is then generated of effect size versus each t statistic cutoff. The plot represents the relationship between the t statistic cutoff and the corresponding effect size for each resulting ROI (see below, Figure 1).

4.
The ROI associated with the t statistic cutoff corresponding to the maximum effect size is selected.

5.
The ROI from step 4 is then used to calculate the effect size in the test set to obtain an unbiased effect size estimate for the particular partition.
Steps 1-5 were then repeated 1,024 times, reassigning patients into the training and test sets each time. At the end of the process, we have a set of "optimal" ROIs (across training/test set partitions).
The effect size is then estimated as the mean effect size over all partitions. To then estimate a consensus ROI from the ensemble of cross-validated measurements, we weighted the contribution of each voxel to the data-driven ROI as the proportion of cross-validation partitions (weighted by the effect size for that cross-validation sample) in which the voxel contributes to the consensus ROI. Thus, the resulting map has a stronger representation from voxels consistently contributing to the overall effect size across cross-validation samples and weaker representation from voxels whose contribution was more variable.
F I G U R E 1 Plots of effect size versus t score threshold cutoff for each clinical variant, used to identify t score threshold giving map with maximum effect size It should be noted that at high t-thresholds the maximal empirical effect size estimate becomes highly variable over neighboring thresholds because only a small number of voxels form a ROI at high thresholds. To mitigate this effect and generate a stable estimate of maximum effect size, we smoothed the effect size curve plotted against threshold. However, even lowess regression did not sufficiently downweight the influence of high thresholds. We therefore implemented a heuristic method to identify the maximum effect size. Specifically, a lowess regression was performed after iteratively excluding a top set of voxels (from 0% to 10% of the highest voxels in increments corresponding to those associated with the t-thresholds). At each iteration, the lowesssmoothed maximum was calculated, and the overall maximum was taken as the median of all the smoothed maximums. This approach was able to identify the location of the maximum in reasonable agreement with the choice that one would make visually as being the maximum of the relatively smooth (and therefore reliable) part of the curve (see Figure 1).
In order to estimate the potential impact of using an optimized data-driven ROI of change for future clinical trials, we calculated the necessary sample size in a hypothetical clinical trial seeking to detect a 20% and 40% reduction in the change over 1 year in volume loss in each FTLD group (α = 0.05, power = 0.8). We compared the sample size from the effect size estimated using the data-driven ROIs (i.e., via the mean effect size over the test set estimates) to the sample sizes obtained by measuring change within a priori ROIs based on cerebral anatomy. For this purpose, we used frontal, temporal, combined frontal and temporal, and whole gray matter masks as regions of interest (ROIs) relevant to FTLD. These ROIs were obtained from the AAL brain atlas supplied with the WFU-PickAtlas software package (Maldjian, Laurienti, Kraft, & Burdette, 2003).

| Change in clinical variables and sample size estimates
Changes in clinical variables were analyzed using linear mixed effects models with cognitive score as the dependent variable and elapsed time in years as the predictor. In order to compare the sample size estimates generated for imaging-based measures of change to those generated using clinical measures, we calculated sample size estimates using annualized changes in score for the MMSE, selected measures of language and executive functioning, and for the CDR, which has been identified as an attractive measure for tracking change in FTLD (Knopman et al., 2008). We calculated the necessary sample size in a hypothetical clinical trial seeking to detect a 20% and 40% reduction in the change over 1 year in clinical measures in each FTLD group (α = 0.05, β = 0.8). These analyses were carried out using Stata (version 14, www.stata.com).
The differences in mean interscan interval across groups were not statistically significant (p = .11), nor were differences in education level (p = .43) or disease duration (p = .45). In terms of cognitive and T A B L E 1 Baseline and 1-year clinical data in patient groups bvFTD (n = 44) nfvPPA (n = 26) svPPA (n = 30)

| Change maps in amyloid-negative and nongene carrier bvFTD subjects
The  Sample size for placebo-controlled trial with 1:1 treated/placebo ratio, standard deviation based on patient group only (see Section 2). b The imaging measure with the highest effect size for each diagnostic group is highlighted (bold) to facilitate comparison.
Similarly, previous reports have demonstrated that patterns of atrophy in autosomal dominant forms are different than in sporadic FTD, with more widespread cortical involvement, including the parietal lobes. All of the bvFTD cases had genetic testing performed through research using previously described methods (Naasan et al., 2016).

| Variability in locations of peak change across individuals
The variability in effect size across clinical syndromes was striking.
One possible explanation is that mean rates of change were slower for bvFTD than for other groups; however, this would be inconsistent with prior studies indicating that rates of decline in clinical measures and brain volume in bvFTD are similar to rates of decline in other variants (Krueger et al., 2010;Rascovsky et al., 2001;Roberson et al., 2005). Given that the algorithm is designed to quantify the reliability of change in each voxel across individuals, another possibility is that the patterns of change might vary across individuals differently in each of the groups. To examine this, we plotted the locations of peak voxels (i.e., those with the highest rate of change) for all individuals, and displayed these locations in MNI space for each diagnostic group ( Figure 5). As would be predicted from the effect size estimates, peak regions of change were highly clustered across individuals in the svPPA group, but with greater spatial variation in the locations of peaks in nfvPPA, and perhaps the most heterogeneous spatial distribution was seen in bvFTD.

| DISCUSSION
The aim of this analysis was to create ROIs that would generate maximal effect sizes for measuring change in cortical volume in three major variants of FTLD. As would be expected, the maps varied considerably across the three major variants. In bvFTD, they included the medial and lateral portions of the frontal lobes, the insula, the striatum, and the temporoparietal regions bilaterally. In svPPA, the most reliable  (Jack et al., 2013).
Given that atrophy in bvFTD occurs earliest in the insula and ventromedial frontal regions (Kril & Halliday, 2004;Seeley et al., 2008), these regions may reach a point where additional volume loss does not occur, while at the same time regions that are not involved early in bvFTD, such as the parietal lobes, may just be entering the phase of rapid decline when patients typically present for evaluation. The same phenomenon may explain the relative sparing of the temporal poles in the change maps for svPPA, which has been observed in prior studies and attributed to floor effects (Brambati et al., 2009;Rohrer et al., 2008). These findings highlight the value of empirically defined ROIs in tracking change as opposed to using ROIs defined according to prior knowledge about the regions that are most severely affected in each disease. These ROIs are affected by regional patterns of acceleration and deceleration that are likely stage specific, and thus would need to be recreated for use in patient groups substantially earlier or later in the disease course than those studied here.
Perhaps, more striking than the regions identified were the differences in sample size estimates across syndromes. Our data indicate that the sample sizes that would be required to detect changes in the rate of atrophy in bvFTD are larger than in nfvPPA and even more so when compared with svPPA. The fact that estimates obtained using the statistically driven approach were only slightly better than those obtained with whole gray matter supports the idea that the variability in regions of change in bvFTD makes it difficult to find focal, reliable regions for bvFTD as a whole. In contrast, in nfvPPA and particularly svPPA, the stronger overlap in regions of peak atrophy between individuals means that very reliable change can be measured in a relatively circumscribed region, such that techniques designed to find these regions, like the one used in this analysis, yield significant benefits for clinical trials.
The reason for the low level of predictability in regions of change across individuals with bvFTD is not readily apparent. Based on our analysis, the presence of amyloid-positive cases or mutation carriers were not likely explanations because the maps generated using only known amyloid-negative and known gene-negative cases were similar to those obtained in bvFTD as a whole, including the presence of atrophy in the parietal lobes. Of course, we may still have included some cases due to mutations not yet discovered. Variability in the causative proteinopathy across individuals may be another explanation.
Although svPPA is almost uniformly associated with Tar-DNA-binding protein type C (TDP-C) protein pathology, bvFTD can be associated with a variety of proteinopathies including various forms of TDP as well as various forms of tau pathology including progressive supranuclear palsy, corticobasal degeneration, Pick's disease, and other variants (Bang et al., 2015). Differences between proteinopathies in patterns of imaging abnormalities have been established crosssectionally (Whitwell et al., 2011). Patterns of decline across different proteinopathies can also be examined as cohorts of autopsied cases with longitudinal imaging data grow, and techniques for identifying specific proteinopathies in vivo improve. In addition, current theories suggest that proteins causing neurodegenerative disease spread within neuroanatomically defined networks (Seeley, Crawford, Zhou, Miller, & Greicius, 2009). It is possible that the particular network involved in a disorder, and/or variability in strengths of connectivity within and between networks across individuals may also mediate patterns of spread. Verification that any of these, or other factors, can predict individual patterns of change would have obvious benefit for future clinical trials. It is also possible that other imaging methods, such as diffusion tensor imaging, may provide more reliable methods of tracking change over time (Mahoney et al., 2015).
One potential benefit from the use of imaging as a marker of longitudinal decline is that increased precision could result in improved effect sizes when compared with clinical measures of change . This was generally confirmed in our analysis. For instance, we found that a placebo-controlled trial would require 592 subjects per arm using the CDR-SB to detect a 20% reduction in rate of change in bvFTD (Table 2). This estimate is roughly consistent with a prior study that estimated a sample size of 582 (Gordon et al., 2010) to detect a 25% effect of a drug. In contrast, our analysis indicates that a study measuring rates of atrophy using a statistically derived ROI in T1-weighted images would require 409 people to detect the same effect. That said, other groups have published methods for identifying optimal clinical measures for tracking change using methods that are similar in principle to the approach used here for brain voxels (Ard, Raghavan, & Edland, 2015). These have yet to be examined in FTLD.
While it is currently unlikely that volumetric change would be acceptable as a primary endpoint in clinical trials, this might become possible if reliable links between volumetric changes and clinical changes can be established. In addition, imaging could be used as evidence for a disease modifying effect of a proposed treatment, or in early clinical development (e.g., phase 2 studies) to establish proof of concept to support advancement of a potential treatment to a phase 3 trial.
Our results confirm that data-driven ROIs of change identify ex- to prior studies that used training and test sets (Chen et al., 2010;Hua et al., 2009) but instead of a single training-test partition, we use cross-validation through repeated resampling of the data. The estimate of the effect size we generated from this procedure should be a conservative estimate of the effect size achievable with the optimized ROI we created because it was generated based on multiple partitions of the data that always used a smaller sample than the total (the training partition) to generate a ROI that was then tested in the test partition for each cross-validation run. It will be important to test the effect size of the final consensus ROI in independent datasets from different cohorts, ideally collected at other centers. In addition, our analysis compared the rates of change in cerebral cor-