Reassessing associations between white matter and behaviour with multimodal microstructural imaging

Several studies have established speciﬁc relationships between White Matter (WM) and behaviour. However, these studies have typically focussed on fractional anisotropy (FA), a neuroimaging metric that is sensitive to multiple tissue properties, making it difﬁcult to identify what biological aspects of WM may drive such relationships. Here, we carry out a pre-registered assessment of WM-behaviour relationships in 50 healthy individuals across multiple behavioural and anatomical domains


a b s t r a c t
Several studies have established specific relationships between White Matter (WM) and behaviour.However, these studies have typically focussed on fractional anisotropy (FA), a neuroimaging metric that is sensitive to multiple tissue properties, making it difficult to identify what biological aspects of WM may drive such relationships.Here, we carry out a pre-registered assessment of WM-behaviour relationships in 50 healthy individuals across multiple behavioural and anatomical domains, and complementing FA with myelinsensitive quantitative MR modalities (MT, R1, R2*).Introduction The past decade has shown that White Matter (WM), and in particular the myelinated structures that dominate it, have more varied functions than previously thought, from trophic support of axons (Fu ¨nfschilling et al., 2012;Nave, 2010) to active regulation of physiological and behavioural processes (Kaller et al., 2017;Lazari et al., 2018;Steadman et al., 2019).These basic biology findings suggest that WM may play a role in brain physiology and behaviour, and that WM could be targeted for therapeutic gain in neuropsychiatric disorders (Gibson et al., 2018;Vanes et al., 2020).
In humans, much evidence on the role of WM has come from a large body of studies linking behaviour to diffusiontensor-based metrics such as fractional anisotropy (FA), a metric derived from diffusion weighted imaging that is sensitive to features of WM microstructure (Boekel et al., 2015;Johansen-Berg, 2010;Lazari & Lipp, 2021;Roberts et al., 2013).While these studies have provided seminal evidence for a link between WM and human behaviour, questions remain about the generalizability and interpretation of these effects.
FA-behaviour relationships are particularly difficult to interpret on a biological level.Diffusion signals are sensitive to a broad range of tissue properties, including myelination levels, fiber orientation, axon diameter, astrocyte and vascular morphology (Farquharson et al., 2013;Sampaio-Baptista & Johansen-Berg, 2017;Stolp et al., 2018).Therefore, a given FA-behaviour correlation could arise from a diversity of microstructural patterns (Zatorre et al., 2012).Moreover, while other tensor-based metrics can be derived from diffusionweighted imaging, it is unclear whether they differ from FA in their biological sensitivity (Lazari & Lipp, 2021).
In recent years, an increasing number of techniques (Fig. 1) have been successfully applied to the study of WM, and of WM myelination in particular (Heath et al., 2018).As WM is dominated by myelinating oligodendrocytes, many of these techniques have focused on detecting direct signals from myelin or from iron, which is enriched in the cell body of oligodendrocytes.Magnetisation Transfer-based techniques, for example, quantify the fraction of macromolecule-bound water protons, and have been shown to relate strongly to myelination in a number of validation studies (Deloire-Grassin et al., 2000;Dousset et al., 1992Dousset et al., , 1995)).R2* mapping, on the other hand, quantifies local field distortion caused by iron, and has been confirmed as an iron marker by several validation studies (Langkammer et al., 2010;Sun et al., 2015).R1 has gained attention recently as a quantitative metric for myelination, and although its effectiveness as a WM myelin marker has not been directly tested, it has been shown to detect spatial distributions of myelin in grey matter (Lutti et al., 2014;Stu ¨ber et al., 2014).In addition to the development of new MR techniques, new statistical tools, such as joint inference permutation testing (Winkler et al. 2014(Winkler et al. , 2016)), facilitate the integration of Magnetic Resonance Imaging (MRI) techniques to clarify the biological interpretation of MRImeasured effects in white matter.
Applying these approaches to studying WM microstructural techniques could be helpful for clarifying the mechanisms behind WM-behaviour relationships.In particular, using MRI modalities that are sensitive to different biophysical tissue properties could disentangle whether myelination, oligodendrocytes, or fiber orientation, or a combination of them, are key in driving reported FA-behaviour correlations.In turn, if all WM-behaviour relationships are driven by a common biological mechanism, then establishing recurrent multimodal patterns that correlate with behaviour could uncover it, with powerful implications for future studies looking at WM-behaviour relationships and biomarker development.
To tackle these open questions regarding WM-behaviour relationships, we set out to: 1. Perform confirmatory, pre-registered testing of FAbehaviour relationships.
2. Perform pre-registered testing of relationships between behaviour and microstructural imaging across neuroimaging modalities.3. Identify multimodal microstructural signatures which may provide insights into the underlying biology of WMbehaviour relationships.

Participants
Fig. 2 summarises the study design.50 healthy participants (25 female; aged 18e38 years, mean 26.2 years, median 26 years) underwent a single session of behavioural testing and MRI on the same day.As there is limited literature on the sample sizes needed to robustly detect crossesectional correlations, our target sample size was based on previous work which had informed our hypotheses [n ¼ 20 for DSST (Metzler-Baddeley et al., 2012), n ¼ 21 for AFT, as the average sample size in the studies reviewed by (Gooijers & Swinnen, 2014), and n ¼ 26 for TOJ (Husain et al., 2011)].Studies reporting positive results may underestimate the necessary sample sizes (Button et al., 2013), so we doubled the sample size reported from the literature, thus bringing our sample size in line with a report recommending samples sizes between n ¼ 20 and n ¼ 40 for studies on FA (De Santis et al., 2014).All participants were self-assessed right-handed and their handedness was further assessed through the Edinburgh Handedness Inventory (Oldfield, 1971) (score range 60e100, mean 87.2, median 90).All participants were screened for MRI safety, received monetary compensation for their participation, and gave their informed consent to participate in this study.All study procedures followed the Declaration of Helsinki, and were reviewed and approved by the local ethics committee at the University of Oxford.

Preregistration
Details of the task data collection and analysis plans were preregistered on the Open Science Framework website (full preregistration available here: https://osf.io/ar7zs/).In brief, the pre-registration covered hypotheses and aims of the project, including which behavioural measures, MR metrics and regions of interest to use, while analytical details were decided separately after data collection.
We report here relevant text from the pre-registration: "Overall aim: testing whether previously reported correlations between behavioural measures and fractional anisotropy (FA) measures in long-range projections obtained using diffusion-weighted magnetic resonance imaging (dw-MRI) are related to indices of myelin content obtained using novel quantitative magnetic resonance imaging (qMRI) protocols.To this end, we aim to replicate a sample of previous studies, and extend these FA/behaviour analyses to myelin qMRI/behaviour analyses".
Specific brain/behaviour predictions were made for each task, listed in the analysis section below.

Behavioural tasks
A set of behavioural tasks was selected to build on prior studies reporting relationships between behaviour and WM microstructure (https://git.fmrib.ox.ac.uk/alazari/reassessing-associationsmultimodal/).
Fig. 1 e Each neuroimaging modality is sensitive, but not specific, to different features of the biological tissue.This study aimed to use multiple MR modalities that are sensitive to myelin, but measure different biophysical properties of white matter.
The presence of FA-behaviour relationships has been particularly clear for the corpus callosum and for the cingulum.The cingulum has been often implicated in cognitive control (Bathelt et al., 2019), and cingulum FA has been found to strongly correlate with performance on neuropsychological tasks (Metzler-Baddeley et al., 2012).The corpus callosum, on the other hand, allows the nodes of the motor network in each hemisphere to communicate with one another, and both positive and negative relationships have been widely reported between callosal FA and various types of bimanual performance [(Johansen-Berg et al., 2007;Muetzel et al., 2008;Sullivan et al., 2001) and (Gooijers & Swinnen, 2014) for a comprehensive review of callosal-bimanual behaviour relationships].
FA-behaviour relationships have also been thoroughly explored in behavioural paradigms beyond the motor system.As mentioned above, bimanual motor performance has been the subject of much literature, and so has bilateral sensory processing.In the visual domain, topographic organisation and visuospatial capacity have both been shown to relate to callosal microstructure (Saenz & Fine, 2010;Todorow et al., 2014).In the auditory domain, relationships have been established between perceptual acuity and WM microstructure, although mostly in pathology (Husain et al., 2011;Lin et al., 2008;Wang et al., 2019).While there have been no previous studies on WM relationships with somatosensory acuity, it would be logical to expect a similar relationship between somatosensory perceptual acuity and microstructure of WM in relevant tracts.
Specifically, we assessed three task domains: (1) testing for a relationship between callosal FA and bimanual motor performance using the Alternating Finger Tapping task aimed to directly replicate a series of previous studies [reviewed by (Gooijers & Swinnen, 2014) (3) testing for a relationship between FA in somatosensory tracts and somatosensory perceptual acuity using the Temporal Order Judgement Task aimed to extend previous findings in the visual and auditory domain, to the sensory system.

Digit symbol substitution test (DSST)
A paper-based Digit Symbol Substitution Test (DSST) was conducted as per https://healthabc.nia.nih.gov/sites/default/files/dsst_0.pdf.After training on substituting 10 digits for symbols, participants were asked to sequentially fill in the remaining 90 symbol-digit boxes in 90 sec.

Analysis of the DSST
The score was calculated as the total number of symbols filled in correctly by the end of the task.Two participants were identified as outliers (>3 SD away from the mean) and thus excluded from further analyses.

Alternating Finger Tapping (AFT) task
The finger tapping task aimed to test the participants' bimanual coordination.The task was based on (Muetzel et al., 2008) and (Pelletier et al., 1993) and ran as follows: three blocks were repeated four times (the first one for training purposes): during the first block, participants were asked to tap their right index finger on a buttonbox (Current Designs, Inc., Philadelphia, PA) 30 times, as fast as they could (right monomanual condition); during the second block, participants were asked to tap their left index finger (left monomanual condition); during the third block, participants were asked to alternate between right and left index finger button presses (bimanual condition).For each block, after the 30 button presses were finished, the total elapsed time was fed back on the computer screen.The experimenter inspected the participant movement by eye to ensure they were correctly switching between fingers and that they were moving the finger rather than the hand.Participant posture and hand position was carefully kept constant throughout all blocks.One participant did not carry out the AFT due to a hardware problem.

Analysis of the AFT task
Alternating Finger Condition (AFC) duration was extracted, i.e., average total time needed for 30 taps on the alternating finger condition (Muetzel et al., 2008).Two participants were identified as outliers (>3 SD away from the mean) and thus excluded from further analyses.Total time needed for 30 taps on the monomanual conditions was used as a covariate in group-level analyses (Pelletier et al., 1993), together with age and gender.

Temporal order judgement (TOJ) task
The Temporal Order Judgement (TOJ) task aimed to test participants' capacity to discriminate between two closely timed tactile stimuli delivered to the fingertips.The task was based on a previous investigation of the functional activity associated with such behaviour (Kolasinski et al., 2016) and ran as follows.A PC running a PsychoPy script delivered, via a USB 6501 card (National Instruments) and an amplifier (Tactamp, Dancer Design), two asynchronous pulses to two vibrotactile stimulators (also known as tactors, Dancer Design) positioned within holes in a foam pad.The participant was asked to keep their hands relaxed on the foam pad, with their index fingers gently lying on the tactors.A piece of cardboard was used to block visual input from the tactors; similarly, headphones playing low levels of pink noise were used to block the auditory input from the tactors.Participants performed a two alternative forced choice (2AFC) task and were asked to press on one of two foot pedals, depending on the side of the pulse that they thought had come first.Participants were asked to respond within 2 sec.If they did not respond within this time then no response was recorded and a new trial was started.
They were also instructed that if it was hard to judge which pulse came first, they should just make their best guess.Intervals between pulses ranged from 0 to 300 msec.The task featured a practice session with 10 trials and a full session with 280 trials, for a total duration of roughly 12 min.

Analysis of the TOJ task
After trials with no response were discarded, the number of correct pedal responses were plotted as a function of interstimulation interval and a logistic regression was fitted to the data.At this stage, six participants were excluded as the logistic regression failed to fit the data correctly.The slope of the curve and the Just Noticeable Difference (JND) were used as key metrics of performance on the task (Kolasinski et al., 2016;Shore et al., 2005).

MRI data collection
Magnetic Resonance Imaging (MRI) data were collected with a 3.0-T Prisma Magnetom Siemens scanner, software version VE11C (Siemens Medical Systems, Erlangen, Germany).Participants were asked to keep their head still and to wear earplugs during scanning in order to reduce the impact of MRIrelated noise.The sequences were collected as follows: T1weighted structural imaging (T1w), resting-state fMRI (rs-fMRI), Multi-Parameter Mapping (MPM) and Diffusion-Weighted Imaging (DWI).MRI scan pre-processing, analysis and statistical comparisons were performed using FMRIB Software Library (FSL, v6.0), except for the MPM quantitative map estimation step which was carried out using the hMRI toolbox implemented in Matlab-based SPM, as described in (Tabelow et al., 2019).The T1w sequence had a TR of 1900 msec, TE of 3.96 msec, a 1 mm isotropic resolution and a large Field of View (FOV, 256 mm 3 ) to allow for the nose to be included in the image and thus facilitate neuronavigation later on in the paradigm.The sequence used GRAPPA with an acceleration factor of 2.
The MPM protocol [as per (Weiskopf et al., 2013)] included three multi-echo 3D FLASH (fast low-angle shot) scans with varying acquisition parameters, one RF transmit field map (B1þmap) and one static magnetic (B0) field map scan, for a total acquisition time of roughly 22 min.To correct for interscan motion, position-specific receive coil sensitivity field maps, matched in FOV to the MPM scans, were calculated and corrected for (Papp et al., 2016).The three types of FLASH scans were designed to be predominantly T1-, PD-, or MTweighted by changing the flip angle and the presence of a pre-pulse: 8 echoes were predominantly Proton Densityweighted (TR ¼ 25 msec; flip angle ¼ 6 degrees; TE ¼ 2.3e18.4msec), 8 echoes were predominantly T1weighted (TR ¼ 25 msec; flip angle ¼ 21 degrees; TE ¼ 2.3e18.4msec) and 6 echoes were predominantly Magnetisation Transfer-weighted (MTw, TR ¼ 25 msec; flip angle ¼ 6 degrees; TE ¼ 2.3e13.8msec).For MTw scans, excitation was preceded by off-resonance Gaussian MT pulse of 4 msec duration, flip angle of 220 degrees, 2 kHz frequency offset from water resonance.All FLASH scans had 1 mm isotropic resolution and field of view (FOV) of 256 Â 224 Â 176 mm.The B1 map was acquired through an EPI-based sequence featuring spin and stimulated echoes (SE and STE) with 11 nominal flip angles, FOV of 192 Â 192 Â 256 mm and TR of 500 msec.The TE was 37.06 msec, and the mixing time was 33.8 msec.The B0 map was acquired to correct the B1þ map for distortions due to off-resonance effects.The B0 map sequence had a TR of 1020.0 msec, first TE of 10 msec, second TE of 12.46 msec, field of view (FOV) of 192 Â 192 Â 256 mm and read-out bandwidth of 260 Hz/pixel.

MRI preprocessing
A custom pipeline based on existing FSL tools (Smith et al., 2004) was developed for our diffusion sequence.The topup tool was run on average images of AP b0 volumes and PA b0 volumes.The resulting susceptibility-induced off-resonance field was used as an input for the eddy tool (Andersson & Sotiropoulos, 2016), which was run with options optimised for multiband diffusion data to correct for eddy currents and subject movement.To generate Fractional Anisotropy (FA) maps, a diffusion tensor model was fit to each voxel through DTIFIT.
Magnetisation Transfer saturation (MT), R1 and R2* quantitative maps were estimated through the hMRI toolbox (Tabelow et al., 2019), with default settings including ESTAT-ICS modelling (Weiskopf et al., 2014).In order to register MPM volumes to FA volumes, we used the following steps.Boundary-Based Registration was used to calculate a DWI-to-T1w registration using preprocessed b0 images (with high tissue boundary contrast).A customised pipeline was used to apply the fslreorient2std tool to the MPM maps and register them to T1w space.At this stage, 1 participant was excluded as the MPM-derived maps were heavily corrupted due to movement artefacts; 1 participant was excluded due to lower quality signal in the MPM scan, which resulted in poor registration with other modalities.Once registration matrices for MPM-T1w and DWI-T1w were calculated, they were inverted, concatenated and applied as needed to bring MPM volumes into DWI space with minimal interpolation.Registrations were assessed manually and one participant was excluded due to poor registration across all analyses.

MRI analysis
To bring all volumes into a common space, native FA volumes were skeletonised with Tract-Based Spatial Statistics [TBSS (Smith et al., 2006)], and the skeletonisation transforms were subsequently applied to MPM-to-DWI registered volumes.Group-level analyses were then conducted in skeleton space for all data.
All behavioural performance measures were normalised (through z-scoring, or rank-based inverse-normal transformation if not normally distributed) and correlations between MRI metrics and behaviour were assessed for each behavioural measure separately.
Relevant text from the preregistered analysis plan is as follows: Cingulum and DSST: "We aim to replicate a reported relationship between […] number of substituted digits in the Digit Substitution test and cingulum FA (Metzler-Baddeley et al., 2012) [ …], and to extend the protocol to investigate qMRI [ …]/behaviour relationships." Callosum and AFT: "We aim to replicate a reported relationship between callosal FA and AFC duration in the finger tapping task (Muetzel et al., 2008;Sullivan et al., 2001).We further aim to test for a relationship between myelin metrics in the corpus callosum and AFC duration" Sensorimotor tracts and TOJ: "Performance on the temporal order judgement task is not associated with integrity of a single specific white matter tract, but rather with a set of tracts involving multiple sensorimotor areas.Accordingly, we plan to run exploratory analyses across the whole brain, testing for associations between JND/slope values and FA/ qMRI." Covariates of age, sex, and performance on control tasks (unimanual finger tapping speed for the AFT, and visuomotor speed for DSST) were included.For each behavioural assay, voxelwise analyses were restricted to voxels within a predefined anatomical mask chosen from standard atlases included in FSL and based on the a priori hypotheses: a cingulum mask for DSST, a callosal mask for AFT and a mask of cortico-cortical and ascending sensorimotor tracts for TOJ.The masks were derived from the JHU ICBM-DTI-81 Atlas, the JHU White-Matter Tractography Atlas and the Human Sensorimotor Tracts Atlas, respectively.
Within these masks, analyses were conducted with voxelwise maps of FA, MT, R1 and R2*.Voxelwise inference across these MRI modalities, testing for correlations between each MRI modality and behavioural measures, was performed using the Permutation Analysis of Linear Models (PALM) tool (Winkler et al., 2014).Cluster-wise inference was conducted to control familywise error over the image.A cluster-forming threshold of t > 1.7 (equivalent to p < .05,based on the degrees of freedom) was used in all instances, at the 5% familywise error level.

Unimodal tests of FA
For unimodal hypotheses on FA, we reported the univariate results for correlations between FA and behaviour.

Multimodal tests
For multimodal hypotheses, voxelwise inference using Non-Parametric Combination (NPC), as implemented in PALM (Winkler et al., 2016), was used to produce two types of inferences.(1) Correcting over modalities allowed us to ask whether any individual modality correlates with behaviour; (2) Combining over modalities allowed us to ask whether any combination of modalities correlates with behaviour.
For approach (1), we conducted cluster-wise inference on each modality separately, with familywise error controlled over the image and the K modalities.For each voxel, we reported the minimum image/modality-corrected cluster pvalue across modalities.
For approach (2), combining evidence of effects over K modalities, we used Fisher's p-value combining method at each voxel: With this approach, evidence can be assessed for either directional or non-directional effects: combining one-sided pvalues (based on prior expected directions of effects) will test for directional effects; combining two-sided p-values will provide sensitivity to non-directional effects (i.e., combination of either direction) as well.Here, a directional Fisher test, testing for positive effects across all modalities, was used to test for putative myelin signatures.

Simulation-based post-hoc power calculations for combined multimodal tests
A comprehensive power analysis for cluster-wise inference that accounts for the spatially-varying dependence among imaging modalities is beyond the scope of this work.However, so as to provide a rough indication of power for future studies of multimodal microstructural imaging, we conducted univariate simulation-based power calculations for the combined multimodal (Fisher) tests.Pearson correlations for each modality-behaviour pair were recorded at the location of the peak voxel in the Fisher test inference map.In each simulation, a Gaussian random vector of behavioural and imaging values were generated with the specified correlation induced between the behaviour and each imaging value.We then tested whether the null hypothesis for each simulation would be rejected under a Fisher test with alpha set at .001.Power was then calculated as the percentage of tests rejecting the null hypothesis across all simulations.For each WMbehaviour correlation, power was calculated for samples sizes ranging from 10 to 300 subjects.While this approach may be optimistic because of using a peak voxel to measure effect sizes, it probably is conservative since it represents power at a single voxel and does not reflect the sensitivity gained through cluster inference.

Transparency and openness promotion (TOP) statement
We report how we determined our sample size, all data exclusions (if any), all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.The conditions of our ethics approval do not permit sharing of the data supporting this study with any individual outside the author team under any circumstances.

Results
We first used unimodal analyses to test for correlations between DWI-derived FA and behaviour, based on previously reported literature (Fig. 3).No relationships were found between behaviour and FA within tracts of interest for either TOJ or DSST (TOJ: peak p corr ¼ .08;DSST: peak p corr ¼ .49).For AFT, a significant correlation was found between callosal FA and AFT performance (peak p corr ¼ .016).
We then performed multimodal tests, testing whether any individual modality (FA, MT, R1 or R2*) strongly correlated with behaviour (Fig. 4), by considering p-values across both voxels and modalities for each WM-behaviour relationship.No relationships were found between behaviour and multimodal MRI metrics within tracts of interest for either TOJ or AFT (TOJ: peak p corr ¼ .339;or AFT: peak p corr ¼ .09).For DSST, a significant correlation was found between parahippocampal cingulum and DSST (peak p corr ¼ .038),driven entirely by R1 (only modality with any voxel of p corr <.05, Fig. 4).
While single-modality tests allow to identify strong correlations with a particular modality, they cannot identify combined trends across modalities, which can be particularly informative of the underlying biology.For instance, a positive trend across all modalities considered here (which are known to positively correlate with myelin content of the tissue) would indicate that tissue myelination may be related to behavioural performance.Likewise, trends in discordant directions could also be informative, as they could unveil multimodal signatures related to other biological tissue properties such as vasculature and fiber orientation.
Fisher tests were used to detect combined multimodal trends between behavioural measures and MRI metrics (FA, MT, R1 and R2*).With the usual (directed, positive) Fisher test (Fig. 5, 2nd column), no relationships were found between behaviour and multimodal MRI metrics within tracts of interest (TOJ: peak p corr ¼ .532;AFT: peak p corr ¼ .184;DSST: peak p corr ¼ .2).With a non-directed Fisher test (results not shown), once again no relationships were found between behaviour and multimodal MRI metrics within tracts of interest.(TOJ: peak p corr ¼ .82;AFT: peak p corr ¼ .11;DSST: peak p corr ¼ .29)Taken together, these two tests argue against the presence of consistent multimodal microstructural signatures related to myelination or to other biological tissue properties.
The lack of a common microstructural signature is also apparent when considering the top 5th percentile t-statistics (Fig. 5, 3rd column) and the t-statistics maps for each task (Figs.S1, S2 and S3), where peaks are not consistent across modalities.This further confirms the negative Fisher tests, as there is no common trend across modalities within each group of WM-behaviour tests.
To aid future studies wishing to explore WM-behaviour correlations, and myelin-behaviour correlations in particular, we ran post-hoc simulation-based power analyses to identify the sample sizes needed to detect a combined multimodal effect through a Fisher test (Fig. 5, 4th column).Based on the observed effect sizes, we find that sample sizes needed to detect a myelin-behaviour correlation across the 4 modalities in a directed Fisher test vary from 190 to 200 participants for DSST, to 40e50 for AFT, to 60e70 for TOJ.
For completeness, we also report analyses of this dataset using conventional univariate approaches, considering each modality separately (Figs.S1, S2 and S3) and not correcting across modalities.We find that if each modality-behaviour correlation was run as a separate analysis, each behaviour would show a correlation with at least one modality.Strikingly, different behaviours correlate most strongly with different modalities DSST with R1 (Fig. S1); AFT with FA and MT (Fig. S2); TOJ with R2* (Fig. S3), thus strengthening the evidence against a common microstructural signature across behaviours.

Discussion
Our first aim was to assess the robustness of relationships between white matter FA and behaviour across a range of behavioural tasks.We find a unimodal correlation between the structure of the corpus callosum FA and bimanual coordination, in accordance with previous literature (Bathelt et al., 2019;Johansen-Berg et al., 2007;Metzler-Baddeley et al., 2012;Muetzel et al., 2008;Sullivan et al., 2001).This confirms that individuals with lower callosal FA perform better in tasks requiring bimanual coordination.It also suggests that the extensive early literature on bimanual coordination and the corpus callosum (Gooijers & Swinnen, 2014) can be replicated, even with larger sample sizes and recent preprocessing pipelines.However, a robust relationship between FA and behaviour was identified in only one out of three tasks considered here.This can be due to several reasons.One possible explanation is that effect sizes inferred from previous studies might be overinflated due to publication bias (Turner et al., 2008) and under-powered analyses (Button et al., 2013).However, it is worth noting that, of the three tasks considered here, only the FA-AFT experiment, which did successfully identify a FAbehaviour relationship, was a direct replication of a previous testing protocol.The other two tasks were designed as conceptual replications or extensions, but did not precisely Fig. 3 e FA and behaviour.Unimodal relationships between FA and behaviour were tested across anatomical masks (shown in green) that were selected for each task.Results highlight that the Alternating Finger Tapping task (AFT), but not Temporal Order Judgement task (TOJ) and Digit Symbol Substitution Test (DSST) has a significant relationship with FA (red cluster shows voxels with corrected p-values below .05).Within that cluster, mean FA is extracted for each subject and plotted against performance in the scatterplot (with line of best fit and 95% confidence bands), that is for visual assessment of the correlation, rather than for statistical inference.
replicate experimental conditions and analysis steps.For instance, our analyses employed Tract-Based Spatial Statistics (Smith et al., 2006), as well as recently developed preprocessing tools (Andersson & Sotiropoulos, 2016), both of which differed from some of the studies we based our hypotheses on (Metzler-Baddeley et al., 2012).While our aim was not to perfectly replicate analyses from previous papers, it is possible that differences in preprocessing may be driving discrepancies between our FA results and the results from previous studies.In summary, the relationships between FA and behaviour that have been established may be robust and replicable, but the experimental and analytic conditions under which they occur needs clarification.
A second aim of the present study was to probe whether multimodal MR can provide useful insights on WM-behaviour relationships.We find that this is the case for at least one of the WM-behaviour relationships we tested: R1 correlates with DSST performance, such that individuals with higher R1 perform better in the DSST task requiring cognitive control.Higher R1 could reflect greater myelin, oligodendrocytes, vasculature or other iron-rich tissue components.In this case, multimodal analysis allowed identification of a WMbehaviour relationship that would have not been detected by an analysis focused on FA in isolation.This confirms that there is value in multimodal imaging, as some modalities may be more sensitive to the presence of a relationship than others.
A third aim was to test whether there are common multimodal microstructural patterns in WM-behaviour relationships, which may provide insights into the underlying biology.We fail to find robust evidence for multimodal effects and cross-modality signatures.Rather, we find that effect sizes and directionality of effect in the relationship between each modality and each behaviour are highly heterogeneous.This means that MR modalities in each tract not only show heterogeneity in how they relate to the same behaviour, but there is also variation as a function of which tractebehaviour correlation is being considered.
A key insight from the study is therefore that the relationship between WM and behaviour is highly varied.Given that each modality has a specific pattern of sensitivity to the underlying biology (Fig. 1), the results suggest that different aspects of WM biology may be driving different WMbehaviour correlations.There are two prominent sources of biological heterogeneity in white matter, which are likely relevant to the results in this study.
One driver of heterogeneity may be at the level of myelination.We selected metrics that were all sensitive to the amount of myelin in an imaging voxel (Fig. 1), predicting that if myelination were responsible for WM-behaviour relationships, a common multimodal pattern across all relationships would be identified.Such patterns were not found, arguing against myelination as a common driver.However, such reasoning might be overly simple-minded.Histological studies Only the DSST has a significant relationship with cingulum WM, driven by R1, when considering FWER-corrected p-values (red cluster shows voxels with corrected p-values below .05).Within that cluster, mean R1 is extracted for each subject and plotted against performance in the scatterplot (with line of best fit and 95% confidence bands), that is for visual assessment of the correlation, rather than for statistical inference.
have increasingly highlighted the heterogeneity of features in the myelinated axon, which can vary independently of each other (Almeida & Lyons, 2017).For instance, we know that Nodes of Ranvier, myelin sheath thickness, myelin sheath length, and number of myelin sheaths, can all independently affect an axon's physiological properties, which one would expect, in turn, to shape behaviour (Kaller et al., 2017).Varying these features might have differing effects on the overall amount of myelin in a given voxel meaning that the imaging metrics used might not be equally sensitive to all relevant features of the myelinated axon.
A second important driver of heterogeneity is non-myelin features of WM.As exemplified in Fig. 1, while all sequences we used are sensitive to myelin, some are also sensitive to fiber orientation and neuronal volume (FA), and some are sensitive to iron and vasculature (R1 and R2*).Therefore, one possible interpretation of the data is that the relationship between AFT performance and the corpus callosum is highly influenced by fiber orientation, whereas the relationship between the DSST performance and the cingulum is shaped by vasculature.Previous studies highlighted that both fiber orientation (Chang et al., 2017;Wedeen et al., 2005) and vasculature (Licht et al., 2011;Rhyu et al., 2010;Thomas et al., 2016) are important for brain function, and our data thus draw further attention to the fact that these factors may be influential in WM-behaviour relationships.
These two factors combined may explain why there is no single aspect of WM that drives behaviour.Rather, our findings confirm that heterogeneity at the cellular level is reflected in variation in the relationship between neuroimaging markers and behaviour.Importantly, this emphasizes that there is no single modality or single combination of modalities which is optimal to study WM-behaviour relationships.In this respect, our study poses practical limits to the possibility of developing a one-size-fits-all approach to the investigation of white matter-behaviour relationships, due to their inherent diversity.
While this heterogeneity means it is not straightforward to predict which MR modality is best suited for each type of WM investigation, it also suggests that multimodal studies of WM should tailor their MR sequence protocols and analyses pipelines to privilege markers and statistical approaches that can test and compare biologically-grounded models.For example, with an appropriate acquisition sequence and a joint Fig. 5 e Lack of evidence for combined multimodal signatures.A Fisher test was used to search for multimodal microstructural signatures relating WM to behavior, but no significant effects were found (2nd column).Effect sizes are reported for each modality-behaviour correlation, as measured by the top 5% t-statistic within peak Fisher clusters.This analysis was carried out to provide a clear visualisation of peak effect size for each pair of MR modality and behaviour, rather than for statistical inference (3rd column).For each WM-behaviour correlation, we used a simulation-based approach to calculate sample sizes needed to reach 80% power (red line), given the observed effect sizes found in our pre-registered tests.Sample sizes needed to detect a combined multimodal effect vary from 190 to 200 participants for DSST, to 40e50 for AFT, to 60e70 for TOJ (4th column).
multimodal statistical framework, one might be able to test whether a given WM-behaviour correlation is driven by myelination, vasculature (Thomas et al., 2016), or connectivity (Sui et al., 2014).Such approaches are most likely to generate further insights into WM-behaviour relationships in the future.
One key limitation of the study is that the results cannot disentangle to what extent differences between WM tracts contribute to the observed diversity of WM-behaviour relationships.One could argue, for example, that our results demonstrate that FA is more important for WM-behaviour relationships involving the corpus callosum, whereas R1 is more important for understanding the cingulum, while MT/ R2* are more important in investigations of the corticospinal tract.Because each of the behaviours we selected relates to a different WM tract, it is impossible to disentangle whether different kinds of behaviours are most strongly driven by different microstructural patterns, or whether there is neuroanatomical heterogeneity in the importance of different microstructural features of each tract.Although both are likely to matter, further studies relating individual tracts to multiple behaviours are required.
Moreover, an additional limitation of the study lies in the extent to which it was pre-registered.While our preregistration covered hypotheses and aims, including behavioural measures, MR metrics and regions of interest, it is now increasingly being acknowledged that many analytical choices in neuroimaging can have a large influence on the final results (Nichols et al., 2017;Pervaiz et al., 2020), and are thus crucial for confirmatory analyses.Therefore, we recommend future studies to include sample size and details of their preprocessing and statistical modelling in their preregistrations when appropriate.
The results also hold useful lessons for statistical aspects of future multimodal studies of WM.WM-behaviour correlations often have small effect sizes, and in our results we find that these effects are sometimes not detected when multiple hypotheses are tested concurrently.Testing for effects across modalities increases the false discovery rate proportionally to the number of modalities tested, and thus needs to be adequately corrected for in order to reach appropriate interpretations (Winkler et al., 2016).However, while multiple comparison correction has long been the gold standard statistical advice, multimodal brain imaging studies often do not report whether, and if so, how, correction for multiple comparisons was carried out (Bezukladova et al., 2020;Winston et al., 2020).Surprisingly, even gold standard guidelines in the field like COBIDAS do not report best practices for statistical reporting in multimodal imaging (Nichols et al., 2017), and many packages that support multi-modality statistical testing do not allow joint statistical tests, thus leaving room for needless analytic flexibility.Our results suggest there is a need for increased transparency in reporting of multimodal statistics, which statistical guidelines on multimodal imaging might facilitate in the future.In this respect, our results also add weight to previous calls to pre-register the modalities to be used in a given analysis (Picciotto, 2018), and to report all tested modalities in publications.This aspect of statistics in multimodal studies also needs to be taken into account when assessing the power of a given analysis.When modalities are analysed separately, multimodal studies require multiple statistical tests across modalities.Therefore, for the same effect size, a study analysing multiple modalities may need more subjects to achieve the same power, and it is important to take this into account in power analyses.We thus recommend using larger sample sizes for multimodal compared to unimodal studies.Alternatively, another solution is to use non-parametric multivariate tests (Winkler et al. 2014(Winkler et al. , 2016) ) and/or dimensionality reduction techniques (Groves et al., 2011;Sui et al., 2014), in scenarios where multimodal data are available but the data set size is only powered for unimodal tests.While there is little literature on multimodal power analyses for cross-sectional studies using microstructural imaging, our results indicate that sample sizes of 40e200 may be required to detect joint multimodal effects through non-parametric multivariate tests.
In conclusion, these results highlight a broad heterogeneity in white matter's relationship with behaviour.They also underscore the added value of multimodal imaging approaches, as different neuroimaging modalities might be best suited to detect different WM-behavior relationships.However, this added value needs to be weighed carefully against the need for more power and/or dimensionality reduction approaches in multimodal studies.Finally, the results effectively limit the possibility of developing a one-size-fits-all approach to study white matter, and suggest that different aspects of WM biology may be driving different WM-behaviour correlations.

Fig. 2 e
Fig. 2 e Study design and summary of MRI and behavioural data acquired.

Fig. 4 e
Fig. 4 e Multimodal microstructural imaging and behaviour.Multimodal relationships between behaviour and individual MRI metrics (FA, MT, R1 and R2*) across Digit Symbol Substitution Test (DSST), Alternating Finger Tapping task (AFT) and Temporal Order Judgement task (TOJ).Only the DSST has a significant relationship with cingulum WM, driven by R1, when considering FWER-corrected p-values (red cluster shows voxels with corrected p-values below .05).Within that cluster, mean R1 is extracted for each subject and plotted against performance in the scatterplot (with line of best fit and 95% confidence bands), that is for visual assessment of the correlation, rather than for statistical inference.