Multi-centre classification of functional neurological disorders based on resting-state functional connectivity

Highlights • Using machine learning on multi-centre data, FND patients were successfully classified with an accuracy of 72%.• The angular- and supramarginal gyri, cingular- and insular cortex, and the hippocampus were the most discriminant regions.• To provide diagnostic utility, future studies must include patients with similar symptoms but different diagnoses.


Introduction
Functional neurological disorders (FND) describe the presence of neurological symptoms not caused by a classical neurological disease (American Psychiatric Association, 2013) but related to brain dysfunctions (Drane et al., 2020). Patients can experience a wide range of neurological symptoms, most frequently motor (e.g., weakness or abnormal movements), sensory (e.g., numbness), or attacks of clouded consciousness which are sometimes accompanied by convulsions (World Health Organization, 1993). Nowadays, the diagnosis of FND is made on the basis of positive clinical signs (Daum et al., 2015;Stone and Carson, 2015), and less emphasis is put on an exclusion process (i.e., not identifying an underlying explanatory neurological disease). Indeed, even if there is no gold standard against which to compare the validity of these signs, several recent studies have shown excellent specificity for several bedside clinical signs (Daum et al., 2015;Espay et al., 2018a;Syed et al., 2011). However, due to heterogeneity of FND symptoms and a broad spectrum of potential differential diagnosis, specialists often request multiple time-and cost-consuming additional tests to rule out an underlying organic lesion or comorbid condition (Espay et al., 2009), even if they were convinced of the diagnosis based on their initial clinical evaluation (Espay et al., 2018a). This highlights the need to identify an adjunctive positive biomarker to support clinicians in their daily clinical routine. Such a marker could allow rapid confirmation of the clinical diagnosis, rather than engaging in a long and exhaustive process of excluding all evoked differential diagnoses.
In the search for new biomarkers in neuropsychiatric disorders, resting-state (RS) functional magnetic resonance imaging (fMRI) has gained growing attention as a promising and easily applicable tool (Greicius, 2008). Resting-state fMRI allows studying blood oxygen dependent level (BOLD) signal fluctuations in the brain under resting condition and therefore does not depend on the patient's active participation. Furthermore, inter-regional correlations of temporal fluctuations are thought to reflect functional connectivity (FC) between spatially distinct brain regions. Therefore, RS fMRI can reveal important information about underlying neuropathophysiological changes in functional networks of patients (Sokolov et al., 2019;Takamura and Hanakawa, 2017;van den Heuvel and Hulshoff Pol, 2010). Even though task-based fMRI studies are predominant in FND, RS studies in FND were able to confirm findings from task-based studies and identified consistent results. Amongst the existing RS studies, (1) increased limbic connectivity to motor control areas Maurer et al., 2016;van der Kruijs et al., 2012), (2) aberrant connectivity from the right temporoparietal junction (TPJ) to sensorimotor regions (Diez et al., 2019;Hassa et al., 2017;Maurer et al., 2016;Mueller et al., 2022;Wegrzyk et al., 2018), as well as (3) altered connectivity from memoryrelated temporal structures (Longarzo et al., 2020;Monsa et al., 2018;Szaflarski et al., 2018) were identified.
In parallel, the application of machine learning algorithms offers a complementary tool for RS fMRI data analysis. Moreover, machine learning approaches have shown to be robust and sensitive to diseasespecific alterations in functional and structural medical images (Erickson et al., 2017). As such, its value has been demonstrated in several neurological diseases and heterogenous psychiatric disorders by successfully distinguishing patients from healthy controls based on RS FC (for review see (Nielsen et al., 2020)).
In the field of FND, our previous study (Wegrzyk et al., 2018) showed promising results with regards to accurately distinguishing FND from healthy controls (HC). We applied a multivariate classification approach based on whole-brain RS FC aiming at discriminating motor FND patients from healthy controls in a predictive setting. Similarly, in another study the seizure-subtype of FND (psychogenic non-epileptic or functional seizures) was successfully classified against healthy controls, based on RS FC (Ding et al., 2013) and T1-weighted structural MRI data (Vasta et al., 2018). Even though real-life use of such a biomarker will need control groups with similar symptoms to FND and not only healthy controls, these studies provided a strong rationale to continue the validation of such classification algorithms. Indeed, most bedside positive signs for FND are specific and reliable, but neuroimaging classification based on machine learning might provide a future clinical tool in the form of an additional rule-in test against other neurological and psychiatric diseases and disorders.
The translation of neuroimaging data from bench to bedside has always been challenging due to the clinical heterogeneity (Espay et al., 2018a;Galli et al., 2020) and within-group differences of neuropsychiatric disorders (i.e., FND patients), and consequently its limited generalizability within and between patient populations (Stone et al., 2011). Importantly, overcoming the problem of low generalizability requires large samples, which includes patients with different symptom types and symptom severities, and preferably from different centres. Furthermore, establishing RS FC as an adjunctive positive biomarker for FND requires its applicability within and across different centres, i.e., different symptom types and symptom severity, consequently increasing the sample size and the heterogeneity of the dataset, which might benefit the classification performance. The next step towards a clinical application therefore includes the validation of multivariate classification approaches in different datasets (i.e., with regard to FND subtypes or scanners), and to assess their performance when using multi-centre data.
To bridge this gap, we set out to further evaluate the classification performance of our previously published classification approach (Wegrzyk et al., 2018) through three different validation steps (Dyrba et al., 2013;Nunes et al., 2020;Rozycki et al., 2018). First, our aim was to replicate the previous results by applying the method in additional datasets collected at other centres (intra-centre cross-validation step) and test its robustness when used in a multi-centre setting by pooling the data of these centres together (pooled cross-validation step). Our second aim was to assess the generalizability of the method by using data from each single centre once as test set after training on the data from the other centres (inter-centre cross-validation step). Successfully distinguishing FND patients from HC in a multi-centre setting could set path towards a clinical application by including neurological and psychiatric controls with similar symptoms (but other diagnoses) in future studies.

Participants
Data were collected retrospectively from four different European University Neurology Departments: i) Geneva (Switzerland, previously published in (Wegrzyk et al., 2018)), ii) Bern (Switzerland), iii) Prague (Czech Republic, previously published in (Mueller et al., 2022)) and iv) Groningen (The Netherlands, previously reported in (Marapin et al., 2021(Marapin et al., , 2020). Board-certified neurologists confirmed the diagnosis of FND according to DSM-5 (World Health Organization, 1993) and using positive signs (Stone and Carson, 2015).We included FND patients with motor and sensory symptoms (F44.4 and 44.6), with functional seizures (F44.5), and mixed symptom type (F44.7). For movement disorders (F44.4), clinically definite and documented diagnoses according to (Gupta and Lang, 2009) were included. Exclusion criteria were a current neurological disease or disorder (other than FND), alcohol or drug abuse, pregnancy or breast-feeding, and contraindication for MRI scanning. The studies were approved by local ethics committees at each of the centres, i.e., the ethics committee of the University Hospitals of Geneva (CER 14-088), the Competent Ethics Committee of the Canton Bern (SN_2018-00433), the Ethics Committee of the General University Hospital in Prague (approval number 26/15) and the Medical Ethical Committee of the Amsterdam University Medical Center, location AMC, the Netherlands (identification number MEC10/079). All subjects provided written informed consent.
The dataset included 220 MRI scans from patients suffering from FND and age-and sex-matched HC. Data from 21 subjects were excluded due to too high motion artefacts (see section 2.3), and 10 subjects were excluded due to insufficient quality of the functional data (slice artefacts in frontal and/or parietal regions). To maintain an equal number of ageand sex matches, the equivalent age-and sex match of each excluded subject was discarded as well (n = 17), in order to have a well-balanced dataset (Dyrba et al., 2013;Nielsen et al., 2020). We confirmed matched ages within and between the centres using a type III -ANOVA with factor group and centre. The remaining 172 MRI scans included data from 86 patients and their 86 age-and sex-matched healthy controls (Table 2), correspondingly, it needs to be underlined that -as compared to the previous work -two healthy controls were excluded from the original dataset of centre I in order to have equal number of subjects in both groups. Similarly, as compared to the dataset in (Marapin et al., 2021;, two subjects were excluded due to motion artefacts along with their corresponding age-and sex match).

Data acquisition
Mood disorders are known comorbidities in FND patients (Carson and Lehn, 2016). Therefore, anxiety and depression scores, as well as psychotropic medication (i.e., benzodiazepines, neuroleptics, antidepressants, antiepileptics, and opioids) are commonly assessed in studies on FND patients. Accordingly, centre I, II, and III collected behavioural data of patients and controls on anxiety and depression using the Spielberg State-Trait Anxiety Inventory (STAI, Spielberger et al., 1983) and the Beck's Depression Inventory (BDI, Beck, 1961). Centre IV collected behavioural data on anxiety and depression in patients using the Beck's Anxiety Inventory (BAI, Beck et al., 1988) and the Beck's Depression Inventory (BDI, Beck, 1961). Symptom severity was evaluated using the Clinical Global Impression (CGI) score (0 = no symptoms to 5 = very severe symptoms) in centre I; using the CGI score (0 = no symptoms to 7 = very severe symptoms) in centre II and IV; and using the Simplified Version of the Psychogenic Movement Disorder Rating Scale (S-FMDRS, Nielsen et al., 2017) in centre III. CGI scores with different scales were converted into the same scale. S-FMDRS scores were converted into CGI scores (see Supplementary Material, Appendix 1). Differences in symptom severity between centres (CGI score) were analysed using one-way ANOVA.
Functional and structural MRI data were all acquired on 3-Tesla units using different MRI manufacturers, machines and protocols. Acquisition parameters for the fMRI data of each centre are summarized in Table 1. In one centre (centre IV), fMRI data were based on fast field single echo planar imaging (FEEPI), whereas in the others, it was based on wholebrain single shot multi-slice BOLD echo-planar imaging (EPI). Structural scans were obtained using a T1-weighted Magnetization Prepared Rapid Gradient-Echo (MPRAGE) image in centre I, II, and III; and using a T1 weighted turbo field echo (TFE) image in centre IV.

MR pre-processing
Data were pre-processed and analysed using MATLAB (R2017b, MathWorks Inc., Natick, USA). Each centre was pre-processed individually. An adapted version of the previous pre-processing pipeline from (Wegrzyk et al., 2018) based on the Statistical Parametric Mapping version 12 (SPM12) tools (https://www.fil.ion.ucl.ac.uk/spm/softwa re/spm12/) was used, including: functional realignment and co-registration of the mean functional image to the structural image, and segmentation of the structural image into grey matter, white matter, and cerebrospinal fluid. The functional images were additionally checked for excessive head motion using the framewise displacement (FD) method of Power and colleagues (Power et al., 2014). Mean FD and number of volumes above threshold of >0.5 mm were calculated per subject. A type III -ANOVA was used to evaluate differences in motion artefacts for the factors group and centre. Then, for each subject an individual structural brain atlas based on the AAL atlas (Tzourio-Mazoyer et al., 2002) was built using a customized version of the IBASPM toolbox (Aléman-Gomez et al., 2006). From the AAL atlas, we used 88 regions (whole atlas without the cerebellum and pallidum (due to signal dropout), same as in (Wegrzyk et al., 2018)). The individual structural atlas was mapped onto the native resolution of the functional data. Furthermore, region-averaged time-series were extracted and motion parameters, as well as the average signal from the white matter and the cerebrospinal fluid were regressed out (Richiardi et al., 2011;Wegrzyk et al., 2018). The region-averaged time-series were Winsorized to the 95th percentile to reduce the effect of outliers and linearly detrended. For optimization purposes of the first validation step (see section 2.5), the region-averaged time-courses were either bandpass filtered (0.01-0.08 Hz) or wavelet subband filtered (Richiardi et al., 2011) (see Supplementary Material, Appendix 2 for further details and explanations on the pre-processing pipelines).

Resting-State functional connectivity modelling
Pairwise Pearson correlation coefficients between each pair of atlas regions were calculated for each subject to obtain a correlation matrix (number of regions × number of regions) (Smith et al., 2011). The correlation coefficients were Fisher-Z transformed to make the connectivity matrices Gaussian. The Fisher-Z transformed connectivity matrices of each centre were then connection-wise Z-scored to normalize the data with regard to centre, which acts as a site harmonization. To evaluate the effectiveness of the normalization, we analysed within-and between centre and group effects of functional connectivity differences between each pair of regions using n-way ANOVA before and after normalization. For each subject, the upper triangular part (without the diagonal) of the correlation matrix was extracted and lexicographically organized in a two-dimensional feature representation, which was used further as input feature vectors for the classifier. The feature vector of each subject therefore contained [(88 × 87)/2 ] = 3828 features. The exact procedure can be found in (Richiardi et al., 2011;.

Classification
To perform a binary classification, a Support Vector Machine (SVM) classifier with a linear Kernel function and L2 regularization was used, which learned a discriminative function that separated the two groups as accurately as possible. The SVM implementation for MATLAB of the LIBSVM package (Chang and Lin, 2011) (software available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/) was used, where the C parameter was set at 1. The classification process includes two main steps: 1) training and testing of the model and 2) evaluation of the model. In order to estimate the performance of our model, we chose three cross-validation approaches adapted and similarly implemented as in (Dyrba et al., 2013) and (Nunes et al., 2020): (1) Intra-centre cross-validation: Each dataset was evaluated individually by separating training and test set by using an n-fold leave-one-out (LOO) cross-validation approach, where n represents the number of subjects. For each iteration, n-1 subjects were used as training data and the remaining subject was used as test data. This was repeated until each subject within a centre was used once to test the classification performance. During this intracentre cross-validation, we therefore replicated the results in centre I, and validated its applicability in three other datasets originating from three separate centres (centre II-IV).
(2) Pooled cross-validation: All the data of the four centres were pooled and separated in a training set and a testing set by using the n-fold LOO cross-validation approach again. The classifier was trained on n-1 subjects, including all subjects of the four centres, and tested on the remaining subject. This was repeated until each subject from each centre was used once to test the classification performance. During this pooled cross-validation, we evaluated the classifiers performance when working with data that arise from different scanners introducing a scannerspecific variability.
(3) Inter-centre cross-validation: The data from s-1 scanners were used as a training set and the data from each remaining single centre was used once as a testing set. During this inter-centre cross-validation, we investigated if the learned linear SVM model can be applied to data from an unknown scanner and therefore evaluated its generalization power.
This setting poses great challenges due to the many sources of uncontrolled variance across scanners and datasets (Abraham et al., 2017;Noble et al., 2017). We thus further examined the classification performance when gradually transferring subjects from the test set to the training set. Doing so, the test set is not fully naïve to the potential centre-specific bias introduced in the inter-centre cross-validation setting. This procedure, however, can help to understand the impact of scanner-specific bias to the classification performance. We iteratively transferred data from two subjects (one HC and one FND) from the test set to the training set to examine the learning curve. In each iteration, two more subjects were transferred from the test set to the training set until a maximum number of 28 subjects (i.e., 14 HC, 14 FND) was transferred. Namely, 28 subjects represent the maximum number of subjects that can be transferred in order to have at least two remaining subjects in the test set.
In each setting, the classification performance was calculated as the average performance across all folds. Fig. 1 gives an overview of the three different validation steps (for a detailed description, see Supplementary Material, Appendix 2).

Evaluation
To evaluate the classifier's performance, accuracy, sensitivity, specificity, as well as the area under receiver operating characteristic curve (AUC) were computed. The accuracy provides information about the overall performance of the classifier with respect to both groups and was defined as accuracy = (TP + TN)/n where TP is the number of true positives (patients correctly classified as patients), and TN is the number of true negatives (controls correctly classified as controls) and n is the total number of subjects. The sensitivity is the true positive rate and the specificity the true negative rate, i.e., sensitivity = TP/(TP + FN), specificity = TN/(TN + FP), where FN and FP refer to the number of false negatives and false positives, respectively. The AUC assesses the probability of correctly classifying a random pair of patient and control. It reflects test accuracies as follows: AUC = 1 refers to perfect accuracy, AUC between 0.7 and 0.9 refers to moderate, AUC between 0.5 and 0.7 = refers to low and, AUC = 0.5 is uninformative. To assess the significance of the classification, we performed permutation testing, i.e., the classification was repeated 1000 times using its null distribution with the group labels (patients/control) randomly permuted.

Most discriminative connections
To shed light on which brain areas may be linked to the pathophysiology of FND and common across all four centres, we focussed the post-hoc analyses on the validation steps which pooled all the data from the four centres (step 2: pooled cross-validation). In order to explore the connections that were most discriminative to distinguish patients and controls, we analysed the highest weights assigned by the classifier to the different functional connections (i.e., correlation coefficients).
Within these most discriminative connections, we then further identified those regions that appeared with the highest frequency. From this set of regions, we analysed the connectivity differences between patients and controls by exploring whether these regions were hypo-or hyper-connected in patients versus controls. For this purpose, we calculated the mean connectivity between the corresponding pairs of regions for each group (healthy controls and FND patients).

Impact of anxiety, depression, medication, and clinical score on classification performance
In order to verify that our results were not driven by potential confounding factors like anxiety (STAI), depression (BDI), psychotropic medication (yes/no), and clinical scores/symptom severity (CGI), we used a logistic regression analysis (using glm function in R, which automatically removes missing data from regression analysis). Specifically, we test whether the aforementioned factors could predict if a subject was classified correctly or not (yes/no). We tested each factor individually and in combination.

Demographic and clinical data
Data from 86 FND patients and 86 age-and sex-matched healthy controls, arising from four different centres were included in this study. All patients and 71 HC completed the Beck's Depression Inventory (BDI, (Beck, 1961)); 71 patients and 71 HC completed the State-Trait Anxiety Inventory (STAI-S, (Spielberger et al., 1983)). Two patients of centre II were not rated using CGI. Demographic and clinical data are presented in Table 2. There was no significant difference in age between centres and groups. One-way ANOVA on symptom severity (CGI scores) identified a significant effect of factor centre. Post-hoc Tukey's honestly significant difference (Tukey's HSD) showed that the difference in symptom severity between centre I and IV (p = 0.02), between centre II and IV (p = 0.001) and centre III and IV (p = 0.011) were statistically significant, meaning centre IV had more severe cases than the three other centres.
FND symptom type was similar between centre I to III with a majority of abnormal movement (F44.4) diagnosis (see Table 2 for details) as well as functional seizures (F44.5) or mixed (F44.7) whereas centre IV had exclusively abnormal movements (F44.4) cases.

Framewise displacement
FD measures showed a significant main effect of centre (F(3,164) = 5.5210, p = 0.001). Post-hoc multiple comparison of means showed that the difference between centre I and centre III (p < 0.0001) and centre IV (p = 0.0006), as well as between centre II and centre III (p = 0.0002) and IV (p = 0.008) were statistically significant (Supplementary Material, Figure S1), meaning that centres III and IV had more motion artefacts as compared to centre I and II.

Replication and robustness of classification approach
(1) Replication: Applying the method from (Wegrzyk et al., 2018)    66.7-70.8), and their diagnostic abilities -indicated by the AUCwere moderate to good in all three centres (see Table 3 for details).
(3) Pooled cross-validation: When data from the four centres were pooled, a significant classification accuracy of 71.5% (sensitivity: 67.4%, specificity 75.6%, AUC: 0.79, p = 0.003, see Table 3 for details) was found. We present below the list of most discriminative features with their SVM weights, the confusion matrix, and the receiver operating characteristic (ROC) curve and of the pooled cross-validation in Fig. 2.
A visual representation of accuracy, sensitivity, specificity across all centres, and ROC curve of the intra-centre-and inter-centre cross-validation can be found in Supplementary Material, Figure S2/S3.

Most discriminative connections
In the pooled cross-validation, regions such as the hippocampus, the bilateral angular gyrus, the cingulate cortex, bilateral frontal regions and the bilateral supramarginal gyrus were most frequently found within the most discriminative connections. When exploring the connectivity differences between patients and controls in the regions yielding the most discriminative connections, we identified increased connectivity in patients between: (a) the hippocampus and temporal regions (e.g., right superior temporal gyrus and middle temporal pole), the cingulate cortex, and the bilateral precuneus (b) the bilateral angular gyrus and sensorimotor regions (e.g., postcentral gyrus), the bilateral fusiform gyrus, and the left superior occipital gyrus (c) right cingulate cortex and right frontal regions (e.g., orbitofrontal gyrus) and the right thalamus Similarly, we identified decreased connectivity in patients between.
(a) the right hippocampus and right frontal regions (e.g., inferior orbitofrontal gyrus), subcortical regions (e.g., bilateral parahippocampal gyrus and bilateral amygdala) and subcortical structures (left putamen) (b) the anterior cingulate cortex and the right caudate (c) the right and left amygdala (d) left supramarginal gyrus and frontal regions (e.g., orbitofrontal and middle frontal gyrus) For visualization purposes, regions yielding the most discriminative connections for the pooled cross-validation are presented in Fig. 3 (the corresponding figure for each single centre can be found in Supplementary Material, Figure S5). A figure displaying hyper-and hypoconnectivity between the regions yielding the most discriminative connections can be found in Supplementary Material, Figure S4. Data were visualized using BrainNet Viewer (Xia et al., 2013). Mean functional connectivity in controls and patients between pairs of regions showing most discriminative functional connectivity of the pooled crossvalidation can be found in the Supplementary Material, Table S1).

Logistic regression of anxiety, depression, medication, and clinical scores
Whether a subject was classified correctly or not (yes/no) could not be predicted by anxiety, depression, medication and clinical scoresneither in the intra-centre nor in the pooled cross-validation setting (Supplementary Material, Table S2). These potential confounding factors thus did not drive the classification performances.

Generalizability to multi-centre data
(1) Inter-centre cross-validation: When data from each single centre were used once to test the classifier and data from the remaining three centres were used to train the classifier, we found classification accuracies ranging from 37.5 to 50% (sensitivity: 37.5 -56.5%, specificity: 33.3 -54.2%), below chance level. Correspondingly, the AUC was below chance (see Table 4 for details).
(2) Centre normalization of functional connectivity data: After normalization (see section 2.3), n-way ANOVA on the different connections with factor group and centre, corrected for multiple comparisons using false discovery rate (FDR), showed only a significant effect of factor group in 287 connections. No centre Fig. 3. Regions yielding the most discriminative connections of the pooled classification based on the AAL atlas. Size of the nodes correspond to nodal degree, respectively occurrence within the most discriminative connections. Colour of the nodes corresponds to different lobes of the AAL. Thickness of edges correspond to SVM weights. Thicker edges therefore indicate higher SVM weights, respectively higher discrimination power. The mean functional connectivity values corresponding to this figure can be found in Supplementary Material, Table S1. The figures corresponding to each single centre can be found in Supplementary Material, Figure S5.

Table 4
Classification performance of the inter-centre cross-validation step on the four different centres. nor interaction effect was found. After normalization, functional connectivity thus only differed between groups (FND and HC), but no centre effect remained. (3) Adapting the inter-centre cross-validation: By gradually transferring two subjects (1 HC and 1 FND) from the test set to the training set, we observed an improvement of the overall classification performance to the level of the intra-centre and pooled crossvalidation. However, after the transfer of approximately 16-20 subjects, the model started overfitting the results. The different learning curves of accuracy, sensitivity, and specificity of the four centres are presented in the Supplementary Material, Figure S4).

Classification
In line with our first aim, these results show that classification of RS fMRI brain images with a machine learning algorithm (Wegrzyk et al., 2018) could be successfully replicated in three separate samples stemming from different recruiting centres. This means that, overall, this method can successfully distinguish FND patients from healthy controls with accuracies at or above 70% (centre I: 73.9%/II: 72.9%/III: 70.8%/ IV: 70.0%). Importantly, these results confirm that the method provides an accurate and robust classification of FND patients and healthy controls within different MRI scannersas the four centres had different manufacturers and acquisition parameterswhen the models are trained at each site. It also shows robustness against clinical heterogeneity, because the FND populations of the four centres were not identical in terms of symptom type and severity. Namely, centre IV included only functional movement disorders (F44.4), whereas centre I to III included mixed (F44.7) cohorts. Patients included in centre IV rated their symptoms as more severe compared to the FND patients included in the other centres.
To strengthen this first validation step, we examined if the classification approach is also robust when merging the data from all four centres together. Therefore, we ran the exact same analysis in a second validation step by pooling all the data together, this yielded a similarly high classification accuracy of 71.5%. Similar results have been found among diverse neurological and psychiatric conditions (for review: Nielsen et al., 2020;Orrù et al., 2012). This strongly suggests that machine learning is an appropriate and robust tool to detect differences in functional connectivity in FND patients and HC. Furthermore, despite the clinical heterogeneity and potential inter-centre confounding factors (e.g., inter-scanner variability), the classifier yielded high classification accuracies. Using a post-hoc logistic regression analysis, we could additionally show that neither anxiety, depression, psychotropic medication intake, nor clinical scores had an impact on classification performance. These results indicate that our model probably discriminated between patients and controls based on features specific to the underlying FND pathology (i.e., aberrant functional connectivity) and not the clinical comorbidities, nor the symptom severity of FND patients. The underlying changes in functional connectivityindependent of symptom type and severity -might represent a FND specific trait, rather than a state. To further verify what these FND specific traits are, however, it is of utmost importance to compare the classification performance against other patient groups with similar symptoms but different diagnoses (e. g., other neurological disorders and/or psychiatric controls). Moreover, it must be considered that other predisposing factors might potentially drive the classification performance. Namely, the aetiology of FND is multifactorial. For instance, genetic risk factors or preceding traumatic life events are thought to affect the pathophysiological mechanisms of FND (Hallett et al., 2022). Particularly, traumatic life experiences and childhood adversities are known risk factors with average odds ratio between 2 and 4 (Ludwig et al., 2018). Moreover, functional and structural alterations have been detected in FND patients in the context of trauma exposure, particularly in regions pointed out by the pooled analysis such as the cingulate cortex, insula, and the hippocampus (Aybek et al., , 2014Diez et al., 2020;Maurer et al., 2016;Perez et al., 2017). To the best of our knowledge, this is the first study using multi-centre data of FND patients including different symptom types and symptom severity for a multivariate classification approach. Moreover, machine learning algorithms seem to be robust enough against different symptom types and severity scores, as represented in our results.
In line with our second aim, we evaluated the generalizability of this classification approach by examining whether data from a naïve centre can be correctly classified when applying a model that has been trained on data from the three other centres. Even though we normalized with respect to centre, this third validation step showed that individual classification accuracies did not exceed chance level. Compared to the pooled validation, this step introduced scanner bias of the left-out centre only during the testing, whereas during the pooled cross-validation setting the scanner bias was already included in the training set. This suggests that variance introduced by inter-scanner variability is too high to be overcome using inter-centre cross-validation and might be substantially different from variance introduced by other confounding factors such as comorbidities or symptom severity. With our post-hoc adaptation of the inter-centre cross-validation setting, in which we gradually transferred subjects from the test set to the training set in order to introduce centre-specific scanner bias already during the training, we observed a gradual increase in overall classification performance. This observation strengthens our assumption of that interscanner variability plays a critical role and cannot be overcome in our inter-centre cross-validation setting. Indeed, inter-scanner variability is a well-known bias for multi-centre RS fMRI data (Noble et al., 2017;Zhao et al., 2018) that yet has to be overcome. Specifically for multicentric fMRI graph data, not only functional, but also structural imaging data has been shown to influence graph representation, as fMRI data is parcellated according to the structural MRI data (Castrillon et al., 2015). Neither did regressing out the site substantially aid the classification (Castrillon et al., 2015). Alternatively, our sample size might be too small to properly capture sufficient variation within each site (whether subject-driven or related to technical factors) to generalize to completely unseen sites. Another study on multi-site resting-state connectivity classification for Autism spectrum disorder showed that, given sufficient subjects in the training set (between 280 and 500 depending on inclusion criteria), inter-site performance could reach intra-site performance, but that this was not the case at smaller sample sizes (Abraham et al., 2017). The assumption that a sample size may be too small, can be strengthened by the fact that after normalizing the data, no significant centre effect remained.
In summary, a multi-centre scenario increases the sample size (i.e., in our second validation step) and consequently the heterogeneity of the sample, which might benefit the classification performance. On the contrary, it introduces systematic inter-scanner variability ("site bias") which is unrelated to the underlying disorder of interest and thus might complicate the discriminative power (Abdulkadir et al., 2011). Consequently, there are only a few studies investigating the applicability of multi-centre classification based on RS FC. In line with our findings, equivalently good classification performances were achieved in pooled multi-centre classification settings using a SVM classifier based on RS FC e.g., for autism spectrum disorder (N = 240 subjects, accuracy = 79%; Chen et al., 2016), for mild cognitive impairment (N = 367 subjects, accuracy = 72%; Teipel et al., 2017), as well as for major depressive disorder (N = 358 subjects, accuracy = 73%; Nakano et al., 2020). The latter also investigated robustness against site bias on classification using a leave-one-site-out cross-validation (LOSO-CV; equivalent to our inter-centre cross-validation). Comparable with our results, their LOSO-CV did not succeed in classifying major depressive disorder in a fully unknown dataset.
The inter-scanner variability clearly limited the classification performance and generalizability when data from a specific scanner was only used for testing but not during the training. Combining data from different modalities, has been found to be one solution to overcome the limitations of multi-centre RS fMRI (Zhuang et al., 2019). For instance, high classification accuracies were achieved in pooled as well as LOSO-CV combining T1-weighted (structural/anatomical) images with RS functional connectivity from patients with frontotemporal dementia and healthy controls (Donnelly-Kehoe et al., 2019). Accordingly, the successful classification of functional seizures based on structural imaging data (Vasta et al., 2018) would suggest employing multi-modal data of FND patients for future classification approaches when working towards a clinical application. Furthermore, previous studies attempted to identify and characterize inter-scanner variability and how they influence fMRI data (Dansereau et al., 2017;Friedman et al., 2006). As such, classification was found to be improved by site harmonization methods (Nakano et al., 2020;Yamashita et al., 2019;Yu et al., 2018). Site harmonization approaches, however, still face methodological challenges: Recent studies raised concerns that site harmonization methods might interfere with analytical methods (Chen et al., 2022), depend on choice of atlas (Yu et al., 2018), or can be substantially impacted by the use of fMRI acquisition parameters (Mori et al., 2018;Yamashita et al., 2019). Apart from using site harmonization approaches, promising results have also been found when applying unsupervised machine learning algorithms such as deep learning. Although they are computationally more complex, they appeared to be robust against site differences (Dewey et al., 2019;Zeng et al., 2018). At last, a feature selection could be implemented in order to reduce the high dimensionality of our feature vectors (Guyon et al., 2003). However, the aim of this project was to examine the generalizability of the previously applied method on different movement disorders/FND centres, rather than developing the best possible machine learning approach suitable for a multi-centre setting. Nevertheless, this could be the goal of future additional validation studies.

Connectivity patterns
Upon visualization of the most discriminative weights of individual connections, we could evaluate their individual contribution to the overall classification. Our study identified regions as most discriminative that indeed were commonly reported in the literature, such as the cingulate cortex Baek et al., 2017;Blakemore et al., 2016;Marapin et al., 2020), right temporal regions (i.e., the temporoparietal junction, TPJ) (Aybek et al., 2014;Espay et al., 2018b;Maurer et al., 2016), the amygdala Morris et al., 2017;Voon et al., 2011), the insula (Espay et al., 2018b;Stone et al., 2007;Voon et al., 2011), the inferior frontal gyrus (IFG, Espay et al., 2018b) or the dorsolateral prefrontal cortex (dlPFC, Aybek et al., 2014;Voon et al., 2016;. However, feature weights need to be interpreted with caution, as a machine learning algorithm values the utility for classification, rather than the clinical relevance of a feature (Nielsen et al., 2020;Nunes et al., 2020). Therefore, one should not infer upon the potential underlying mechanisms of a disorder, but rather examine the weights for their potential pathophysiological validity. As such, our results provided connectivity patterns that are particularly interesting to further construe: connections including 1) the angular-and supramarginal gyri, to sensorimotor regions and 2) cingular-and insular cortex, to hippocampal regions. The angular and supramarginal gyrus are located within/bordering the temporo-parietal junction (TPJ), a key structure for FND. Abnormal interaction between the TPJ and sensorimotor regions has been repeatedly found in FND patient and is thought to be associated with their impaired sensory prediction signal (i.e., the sense of agency) (Perez et al., 2012;Voon et al., 2010). Similarly, RS-fMRI study in FND identified decreased connectivity from the TPJ to sensorimotor regions (Maurer et al., 2016), to the precuneus (Mueller et al., 2022), and between the TPJ, motor regions, cingulate cortex and insula (Diez et al., 2019), as well as decreased connectivity between the right inferior parietal cortex to the dlPFC and the anterior cingulate cortex  supporting the theory of impaired sensorimotor integration and impaired sense of agency. On the other hand, the cingular-and insular cortex, and hippocampal regions belong amongst others -to the limbic network and are considered to be part of the emotion-cognition integrative system (Pessoa, 2008). Altered connectivity in FND in limbic regions have been associated with abnormal frontal lobe emotional control and emotion-motion interactions (Aybek et al., 2014;Monsa et al., 2018). In particular, aberrant hippocampus activity was found in response to aversive stimuli in task-based fMRI using emotional stimuli (Aybek et al., 2014;Blakemore et al., 2016;Szaflarski et al., 2018). Moreover, increased FC was found between the cingulate cortex, precuneus, and the ventromedial prefrontal cortex during a motor task (Cojan et al., 2009). Similarly, RS fMRI studies on FND identified increased connectivity from parahippocampal structures to the right superior temporal gyrus (Longarzo et al., 2020) and to the middle-and inferior temporal gyrus (Szaflarski et al., 2018), increased connectivity between the hippocampus and default mode network (DMN) related regions (Monsa et al., 2018), as well as increased FC from the amygdala to the dlPFC . Alterations in RS FC in these regions thus support previous findings on task-based fMRI stating an impaired emotion regulation in FND (Aybek et al., , 2014Espay et al., 2018b).

Towards a clinical application
Excellent sensitivity and specificity (between 80 and 100%) has been found for bedside clinical signs (Daum et al., 2015;Espay et al., 2018a;Syed et al., 2011). However, these maneuvers may still face several limitations, including a lack of gold standards against which to compare them and unblinded assessments in most studies along with other methodological issues such as a poor description of how the diagnosis of FND was made. Additional diagnostic procedures might support the clinical diagnostic process. With regard to a multivariate classification approach applied within a clinical setting, an accuracy of 70% might not present a final solution. The setting of classifying patients against healthy controls does not represent the clinical need and limits the generalizability of these results to clinical application at this stage. For daily clinical routine, one should rather aim at distinguishing a functional symptom from identical/similar neurological and psychiatric symptoms, and not from a healthy control. The potential applicability of such a machine learning approach would be for example to assist screening of patients in the emergency department in cases of ambiguous neurological symptoms or could provide more details in difficult cases. Therefore, rather than replacing a clinical diagnosis, it might provide additional diagnostic support in the form of additional rule-in tests. A patient with a functional disorder could easier be identified as such -in addition to the bedside clinical signs -and could be directly referred to a specialist, before undergoing multiple medical tests and examinations (Espay et al., 2009). Besides, the medico-legal context highlights the importance of identifying an adjunctive positive biomarker in order to help distinguishing FND from intentionally produced neurological symptoms as observed in malingering or factitious disorders in which patients fabricate their symptoms or simply are feigning or lying about their symptoms (Colombari et al., 2021). Therefore, to test the power against differential diagnoses, it is of utmost importanceas a next step -to classify FND patients against similar psychiatric patients, trauma patients or against neurological patients with the same or similar symptoms (e.g., dystonia, essential tremor, Parkinson's disease, or multiple sclerosis). In summary, machine learning algorithms could thus further support differential diagnoses and optimize treatment prevention and patient management. However, diagnostic utility is only provided if these results can be replicated in other patients with the same or similar symptoms, but different diagnoses.

Limitations and future directions
This study has several limitations. Even though data from four different centres were used, the sample size is small compared to other multi-centre classification studies using multi-centre data bases, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) (Jack et al., 2008) or the Autism Brain Imaging Data Exchange (ABIDE) project (Di Martino et al., 2014). To date, a large, multi-centre database sharing imaging data of FND patients unfortunately does not exist. Small sample sizes have been associated with higher reported accuracies without properly controlling for overfitting (Vabalas et al., 2019). We avoided overfitting by perfectly matching our groups within-and between centres and by applying a leave-one-out cross-validation approach, which is a powerful tool against overfitting and recommended in small samples (Vabalas et al., 2019). Accordingly, our results of the intra-centre and pooled cross-validation are comparatively high with significant accuracies and highly balanced sensitivities and specificities. Nevertheless, a multi-centre database would bring the advantage of adjusting scanner protocols on each centre and scanner type and would thus provide comparably high data quality and low inter-scanner variability. Thereby, multi-centre imaging studies must be planned carefully with regards to scanner hardware and software, implementation of an appropriate quality assurance program to properly validate and monitor data, and application of proper site standardization methods (for recommendations see Glover et al., 2012).
A second limitation is the use of only one atlas with 90 cortical-and subcortical regions. As for now, the purpose of this project was to validate the previously published method across different centres, no changes were made to the pre-processing pipeline. Despite involving a higher computational load, a more fine-grained parcellation (e.g., Glasser atlas (Glasser et al., 2016)) or a voxel-wise approach could detect different information (Eickhoff et al., 2018), and may aid the future development of an adjunctive imaging-based biomarker. On the contrary, using an approach with a higher spatial resolution also bears the risk of overfitting or missing important information due to the comparable high amount of probably uninformative features (Erickson et al., 2017).
A third limitation is that centres III and IV were found to have higher head motion than centres I and II, what might negatively affect functional connectivity (Van Dijk et al., 2012). The significant results obtained in intra-centre and pooled-centre validation, however, indicate that even patients known to have a lot of movements (Centre III and IV had more motor subtypes of FND F44.4) can be correctly classified. For future studies, subjects should be strictly advised to lay calmly, and their head should be fixed using foam cushions. Ideally, prospective motion correction techniques including motion-tracking cameras or a pilot tone approach (Ludwig et al., 2021) could be used to further improve data quality in this respect.
A last limitation is that clinical data where not uniformly collected and used different scales (CGI, S-FMDRS scales), which meant that scales needed to be adjusted. Including symptom severity in our post-hoc logistic regression analysis is therefore not optimal, as the transformation we have done from S-FMDRS to CGI is intuitive but not validated. Similarly, as anxiety and/or depression scores were collected using different questionnaires (STAI, BDI or BAI), the regression analysis showing no influence of mood on the classification performance should be interpreted with caution until future studies confirm this with prospectively collected uniform clinical data. Together with the uneven distribution of symptom types, we cannot fully account for it with good reliability. From a technical point of view, a future project should aim at balancing the different symptom types, so that a data-driven machine learning approach would learn to recognize those patients as well who are normally underrepresented in a clinical setting. To overcome the problem of different symptom type distribution, patients could also be stratified according to their symptom types and/or include the clinical data (e.g., CGI) into the model (Patel et al., 2015). In order to achieve this in a multi-centre setting, it would be necessary that the same clinical data and psychiatric comorbidities are collected using the same clinical scores and identical questionnaires in each centre. Additionally, data on traumatic life events or childhood adversities should be collected, in order to assess the potential influence on functional brain aberrancies.

Conclusion
In summary, multi-centre RS FC has shown its potential to distinguish FND patients from HC. These findings set the ground for future research on adjunctive biomarkers for FND as the method will need to be improved regarding its generalizability regarding inter-scanner variability and the heterogeneity of symptoms, comorbidities, and severity of symptoms. To provide diagnostic utility, future studies must investigate the classification power when classifying FND patients against classical neurological diseases and/or psychiatric disorders as this would represent a closer setting to the clinical daily routine and could be used as a decision support method for the clinical diagnosis. Importantly, not to replace the clinical diagnosis, but to provide additional rule-in criteria for the diagnosis instead.

Funding/support
This work was supported by the Swiss National Science Foundation (SNF Grant PP00P3_176985 for SA) and the Leenaards Nested Project grant (Grant 3642 for SA). TS was supported by the Czech Ministry of Health Projects AZV NU20-04-0332 and AZV NV19-04-00233, and by the General University Hospital in Prague MH CZ-DRO-VFN64165. MAJT is funded by ZonMWTOP, European Fund for Regional Development, Provincie Fryslân, Stichting Wetenschapsfonds Dystonie Vereniging, and unrestricted educational grant from Merz.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.