Robust intra-individual estimation of structural connectivity by Principal Component Analysis

Fiber tractography based on diﬀusion-weighted MRI provides a non-invasive characterization of the structural connectivity of the human brain at the macroscopic level. Quantiﬁcation of structural connectivity strength is challenging and mainly reduced to “streamline counting ” methods. These are however highly dependent on the topology of the connectome and the particular speciﬁcations for seeding and ﬁltering, which limits their intra-subject reproducibility across repeated measurements and, in consequence, also conﬁnes their validity. Here we propose a novel method for increasing the intra-subject reproducibility of quantitative estimates of structural connectivity strength. To this end, the connectome is described by a large matrix in positional-orientational space and reduced by Principal Component Analysis to obtain the main connectivity “modes ”. It was found that the proposed method is quite robust to structural variability of the data.


Introduction
Diffusion magnetic resonance imaging (dMRI) is a non-invasive method for investigating the microstructure of the human brain in vivo ( Basser et al., 1994 ). Based on dMRI, fiber tracking algorithms allow for the macroscopic characterization of the structural connectivity between different brain regions ( Basser et al., 2000 ).
A variety of different fiber tracking approaches exist to infer structural connectivity such as deterministic ( Mori et al., 1999;Tournier et al., 2010a ), probabilistic ( Behrens et al., 2007;Berman et al., 2008;Parker et al., 2003 ), and global algorithms ( Kreher et al., 2008;Mangin et al., 2013;Reisert et al., 2011 ). Tracts in terms of fiber bundles reflecting the connectivity between specific brain regions are commonly determined by dividing the gray matter (GM) into regions of interest (ROIs) that are used as seeds or a priori constraints for the selection of connecting streamlines. Architectonic and template-based atlases provide the most simple way to obtain GM parcellations ( Caspers et al., 2013 ) that can be registered with individual subjects to ensure that ROIs are matched with respect to data size, geometry and location. Generalizations to parcellation-based ideas represent the cortex continuously on the two-sphere . However, the anatomical variability of cortical structures, undefined positions of fiber terminals and the jor tracts (with a high number of streamlines) can be reliably estimated ( Schumacher et al., 2018 ).
In this work we propose a novel method that overcomes several limitations of the common ROI-based "streamline counting " procedures approaches described above. We will see that it can substantially increase the intra-subject reproducibility of quantitative estimates of structural connectivity strength. The proposed idea shares similarities with clustering approaches, which are applied in the context of automatic tract segmentation ( Siless et al., 2011;Brun et al., 2004;O'Donnell et al., 2006;O'Donnell and Westin, 2007;Zhang et al., 2018;Zhang and Laidlaw, 2005 ). Fiber clustering links streamlines to tracts (or fiber bundles) based on similar trajectories and anatomical positions but without any reference to a priori specified brain regions. These methods focus mostly on anatomy of fiber tracts, but not on built connections. Here we concentrate on ROI-based methods, since they allow to investigate connectivity between cortical regions and also enable graphical analysis of the brain networks. In our approach the connectome is described by a huge density matrix ( ∼ 10 7 × 10 4 ) in position-orientation space, which is the basis for the commonly ROI-based approaches. We reduce it by Principal Component Analysis (PCA) to obtain the main "modes " of connectivity 1 . The intuition behind this idea is that the found major singular vectors can be interpreted as the superposition of the Tract Orientation Distributions (TODs) ( Dhollander et al., 2014 ), of the major tracts, no matter how the configuration of underlying fiber reconstruction looks like. In the context of fiber clustering the similarity may also interpreted as similarities between vectors representing cluster memberships. The approach of embedding fibers in an 'ambient' space, which is part of the proposed idea, was also presented in Durrleman et al. (2011) , where the embedding is used for the metrification of the space of fiber bundles. Also in Reisert et al. (2013) an embedding in a TOD-like space is used to derive quantitative estimates for fiber densities.
In the remainder we will specify the details of the proposed method and will demonstrate its performance on simulated phantom data and in vivo data of 26 subjects who underwent two repeated dMRI measurements. Results reveal that our PCA approach is more robust to breaks within continuous streamlines than commonly applied "streamline counting ". Analyses of intra-class correlation coefficients (ICCs) further show a stronger robustness of the proposed method to the structural variability of the data in terms of a substantially increased intra-subject reproducibility of estimates of structural connectivity strength across separate dMRI measurements.

Fiber density matrix and fiber features
Suppose a given set of streamlines coming from some whole brain tractography algorithm, then its corresponding fiber density matrix is defined as where is indicator function, ∈ is a streamline and is a voxel visited by , is the direction (tangent) of the fiber in this voxel. The obtained density matrix represents the full information of the set of streamlines , i.e. under rather mild assumptions (the tangent of is continuous) we are able to reconstruct all streamlines from . Note that defines a similarity metric in fiber space (here denotes the transpose matrix). Correspondingly, defines a similarity metric in position/orientation space. We will see below that our approach relies on a spectral decomposition of . Note the similarity to spectral clustering (see e.g. O'Donnell et al., 2006;Zhang et al., 2018 ), where the decomposition of the matrix is used for the 1 The term "modes " is in the context of eigen analysis associated with the different eigenvectors, or also "eigenmodes ". clustering of fibers into bundles. Indeed, the spectrum of and are identical, so the approaches are deeply interlinked, but with completely different purpose. The matrix is indeed the common similarity metric, when streamline counting approaches are used. Our approach is based on the idea that the major singular values of the matrix induce a more stable connectivity measure than the ordinary streamline counting idea.
In general, to obtain the "Feature Connectivity Matrix " ( ), we find reduced density matrix using right-singular vectors of density matrix and calculate the correlation between and for each pair of ROIs and ( Fig. 1 ). Now, let and be two ROIs and ( ) the indicator function, which is one, if the voxel is contained in , otherwise zero. With this, and the definition of we approximate the "ordinary " connectivity . First, we rewrite the connectivity as an inner-product with respect to a certain kernel matrix as follows: The kernel matrix is just the product = and defines the induced similarity metric. Now, we approximate this similarity by is the matrix obtained when restricting to the largest eigenvectors of . We call the new connectivity measure "Feature Connectivity Matrix ", as we will see later, it can be computed as an inner product in a certain features space. To evaluate this, we write where is the projector onto the subspace spanned by the largest eigenvectors of . If the singular value decomposition of is where and are the left-singular and right-singular vectors of matrix . Then the projector is With this we can compute by first computing where the outer sum runs over all voxels in ROI , such that , constitutes a ROI specific descriptor of the connectivity relationships. Note that the descriptor still depends on the directionality of the fibers within the ROI. Now, the inner-product of the descriptors give the connectivity So, can be computed by a simple inner-product in a certain lowdimensional feature space. The intuition behind the idea, that 's lead to more robust connectivity, is based on the observation that the major singular vectors , ( , ) , when interpreted as tract orientation distributions (TOD), are superpositions of the TODs of the major tracts, no matter how the actual topology of the underlying tractography is  ( Fig. 2 ). This is actually similar to spectral clustering, where the same observation can be made, and is used to cluster streamlines.
Note that in the above derivations the directional index could have already been traced out at a quite early stage. Even the matrix could have been defined by , = ( ℎ ℎ ) completely disregarding the orientation. Disregarding would actually lead to a much smaller problem, and hence, smaller running times. However, we found that there are higher order similarities, which prohibit this, and which are beneficial as we will see later. That is, one can easily generalize the introduced connectivity measure by We found in experiments, that, indeed coefficients > 1 can be beneficial in terms of robustness and quality of reproducibility, however an important property is also lost: if A, B and C are disjunct ROIs, one expects to have: ∪ , = , + , . It holds for = 1 , but not for > 1 . Nevertheless, we used power coefficient = 2 in all experiments with simulated data and = 1 for in in vivo data. Higher values of allow the reduction of "background noise " in obtained connectivity matrices. Since the connectivity matrices for simulated data are quite sparse, higher values of can improve obtained results.

Normalization
Since values in change between 0 and 1, and c -values in indicate the number of fibers, we normalized both and in some experiments by the row/columns sums, i.e.
where -is one of connectivity matrices.

Implementation
To compute the directional index has to be discretized by . We found, that 32 or 64 directions were enough to obtain adequate results. To compute the projection operators , we computed the ( , ) , ′ by using MATLAB's eigs function. The function eigs requires the application of the matrix , which involves the repeated application of the matrix , normalized over the fiber densities along the position/orientation direction and normalized over the fiber length along the fiber direction.

Simulated data
Phantom data was simulated using Phantomas software ( Caruyer et al., 2014 ) using known ground-truth geometry of testing set from the 2nd reconstruction HARDI challenge ( Fig. 3 a). The phantom consists from 20 fiber bundles and includes 40 seed regions. Images were corrupted by Rician noise with SNR 10, 30, 50. Voxel size 1 mm 3 isotropic; 74 gradient directions with b -factor = 2000 s/mm 2 were used to simulate the data.

In vivo data
To investigate intra-subject reliability 26 healthy subjects were scanned twice on a Siemens TIM TRIO, 6/8 partial Fourier, TR = 10,900 ms, TE = 107 ms; b -factor = 1000 s/mm 2 with 61 gradient direction and an isotropic resolution 1 mm 3 . The data was reconstructed with adaptive combine ( Walsh et al., 2000 ) such that the noise distribution is close to Rician. Additionally, T1-weighted images were segmented using SPM into white matter, gray matter and CSF. A threshold of 0.5 was used to estimate the area of reconstruction for white matter.  (10) . It can be seen, that results differ significantly for global (GT, mfGT) and streamline algorithms (iFOD), since there were more false connections in iFOD2 reconstructions ( Fig. 4 ).

Parameters evaluation
First, we investigated the influence of the number of Principal Components and the power coefficient (formula (10) ) on the obtained connectivity matrices using simulated and in vivo data. For simulated data we compared obtained connectivity matrices with ground truth connectivity matrix using chi-squared similarity measure ( Fig. 3 b): 1 and 2 are the connection values from compared matrices, we set = 0 . 1 in our experiments. It should be noted, that matrices should be normalized before comparison.
Obtained similarity values for different and are shown in Fig. 3 . We found lower similarity values for streamline algorithm iFOD2. It can be explained by higher number of false fibers in results and also "curly " fibers, since the proposed method accounts for directional distribution of fiber segments. Figs. 3 and 4 demonstrate the influence of the described parameters on obtained connectivity matrices. It can be seen, that the power coefficient just reduces the "noise " in background: low c -values get even lower and high c -values just remain high. The increase of emphasizes trajectories that are built of coherent fibers, whereas connections built by less homogeneous bundles get weaker. Fig. 4 c shows an example: fibers with different directions inside a bundle (connect ȡ ng 39th and 40th ROIs) get weaker with increasing of . We also investigated the global behavior of the obtained c -values for different and settings (Fig. 5) . The c -values were divided into three groups by their median values. It can be seen, that in all settings high c -values remain high, but low c -values get lower with the increasing of . With increasing of power coefficient groups of c -values get more separable.
The spherical-shaped phantom has a rather simplified structure in comparison to the real human brain. Thus, we also evaluated the influence of the above parameters on results obtained from in vivo data. We investigated the connectivity between 90 areas defined by the AAL atlas ( Tzourio-Mazoyer et al., 2002 ). The AAL atlas was registered to the subject's space, normalization and deformation were performed using SPM. The first 45 ROIs belong to the right hemisphere, another 45 ROIs belong to the left hemisphere.
For analysis we considered the connection density as the proportion of existing connections in the connectivity matrix relative to all possible connections that can be formed. For binarization we used a threshold of 0.05 for all obtained ; The connection density can be calculated as follows ( Connectivity, 2016 ): where -is the number of nonzero elements in connectivity matrix, -is the number of nodes. Fig. 6 (top row) shows the obtained connectivity density for different feature space dimensions and different values for the power coefficient . It can be seen, that the density decreases with increasing , since c -values are getting lower (see Fig. 7 ). The same was observed for simulated data.
The reproducibility of the connectivity estimates is also an important aspect. Therefore, the intraclass correlation coefficient (ICC) for each cvalue was computed for both and . If 1 and 2 are the c -values for scan 1 and scan 2 of a certain subject, then the ICC is calculated as follows: where <> denotes the expectation value over the whole group of subjects, and = < ( 1 + 2 )∕2 > . In order to exclude the influence of connectivity density on the obtained results, mean and median values were calculated only for nonzero ICC values. The obtained median ICC values (see Fig. 6 ) were more or less on the same level for all features space dimensions. Fig. 7 shows obtained average . We found, that with an increase of the feature space dimension a lower number of c -values remain high. Which seems quite reasonable, since with an increase of , the set of characteristics used to calculate the correlation between ROIs is also increased. Accordingly, an increase in leads to the identification of areas connected by the most homogeneous fiber bundles. The coefficient , in turn, only affects the degree of weighting of c -values. An increase in leads to an even greater decrease for low c -values. Thus, the patterns found in experiments with simulated data were also revealed for data in vivo .    By varying the features space dimension ( ) we can determine the amount of the specificity of the connectivity information obtained. A high number of components only recovers highly coherent bundles, while a lower number let survive also more chaotically organized tracts. In case of simulated data we found that there was no significant improvement for values higher than 50. However, due to the simple structure of the phantom, there are only 20 fiber bundles and 40 seed regions, it seems more reasonable to use larger for in vivo data. The power coefficient in formula 11 influences the "sharpness " of c -values distribution. Values higher than 1 allow reduce the number of low c -values, but they lead to a lost of the linearity in formula 11 . In the remainder of the work we will show experimental results for = 80 and = 2 for simulated data and = 1 for in in vivo data.

Simulated data results
We compared the chi-similarity metrics for both and (see Fig. 8 ). The obtained similarity values are shown in Table 1 . It can  be seen, that there was no significant difference in similarity values for both methods. But seem more robust to noise -whereas there was a decrease in similarity values with decrease of SNR for (up to 8% for iFOD2), similarity values for remain the same. Further, we investigated the robustness of the proposed method against artificial breakage of fibers to show the robustness of the proposed method against the topology of the underlying streamline configuration. All streamlines were cut at a fixed position relative to their length. Here we show results for fibers, broken at 0.7 of their length. It was found, that there was no dependency between obtained results and the position of the breaks. We also investigated the robustness of the method for fibers broken at random points. But, in this case, significant changes in fiber length distribution affected the obtained results. Fig. 9 shows end points maps overlaid with FA maps for mfGT results before (a) and after (b) breakage. It was found, that whereas there were no true connections on s for broken fibers, s remained the same. Example of and for broken mfGT results are shown in Fig. 9 c-d. Similarity values for broken FT results are shown in parenthesis in Table 1 .
The 2 -metric takes the absolute values of the connectivities into account. To get an agreement measure, which is more independent of the absolute values (the ground truth is actually a binary matrix), the connectivities can be thresholded to obtain also binary connectivity matrices. To compare with the ground truth we used the f1-score 2 . Fig. 10 shows the f1-score for (dashed line) and (solid line) with different thresholds. It can be seen, that scores differ significantly for broken FT results.

In vivo data
Similar to Reisert (2015) logarithmic scaling was used to obtain comparable contrast of connectivity matrices: ( , ) = ( + ( , )) , where t -is the 30% quantile of -values over all regions. Fig. 11 shows the average connectivity matrices in logarithmic scale. For connectivity matrices, shown on this figure we did not used any threshold. Fig. 12 (top row) shows individual ICC values for each c -value for s and s for mfGT. To investigate the robustness of connectivity matrices against artificial breaks of fibers, broken FT results were created for each FT algorithm -all fibers were broken at fixed position relative to their length. We obtained connectivity between different ROIs after that. We show here the results for fibers, broken at 0.7 of their length. Obtained ICC values for each individual c -value for broken mfGT results are shown in Fig. 12 (bottom row). ICC values reduced for both and in most cases, but ICCs are still higher for than . Table 2 shows mean and median ICCs for all FT algorithms, ICC values for broken FT results are shown in parenthesis. While ICCs for for original and broken FT results are close, there is a significant difference in ICCs for s.

Discussion
The proposed method performs topologically robust connectivity estimation using tract density features. The idea is to describe the connectome by a huge density matrix in position-orientation space. We used 2 If is the ground truth and is the predicted binary connectivity matrix, we computed the precision = ∑ ∕ ∑ , the recall = ∑ ∕ ∑ and finally the f1-score by 2 + .    11. In vivo data results. Average for GT, mfGT and iFOD2 (upper row) in logarithmic scale; Average for GT,mfGT and iFOD2 (lower row) in logarithmic scale. There are more "strong " connected, but also "weak " connected areas on s.  Principal Component Analysis to obtain the major components of this matrix, which correspond to the main underlying tract orientation distributions. First, we investigated the dependency of the results on the number of principal components and the value of the power coefficient in formula (11) for simulated and in vivo data. For simulated data we investigated the similarity between the estimated and ground-truth connectivity matrices. It was found, that similarity values increase in case of global FT algorithms up to = 50 and then remain stable (see Fig. 3 ). The situation was different for iFOD2, since there were a lot of false bundles in the reconstruction. The similarity was low for all investigated values of . It was found, that the power coefficient influences the weighting of the c -values: during increase of only connections built by fibers with similar trajectories remain high, connections built by less homogeneous bundles become weaker ( Fig. 4 ). This can be explained by the fact, that by increasing of we increase the number of features or "characteristics " that we use to calculate the correlation. Similar tendencies were observed for in vivo data.
Since the reproducibility of the connectivity results is an actual challenge today (e.g Buchanan et al., 2014;Pr čkovska et al., 2016 ) 26 subjects were scanned twice to evaluate robustness of the proposed method with the original "fiber counting " method. It was found, that the ICC val-ues were higher in most of the cases for (see Fig. 12 and Table 1 ). We also calculated the connectivity density of the obtained matrices depending on . We found that the connectivity density decreases with an increase of , but ICC values remain more or less on the same level for a = 1.
To show the robustness against topological changes we artifically broke fibers. It was found that although s were ruined after the breaking procedure, there were no changes in s. This was also confirmed by the similarity values indicated in parenthesis in Table 1 . There were no significant changes in mean and median ICC values for all FT algorithms (see Table 2 ). An improvement in ICC values for s for all FT algorithms in comparison with s was up to 20% (for mfGT results).
In general, a connectivity estimation method, not dependent on the topology of the connectome, can be potentially used to generate scannerindependent large-scale normative data for clinical applications. In this work we proposed a method to increase the robustness and intra-subject reproducibility of structural connectivity measures, which was demonstrated in experiments with simulated and in vivo data.

Data and code availability statement
The data and code used in the study will be available on request. The data and code sharing adopted by the authors comply with the requirements of the Medical Center, University of Freiburg, and comply with ethics of the Medical Center, University of Freiburg.