Evaluation of tractogram filtering methods using human-like connectome phantoms

Tractography algorithms are prone to reconstructing spurious connections. The set of streamlines generated with tractography can be post-processed to retain the streamlines that are most biologically plausible. Several microstructure-informed filtering algorithms are available for this purpose, however, the comparative performance of these methods has not been extensively evaluated. In this study, we aim to evaluate streamline filtering and post-processing algorithms using simulated connectome phantoms. We first establish a framework for generating connectome phantoms featuring brain-like white matter fiber architectures. We then use our phantoms to systematically evaluate the performance of a range of streamline filtering algorithms, including SIFT, COMMIT, and LiFE. We find that all filtering methods successfully improve connectome accuracy, although filter performance depends on the complexity of the underlying white matter fiber architecture. Filtering algorithms can markedly improve tractography accuracy for simple tubular fiber bundles (F-measure deterministic – unfiltered: 0.49 and best filter: 0.72; F-measure probabilistic – unfiltered: 0.37 and best filter: 0.81), but for more complex brain-like fiber architectures, the improvement is modest (F-measure deterministic – unfiltered: 0.53 and best filter: 0.54; F-measure probabilistic – unfiltered: 0.46 and best filter: 0.50). Overall, filtering algorithms have the potential to improve the accuracy of connectome mapping pipelines, particularly for weighted connectomes and pipelines using probabilistic tractography methods. Our results highlight the need for further advances tractography and streamline filtering to improve the accuracy of connectome mapping.


Introduction
White-matter (WM) fiber bundles comprise myelinated axons and encompass approximately 50% of the human brain volume (Bullock et al., 2022).They can be reconstructed in vivo using diffusion-weighted magnetic resonance imaging (dMRI) and tractography (Basser et al., 2000;Mori et al., 1999).Recently, tractography has been used to map the human connectome (Hagmann et al., 2008(Hagmann et al., , 2007;;Sporns et al., 2005).Tractography enables quantification of the WM (structural) connectivity strength between each pair of brain regions of a connectome.Conventionally, connectivity strength is quantified with the tractography streamline count, otherwise known as the number of streamlines or fiber count.The streamline count for a fiber bundle is the number of streamlines reconstructed that interconnect the two ends of the bundle.However, the potential limitations and biases of the streamline count as a quantitative measure of structural connectivity are increasingly recognized (Calamante, 2019;Jones et al., 2013;Yeh et al., 2021;Zhang et al., 2022).To improve the estimation of structural connectivity strength and overcome some of these limitations and biases, streamline/tractogram post-processing algorithms were recently developed (Daducci et al., 2015;Pestilli et al., 2014;Schiavi et al., 2020;Smith et al., 2015Smith et al., , 2013)).The core idea of these algorithms is to estimate a weight for each streamline based on their contribution to the underlying dMRI signal or to remove biologically implausible streamlines.
These methods could be considered as streamline/tractogram filtering algorithms where streamlines with lower weight (closer to zero) could be removed.Hence, the filtered streamline counts provide an estimate of structural connectivity that is more consistent with the underlying white matter microstructure, compared to raw streamline counts.
Despite the development of numerous streamline/tractogram filters and their growing application in human connectome mapping, the relative performance of different filters remains unknown and the improvement in connectivity estimation accuracy achieved through filtering requires further investigation.This evaluation should ideally involve generation of a ground-truth phantom that is representative of human brain morphology and for which WM connectivity is known.Specification of a ground-truth connectome for the human brain is challenging, given the diversity in the orientation, geometry, size, and density of fiber bundles and inter-individual differences.In the absence of a ground truth, tractography has been validated using dMRI phantombased methods (Drobnjak et al., 2021;Hubbard and Parker, 2014;Maier-Hein et al., 2017).The purpose of these phantoms is to define underlying microstructural properties that could be used to quantitatively and qualitatively evaluate fiber bundles reconstructed non-invasively using tractography.dMRI phantoms provide a ground truth to determine whether the reconstruction of a fiber bundle with tractography is a true positive (TP) or a false positive (FP).A connectome phantom consists of dMRI data along with the explicit description of its structural connectivity.To date, three connectome phantoms have been developed (Caruyer et al., 2014;Rafael-Patino et al., 2021;Sarwar et al., 2019), i.e., these studies simulated spherical phantoms with known structural connectivity.A major drawback of these phantoms is that the simulated fiber bundles feature tubular geometry, which is a simplification of the complex morphological structure of WM fiber bundles comprising the human brain.It is well known that tractography performance can degrade for complex fiber configurations (Daducci et al., 2016;Jeurissen et al., 2019;Maier-Hein et al., 2017;Schilling et al., 2021).Hence, there is a need to validate the accuracy of the connectomes estimated from tractography reconstructions with and without filtering for brain-like fiber bundles.
Connectome validation studies suggest a trade-off between the sensitivity and specificity of tractography (Maier-Hein et al., 2017;Schilling et al., 2019b;Thomas et al., 2014).Tractography methods that are sensitive can result in the reconstruction of many spurious connections, while genuine connections can be overlooked with connectome mapping pipelines providing greater specificity (Sarwar et al., 2021;Sotiropoulos and Zalesky, 2019).However, most connectome validation studies to date have not evaluated state-of-the-art microstructure-informed streamline filters, which can potentially improve sensitivity and specificity.Furthermore, evaluation studies have been undertaken using semi-realistic connectome phantoms that simplify the complexity of the brain's connectivity architecture.Thus, there is a need to develop more realistic, brain-like connectome phantoms to evaluate tractography performance and streamline filtering algorithms.
In this work, we first establish a framework to simulate realistic brain-like connectome phantoms.Our phantoms recapitulate key properties of the human connectome, including fiber bundle morphological characteristics, cortical geometry, and brain network complexity.We use our connectome phantoms to evaluate the performance of several state-of-the-art microstructure-informed tractography filtering algorithms, including spherical-deconvolution informed filtering of tractograms (SIFT) (Smith et al., 2013), SIFT2 (Smith et al., 2015), linear fascicle evaluation (LiFE) (Pestilli et al., 2014), convex optimization modeling for microstructure informed tractography (COMMIT) (Daducci et al., 2015), and COMMIT2 (Schiavi et al., 2020).It should be noted that many techniques can be found in the literature for improving the accuracy of the tractography (Legarreta et al., 2021;Wasserthal et al., 2018;Zöllei et al., 2019), but the study will focus on evaluating the performance of the aforementioned microstructure-informed tractography filtering algorithms.Filter performance is evaluated for both binary and weighted connectomes using measures of reconstruction sensitivity, specificity, and correlation with the ground-truth connectome.We find that microstructure-informed filtering can reliably improve the accuracy of connectome reconstruction in many circumstances, particularly for weighted connectomes.However, filtering achieves a relatively modest improvement in connectome accuracy, and it cannot alleviate major errors in fiber reconstruction.While the development of more accurate tractography algorithms is needed, we suggest that including a streamline filtering algorithm in current connectome mapping pipelines can yield improvements in connectome reconstruction accuracy.

Materials and methods
Fig. 1 provides a schematic of our framework to simulate dMRI data for realistic human brain-like connectome phantoms.We used scanneracquired dMRI on which tractography was performed to generate an initial set of streamlines.A cortical parcellation atlas was used to extract the streamlines that connect pairs of regions.The extracted streamlines were processed to adjust the strength of the ground-truth connections.Finally, the dMRI signal was simulated for the adjusted streamlines, resulting in a dMRI phantom with a known structural connectivity.The details of these steps are reported in Section 2.1.These simulated connectome phantoms were then used to evaluate several microstructureinformed filtering algorithms.To assess the robustness of our findings, we also simulated two different spherical connectomes (Sections 2.2-2.3) that include the diffusion-simulated connectivity (DiSCo) phantom (Rafael-Patino et al., 2021), and complexity-controlled spherical phantoms (Sarwar et al., 2019) to further evaluate the filtering algorithms.

Dataset
We obtained minimally pre-processed dMRI data from the Human Connectome Project (HCP) (Van Essen et al., 2013) for 10 randomly chosen individuals (Participant ID: 130114, 130518, 130720, 135124, 135629, 136631, 137532, 138130, 138332, 139435).Individuals were aged between 22 and 35 years (6 males).MRI acquisition protocols and minimal pre-processing pipelines are described in detail elsewhere (Glasser et al., 2013;Sotiropoulos et al., 2013).Briefly, 90 gradient directions were acquired at each of three b-values (1000, 2000, and 3000 s/mm 2 ) using pulse gradient spin-echo planar imaging that was pre-processed to correct eddy-current, head motion, and gradient-nonlinearity distortions.The corrected volumes were transformed to standard space and gradients vectors were rotated to account for the transformation.
The FSL automated segmentation tool (Zhang et al., 2001) was used to segment each individual's T1-weighted images into three tissue types: WM, gray-matter (GM), and cerebrospinal fluid (CSF).These segmented masks enabled the simulation of tissue-specific signals and consideration of volume fractions for the voxels comprising multiple tissue types (Section 2.1.3).

Mapping and readjusting structural connectivity
Whole brain tractography was performed on the selected preprocessed dMRI data (10 individuals).This involved estimating local fiber orientations for each WM voxel using multi-shell constrained spherical deconvolution (CSD) (Jeurissen et al., 2014), followed by deterministic tractography (Mori et al., 1999) to generate streamlines, consistent with recent recommendations (Sarwar et al., 2019).Streamlines were uniformly seeded from a segmented WM mask (Section 2.1.1).The WM mask was dilated by one voxel to fill potential gaps between GM and WM boundaries.Tractography was performed using MRtrix3 (Tournier et al., 2019) software package (www.mrtrix.org),with default parameters for step-size (0.1 × voxel size), angle threshold (60 • ) and fiber orientation distributions (FOD) threshold (0.1) to generate 2 million streamlines for each individual.
A structural connectivity matrix was mapped by assigning streamline endpoints to regions of an established cortical parcellation atlas (Yeh et al., 2019).We used the Desikan-Killiany atlas in this study, which comprises 68 cortical regions (Desikan et al., 2006).Streamlines were discarded if one or both of their endpoints did not reside in a 2mm radius of a region of the cortical parcellation.The structural connectivity between a pair of regions was defined as the number of streamlines interconnecting the two regions.
The reconstructed connectivity matrix and streamlines generated for individuals were additionally processed to: 1) discard connections with few reconstructed streamlines, and 2) improve the WM coverage of the reconstructed connections.To this end, thresholding is often used to eliminate connections comprising the fewest numbers of streamlines, under the assumption that these connections are spurious (Buchanan et al., 2020;Roberts et al., 2017;Yeh et al., 2021).We applied a streamline count threshold where the top 3% of weakest connections were removed from the reconstructed connectomes.Threshold selection was based on the findings from our previous study (Sarwar et al., 2019) which resulted in comparatively more accurate connectomes.Tractography is susceptible to numerous biases and it can underestimate connectivity strength for long-range connections (Schilling et al., 2019a;Yeh et al., 2021).To potentially alleviate some of these biases, the strength of weak connections was adjusted.This involved replicating the streamlines of weak connections to improve their strength, as detailed in Supplementary Data (Section S1).The processed streamlines were then used for signal simulation, where the connectivity matrices were also modified to reflect this adjustment.

Signal generation
A three-compartment model was used to simulate tissue-specific (WM, GM, and CSF) dMRI signal for each voxel.The WM compartment was computed for the voxels that were traversed by reconstructed streamlines, whereas the GM and CSF compartments were computed using the respective extracted masks (Section 2.1.1).Hence, the dMRI signal for the ith voxel was defined as: Where f WM , f GM and f CSF are the volume fractions of WM GM and CSF compartments respectively, TE is the echo time, and T2 is the relaxation time (Ferizi et al., 2015;Neher et al., 2014).The volume fractions sum to unity for each voxel.A range of volume fractions was evaluated in a separate study (Sarwar et al., 2019), where the relative performance of tractography was not affected by this parameter.
The WM signal, S WM i was defined by a combination of intra-and extra-axonal compartments characterized by "stick" (Behrens et al., 2003) and "zeppelin" (Alexander, 2008) models respectively.The WM signal was dependent on the number of streamlines traversing each voxel.Streamlines were grouped in putative fiber bundles based on the pair of cortical regions that they interconnect.We use s k i to denote the total number of streamlines that traverse voxel i and belong to fiber bundle k.As such, the total number of streamlines traversing voxel i is given by , where C is the total number of fiber bundles.The connection strengths of the fiber bundles were normalized using the strongest connection T in the reconstructed connectivity matrix.Using this nomenclature, the WM signal for ith voxel was defined as: Where T is a normalizing constant given by max Where b is the b-value, v z(x) is the principal fiber direction at x of fiber p ∈ k i (k i represents set of streamlines corresponding to connection k in ith voxel), g is the gradient direction, and α and β are diffusivity parameters.If d ‖ and d ⊥ are the apparent diffusivities parallel and perpendicular to the principal direction, then d ‖ = α + β and d ⊥ = β.
The integration Eqs. ( 3) and (( 4)) was evaluated by sampling at most 3 tangent vectors within each voxel along the length of each fiber trajectory.
Signals for GM (S GM i ) and CSF (S CSF i ) compartments for ith voxel were defined by the isotropic tensor model (FA = 0) with diffusivities d GM and d CSF respectively and Finally, Rician noise σ was added to the simulated signal (Gudbjartsson and Patz, 1995).The signal-to-noise ratio (SNR) was computed as SNR = S0 σ , where S 0 is the non-gradient weighted signal b = 0 averaged across voxels traversed by fibers.The proposed methodology was used to simulate single-shell dMRI data.The dMRI acquisition and signal parameters (Neher et al., 2014;Perrone et al., 2016;Sarwar et al., 2020Sarwar et al., , 2019) ) used in this paper are reported in Table 1.

DiSCo phantom
DiSCo is a digital publicly available dMRI phantom composed of 12,196 tubular fibers connecting 16 distinct pairs of regions (Rafael--Patino et al., 2021).A predefined connectivity matrix (16 × 16) exists for this phantom.In order to maximize similarity with the human-like connectome phantoms in this work, we did not use the provided simulated image data directly; instead, we combined the provided underlying ground-truth tractogram of the DiSCo phantom with the simulation model reported in Section 2.1.3.Note that the developers of the DiSCo phantom used a different signal generation model, and thus direct comparison with our phantom was not possible.The DiSCo phantom was simulated using the same parameters reported in Table 1, except the image dimensions of this phantom were 40 × 40 × 40 (equivalent to the original DiSCo phantom).

Complexity-controlled spherical phantoms
Spherical phantoms with tubular fibers connecting pairs of regions defined on their circumference were simulated following the procedure described elsewhere (Sarwar et al., 2019).These phantoms comprised 60 regions and were simulated using the method described in Section 2.1.3.Phantoms with various levels of fiber complexity were generated, where the complexity was defined as the extent of voxels containing multiple fibers (more than one) in a phantom.It should be noted that no initial set of streamlines existed in this case (unlike previous phantoms -Sections 2.1-2.2),so a constant connection strength of 1 was assigned to each fiber bundle (Eq.( 2)).We generated phantoms with connection densities ranging from 2% to 20% in increments of 2% that mapped to the complexity of 10% to 50% (increments of 10%).For each connection density, we generated 10 connectome phantoms that were then used to evaluate the filtering algorithms under investigation (Section 2.5).
It should be noted that all the phantoms were generated using the same simulation model (Section 2.1.3)This ensured that any changes in the outcomes were a consequence of the characteristics of the phantom or fiber bundles, not due to the variation in the simulation model.

Streamline tractography
We evaluated the filtering algorithms on whole-brain tractograms generated using two exemplar algorithms from the MRtrix3 software with default parameters: To test generality of results, our analyses were replicated using the probabilistic "Parallel Transport Tractography" algorithm as implemented in the "Trekker" software with default parameters (Aydogan and Shi, 2021); for clarity of presentation, relevant results are deferred to Supplementary Material.

Microstructure-informed tractogram filtering
We evaluated the performance of several microstructure-informed tractography filtering algorithms, SIFT (Smith et al., 2013), SIFT2 (Smith et al., 2015), LiFE (Pestilli et al., 2014), COMMIT (Daducci et al., 2015), and COMMIT2 (Schiavi et al., 2020).The common principle governing the operation of these filters is to selectively eliminate or down-weight streamlines to improve the correspondence between the diffusion signal estimated from the generated tractogram and the acquired dMRI.COMMIT2 extends this strategy by injecting priors about brain anatomy and its organization into the estimation process.All the methods except SIFT assign a "weight" to individual streamlines representing their contribution towards the diffusion signal.Hence, the sum of the weights of all the streamlines comprising the connection represents its connectivity strength (Smith et al., 2020).SIFT on the other hand selects a subset of streamlines that best fit the fiber density estimated through the diffusion data.These shortlisted streamlines are then used to map connectomes.The raw tractogram (before filtering) was also used to map connectomes to quantify the improvement achieved with filtering, compared to no filtering.Established implementations of SIFT/SIFT2 (https://mrtrix.org),LiFE (https://github.com/brain-life/encode)and COMMIT/COMMIT2 (https://github.com/daducci/COMMIT)were used.Default parameters were used for these algorithms except for COMMIT2, where the regularization parameter was optimized to yield the most accurate reconstruction.This involved performing COMMIT2 filtering for a range of values (5 × 10 -5 to 5 × 10 -2 ) of regularization parameters.The value that resulted in comparatively highest F-measure of the connectomes was selected in the study (this analysis is reported in the Supplementary Data -Figures S2-S5).

Evaluation methodology
Connections that were present in both the ground-truth and reconstructed connectomes were classified as TPs, while connections that were only present in reconstructed connectomes were considered FPs.False Negatives (FNs) were the ground-truth connections that were absent in the reconstructed connectomes.
The performance of the filtering methods was evaluated for binary and weighted connectomes.The weighted analysis involved mapping the connectome with the filtered tractogram, whereas in binary analysis, the strength of the reconstructed connectome was not considered.Due to symmetry, the metrics were computed using only the upper triangular elements of the connectivity matrices.
The overall performance was evaluated using F-measure because this was not affected by the disproportionate number of True Negatives (TNs-absences of same connection in ground-truth and reconstructed connectomes).The reconstructed connectomes were binarized, where connections with non-zero strength and the absence of connections were deemed as 1 and 0 respectively.

Weighted connectomes
The TPR, FPR and F-measure treat all connections equally (binary connectomes) and therefore do not account for variation in connectivity strength.Ideally, the connectivity strength of the TPs should also be accurately estimated.To evaluate the estimation of connectivity strength (weighted connectomes) by tractography algorithms, we developed a new metric, namely weighted F-measure that incorporates the strength of TPs, FPs, and FNs.
Weighted connectomes analysis required the reconstructed connectome to be normalized to ensure a fair comparison with the groundtruth connectivity matrix.This was achieved by multiplying the reconstructed connectome (RC) with GTw RCw , where GT w and RC w are the sums of all the connections in simulated ground-truth (GT) connectome and RC respectively.The weighted F-measure was then computed for the normalized RC as: Where FPs, and FN w = ∑ k GT k for all FNs.Along with the weighted F-measure, the Pearson correlation between GT and RC was also computed in the weighted analysis as in the DiSCo challenge (http://hardi.epfl.ch/static/events/2021_challenge/).

Human brain-like connectome phantoms
The dMRI simulation framework in Section 2.1.was used to simulate 10 connectome phantoms that were then used to evaluate the accuracy of several streamline filters.Fig. 2 shows simulated dMRI images for the connectome phantoms generated for two representative individuals.While the images are simulated, they recapitulate the characteristics of real dMRI images.It is worth noting that the WM signal for b = 0 varies for different voxels because intensity is proportional to the strength of streamlines crossing the voxels (Eq.( 2)).
Fig. 3 shows the ground-truth and reconstructed connectomes for raw (before filtering) and filtered tractograms.The reconstructed matrices are group averages computed across the 10 phantoms.The unfiltered connectomes are denser than their filtered counterparts.We also observe that probabilistic tractography generates more connections than the ground truth.Comparatively, LiFE reconstructed the most sparse connectomes followed by COMMIT and COMMIT2.It could be observed from Fig. 3 that many of the intra-hemispheric connections are retained by all the filtering algorithms, whereas a substantial number of inter-hemispheric connections were filtered out.It appears that the filtering algorithms have difficulty retaining the TP streamlines connecting inter-hemispheric regions.
Fig. 4 shows the quantitative performance evaluation results for both binary and weighted connectomes.We find that while filtering algorithms can reduce the number of FPs, this was achieved at the expense of filtering TPs.This phenomenon was observed for connectomes generated using both probabilistic and deterministic tractography.For deterministic connectomes, there was no major improvement in the Fmeasure ( ≈0.53) i.e., connectomes of the raw (unfiltered) and filtered tractograms demonstrated similar accuracy in terms of the F-measure (min: 0.562, max: 0.536).For probabilistic tractography, SIFT, COMMIT and COMMIT2 had similar (≈0.5) but greater F-measures than raw (unfiltered -0.46), SIFT2 (0.46), and LiFE (0.40).Overall, SIFT, COMMIT, and COMMIT2 eliminated most FPs from the raw tractograms for both deterministic and probabilistic tractography (>30%).
Weighted analysis (in terms of weighted F-measure and correlation) was also performed to evaluate the performance of the filtering algorithms considering the variation in connectivity strengths.Binary analysis treats all the connections equally, whereas weighted analysis gives more weight to stronger connections than weaker connections.The reconstructed connectomes (raw and filtered) were first normalized with respect to the ground-truth (Section 2.6).The weighted F-measure and Pearson correlation were computed for the reconstructed connectomes (Fig. 4(d,e)).For deterministic tractography, all filtering algorithms slightly outperformed the unfiltered tractogram in terms of the weighted F-measure (an improvement of ≈4%), except LiFE whose Fmeasure was equal to the unfiltered tractogram.For probabilistic tractography, SIFT outperformed other filtering algorithms in terms of weighted F-measure (0.8) followed by COMMIT and COMMIT2 (0.77).No improvement in the weighted F-measure was again observed for LiFE.
In terms of Pearson correlation, SIFT2 outperformed its counterparts (deterministic ≈0.61, probabilistic ≈0.63-15% greater than unfiltered tractogram) which indicates that the strengths of the connections filtered by SIFT2 are more similar to the ground truth as compared to other filtering algorithms.It should be noted that all the filtering algorithms except LiFE outperformed the unfiltered tractogram.The inter-subject variability in results is also reported in Supplementary Figure S2, where the highest variation in results was observed for LiFE filtering on tractograms generated by probabilistic tractography as compared to the other filtering algorithms.It should be noted that the optimal results of COMMIT2 are reported in Fig. 4, whereas its performance for varying regularization parameters can be found in Supplementary Data (Figure S3).

DiSCo phantom
The DiSCo phantom (Section 2.2) was simulated to assess the  and (b,c) presence of two different gradient directions.Fig. 3. Group averaged connectomes for 10 simulated phantoms generated using deterministic and probabilistic tractography followed by state-of-the-art filtering algorithms (SIFT, SIFT2, LiFE, COMMIT, and COMMIT2).Raw tractogram denotes the case in which the tractogram was not filtered and raw streamline counts were used to map connectivity matrices.performance of the filtering algorithms under varying fiber configurations while keeping the signal simulation model constant.The qualitative and quantitative results of DiSCo phantom analysis are reported in Supplementary Data (Figures S4-S5).We found that unfiltered probabilistic tractography produced more FPs (FPR ≈0.9) compared to deterministic tractography (FPR ≈0.56).All the filtering algorithms successfully filter out FPs, but COMMIT2 removed >70% FPs from tractograms generated by deterministic and probabilistic tractography.High TPR and low FPR for COMMIT2 resulted in the highest values of Fmeasure (deterministic: 0.72, probabilistic: 0.814) outperforming the other algorithms (F-measure -min:0.37,max:0.54).This is consistent with the findings reported by Schiavi et al. (2020), where COMMIT2 outperformed the other filtering algorithms.Moreover, COMMIT2 also preserved the connectivity strength in the filtered tractogram resulting in the highest weighted F-measure.This apparent superior performance was, however, not reflected in the Pearson correlation measure.The optimal value of regularization parameter was selected for COMMIT2, where its performance for simulated DiSCo phantom under varying regularization parameters can be found in Supplementary Data (Figure S6).
A difference between the performance of filtering algorithms for human brain-like (Fig. 4) and DiSCo (Supplementary Figure S5) phantom could be observed, which may be due to the highly simplified geometry and fiber complexity of the latter phantom.For phantoms that are more complex than DiSCo, the performance of COMMIT2 and the other filters can diminish.

Complexity-controlled spherical phantoms
The filtering algorithms were also evaluated for spherical connectomes under various complexities (Section 2.3).Fig. 5 shows the average performance for this analysis, where 10 connectome phantoms were simulated for each complexity value.In terms of the F-measure, COMMIT2 (optimal selection of regularization parameter-Supplementary Figures S7-S8) provides the best performance compared to SIFT, SIFT2, and COMMIT for phantom complexities from 10% to 40% Unlike previous results (Sections 3.1 and 3.2), LiFE has the highest F-measure from 10%-50% (whose value is similar to COMMIT2).Firstly, LiFE had the least TPR and FPR here (Fig. 5(a,b)) which affected its overall Fmeasure (Fig. 5d).Secondly, all the connections in these phantoms had the same strength which may have also substantially affected the performance of the LiFE algorithm.No major improvement was observed for the weighted F-measure, which was expected as the variation in the connection strength of the ground-truth was not modeled (as mentioned previously -Section 2.3).Similarly, the correlation with the ground truth of all the filtering algorithms was similar to each other except LiFE which had the lowest correlation with ground truth.

Discussion
In this study, we evaluated the performance of several tractography filtering algorithms using simulated phantoms.Existing connectome phantoms do not capture both the geometrical and spatial properties of human WM fiber bundles and the underlying structural connectivity between pairs of regions.To address this limitation, we established a framework to simulate human brain-like diffusion MRI data by utilizing a scanner-acquired dMRI dataset, on which tractography and postprocessing were performed before the simulation process (Section 2.1).The performance of five filtering algorithms (SIFT, SIFT2, LiFE, COMMIT, and COMMIT2) was evaluated.The filtering algorithms were applied to the tractograms generated by CSD-based deterministic and probabilistic tractography for human brain-like and spherical connectome phantoms.We also utilized the available spherical phantoms for evaluating the filtering algorithms (Sections 2.2 and 2.3).

Human brain-like connectome phantoms
For binary analysis on brain-like phantoms, we found that all the filtering algorithms were successful in removing a fraction FPs (Fig. 4).But this reduction in FPR was achieved at the expense of TPs i.e., many TPs were also filtered out by these algorithms.In terms of the binary Fmeasure, no filtering algorithms noticeably outperformed the raw tractograms for deterministic tractography data (≈0.53).For the probabilistic tractograms, we found that SIFT2, COMMIT, and COMMIT2 had  comparable performance to one another in terms of the binary F-measure (≈0.5); around 8% more than SIFT, LiFE, and raw tractograms (no filtering).
In terms of the weighted F-measure, SIFT (≈0.81) slightly outperformed the other methods with a 4% increase in performance with respect to the raw tractograms (≈0.78) generated using deterministic tractography.This finding was also observed for probabilistic tractography, where an 11% increase in the performance of SIFT was observed with respect to raw tractograms.The correlation outcomes of deterministic and probabilistic tractography demonstrated a consistent trend i.e., SIFT2 emerged as the top-performing filtering algorithm, surpassing all others with correlation coefficients of 0.64 and 0.61 for deterministic and probabilistic tractography respectively.The SIFT algorithm followed closely behind with correlation coefficients of 0.61 (deterministic) and 0.59(probabilistic).On the other hand, COMMIT, COMMIT2, and the unfiltered tractogram exhibited similar correlation coefficients of 0.53 (deterministic) and 0.54 (probabilistic).It is worth noting that LiFE was the only algorithm that yielded lower correlation coefficients than the unfiltered tractogram, with values of 0.48 (deterministic) and 0.43 (probabilistic).Additionally, it is important to mention that LiFE estimated the least dense connectomes in comparison to the other algorithms (Figs. 3 and 4).
In this study, phantoms were generated for 10 different subjects to account for inter-subject variability during the evaluation of filtering algorithms.Hence, the ground-truth connections varied across the ground-truth phantoms.Across these phantoms, we did not find a common connection whose streamlines were consistently downweighted to zero by all filtering algorithms.A subset of connections that were down-weighted to zero or removed by the filterings algorithm (except SIFT2 which did not demonstrate this phenomenon) was extracted for individual subjects.These connections generally comprised fewer streamlines in the raw tractogram(<15).No common geometry (e.g., bending, U-shaped) was associated with these connections (Supplementary Figure S9).
We also examined whether the spatial distance between connections influenced the performance of the filtering algorithm.We extracted the streamlines of the connections for three cases: 1) TPs that were retained, 2) FNs (TPs erroneously removed by filtering algorithms), 3) FPs that were successfully removed, and 4) FPs that were ignored by the filtering algorithms.Supplementary Figures S10-S11 illustrate the lengths of the streamlines extracted for these connections, which were analyzed for any discernible patterns related to the filtering algorithms.The results showed that TPs were predominantly short-range connections, while the filtering algorithms effectively removed many short-range FP connections (1-25 mm).However, they were less successful in filtering medium to long-range FP connections (26-160 mm).The filtering algorithm mistakenly removed fewer TPs, where a specific pattern was not detected concerning the length of these connections.
Additionally, we calculated the relative difference in connection strength for the filtering algorithms compared to the raw tractogram to identify connections that were significantly affected by a filtering algorithm (Supplementary Figure S12-S13).We observed a reduction in strength for the majority of connections across all filtering algorithms, where intra-hemispheric connections were affected more than interhemispheric connections (Supplementary Figure S13).Furthermore, we noticed that a few connections were strengthened by SIFT2, COMMIT and COMMIT2 (Supplementary Figure S12 (b,d,e)), i.e., the filtered connections had greater strength compared to the raw tractogram.Koch and colleagues (Koch et al., 2022) also observed this phenomenon where the underestimation of WM regions by raw tractography.Their study also highlighted that the performance of filtering algorithms is influenced by the underlying WM profiles.Specifically, areas with multiple fiber populations, complex fiber architecture (such as crossing and kissing configurations), and partial volume contamination exhibited a decrease in the density of the filtered tractogram.Regions with multiple fiber populations are more susceptible to the over-representation of WM by tractography, while more accurate WM reconstructions were observed for regions with single fiber populations.

DiSCo phantom
COMMIT2 previously demonstrated a major advancement in removing FPs while preserving the TPs, thereby improving the connectome specificity (Schiavi et al., 2020).But this phenomenon was not observed for the human brain-like connectome phantoms.Hence, it was important to verify that the results were not affected by the simulation model used in this study.The quantitative analysis of COMMIT2 (Schiavi et al., 2020) was performed on spherical phantoms, which is why we repeated the analysis for spherical connectome phantoms that were generated using the proposed simulation model (Section 2.1.3).For this purpose, we simulated the DiSCo phantom using the publicly available ground-truth streamlines.Both deterministic and probabilistic tractography were performed on the phantom, where the outcomes of the evaluation for both tractography algorithms were similar.We found that COMMIT2 outperformed the other filtering algorithm for binary F-measures by removing a substantial number of FPs while preserving TPs (Section 3.2), which was consistent with the original findings of the COMMIT2 study (Schiavi et al., 2020).The weighted F-measure for COMMIT2 was also superior to all other methods (>0.8) suggesting that it can preserve the strength of TPs.In the case of correlation, no improvement was observed for the case of COMMIT2, i.e., raw and filtered connectomes had a similar correlation with the ground truth.This finding was somewhat inconsistent for the original DiSCO phantom (Supplementary Figures S14-S16), where a minor improvement in correlation was observed for all filtering algorithms over raw tractogram.This difference between two spatially and geometrically identical phantoms suggests the possibility that the dMRI simulation model impacts the local fiber orientation estimation and streamlines reconstruction in tractography that in turn affects the connectome estimation.

Complexity-controlled spherical phantoms
To further validate the performance of filtering algorithms, we simulated complexity-controlled spherical phantoms (Sarwar et al., 2019) under varying complexities (10% -50%) that were achieved by varying the connection density of the ground-truth phantoms (Section 2.3).The binary F-measure analysis revealed that COMMIT2 performed better than the other methods (except LiFE) from 10% to 40% phantom complexity, but the performance of all the algorithms was similar at complexity of 50%.In our previous study (Sarwar et al., 2019), we quantified that the complexity (extent of voxels with multiple fibers) of the human brain is ≈46-52% which is replicated in the proposed human brain-like connectome phantoms.This suggests that the performance of all the filtering algorithms becomes comparable for spherical phantoms with complexity equivalent to human brain.LiFE filtering method is an exception in terms of the binary F-measure analysis specific to these phantoms.The LiFE filtering algorithm filtered out many TPs along with the FPs (Fig. 5(a,b)), where this trade-off resulted in a comparatively high F-measure (Fig. 5d).This suggests that LiFE algorithms may have poor performance for phantoms that do not model the variation in fiber bundle densities.

Outcomes of the comparative analysis
Overall, we found that microstructure-informed filtering algorithms are successful in removing some of the FPs from the tractograms.For the brain-like connectome phantom, many TPs are also removed during the filtering process which did not result in a substantial improvement in binary F-measure.It is worth noting that in the absence of filtering, deterministic tractography generated more accurate connectomes than probabilistic tractography in terms of binary F-measure (Fig. 4).
Connectomes reconstructed by probabilistic tractography produce more FPs than deterministic tractography, whose accuracy was improved by using filtering algorithms.Hence, filtered tractograms should be preferred over raw tractogram when using probabilistic tractography.This finding is consistent with our previous study (Sarwar et al., 2019) where it was suggested to use a streamline threshold for improving the accuracy of connectomes mapped using probabilistic tractography.
We further validated the performance of the filtering algorithms for an alternative probabilistic algorithm, Parallel Transport Tractography (Trekker) (Aydogan and Shi, 2021).Overall, we found that the relative performance of the various filtering algorithms and conclusions were unchanged when using an alternative probabilistic tractography algorithm.A reduction of FPs was observed by the filtering algorithm for connectomes mapped with Trekker.Similar to iFOD2 algorithm, LiFE filtering had comparatively poor performance than other filtering algorithms and raw tractogram for weighted evaluation metrics (Fig. 4 and Supplementary Figure S18).The unfiltered tractogram generated by Trekker had a TPR of 097 ± 0.03 suggesting that it can potentially extract nearly all the ground-truth connections but this comes at the cost of a high FPR (0.73 ± 0.18).The TPR of the unfiltered Trekker algorithm is substantially superior to the unfiltered iFOD2 algorithm (TPR: 0.83 ± 0.21 and FPR: 0.39 ± 0.18).The performance of the filtering algorithms is influenced by the characteristics of the underlying WM profiles where a reduction in the density of the filtered tractogram is observed for areas with multiple fiber populations, complex fiber architecture (crossing and kissing configurations) and partial volume contamination (Koch et al., 2022).LiFE demonstrated the least TPR and FPR for spherical phantoms with 50% complexity (Fig. 5).Notably, 52% of the voxels within the WM volume of the human brain, extracted from dMRI images, consist of multiple fibers (Sarwar et al., 2019).Therefore, it is not unexpected that the LiFE algorithm demonstrated poor performance in terms of weighted metrics when applied to the tractogram generated by probabilistic tractography algorithms, as these algorithms typically have higher TPR compared to deterministic tractography.Trekker has the potential to extract streamlines that possess complex underlying WM profiles, as evidenced by its high TPR.In contrast, the LiFE filtering algorithm tends to underestimate complex WM bundles.
For DiSCo phantoms, we found that COMMIT2 with optimized regularization substantially outperforms all filtering algorithms.This was also observed for the spherical phantom with 10% complexity (excluding LiFE as previously discussed).But with the increasing complexity (>10%), the performance of COMMIT2 tends to deteriorate.It should be noted that filtering algorithms aid in identifying FPs in a connectome regardless of the complexity of the phantom.
Identifying and analyzing the weaknesses of filtering algorithms in identifying FP streamlines is challenging due to the biases in orientation estimation models and tractography.Researchers continue to develop tractography algorithms with improved accuracy and lower error rates.This study is the first to compare the performance of filtering algorithms across multiple phantoms, whereas Koch and colleagues (Koch et al., 2022) studied their performance on healthy subjects.The variability of the results for different tractography algorithms also makes it difficult to determine the causes of performance deterioration.For example, SIFT outperformed the other filtering algorithms for probabilistic tractography, whereas a specific filtering algorithm was not identified for deterministic tractography in terms of binary and weighted F-measures.Therefore, additional studies are required to enhance our comprehension of the filtering algorithm's performance.However, in this study, we found that filtering algorithms were more effective at eliminating short-range FP connections compared to long-range FPs.Furthermore, the filtering algorithms had a greater impact on the connection strength of intra-hemispheric connections as opposed to inter-hemispheric connections.
Although filtering methods can improve the accuracy of connectome mapping pipelines, further improvement is possible.Tractography is an ill-posed problem, as it cannot distinguish between different fiber configurations (such as bending, fanning, crossing, and kissing fibers) at the local voxel level.Furthermore, we also demonstrated that the performance of the filtering methods deteriorates with increasing fiber complexity.Hence, advances are needed to enhance the performance of both tractography and filtering algorithms, which could in turn improve the accuracy of the reconstructed connectome.

Limitations
Several important limitations require consideration.The analysis was performed on 10 connectome phantoms (brain-like and complexitycontrolled spherical phantoms).Moreover, single-shell dMRI dataset was simulated.It is well known that the dMRI simulation strategy can impact the performance of tractography (Barrio-Arranz et al., 2015;Ferizi et al., 2014).As future work, this analysis could be extended by simulating 100s of brain-like phantoms using different simulation settings incorporating other artifacts e.g., motion artifacts, Gibbs ringing, and thermal noise in the simulation pipeline under multi-shell configuration.It should be noted that while tractography can be inaccurate, we primarily used it to obtain WM bundles for brain-like connectome phantoms.We considered the reconstructed post-processed bundles as valid connections for the ground-truth phantom.The goal was to develop ground-truth connections with human brain-like spatial and geometrical properties that were achieved using tractography.The study did not comprehensively evaluate the filtering algorithms under variation in their parameter settings.The default setting of the filtering algorithms was used in the analysis except for COMMIT2 (as previously discussed).Also, the impact of filtering algorithms on CSD-based deterministic and probabilistic tractography cannot be generalized to other tractography algorithms.Hence, it is possible that any changes in the settings of the tractography and filtering algorithms can produce results different from the ones reported in this study.

Fig. 1 .
Fig.1.Framework for simulating dMRI data for human brain-like connectome phantoms.Tractography is performed on scanner-acquired dMRI and streamlines connecting pairs of cortical regions are extracted.The strength of extracted connections is adjusted to generate ground-truth structural connectivity.The streamlines corresponding to the adjusted connections along with the segmented images of the three tissues (white-matter, gray-matter, and cerebrospinal fluid) are used to simulate dMRI data using an established diffusion model.

Fig. 2 .
Fig. 2. Simulated dMRI images for two representative connectome phantoms (upper and lower rows).The slices correspond to the (a) absence of gradient directions, and (b,c) presence of two different gradient directions.

Table 1
Single-shell dMRI acquisition and signal parameters used to simulate the connectome phantoms.