Assessing Streamline Plausibility Through Randomized Iterative Spherical-Deconvolution Informed Tractogram Filtering

Tractography has become an indispensable part of brain connectivity studies. However, it is currently facing problems with reliability. In particular, a substantial amount of nerve fiber reconstructions (streamlines) in tractograms produced by state-of-the-art tractography methods are anatomically implausible. To address this problem, tractogram filtering methods have been developed to remove faulty connections in a postprocessing step. This study takes a closer look at one such method, \textit{Spherical-deconvolution Informed Filtering of Tractograms} (SIFT), which uses a global optimization approach to improve the agreement between the remaining streamlines after filtering and the underlying diffusion magnetic resonance imaging data. SIFT is not suitable to judge the plausibility of individual streamlines since its results depend on the size and composition of the surrounding tractogram. To tackle this problem, we propose applying SIFT to randomly selected tractogram subsets in order to retrieve multiple assessments for each streamline. This approach makes it possible to identify streamlines with very consistent filtering results, which were used as pseudo ground truths for training classifiers. The trained classifier is able to distinguish the obtained groups of plausible and implausible streamlines with accuracy above 80%. The software code used in the paper and pretrained weights of the classifier are distributed freely via the Github repository https://github.com/djoerch/randomised_filtering.


Introduction
Tractography uses data acquired with diffusion-weighted magnetic resonance imaging (DW-MRI) to trace nerve fiber tracts in the brain, producing a model called a tractogram.A tractogram consists of a set of streamlines, each of which represents an assumed nerve fiber through an ordered set of 3-dimensional coordinates [1].The main goal of tractography can be described as creating a set of streamlines that pose a maximally accurate, digital representation of the structural connectome [2], which refers to the actual set of nerve fibers in the brain [3].Applications for tractography include neurosurgery planning [4], the study of neurological diseases [5], or the scientific study of the brain to understand its function and its links to human behavior [6].
By generating many streamlines (usually millions), current tractography methods have been shown to be able to recover all relevant bundles [7], at least if they are run in an appropriate way [8,9].However, they are also prone to generating erroneous and implausible streamlines.It has been estimated that an average of four false-positive streamlines are generated for each valid streamline in the tractogram [10].This can hardly be fully avoided because tractography methods with high sensitivity tend to show low specificity and vice versa [11].In order to take steps towards a reliable model of the connectome, removing implausible connections is crucial.
A whole field of research, commonly called tractogram filtering, has been established as a way to deal with the excess of false-positive streamlines.Tractogram filtering approaches remove streamlines from tractograms by evaluating their plausibility, for example, by considering the geometrical properties of streamlines (e.g., [12]), anatomical constraints (e.g., [13]), through clustering approaches (e.g., [14]) or by correspondence to the underlying DW-MRI data (e.g., [15]).A review of methods is presented in [16].
Identifying implausible streamlines can be posed as a binary classification problem: Each streamline is assigned a "positive" (P) or "negative" (N) label if it appears to be plausible or implausible, respectively.Streamlines with a "negative" label are subsequently removed from the tractogram.The terms "true positive" (TP) and "false positive" (FP) frequently appear in tractography literature as well, even though they express notions that are slightly different from the meaning used in statistics [17]: Here, a "true positive" streamline is regarded to be a fully correct reconstruction of an actual, plausible nerve fiber bundle, while a "false positive" streamline represents a faulty/noisy reconstruction that does not accurately represent any existing structure in the brain.Therefore, in this work, we use the terms "plausible" or "implausible" interchangeably with "positive" and "negative" or "true positive" and "false positive", even though the concepts are not entirely equal in the classical sense.
One popular tractogram filtering method is Spherical-deconvolution Informed Filtering of Tractograms (SIFT) [15].This approach belongs to the family of methods that assess tractogram quality by comparing the acquired DW-MRI data to the expected one from the tractogram [16].In particular, SIFT removes streamlines to increase the consistency of the tractogram with respect to the acquired data based on a global optimization approach.Other examples of such methods include LiFE [18], COMMIT [19], SIFT2 [20], and COMMIT2 [21,22,23].SIFT does not operate directly on raw data, but instead uses the fiber orientation distribution (FOD) [24].A tractogram can be filtered by investigating how the effect of removing an individual streamline manifests in the streamline density in each voxel compared to the corresponding local FOD along the streamline.Streamlines that increase the mismatch between streamline density and local FODs are prioritized for filtering.Given the complexity of the computation and the number of streamlines, trying to find a globally optimal solution is infeasible.Therefore, a gradient-descent approach is used to reduce a cost function.
Several criteria for termination of the optimization process are provided by SIFT.By default, streamlines are removed until the cost function gradient of the candidate streamlines becomes sufficiently small.This option is referred to as filtering "to convergence".In the absence of other termination criteria which specify a maximum number of streamlines to be removed, filtering to convergence will naturally lead to the most reliable result set of streamlines, but with the streamline density being decreased the most.In fact, it has been argued that the remaining number of streamlines is not always sufficient for quantitative analyses, and for that reason, the authors of SIFT recommend applying the method to tractograms with a high number of streamlines [25].
Due to the design of SIFT's cost function, streamlines will be removed if the track density in their path is too high to match (parts of) the FODs.Such a mismatch could be due to different reasons, including at least the following cases: 1.The FOD representation in a voxel is not accurately describing the underlying anatomy.This can happen, for example, when the raw data is not able to model and resolve different fiber populations.2. The streamlines may be (partly) faulty/noisy.This is the target case for filtering, as the streamline is a false positive and must be removed.3. The streamlines are plausible, but the streamline density in the corresponding voxel fractions is simply exaggerated.It is common for tractography methods to create multiple similar streamlines.We will refer to these streamlines as redundants.
Thus, a streamline being rejected by SIFT is therefore not a sufficient indicator of it being implausible.Since SIFT removes streamlines from these three categories, SIFT as such is not completely suitable for assessing the correspondence of an individual streamline to the DW-MRI data.The focus of this paper is, therefore, to distinguish cases two and three.
Machine learning has been used for training classifiers to speed up the processing of expensive tractogram filtering methods [26,27].Although it is appealing to train a binary classifier to distinguish false-positive from truepositive streamlines based on the raw output of SIFT, this is unfortunately not possible for the abovementioned reasons.Indeed, redundant and true positive streamlines share similar features, but SIFT rejects the former and accepts the latter.Thus, for SIFT to be used in classifiers of streamlines, it is necessary to have a method to distinguish between false positives and redundants.
We have found that specific streamlines may be classified differently by SIFT depending on the composition of the tractogram.In this paper, we take advantage of this property with the goal of disentangling false positives from redundant streamlines.More specifically, we apply SIFT on randomized subsets of tractograms to identify both of those streamline groups.Since SIFT seems to yield inconsistent results for certain streamlines when found in different subsets, we spotlight those streamlines and explore how they might be appropriately labeled.We refer to the proposed approach as randomized SIFT (rSIFT).

Datasets
We carried out the experiments on pre-processed data [28] of six different subjects from the young adult data set of the Human Connectome Project [29].The whole-brain tractograms derived from this data were provided by the authors of [30] 1 and computed with the iFOD2 [31] algorithm.Ten million streamlines were created for each subject, restricted to be between 40mm and 250mm in length, with a step size of 0.625mm.The tracking was further constrained by anatomical priors based on the segmentation of different tissue types in the brain [32].In order to make computation more feasible, we used an additional post-processing step in which the streamlines were compressed to smaller sets of coordinates with the method in [33] using a tolerance error of 0.35mm.
The ten million streamlines in each HCP tractogram covered the entire white matter volume.An exemplary depiction of the distribution of streamline length, the number of sampling points, and their correlation for one subject can be seen in Fig. 1.This figure also shows that the streamline length and the number of sampling points are still highly correlated after compression.The tractograms were generated with MRtrix3 [34] with the exception of the compression step, for which the Python library Dipy [35] was used.
We further studied the phantom data from the DiSCo Challenge [36], which was used as ground truth to create datasets with known true positive, redundant, and false-positive streamlines.The ground truth of this dataset consists of 12 196 streamlines.In our first experiment on this data, we used half of the streamlines as true positives, while the other half was distorted to produce realistic implausible streamlines by applying random rotation in 3D.The rotations were done by randomly selecting Euler angles in the range of 45-315 degrees to get streamlines not too close to the ground truth.We filled the tractogram with false positives to a total of 89 570 streamlines in order to achieve a similar streamline density compared to the 10 million used in the HCP case.In a second experiment, half of the ground-truth streamlines were used as true positives, while the second half was systematically copied to produce redundant streamlines.Notice that all streamlines are plausible in this experiment.The second half of the data was split evenly into five further groups: the first fifth of the streamlines was copied once, such that each of them would appear twice in the tractogram.The other groups were copied twice, four, nine, and 48 times (filler data), respectively.This resulted in a tractogram with around 89 000 streamlines, including groups of streamlines appearing with very different frequencies.

Randomized SIFT (rSIFT)
Fig. 2 summarizes the method, and Algorithm 1 shows the pseudocode of rSIFT.The following subsections detail the different components of the method.Instead of running SIFT once on the complete tractogram, we applied it several times to random subsets of streamlines.The number of times a streamline was filtered out or kept over the different runs of SIFT was used to define its acceptance rate (AR) per subset size.We refer to each evaluation of a streamline through SIFT as a vote.That means a streamline that was kept in a tractogram after filtering receives a positive vote and a streamline that was removed receives a negative vote.
In order to examine the influence of not only the composition, but also the subset size on the results of the SIFT algorithm, we employ different subset sizes in the rSIFT procedure.The streamline subset sizes were selected to cover a wide range but simultaneously be able to produce meaningful results within reasonable computation time.
In pretests, out of the ten million streamlines per HCP tractogram, only around 2-2.5% remained after filtering.The fact that the algorithm terminated through convergence indicates that the remaining set of streamlines was too small to guarantee a stable model and valid removal of streamlines.Any results for streamline sets smaller than that might not be meaningful.Thus, the smallest subset size was chosen to be 2.5% of the original tractogram size, i.e., 250 000 streamlines for the HCP data.We applied a similar procedure to determine a minimum subset size for the DisCo data, which turned out to be around 16 000 in the tractogram with ground truth and false-positive data and around 13 000 in the tractogram with ground truth and redundant data.
For the first DiSCo data experiment (with implausible streamlines), we used SS = {89 570, 44 785, 22 392, 16 000} and for the second experiment (with redundant streamlines), SS = {88 996, 44 498, 22249, 13 000}.Input: Pseudocode outlining the procedure of rSIFT.For each chosen subset size n, k random subsets of the tractogram are extracted and filtered with SIFT to receive the indices of the accepted and rejected streamlines subset P i and subset N i .These are used to update the number of votes.
Subsequently, rSIFT was run on all of the pre-defined subset sizes n ∈ SS, with a number of repetitions k defined for each n.k, was adapted to n, such that n × k remained constant.In particular, k = τ M/n, with M being the total number of streamlines in the tractogram and τ being a parameter that we set to five in the experiments.This way, we aim to obtain enough votes for each streamline to compute robust statistics (e.g., k = 5 and k = 20 for n = 1 × 10 7 and n = 2.5 × 10 6 , respectively, in the experiments with HCP data).
Notice that, since each streamline subset is defined from the complete tractogram, the number of occurrences across all subsets varies for all streamlines.That means that the number of received votes is expected to differ between streamlines and could, for some, even be zero.
After completing the filtering procedure, the total numbers of positive and negative votes of a streamline s for subset size n, denoted as P n (s) and N n (s), respectively, were used to compute its acceptance rate AR n as: In addition to this, AR(s) (without n) refers to the acceptance rate of a streamline compiling the votes received over all subset sizes.For the analysis of the method, we analyzed the distribution of AR n over different choices of n.
As mentioned, finding an anatomically reliable ground truth is an ongoing challenge in tractogram filtering.Having this in mind, our focus here is limited to extending the notion of streamline acceptance that is possible with SIFT.Streamlines with AR = 100% and AR = 0% are likely to be plausible and implausible, respectively, since all runs of SIFT are consistent for these streamlines regardless of the tractogram configuration they are in.That means that these streamlines can be used to generate a pseudo ground truth of a plausible and implausible class of streamlines in the sense of SIFT.The plausibility of the remaining streamlines with less consistent results is harder to assess solely based on their AR.Thus, we group them with the label inconclusive.Notice that inconclusive streamlines may consist of a mix of plausible and implausible streamlines that SIFT cannot consistently detect or that SIFT has not evaluated in all subset sizes.In summary, we separate the streamlines in a tractogram into three classes: 1. plausible where AR = 100% 2. implausible where AR = 0% 10 3. inconclusive where 0% < AR < 100%.
We use the distinction of plausible and implausible streamlines as pseudo ground truth for training the classifiers described in the next section.

Neural network-based streamline classifier
We designed rSIFT with the goal of providing useful information about the plausibility of individual streamlines.In order to gain insight into the properties of the streamline grouping based on plausible, implausible, and inconclusive labels, we analyzed the performance of a neural network classifier trained on these labels in different scenarios.In doing so, we simultaneously show the feasibility of training a neural network model that mimics the characteristics of rSIFT, and thus provide a computationally efficient alternative implementation to the originally proposed method.
In our experiments, we investigated the composition of the three rSIFT labels by means of their separation based on neural network classifiers.First, we trained models for the pair-wise separation of labels, i.e., plausible (positive) vs. implausible (negative), plausible vs. inconclusive, and implausible vs. inconclusive streamlines.Secondly, we trained a multi-class model to distinguish between all three rSIFT labels simultaneously.
The training of all classifiers was performed using a 5-fold cross-validation (CV) approach on the pooled set of streamlines from two HCP subjects.We report the average performance of all CV models on the respective validation set.For testing, we chose the best performing model in terms of balanced sensitivity and specificity in order to minimize bias.This model was used to evaluate streamlines from four unseen HCP subjects, which we refer to as test data.
In order to make the classifier applicable to data from multiple subjects and to provide a basic image registration, streamline coordinates were normalized to a range of -1 and 1 in each dimension using the minimum and maximum streamline coordinates per subject.Since the input streamlines comprised a varying number of points, they were resampled to the same number of points in order to be used as input for the network.To minimize the impact of this resampling on as many streamlines as possible, the median number of streamline points across training subjects, which was determined to be 22 points, was chosen as the resampling target.Linear interpolation was applied to approximate the original streamline geometry.
Each classifier was composed of two 1-D convolution layers (kernel sizes 5 and 3), with ReLU activation and max-pooling (pool size 2) applied after each of them.The convolutional layers were followed by a dense layer with a dropout chance of 0.5 and connected to either one or multiple output neurons (depending on the classifier type being binary or multi-class).For the binary type, the last layer used a sigmoid activation function, while the multi-class classifier used softmax.Therefore, both classifier architectures were identical except for the number of output neurons (one vs. multiple) and the corresponding activation functions.An illustration of the structure of the classifiers is shown in Fig. 3.The input to the network consisted of a one-dimensional array of the coordinates of one streamline reordered in an interleaved manner, such that the x-, y-, and z-coordinates of the same point were subsequently following each other (i.e., x 1 , y 1 , z 1 , x 2 , y 2 , z 2 , ...).
Since the data labels were determined based on a filtering method that rejects the vast majority of streamlines, a large disproportion in the number of training samples available for each class was expected.Thus, a balanced data generator was used to oversample data of plausible streamlines for network training and testing.The epoch size was defined by the larger of the classes such that each training sample was seen at least once per epoch.The smaller set of samples was shuffled and re-used whenever it was exhausted during an epoch.Each classifier was trained towards maximizing accuracy using the Adam optimizer [37] with default settings in TensorFlow 2.8.0 [37].The training was done in batches of 50 samples each (except for the categorical classifier with three classes, which used a batch size of 60 such that each batch could contain an equal number of samples from each class).Five training epochs were determined to be sufficient since accuracy and loss saturated quickly and showed almost no improvement after two epochs.

Randomized SIFT on HCP data
Fig. 4 and 6 show the distribution of streamlines for the different subset sizes and AR ranges on the six HCP subjects.As shown, most of the streamlines received either AR n = 0% or AR n = 100% after rSIFT for all n.Notice that the fractions of those streamlines with 0% < AR n < 100% increased when the subsets became smaller.Moreover, using smaller subset sizes led to a substantial increase in streamlines with AR n = 100% even after repeated evaluation: The number of streamlines with exclusively positive votes rose from an average of 2.63% to 14.86%, comparing the largest and the smallest subset sizes.Conversely, the streamlines with AR n = 0% were reduced from 96.64% to 53.67%.As shown in the last column of Fig. 4, the number of plausible and implausible streamlines for the whole dataset are around 1.7% and 53.7%, which means that approximately 44.6% of the streamlines can be considered as inconclusive.Interestingly, SIFT yielded inconclusive results for on average 0.7% of the streamlines when the complete tractogram was used (i.e., subset size of ten million), in four of the HCP subjects.
As mentioned, the rSIFT procedure does not guarantee having the same amount of votes for each streamline.Thus, we assessed the distribution of ARs for streamlines with exactly five votes, as shown in Fig. 5.Although the values are slightly different in this case, they follow a very similar trend.Due to the entirely random choice of streamlines for each round of the experiment, we found that the number of streamlines that were not included in any of the subsets in each subset size was negligible.They were not included in Fig. 4. Every streamline received on average 35 votes over all experiments, with an average of five votes per subset size.
We were further interested in assessing the influence of the streamline length on the filtering.As shown in Fig. 7, SIFT strongly influenced the distribution of streamline lengths.In fact, streamlines deemed plausible were among the shortest and barely exceeded the length of 125mm.The Each cell shows the mean ratio of streamlines across HCP subjects, with standard deviations in parentheses.In the last column, percentages are computed over all experiment instances and subset sizes, thus denoting AR instead of AR n .Note that the AR intervals do not include the particular lower bound and include the upper bound, except for 80-100%, where 100% is also excluded.Dashed lines show the respective ARs considering all subset sizes at once for AR=0% (in grey) and AR=100% (in orange).As shown, streamlines tend to get higher AR in smaller subset sizes.same pattern is shown for rSIFT in Fig. 8. Notice that the new class of inconclusive streamlines covers the whole range of lengths.

Randomized SIFT on the DiSCo data
Fig. 9 summarizes the ARs of rSIFT on the DiSCo data with false positives, considering all subset sizes.Although SIFT filters out false positives more often, it also filters out true positives.As shown, selecting a single threshold of AR for distinguishing between true and false positives is challenging since it involves a compromise between filtering out as few plausible streamlines as possible at the cost of accepting more implausible ones or filtering out more plausible ones to minimize the false positives.For example, using a threshold of 20% will lead to a rejection of 64.8% of false positives, at the cost of losing 11.4% of true positives in this experiment.In turn, a threshold of 80% will reject 87.7% false positives, but, at the same time, it will reject almost half of the true positives (48.5%).As in the case of the HCP data, analyzing the streamlines independently where SIFT is not consistent might be beneficial.Fig. 10 shows the results of the experiment with DiSCo data with redundant streamlines.Redundant streamlines are filtered out more often when they appear more frequently in the tractogram.For example, the amount of streamlines with AR = 0% increases from 13.5% to 27.5% when their multiplicity grows from 2 to 49, respectively.However, this increment in the amount of rejected streamlines is relatively slow.For example, for AR = 0%, the amount of streamlines only increases by 3.4% (13.5% vs. 16.9%)when increasing the number of redundant copies from 2 to 10. Redundancy has a larger effect at the other end: the number of streamlines with AR = 100% decreases rapidly with the number of redundant copies.Comparing Fig. 9 and 10, it can be seen that SIFT rejects more often true positives when there are no false positives.As a comparison, around 53% of streamlines (6 483 of 12 196) remain in the tractogram when SIFT is used only once on the DiSCo ground truth streamline set.

Classification performance
An overview of the performance of the three binary classifiers on HCP data can be found in Fig. 11.
As shown, the binary classifier for plausible and implausible streamlines was around 80% accurate on the validation folds, with similar accuracy for both classes.This shows that the procedure for balancing the classes was appropriate for this task.In turn, the performance of the binary classifiers trained to distinguish inconclusive streamlines from plausible or implausible was much lower and very similar (e.g., accuracy was 67.80% and 66.09%, respectively).As mentioned, we also evaluated the best-performing CV model for distinguishing negative and positive streamlines on the four unseen subjects.As shown in Fig. 11, the results of this test were equal to the five-fold CV experiment, with a mean accuracy of 80.2%.
Fig. 12 and 13 show the performance of the multi-class classifier in terms of the confusion matrix and accuracy-based metrics, respectively.This network was trained to recognize the three classes 'P', 'N', and 'I'.While it showed some success on the sets of plausible and implausible streamlines (true positive rate (TPR) and true negative rate (TNR) around 70% in Fig. 12, column maximum for 'P' and 'N' on the diagonal of confusion matrix in Fig. 13), its performance remained at chance level for the inconclusive class (true inconclusive rate (TIR) less than 1/3 for three classes in Fig. 12, similar values in all entries of column 'I' in confusion matrix in Fig. 13).Therefore, it seems that this classifier is unable to tell the inconclusive streamlines apart from plausible or implausible ones in our pseudo ground truth.
For further analyses, we chose the binary model between plausible and implausible streamlines from the cross-validation fold that yielded the most balanced sensitivity/specificity values (81.26% and 83.15%).Raw classifier scores for test samples from the positive class showed a mean of around 0.71 and a median of around 0.76, while the negative test samples received Figure 13: Confusion matrix for the multi-class task involving samples from all three sets from rSIFT."N" denotes "negative" (implausible), "P" "positive" (plausible), and "I" inconclusive.The cells show the mean percentage of streamlines belonging to this case in cross-validation, with the standard deviation in parentheses.The columns show the original labels, and the rows show the label given by the classifier.Each column sums up to 100%.a mean score of around 0.28 and a median of around 0.21.These results suggest that the classifier was relatively confident about the results.However, the mismatch between mean and median hints at the presence of outlier streamlines.Accordingly, classification scores had both a mean and a median of around 0.48 when the classifier was applied to the inconclusive set, which was not part of the training data.
We further split this set into two groups using a threshold of 0.5 on the classification score to get a set of leaning positive and leaning negative streamlines.The mean scores of these two groups were 0.72 and 0.21, which shows that they were slightly more off-center for the negative class.
As mentioned before, SIFT tends to filter out long streamlines.Fig. 14  and 15 show the performance of the binary classifier (P vs. N) with respect to the length of the streamlines.It is apparent that longer streamlines of the positive class were misclassified more frequently (cf.Fig. 14), and the same applies to the shorter streamlines of the negative class (cf.Fig. 15).Applying the P vs. N classifier to the inconclusive set, we found that, also here, shorter streamlines tended more to be classified as plausible and longer ones to be classified as implausible (cf.Fig. 16).  Figure 17: Scatter plot relating classifier scores of streamlines and their acceptance rate from rSIFT.The data points (i.e., streamlines) in the plot are sampled in a balanced manner such that all classes (i.e., plausible, implausible, and inconclusive) appear with the same frequency, where blue represents the positive, orange the negative, and gray the inconclusive class.As seen, scores for the positive and negative classes accumulate around the ends of the spectrum.For the inconclusive samples, there is a weak correlation of 0.28 (depicted with the trend line) between scores and AR from rSIFT.

Comparison between classifier scores and rSIFT
The correlation coefficient for the relation between classifier scores and the ARs of inconclusive streamlines in rSIFT was moderately positive with a value of 0.28.Fig. 17 depicts scores given by the classifier to each class and how they compare to the vote distribution from rSIFT.
In order to visualize clustering patterns of the three different sets, we used a t-stochastic neighbor embedding (t-SNE) [38], which enables highdimensional input vectors to be projected into a 2-dimensional plane.This way, hidden structures in the data as well as the closeness of data points in high-dimensional space can be made visible.We applied t-SNE to the streamlines' feature vectors taken from the 4-th (pre-dense) network layer to examine if the streamline groups were clustered together.For the sets of streamlines, we used the pseudo ground truth and inconclusive streamlines with 0% <= AR < 20%, or 80% < AR <= 100%, respectively.These inconclusive streamlines are expected to be closer to the pseudo ground truth data.
Plots of t-SNE were fine-tuned to a learning rate and perplexity of 100 each and 800 iterations.The comparison of t-SNE results with and without inconclusive streamlines using balanced sampling can be seen in Fig. 18.
As shown, the positive and negative streamlines formed multiple small clusters which were positioned close to each other (cf.Fig. 18a).When additionally presented with samples from the inconclusive class, small differences were noticeable depending on if samples with low or high AR were used.Especially the latter seemed to be more close to the clusters of the plausible streamlines (cf.blue and black dots in Fig. 18c).

Discussion
Tractogram filtering based on methods like SIFT offers the opportunity to improve the match of a given tractogram with the measured diffusion MRI data.However, the ability of SIFT to assign binary labels of streamline plausibility is inherently limited.In this paper, we propose a randomized iterative adaptation, called rSIFT, as a way to address this limitation of SIFT.We analyzed rSIFT in experiments on human and phantom data and employed neural network-based streamline classifiers further to characterize the properties of the output of our method.Additionally, the trained classifiers underline the potential for tractogram filtering being performed by neural networks at a reduced computational cost.In the following subsections, we discuss different aspects of our findings in further detail.

SIFT
Our experiments highlight how the filtering performed by SIFT depends not only on the characteristics of a streamline but also on the size and composition of the streamline set used as the input.As shown in the presented results, the same streamline can be accepted and rejected in different SIFT runs depending on the streamlines contained in the analyzed tractogram.This ambiguity, reflected in the inconclusive class from rSIFT, is more prominent with smaller tractograms.Through the randomized and iterative approach, rSIFT is designed to exploit this dependency of the SIFT output on the composition of the input tractogram.
This dependency on the input tractogram is rooted in the global optimization approach for improving the consistency between tractogram and diffusion data.It implies that the focus of SIFT lies in fitting all streamlines simultaneously to the diffusion data instead of assessing an individual  streamline alone.As shown in the experiments, filtering a tractogram containing fewer streamlines with SIFT will directly lead to an increase in accepted streamlines (2.6% for the whole tractogram vs. 14.9% for the smallest subset size), a decrease in rejected ones (96.6% vs. 56.8%,respectively), and an increase in streamlines for which SIFT gives mixed results (0.7% vs. 28.4%).One possible explanation is that smaller tractograms might contain fewer redundant streamlines making it more likely for streamlines to be accepted.
As a consequence, we argue that SIFT in its original form with only a single run is not perfectly suited for the assessment of individual streamline plausibility with respect to the diffusion data.In fact, for some of the HCP tractograms, the results differed even when the set of streamlines did not: In the experiment instance where the complete tractogram was repeatedly filtered (i.e.subset size 10 million), around 0.7% of the streamlines received mixed votes even though the overall tractogram remained the same, but was arranged in shuffled order due to our random sampling of streamlines.
A general limitation of SIFT is that the method is indifferent to why streamlines do not match the data.As mentioned in the introduction, anatomical implausibility is just one possible reason, but another possible explanation can be that the streamline in question contributes to a white matter fiber bundle that has been reconstructed with exaggerated streamline density compared to other bundles, relative to the measured data.This is supported by our findings on the DiSCo data: As shown in Fig. 10, redundant streamlines are rejected more often when they appear with a higher frequency in the dataset.At the same time, SIFT rejects true positives more often when there are no false positives present in the tractogram.Together, this shows that SIFT rejects not only implausible, but also both plausible and redundant streamlines.However, for purposes such as finding training samples for a machine learning classifier or for the purpose of combining several filtering methodologies in an ensemble fashion, as showcased in [27], there is a need for methods performing the distinction between plausible and implausible streamlines more reliably than what is achievable with a single run of SIFT.

Randomized SIFT
The proposed method, rSIFT, is a randomized, iterative adaptation of SIFT with the aim to separate plausible (including redundant) streamlines from implausible ones.Through the repeated application of SIFT to random tractogram subsets, we defined distilled sets of streamlines that are consistently categorized as either fitting the diffusion data well or not, thus deemed plausible or implausible in the sense of SIFT.Despite a large amount of votes per streamline (35 on average), the repeated SIFT assessments were consistent for around half of the streamlines in the HCP tractograms.Based on this, these streamlines might also possess distinct characteristics that a machine learning classifier may be able to learn.Thus, they are suitable for creating a pseudo ground truth of plausible and implausible streamlines with consistent characteristics within each of the two label groups, as we did in this study.
On the other hand, around half of the streamlines were inconclusive, with ARs between 0% and 100%.Some of these streamlines may be outliers that received mostly consistent votes.Apart from those, we suspected that the group of inconclusives contains a large number of redundant streamlines as well.By choosing random subsets of the tractogram and thus omitting potentially similar streamlines that would "rival" a candidate redundant streamline, our filtering procedure allows such streamlines to receive more positive votes than in the context of the full tractogram.Indeed, our experiments show that this group of inconclusive streamlines likely contains a mixture of both redundant, i.e., actually plausible, as well as implausible streamlines.Naturally, alternative choices of AR thresholds for the separation of plausible, inconclusive, and implausible streamlines are possible.Actually, our results show that streamlines in the inconclusive group with higher and lower AR (e.g., higher than 80% and lower than 20%) share similar characteristics with plausible and implausible streamlines, respectively.This is a promising avenue for future research to better understand the composition of the group of inconclusive streamlines.
In general, it needs to be noted that the labeling of a streamline as "plausible" through SIFT represents more of an absence of the label "implausible".One could even argue that those plausible streamlines had remained in the tractogram because there were other streamlines that demonstrated to be even less plausible before filtering was terminated.However, we think this to be unlikely as, from our experience, SIFT filters streamlines quite strictly compared to other tractogram filtering methods, leading to the removal of a significant amount of streamlines.For example, it is interesting that SIFT only kept 53% of the streamlines in the original DiSCo data.This value was even lower for AR=100% after rSIFT (35.0% and 33.2% for the false positives and redundant experiments).This means that SIFT may, in general, be too restrictive, and a threshold on AR can be set to a lower value (e.g., 80%) to distinguish between plausible and implausible streamlines.Of course, such a threshold involves a trade-off between rejected false positives and removed true positives.
It is important to emphasize that, as discussed in Smith et al. [39], SIFT results tend to correlate with streamline length.Since rSIFT builds on SIFT, this characteristic is inherited and also rSIFT, and the trained classifiers show this tendency to keep shorter streamlines.It would be interesting to assess if this correlation is a bias of SIFT or if the longer streamlines are more likely to be implausible.
Finally, notice that while rSIFT is based on SIFT, they have different goals and applications.While SIFT assesses the consistency of the whole tractogram with respect to the measured data, rSIFT augments this goal with the assessment of the stability of these results and thus becomes more specific for individual streamlines.

rSIFT neural network classifier
We employed different neural network classifiers to investigate the inconclusive streamlines in rSIFT as well as to showcase the potential of such machine learning-based approaches for the purpose of reducing the computational burden of our method.
The classifier which was trained to tell apart the plausible and implausible samples showed good accuracy.This indicates that some distinct structural information that may separate the two sets of streamlines was recognized, even in data samples unseen during training.Since rSIFT is time-consuming, it is relevant to assess whether the obtained classification scores could be used as an alternative to the SIFT-based method.The distributions of classification scores show that it is unlikely for a plausible streamline to receive a classifier score that is lower than the average (0.28) or median (0.21) scores for implausible streamlines.Considering the difference in the size of the groups of plausible and implausible streamlines, we would argue that streamlines with scores lower than approximately 0.2 could be safely removed.Such a strategy could significantly reduce the number of false-positive streamlines in a tractogram and would require little computational effort once the classification neural network has been trained.
Since the performance of the binary classifiers distinguishing plausible from implausible streamlines carried over to unseen data (of plausible/implausible samples), it may also be suited to identify such information in the streamlines assigned to the inconclusive group.However, employing a similar network architecture, the three-class network was unable to recognize the inconclusive samples (cf.TPR and TNR in Fig. 12).The same tendency was found for the binary approaches to distinguish inconclusive from plausible or implausible streamlines, respectively (in both cases, accuracy, sensitivity, and specificity decreased significantly compared to the N vs. P case, as seen in Fig. 11).This suggests that there is no inherent structural difference between inconclusive and likely plausible or implausible samples obtained through rSIFT.The findings of the t-SNE experiment further show that inconclusive streamlines are clustered together with the likely plausible and implausible samples.Additionally, their acceptance rate seems to be related to the observation of being closer to one group or the other.Taking this into account, as well as the performance of the classifiers, it can be inferred that the inconclusive samples are a mixture of plausible (containing redundant) and implausible streamlines.
Notice that the binary classifier (plausible vs. implausible) yielded good results in both cross-validation as well as classification of unseen data despite the fact that its only input is the streamline coordinates.In contrast, SIFT considers the FOD data.This observation is consistent with the results in [27] that showed that the coordinates of the streamlines are the most important features for the prediction of different tractogram filtering methods, followed by diffusion data.
Further, the presented classification results were achieved with a relatively simple neural network architecture.In preliminary experiments, we explored the architecture of the classifier (e.g., adding more layers or batch normalization) without a large improvement in performance.However, ways of building more sophisticated classifiers achieving even better accuracy are a direction of research that may be explored more in the future.For datasets that suffer from severe data imbalance (such as our training data), specific care may be put on recognizing the false positive samples in order to help decrease their number in comparison to the true positives.

Future work
There are many avenues for extending the current study.While we focused our experiments on SIFT, the same methodology can be applied to similar tractogram filtering methods such as LiFE [18], COMMIT [19], SIFT2 [20], or COMMIT2 [21,22,23].It may be particularly interesting to investigate if the presented findings would generalize to these methods as well.Also, further performance evaluation with other tractography methods and images acquired with clinical settings is relevant.
Regarding the classifier, one possibility is to add different input features as done, e.g., in [40,27], which described the streamline's structure in a more sophisticated manner or add diffusion data [41,27].
As discussed previously, rSIFT can be used to assess streamline plausibility or to use it for training machine learning-based classifiers, as we did in this paper.An additional interesting application would be to combine filtering and tractography in order to improve the quality of the tractogram from its generation.

Conclusion
In this paper, we proposed rSIFT, randomized and iterative adaptation of SIFT that allows assessing the plausibility of individual streamlines with improved specificity.rSIFT was used to generate pseudo ground truths for the training of machine learning-based classifiers.These classifiers were used to study the characteristics of different types of streamlines (plausible, implausible, and inconclusive) and to speed up the computations.We show how to use AR from rSIFT or the classification scores for distinguishing plausible and implausible streamlines.Streamlines with inconclusive results from rSIFT are likely to be a mixture of redundant and implausible streamlines.

Figure 1 :
Figure 1: Distribution of the number of points and length of streamlines in an exemplary tractogram of ten million streamlines.Top: histogram of the number of points per streamline.Middle: histogram of streamline length in mm.Bottom: correlation plot between streamline length and the mean number of sampling points with standard deviation (light blue).

Figure 2 :
Figure 2: Pipeline of rSIFT.Top: SIFT is run on different random subsets of the original tractogram.The numbers of positive and negative votes are used to estimate the acceptance rate (AR) of every streamline.Bottom: streamlines with AR=100% and 0% are used to compute a pseudo ground truth of streamlines.The remaining streamlines are inconclusive.

Figure 3 :
Figure 3: Architecture of the classifiers.Left: binary classifier.Right: multi-class classifier.The only difference is in the last layer.The binary classifier uses one output neuron with sigmoid activation, while the multi-class classifier uses multiple output neurons with softmax activation.

Figure 4 :
Figure 4: Distribution of streamlines per acceptance rates AR n and different subset sizes n.Each cell shows the mean ratio of streamlines across HCP subjects, with standard deviations in parentheses.In the last column, percentages are computed over all experiment instances and subset sizes, thus denoting AR instead of AR n .Note that the AR intervals do not include the particular lower bound and include the upper bound, except for 80-100%, where 100% is also excluded.

Figure 5 : 6 )
Figure 5: Distribution of streamlines per vote combination and subset size.Only combinations of exactly 5 votes are considered."P: 0, N: 5" refers to "0 positive, 5 negative votes" with the other rows labeled analogously.The cell values show the mean ratio of streamlines across HCP subjects, with standard deviations in parentheses.

Figure 6 :
Figure6: Evolution of acceptance rates (AR n ) from rSIFT with the subset size n.Each curve shows the percentage of streamlines with the respective AR, averaged over all six HCP subjects.Standard deviations are shown in transparent (zoom in for details).Dashed lines show the respective ARs considering all subset sizes at once for AR=0% (in grey) and AR=100% (in orange).As shown, streamlines tend to get higher AR in smaller subset sizes.

Figure 7 :
Figure 7: Distribution of streamlines accepted and rejected by SIFT with respect to their length for one exemplary subject with ten million streamlines.

Figure 8 :
Figure 8: Distribution of streamlines labeled plausible (accepted), implausible (rejected), and inconclusive by rSIFT with respect to their length for one exemplary subject with ten million streamlines.

Figure 9 :
Figure 9: Distribution of acceptance rate (AR) for true positive (TP) and false positive (FP) streamlines in the DiSCo dataset after running rSIFT with τ = 5 and four different subset sizes.The range 80-100% excludes AR = 100%.Note that the values in the columns TP and FP sum to 100%, respectively.

Figure 10 :
Figure 10: Distribution of acceptance rate (AR) for true positive (TP) and redundant (R) streamlines in the DiSCo dataset after running rSIFT with τ = 5 and four different subset sizes.The number in parentheses indicates the number of replications per streamline, i.e., how often they were found in the tractogram.The range 80-100% excludes AR = 100%.Each column sums up to 100%.

Figure 11 :
Figure 11: Classifier performance for binary classification."N" denotes "negative" (implausible), "P" "positive" (plausible), and "I" inconclusive.The metrics are specified through the mean values as determined by 5-fold cross-validation (CV) in training or tests with unseen subjects during training, with standard deviations in parentheses.

Figure 14 :
Figure 14: Histogram of streamline lengths of positive (P) streamlines, grouped by the streamline labels obtained from the binary positive vs. negative (plausible vs. implausible) classifier.The streamlines were taken from one test subject unseen in training.Longer streamlines are more frequently misclassified (i.e., labeled differently by the classifier than by rSIFT).Negative classifications outnumber positive classifications above a streamline length of around 75mm.

Figure 15 :
Figure 15: Histogram of streamline lengths of negative (N) streamlines, grouped by the streamline labels obtained from the binary positive vs. negative (plausible vs. implausible) classifier.The streamlines were taken from one test subject unseen in training.Shorter streamlines are more frequently misclassified (i.e., labeled differently by the classifier than by rSIFT).Positive classifications outnumber negative classifications below a streamline length of around 60mm.

Figure 16 :
Figure 16: Histogram of streamline lengths of inconclusive (I) streamlines, grouped by the streamline labels obtained from the binary positive vs. negative (plausible vs. implausible) classifier.The streamlines were taken from one test subject unseen in training.Shorter streamlines are more frequently classified as positive, and longer streamlines are more frequently as negative, with an equal amount of positive and negative predictions around a streamline length of around 65mm.

Figure 18 :
Figure 18: t-SNE results.Orange dots represent streamlines from the implausible pseudo ground truth, blue dots represent streamlines from the plausible pseudo ground truth, and black dots show inconclusive streamlines.The upper row does not include inconclusive samples.In the middle row, inconclusive samples with an acceptance rate AR ≤ 20% in rSIFT were used.The bottom row shows the results with inconclusive streamlines with AR ≥ 80%.