Combined tract segmentation and orientation mapping for bundle-specific tractography

While the major white matter tracts are of great interest to numerous studies in neuroscience and medicine, their manual dissection in larger cohorts from diffusion MRI tractograms is time-consuming, requires expert knowledge and is hard to reproduce. In previous work we presented tract orientation mapping (TOM) as a novel concept for bundle-specific tractography. It is based on a learned mapping from the original fiber orientation distribution function (fODF) peaks to tract specific peaks, called tract orientation maps. Each tract orientation map represents the voxel-wise principal orientation of one tract.Here, we present an extension of this approach that combines TOM with accurate segmentations of the tract outline and its start and end region. We also introduce a custom probabilistic tracking algorithm that samples from a Gaussian distribution with fixed standard deviation centered on each peak thus enabling more complete trackings on the tract orientation maps than deterministic tracking. These extensions enable the automatic creation of bundle-specific tractograms with previously unseen accuracy. We show for 72 different bundles on high quality, low quality and phantom data that our approach runs faster and produces more accurate bundle-specific tractograms than 7 state of the art benchmark methods while avoiding cumbersome processing steps like whole brain tractography, non-linear registration, clustering or manual dissection. Moreover, we show on 17 datasets that our approach generalizes well to datasets acquired with different scanners and settings as well as with pathologies. The code of our method is openly available at www.github.com/MIC-DKFZ/TractSeg.


Introduction
The white matter of the human brain is made up of a large number of individual fiber tracts.Those tracts overlap, resulting in multiple fiber orientation distribution function (fODF) peaks per voxel and larger bottleneck situations with tracts per voxel outnumbering the peaks per voxel.In consequence, tractography is highly susceptible to false positives [17,16].The only safe solution around false positives so far is the explicit dissection of anatomically well-known tracts.While manual dissection protocols [28] can be considered the current gold standard, a variety of approaches has already been developed for automating the process: Region-of-interestbased approaches filter streamlines based on their spatial relation to cortical or other anatomically defined regions, which are typically transferred to subject space via atlas registration and segmentation techniques [34,38].Clustering-based approaches group and select streamlines by measuring intra-and inter-subject streamline similarities, referring to existing reference tracts in atlas space [9,22,23].Concept-wise, many previous approaches have opted for performing a rather blind whole brain tractography and then investing the effort in streamline space, clearing the tractograms from spurious streamlines and grouping the remaining ones.
In Wasserthal et al. [35] we presented a novel concept called tract orientation mapping (TOM) that approaches the problem before doing tractography by learning tractspecific peak images (tract orientation maps, also abbreviated TOM).Each TOM represents one tract, and each voxel contains one orientation vector representing the local tract orientation, i.e. the local mean streamline orientation of the tract (see Fig. 1).These tract orientation maps can then be used as a prior -similar to Rheault et al. [25], who employed registered atlas information as a tract-specific prior -or directly as orientation field for tractography.In Wasserthal et al. [36] we presented a novel method called TractSeg for fast and accurate tract segmentation.Based on these preliminary works we introduce an comprehensive approach to bundle-specific tractography: On low resolution data, TOM tends to oversegment the individual tracts.In contrast to the complex task of voxelwise peak regression with TOM, the simpler binary segmentation with TractSeg yields more accurate tract delineations.Therefore we use the segmentation results from TractSeg to filter the TOM tractograms.After filtering with the TractSeg segmentations the tractograms show good spatial extent and orientation.However, a lot of fibers are still ending prematurely.Filtering the streamlines by a gray matter segmentation is not sufficient, as tracts tend to touch gray matter regions but are not supposed to end there.To obtain proper segmentations of the regions where each tract starts and ends, we trained another convolutional neural network using the same approach as TractSeg [36].Now all fibers not ending in the start/end regions can easily be removed.Using the steps described so far, highly accurate bundle-specific tractograms can be obtained in most situations.However, in some cases a simple deterministic tracking of the TOM peaks yields sub-optimal results, for example due to low image resolution.Therefore we propose a probabilistic approach to TOM tractography which maximizes the sensitivity of the proposed bundle specific tractography pipeline, even on low resolution data or strongly bent tracts.An overview of the entire pipeline can be seen in Fig. 2.
For training and the first part of the evaluation we use the dataset provided by Wasserthal et al. [36] containing 105 subjects from the Human Connectome Project (HCP).For the second part of our evaluation we use 17 differently acquired datasets to evaluate how good our approach generalizes to other datasets.We compare our method to seven other sate-of-the-art methods for generating bundlespecific tractograms.We show that our approach is easy to set up, fast to run and does not require affine or elastic registration, parcellation or clustering.

Materials and Methods
All three methods (tract segmentation, start/end region segmentation and tract orientation mapping) are based on the same fully convolutional neural network architecture (U-Net [27]) that receives as input the fiber orientation distribution function (fODF) peaks.What differs is the training target the network has to learn.For tract segmentation, the network is performing voxel-wise binary classification to discern tract and non-tract voxels.For tract start/end region segmentation it is also doing binary classification but now the number of classes has doubled because for each tract one start and one end region is learned.For tract orientation mapping the network regresses a single 3D peak vector, i.e. three float values, per voxel and bundle.In this way, the models used for the three methods only differ in the number of output channels, the final activation function and the loss function.

Preprocessing
While we successfully tested raw image values as input for our method, this would have restricted the method to the MRI acquisition used during training, not allowing for any variation in the acquisition, such as a change of bvalue or the number of gradient directions, without a complete retraining of the model.Moreover, for high angular resolution datasets, it would have resulted in an input image with an accordingly large number of channels, resulting in unfeasible high memory demand and slow file input/output during training.A more condensed representation of the data was chosen to mitigate this problem: The network expects to receive the three principal fiber directions per voxel as input, thus requiring nine different input channels (three per principal direction).In this study, the principal directions were estimated from the diffusion data using the multi-shell multi-tissue constrained spherical deconvolution (CSD) and peak extraction available in MRtrix [13,31] with a maximum number of three peaks per voxel.If a voxel contained only one fiber direction e.g.voxels in the corpus callosum, then the second and third peak are set to zero.The HCP images have a spatial resolution of 145 × 174 × 145 voxels.We cropped them to 144 × 144 × 144 without removing any brain tissue to make them fit to our network input size.

Model 2.2.1. Architecture
The proposed 2D encoder-decoder architecture was inspired by the U-Net architecture previously proposed by Ronneberger et al. [27].To enable better flow of the error gradients during backpropagation we added deep supervision [12].This reduced the training time and slightly improved the results.
The input for the proposed network is a 2D image at 144 × 144 voxels and 9 channels corresponding to the 3 peaks per voxel (each 3D peak is represented by three float values).The output is a multi-channel image with spatial dimensions of 144 × 144 voxels, where each channel contains the voxel-wise results for one tract.For tract segmentation this leads to 72 channels and for start/end region segmentation to 72 • 2 = 144 channels.Following the same approach 72 • 3 = 216 channels would have been needed for tract orientation mapping.However, given such a high number of classes the training did not converge anymore.To deal with this issue we chose to only train for 18 tracts (=54 channels) at the same time.So four models had to be trained to cover all 72 tracts.For the segmentation tasks the networks output a probability between 0 and 1.These probabilities are converted to binary segmentations by thresholding at 0.5.For the tract orientation mapping the networks output one peak per voxel and tract.Peaks shorter than 0.3 are discarded.To avoid a downsized output in comparison to the input we padded with half the filter size (rounded down).Given a filter size of 3 the padding was set to 1.This is also referred to as SAME padding [5].

Handling of 3D data
While in principle the U-Net architecture allows extensions to image segmentation with 3D convolutions [40], we here propose a 2D architecture.Using a 3D U-Net, we did not achieve the same performance as when using a 2D U-Net.To still leverage the additional information provided by the third dimension, we randomly sampled 2D slices in three different orientations during training: axial, coronal and sagittal.This meant that our model learned to work with all three of these orientations.During inference three predictions per voxel per tract were generated, one for each orientation, resulting in an image with dimensions of 144 × 144 × 144 × nr classes × 3 (after running our model 144 • 3 times).We use the mean to merge those three prediction to one final prediction.Running the model three times (once for each orientation) is only done for tract segmentation and start/end region segmentation.For tract orientation mapping it did not improve the results.Therefore we only run the model once for all slices along the y-axis (as shown in Wasserthal et al. [36], the y-axis gives the best results when only using one axis).

Bundle-specific threshold
Small tracts like the commissure anterior (CA) tend to be incomplete on low quality data.By increasing the sensitivity it is often possible to generate a complete representation of the tract.Increasing the sensitivity is possible by lowering the threshold applied on the model output.For the segmentation tasks we lower the threshold used for converting the probability maps to binary maps (default: 0.5, now: 0.4 for Fornix (FX) and corticospinal tract (CST), 0.3 for CA).For TOM we lower the threshold used to remove short peaks (default: 0.3, now: 0.1).This bundle-specific thresholding was applied to all dataset with lower resolution than the HCP data.For some subjects from the non-HCP datasets we had to lower the threshold for creating binary masks even further (0.03) to receive a complete segmentation of the CA.

Uncertainty
For some use cases it might be of interest knowing how certain the model is about its prediction.Gal and Ghahramani [7] have shown that Bayesian inference can be approximated by using Monte Carlo dropout.This gives an estimation of voxel-wise model uncertainty.In our case we ran the model 30 times with dropout activated.Then we plotted the standard deviation of those 30 runs for each voxel, giving an estimation of the model uncertainty.

Loss
For the segmentation models we trained our network using the binary cross-entropy loss.Sigmoid activation functions were used in the last layer.For a given target y, an output of the network ŷ and N number of classes, the loss is calculated as follows: For the tract orientation mapping model the network was trained using weighted cosine similarity combined with mean squared error of the l 2 -norm as loss.Linear activation functions were used in the last layer.The loss is defined as follows with N being the number of classes, y the training target, ŷ the network output and w a weighting factor which was used to handle the class imbalance between background and bundles and reinforce the training signal of the tracts.We set w = 1 for all background voxels and w = 10 = w max for all foreground voxels.At epoch 0 we use w max = 10 and linearly reduced it to w max = 3 at epoch 300.

Hyperparameters
Leaky rectified linear units (ReLU) were used as nonlinearity [19].A learning rate of 0.001 was used and Adamax [15] was chosen as an optimizer.The batch size was 47.All hyperparameters were optimized on a validation dataset independent of the final test dataset.The network weights of the epoch with the highest Dice score during validation were used for testing.

Data augmentation
To improve the generalizability of our model, we applied heavy data augmentation to the peak images during training 1 .The following transformations were applied to each training sample.The intensity of each transformation was varied randomly by sampling from a uniform distribution U. • Gaussian noise with mean and variance (µ, σ) ∼ (0, U[0, 0.05]) The training samples were normalized to zero mean and unit variance before passing them to the network.When training our network on peaks generated by the MRtrix multi-shell multi-tissue CSD method, we found that it did not work well on peaks generated by the standard MRtrix CSD method.In order to ensure our model worked well with all types of MRtrix peaks, we generated three peak images: (1) multi-shell multi-tissue CSD using all gradient directions, (2) standard CSD using only b = 1000s/mm 2 gradient directions, (3) standard CSD using only 12 gradient directions at b = 1000s/mm 2 .
During training, we randomly sampled from these three peak images, thus ensuring that our network worked well with all of them.We trained for 250 epochs with each epoch corresponding to 193 batches.This means that over the course of the entire training, the network has seen 2,267,750 slices which have been randomly sampled from axial, coronal and sagittal orientations, randomly sampled from three different peak types and randomly permutated by the data augmentation transformations.The results presented in section 3 were obtained using an implementation of the proposed method in Pytorch 2 .

Super resolution
Our models were trained with images of size 144 × 144 corresponding to the 1.25mm resolution of the HCP data.As mentioned in section 2.3 we were using resampling as data augmentation.This means images were downsampled to a resolution of 2.5mm to simulate lower resolution images.Then they were upsampled back to 1.25mm to fit the 144 × 144 input size of the model.So the resolution kept the same but the images got blurred by the down-and upsampling.This down/upsampling was only done for the input images (peaks) not for the labels (training target).This way the models were able to learn a higher resolution output than was actually provided as input.This is commonly referred to as super resolution [1].When our approach receives a low resolution image as input, it is first upsampled to resolution 1.25mm and then fed to the model which returns a output also in 1.25mm resolution.This higher resolution especially helps on very thin bundles like the commissure anterior (CA).

Data
For training our models the dataset published by Wasserthal et al. [36] was used.It contains reference delineations of 72 major white matter tracts (see supplementary materials for a list of all tracts) in 105 subjects from the Human Connectome Project.The reference delineations are provided in form of streamlines.In this pa-2 pytorch.orgper we refer to this dataset as reference data or reference tracts.

Preprocessing of reference data for different tasks
To be able to use the dataset for our three tasks (tract segmentation, start/end region segmentation and tract orientation mapping) some preprocessing was necessary: For the tract segmentation we convert the reference streamlines to binary masks by setting each voxel to T rue where at least one streamline runs through.For the start/end region segmentation we create binary masks from the streamlines start and end points.However, streamlines have no defined direction.So for example for the corticospinal tract some streamlines start at the cortex whereas other streamlines start at the brain stem.Therefore the start point of one streamline might be in the same region as the end point of the next streamline.The resulting binary mask is the union of the start and end region of a tract.Splitting the union into two binary masks, one for the start and one for the end region is not easy as for some tracts like the uncinate fasciculus those region can be very close together.We took the following approach to split the regions: First we used a clustering algorithm (DBSCAN [6]) to create two clusters from the combined region.The clustering was only done on a subset of the data points to avoid long runtimes.When the start and end region were close together the clustering sometimes misassigned points.Therefore we used the results from the clustering to train a random forest.This lead to a correct separation of the two region for all subjects and ensured fast runtime when running for all data points.From the points in those two regions binary masks were created.Finally we did binary closing and a small amount of binary dilation using scipy [14] to create a consistent region from the single points.For the tract orientation mapping the main streamline orientation in each voxel had to be determined for each tract.Using the mean of all streamlines running through a voxel lead to rather noisy results.Therefore we used Mean Shift clustering to group the orientations of all streamlines in one voxel.Then the mean of the orientations in the biggest cluster was taken as final orientation for that voxel.This substantially reduced the noise.

Clinical quality dataset
The reference dataset is provided in high HCP data quality (HCP Quality).However, in clinical routine, faster MRI protocols are used which result in lower quality data.To test how the proposed method performs on clinical quality data, we downsampled the HCP data to 2.5 mm isotropic resolution and removed all but 32 weighted volumes at b = 1000s/mm 2 .We call this dataset Clinical Quality.The reference tracts from the HCP Quality dataset were reused as our reference tracts here.This provides high quality reference tracts for the low quality data, thus allowing proper evaluation.

Phantom dataset
The Clinical Quality dataset has lower resolution and less directions than the HCP Quality dataset but it was still acquired by the same scanner.To evaluate how the proposed method generalizes to images from other scanners and other acquisition settings we would need a dataset with reference tract delineations from another scanner.Unfortunately such a dataset is not available and using the same approach as was used for the Wasserthal et al. [36] dataset is not feasible: For lower quality datasets it becomes very difficult and ambiguous for an expert to accurately determine where tracts run.The expert delineations would rather be approximations not suitable for detailed evaluation.One solution, however, is to simulate low quality data from a different scanner.Thereby we have perfect ground truth and still low image quality.We used the toolkit FiberFox [20] to create such software phantoms.We selected 21 subjects (not used for training) from the reference data and for each simulated the diffusion weighted image of a brain containing only the 72 reference tracts.The simulated images have an isotropic resolution of 2.5mm, 32 gradient directions at b = 1000mm/s and several artefacts which were randomly chosen from the following list: head motion, ghosts, spikes, eddy currents, ringing, distortions, signal drift and complex Gaussian noise.We call this dataset Phantom.As the Clinical quality dataset and the Phantom dataset only have one bvalue shell, we can not use multi-shell CSD as we did for the HCP Quality data.Instead MRtrix standard CSD was used to generate the peaks of the fODF.

Flavors of TOM
There are three different ways how tract orientation maps can be used to create bundle-specific tractograms: • Directly track on the tract orientation maps • For each voxel select the peak from the original input peaks which is closest to the orientation predicted by TOM.Then track on these peaks.This has the advantage of staying closest to the original signal, but if the original peaks are quite noisy the chosen peak will also be noisy.This is a problem especially on low quality data.
• Use the tract orientation map as a prior by taking the weighted mean between the predicted orientation from the TOM and the original orientation normally used for tracking.
Fig. 4 shows exemplary results for the different tracking options on one subject from a low resolution dataset.In all four cases the tract masks as well as the start/end region masks were used to filter the tractograms.Tracking on the original signal is insufficient: Deterministic tracking lacks sensitivity whereas probabilistic trackings lacks specificity (many false positives).Tracking on the tract orientation maps gives the best of both: high sensitivity (tract is complete) and high specificity (little false positives).Tracking on the best original peaks also shows good results but is missing small parts of the lateral projections of the CST.Therefore for our experiments we chose the first option: Directly track on the tract orientation maps.This gave the best results, especially on low quality data where the original peaks can be quite noisy.

Probabilistic tracking on peaks
The output of the tract orientation mapping is one tract orientation map for each tract.A tract orientation map contains one 3D vector (one peak) at each voxel telling the main orientation of the respective tract at that voxel.Creating streamline from these maps could easily be done by using deterministic tracking (e.g.FACT [18].This works well on high resolution data.However, on low resolution data just following the main orientation in each voxel often leads to small branchings being missed as they can not be represented on the low resolution.Probabilistic tracking enables more sensitive tracking.By not just following the main orientation in each voxel but sampling from the orientation distribution smaller branching can be reconstructed which otherwise would be missed.In our case, however, there is only one orientation per voxel given by the tract orientation map, but no orientation distribution.To be able to sample from orientations around the main orientation, we use a Gaussian distribution centered on the main orientation with a fixed standard deviation.This way we can run probabilistic tracking on the tract orientation maps.Our tracking algorithm is identical to the FACT algorithm with the only difference that at each step the next orientation to take is sampled from the given Gaussian distribution.All streamlines have to start and end in the regions segmented by the start/end region segmentation model and are not allowed to leave the mask generated by the tract segmentation model, otherwise they are discarded.When running on low resolution data the start/end region masks as well as the tract masks were dilated by one voxel to make up for small imperfections in those masks, which can occur when resolution is getting lower.During tracking the following parameters were used: a step size of 0.7 voxels, a fixed standard deviation for the Gaussian distribution of 0.2 and a minimum streamline length of 50mm.Seeds were randomly placed inside of the tract mask until a maximum of 2000 streamlines per tract were created.

Reference methods
We compared our proposed method to 3 methods for automatic tract delineation (comparing segmentation performance in terms of DICE score and orientation quality in terms of voxel-wise angular error): TractQuerier, Re-coBundles and streamline atlas.Moreover we compared to 3 methods for tract segmentation (comparing only segmentation performance): TRACULA, atlas registration and multiple mask registration.We also compared to 2 methods which give a voxel-wise orientation for each tract (comparing only orientation quality): Peak atlas and the best original peak.These methods include clusteringbased as well as ROI-based approaches.We give an outline of how they work (1.) and how we applied them (2.).

TractQuerier
1. TractQuerier [34] extracts tracts based on the regions the streamlines have to start at, end at and (not) run through.2. We compared our method to the output from TractQuerier using the same queries as used in Wasserthal et al. [36] without any further post-processing.

RecoBundles
1. Given streamlines of a reference tract in a reference subject, RecoBundles [9] can be used to find the corresponding streamlines in a new subject.2. We randomly picked 5 reference subjects from the training dataset.Due to the long runtime for RecoBundles, a higher number of reference subjects was not feasible.Then we ran Re-coBundles 5 times for the new subject (once for each reference subject) using the default RecoBundles parameters.This resulted in 5 extractions of each tract in the new subject.To get a final segmentation, we took the mean of those 5 extractions.

Streamline atlas registration (SLAtlas)
1. Given streamlines of a reference tract in a reference subject, registration can be used to align them with a new subject.2. The same 5 reference subjects as those selected for RecoBundles were used.To delineate the tracts in a new subject, we registered each of the 5 reference subjects to the new subject.Affine registration of the whole brain tractograms was already done by RecoBundles so we reused these transformations.Finally for each tract we merged the streamlines from the 5 registered reference subjects.

TRACULA
1. TRACULA [38] uses probabilistic tracking and an atlas of the underlying anatomy to segment tracts.2. Running TRACULA for each subject produced a probability map for each tract.We selected all voxels with a probability of greater than 0 to create a binary segmentation.The original paper created a binary segmentation by masking out all values below 20% of the maximum.However, this leaves only the very core of every tract, thus creating many false negatives.Segmenting all values with probability greater than 0 gives better results compared to the reference tracts.TRACULA does not support all of the 72 tracts that we use; it only supports 18.Out of the 18, only 8 tracts match our 72 tracts exactly.Quantitative comparison was therefore only performed on these 8 tracts.A GPU implementation is available for bedpost, which is part of the TRACULA pipeline which we used [10].

Atlas registration (Atlas)
1. Several subjects can be averaged to an atlas which can then be registered to new subjects to segment struc-tures.2. We split our dataset into training and testing data, using the same 5-fold cross-validation as used for the evaluation of our proposed method.The training data was used to create a tract atlas.Firstly, we registered all subjects to a random subject using symmetric diffeomorphic registration implemented in DIPY [3,8].Registration was performed based on the FA maps of each image.After registration, the FA maps of all images were averaged.Then, in a second iteration all images were registered to this mean FA image.This two-stage approach limits the bias introduced by the initial subject choice in the first iteration.The tract atlas thus contained the tract masks for all 72 reference tracts.For each tract, we took the mean over all subjects, which produced a probability map.We thresholded the probability map at 0.5 to create a final binary atlas.During test time, the atlas was registered to the subjects of interest, yielding a binary mask for each tract in subject space.

Multiple mask registration (Multi-Mask)
1. Using an atlas can blur some of the details as it is based on group averages.The blurring can be reduced to some extent by registering the masks of single training subjects to a test subject instead of an averaged atlas.2. The same 5 reference subjects as those selected for Re-coBundles were used.To segment the tracts in a new subject, we registered each of the 5 reference subjects to the new subject (symmetric diffeomorphic registration of the FA maps) and averaged the tract masks (from the reference tracts) of all 5 reference subjects.Finally, we thresholded this average at 0.5 to produce a binary mask for each tract in the space of the new subject.This differs from the Atlas registration method in that the reference subjects are directly registered to subject space and are merged (1 registration) instead of first being registered to atlas space, then merging and being registered to subject space (2 registrations needed).Moreover, Atlas Registration uses 63 subjects while Multi-Mask only uses 5.

Peak atlas
This method is identical to Atlas with the only difference that instead of using binary masks we use peak images.The transformation calculated from registering the FA images is applied to each of the 9 channels of the peak image independently.

Best original peak (BestOrig)
1. Given the peak map of a reference tract, in each voxel we can choose the peak from the original signal that is closest to the peak from the reference tract, resulting in a new peak map. 2. For each subject in the test set we use the reference peaks to extract the best peak from the original signal.As we are using the ground truth in this method, it is not a fair method to directly compare to but it gives a good estimation of how good the original peaks are.

Experiments and results
For evaluation 5-fold cross-validation was used, i.e. 63 training subjects, 21 validation subjects (best epoch selection) and 21 test subjects per fold.The Wilcoxon signedrank test [37] was used to test for statistical significance.For multiple testing, we applied the Bonferroni correction.

Segmentation performance
For evaluating segmentation performance we used the Dice score [29] as our metric.We calculated the Dice for each subject between each of the 72 reference tracts and the respective prediction of either our proposed method or one of the reference methods (e.g.RecoBundles).Then we averaged the Dice results for all 72 tracts to get one final Dice score per subject per method.Over all three datasets (HCP Quality, Clinical Quality and Phantom) our proposed method significantly (p < 0.01) outperformed the reference methods by a large margin: on the HCP Quality dataset on average by 19 Dice points and on the low quality datasets on average by 23 Dice points (Clinical Quality) and 31 Dice points (Phantom) (Fig. 5).In general, the proposed method was less affected by the quality loss in the Clinical Quality and Phantom data than the reference methods.Please note that TRACULA only provides segmentations for a subset of the 72 tracts.The reported scores for TRACULA are restricted to this subset.

Orientation performance
For evaluating orientation performance we use the voxel-wise angular error as metric.We calculate the angular error between the reference orientation and the orientation of the proposed method.We do this for every voxel where the reference peak and the peak of the proposed method have a length greater than zero.Then we average the errors to get one final angular error per subject per method.To calculate the voxel-wise main streamline orientation for the methods which output streamlines (RecoBundles, TractQuerier and SLAtlas) we used the same technique as used for calculating the main streamline orientation for the reference data (see section 2.4.1): the streamline orientations in each voxel were first clustered and then the mean of the biggest cluster was chosen.On the HCP Quality data RecoBundles and TractQuerier show slightly better orientation errors than our proposed method.Those methods have difficulties finding the borders of tracts (poor segmentation performance) but they are good at finding the correct streamlines belonging to the core of the tract and therefore show low orientation errors, as long as the underlying whole brain tractogram is of high quality.As soon as the image quality gets lower (Clinical Quality and Phantom dataset), the whole brain tracking also suffers and therefore the angular error of these methods rises significantly.Our proposed method on the other hand is not dependent on the whole brain tracking and quite robust to lower image quality as it was trained with extensive data augmentation.As a results the angular error only rises by 1 degree when using our proposed method on the Clinical Quality data compared to the HCP Quality data (Fig 6).SLAtlas and Peak Atlas show high angular errors for all three datasets.Fig. 7 shows the angular error for each tract independently on the Clinical Quality dataset.

Qualitative evaluation
For the qualitative evaluation, one subject (623844) was selected from the test set.We chose a subject whose Dice scores were closest to the mean Dice scores for the entire datasets to make the subject representative for the entire dataset.Since the scope of this manuscript does not allow us to show results for all 72 tracts, we selected three tracts that represent different degrees of reconstruction difficulty according to Maier-Hein et al. [17]: the inferior occipito-frontal fascicle (IFO), corticospinal tract (CST) and commissure anterior (CA).The IFO is a tract which is fairly easy to reconstruct, which is reflected by its consistently good scores for all methods.The CST is more difficult to reconstruct.Its beginning at the brain stem is easy to reconstruct but as the fibers get closer to  the cortex, they start to fan out.Finding these lateral projections is more difficult.Finally, the CA is a tract that is difficult to reconstruct.Due to its very thin body, it is hard to find fibers that run the entire way from the right to the left temporal lobe.The CA is one of the tracts with the lowest performance out of all of the methods.We show results for all reference methods that produce streamline output (RecoBundles, TractQuier and SLAtlas).For each tract one 3D view is shown as well as one 2D slice allowing more in detail evaluation.On the 2D slice the mask of the bundle is shown (red) as well as the mask of the reference bundle (green).
As can be seen in Fig. 8, the Proposed method yielded accurate and spatially coherent reconstructions on all three tracts.RecoBundles oversegmented the CST to neighbouring gyri and selected many streamlines ending prematurely instead of reaching the correct start and end regions of the tract.TractQuerier did not properly segment any of the example tracts.As it defines tracts mainly by their endpoints, it leaves much room for wrong turns between the start and end points.TractQuerier extracts a lot of false positives, especially when using probabilis-Figure 7: Angular errors for all 72 tracts on the Clinical Quality dataset for our proposed method and all reference methods sorted by error.The full name of each tract can be found in the supplementary materials tic tracking.The CA cannot be properly reconstructed with TractQuerier as the default Freesurfer parcellation is not precise enough for the small parts of the CA.SLAtlas produces reconstructions looking convincing on first sight but when looking at them on the 2D view we can see that it involves severe oversegmentation (e.g.segmenting gray matter and non-brain area for the CST) and slightly shifted tracts (e.g.CA).This is most probably owed to the affine registration, which can not fully resolve the intersubject variability.
On Fig. 9 the results for the same subject (623844) from the Phantom dataset can be seen.The different methods in principle show the same shortcomings as for the HCP Quality dataset, but now more severely.

Generalization to other datasets and pathologies
To test the capability of the proposed method to generalize beyond HCP, which it was trained on, we applied it to 17 differently acquired datasets (see Table 1).These 17 datasets represent a wide variety of data: Different scanners, different spatial resolutions, different b-values, different number of gradients, healthy and diseased, normal and abnormal brain anatomy.
An expert manually dissected the three tracts already shown in the above qualitative evaluation (CST, CA and IFO) from one randomly chosen subject from each dataset shown in table 1. Visual comparisons were performed between manual dissections and the results of our proposed method as well as the previously introduced reference methods RecoBundles, TractQuerier and SLAtlas.Methods depending on reference data (all three methods) were provided with our HCP reference data.All subjects (expect for subjects from HCP datasets which did already receive basic preprocessing) were denoised (using MRtrix [33]), corrected for eddy currents and motion artifacts (using FSL eddy [2]) and rigidly registered to MNI space.This is not required for our proposed method to work.It only requires that the left/right, front/back and up/down orientation of the images are the same as for the HCP data (i.e.images are not mirrored).Rigid registration to MNI space is an easy way to ensure this.We also observed a small performance increase when images are rigidly registered to MNI space instead of only adapting the orientation.Our proposed method showed anatomically plausible results for all subjects and most of the tracts.Only the CA was not completely reconstructed in around 20% of the subjects.We observed partly incomplete manual reference dissections in these areas as well, indicating that the size of this very thin tract is reaching the resolution limit of the underlying imaging acquisition.Figs. 10, 11 and 12 show exemplary results for three subjects with pathologies.For those subjects we show the corticospinal tract (CST), the optic radiation (OR) and the thalamopostcentral tract (TPOSTC) as those tracts are heavily affected by the respective pathology.The other 19 subjects can be found in the supplementary materials.

Project
Pathology   Fig. 10 shows the results for an alzheimer patient with abnormally large ventricles from the OASIS dataset.Even though our proposed method has only seen healthy subjects with normally sized ventricles during training it managed to properly reconstruct the CST and the OR which are heavily distorted by the enlarged ventricles.Re-coBundles also managed to find the distorted tracts.However, it failed to find the lateral projections of the CST and oversegmented the Meyer's loop of the OR.TractQuerier showed severe oversegmentation of both the CST and OR.SLAtlas did not manage to adapt to the enlarged ventricles.It placed the tracts inside of the ventricles as the affine registration is not able to resolve these distortions.MS lesions demyelination takes place, leading to a loss in diffusion-weighted signal.However, the axons themselves are still intact.Therefore fibers are still running through the lesions but they are harder to reconstruct as the signal is weakened by the demyelination.Our proposed method manages to properly reconstruct streamlines running through these lesions.RecoBundles and TractQuerier were also able to reconstruct streamlines running through the lesions as they use tracking based on constrained spherical deconvolution which shows good results in reconstructing orientation information inside of the lesions (using a simple tensor model would not be sufficient to reconstruct the orientation information inside of the lesions).However, RecoBundles and TractQuerier are failing in properly reconstructing the entirety of the tracts: RecoBundles fails to reconstruct the Meyer's loop of the OR and TractQuerier show severe oversegmentation of both tracts.We do not show a reference tract delineation for this subject as the lesions would not be visible anymore then.Fig. 12 shows the results for a patient with mild brain volume loss and schizoaffective disorder.Around the postcentral gyrus the volume loss is more severe.We show results for the Thalamo-postcentral tract (T POSTC), containing fibers running from the thalamus to the postcentral gyrus.Despite the reduced brain volume our proposed method managed to correctly reconstruct the fibers in the postcentral gyrus.RecoBundles is missing major parts of the tract and SLAtlas is not able to adapt to the reduced brain volume leading to streamlines running outside of the postcentral gyrus.
We also tested our method on subjects with brain tumors.However, given the vast distortions a tumor can produce, it is unclear where exactly certain tracts run.Experts can only assess if a tract could be a plausible reconstruction not containing any obvious errors (e.g.running through the tumor).The tract reconstructions in tumor patients produced by our method were rated as plausible by an expert.However, given the difficulty of evaluating tracts in tumor cases we do not show any results.

Runtime
Runtime experiments were performed using a server with 16 2GHz Intel Xeon cores and an NVIDIA Titan X for the GPU-based approaches.We evaluated the runtime of all methods producing streamline output (Proposed, RecoBundles, TractQuerier and SLAtlas).The runtime does not include the fitting of the constrained spherical deconvolution model as this is identical for all methods.For the HCP Quality experiments whole brain tractograms with 10 million streamlines where used.For the Clinical Quality experiments whole brain tractograms with 500,000 streamlines were used.This reduced the runtime significantly.For our proposed approach on HCP Quality and Clinical Quality 2000 streamlines where generated for each tract.Fig 13 shows the results for each method when reconstructing all 72 tracts in a previously unseen subject.Our method was 137x faster than the reference methods for HCP Quality and 50x faster for Clinical Quality.

Uncertainty example
Fig. 14 shows the model uncertainty for the commissure anterior for three subjects: One subject from the HCP Quality dataset, one subject from the Clinical Quality dataset and one subject from a non-HCP dataset (BrainGluShi [4]).For the HCP Quality dataset the uncertainty was quite low.For the Clinical Quality dataset  the uncertainty was increased as it has lower image quality.For the non-HCP dataset the uncertainty was even higher as the subject comes from a dataset the model was not trained on.These results indicate that Monte Carlo dropout gives reasonable uncertainty estimations.

Overview
Our proposed approach is a novel method for bundlespecific tractography.It was evaluated on 72 tracts in a cohort of 105 HCP subjects in original high quality and also on reduced quality, more clinical-like datasets.Moreover we evaluated the approach on synthetic software phantoms.Seven methods were used as a benchmark.Our experiments demonstrated that our approach achieves yet unprecedented accuracy and runtime while being less affected by the reduction in resolution in the clinical quality data.It also generalizes well to unseen datasets and pathologies.

Reference data
The tract delineations from the reference dataset used for training and evaluation do not represent a real ground truth.They are approximations based on diffusion weighted images, which has several limitations.However, given the high quality of the HCP data and the manual inspection by an expert, the employed dataset represents one of the best existing in vivo approximations of known white matter anatomy in a cohort of that size.Moreover, by using synthetic software phantoms we were able to evaluate our method on a dataset where the real ground truth is available.
On the Phantom data Dice scores were lower than on the Clinical Quality data.This had two main reasons: First the phantoms were simulated containing major artifacts whereas Clinical Quality contains only little artifacts as it based one the high quality HCP data.Second the domain shift between the training data and the Phantom data is significantly greater than the shift between the training data and the Clinical Quality data: On the one hand different acquisition settings were used during phantom simulation and on the other hand the simulated brains only contain the 72 reference tracts.Those cover the majority of the brain, but several tracts (like for example all u-fibers) were not included in the phantom.Therefore the training data and the phantoms are less similar and the resulting scores are reduced.Despite the reduced Dice scores in comparison to the Clinical Quality data, our approach still shows complete reconstructions on the Phantom dataset with only minor errors (Fig. 9).

Reference methods
Selecting appropriate reference methods for a fair comparison was not easy as all methods have slightly different approaches and requirements.The comparison with our selected reference methods is also subject to some limitations: TractQuerier was part of the pipeline used for the creation of the reference dataset which was then used to evaluate TractQuerier against, thus inducing a potential positive bias for the method.For RecoBundles we were only able to use 5 reference subjects due to the long runtime of RecoBundles.Using all 63 subjects from the training set as reference subjects would have been computationally infeasible for 72 tracts and tractograms with 10 million streamlines.Moreover, as suggested by our Atlas and Multi-Mask experiments, averaging more subjects, does not necessarily increase accuracy as small details become blurred.Using 5 reference subjects therefore provides a good estimation of the performance of RecoBundles.We used the default RecoBundles settings.Optimizing those might improve the results to some degree.TRACULA uses its own tract definitions, which may differ from our definitions to some degree, introducing a negative bias.However, for very well defined tracts like the CST, this bias should be minor and a meaningful comparison is still possible.SLAtlas is showing high angular errors because it is only based on affine registration making the registered tract not align properly on the new subject.Peak Atlas is based on elastic registration which leads to better alignment of tracts.However, it is also showing high angular errors as elastic registration is still not able to completely resolve the complex inter-subject variability that exists between human brains.Moreover, the elastic deformation was calculated based on the FA images and then applied independently to each channel of the peak image.This way peaks were only shifted but not rotated.Implementing a registration algorithm that can also rotate the peaks would have exceeded the scope of this work.This might be the main factor for the poor performance of this method.
As we have shown, our comparison to the reference methods has some limitations.However, those limitations do not apply to all reference methods and those limitations alone can not explain the large accuracy gap between our method and all reference methods, indicating the great potential of the proposed method

Generalization
Our method is based on supervised learning, bearing the inherent limitation of depending on the availability and quality of training data.This is similar to most of the reference methods which also require reference tracts or atlases.Using scanners and acquisition sequences different from the training data introduces a domain shift and therefore reduces the performance.By using heavy data augmentation during training this domain shift can be reduced.Our experiments on the phantom data have shown that our method generalizes well to unseen acquisition sequences.We have also shown on a wide range of unseen datasets from different scanners with and without pathologies that the proposed method produces anatomically plausible results in most cases.

Code availability
The proposed method is openly available as an easy-to-use python package with pretrained weights: https://github.com/MIC-DKFZ/TractSeg/

Figure 1 :
Figure 1: Exemplary depiction of a slice through two of the reference tracts, the original fODF peak image and the corresponding reference TOMs (CST right: corticospinal tract; CC: corpus callosum).

Figure 2 :
Figure 2: Pipeline overview: Constrained spherical deconvolution (CSD) is applied to obtain the three principal fODF directions per voxel which is the input to three U-Nets.The three U-Nets are used to create a tract orientation map, a tract mask and a start/end region mask for each tract.Then probabilistic tracking is run on the tract orientation maps.All fibers leaving the tract mask and not ending in the start/end masks are discarded.The result is one tractogram for each tract.

Figure 3 :
Figure 3: Proposed U-Net architecture.Blue boxes represent multichannel feature maps.White boxes show copied feature maps.The gray number on top of each box gives the number of channels, the x-y-size is given at lower left corner of each box.Network operations are represented by differently colored arrows.The main difference to the original U-Net architecture are the extra convolutions layers in the upsampling path allowing better gradient flow (deep supervision).

Figure 4 :
Figure 4: Right corticospinal tract (CST) in one subject from the BrainGluSchi [4] dataset (2mm isotropic resolution, 30x b = 800mm/s 2 ) reconstructed by the different tracking variants.Probabilistic tracking on TOMs shows the best results in terms of sensitivity and specificity.

Figure 5 :
Figure 5: Segmentation results on the HCP Quality, Clinical Quality and Phantom dataset with a gray dot per subject (mean over all tracts) and a colored dot for the mean over all subjects.Proposed: Our method; Multi-Mask: Multiple mask registration; Atlas: Atlas registration; SLAtlas: Streamline atlas.*: TRACULA does not provide segmentations for all tracts, only for a subset.The score is the mean over the tracts in the subset.

Figure 6 :
Figure 6: Orientation performance results on the HCP Quality, Clinical Quality and Phantom dataset with a gray dot per subject (mean over all tracts) and a colored dot for the mean over all subjects.Proposed: Our method; BestOrig: Best original peak; SLAtlas: Streamline atlas.

b = 1000mm/s 2 3TFigure 8 :
Figure 8: Qualitative comparison of results on HCP Quality test set: reconstruction of right corticospinal tract (CST), commissure anterior (CA) and right inferior occipito-frontal fascicle (IFO) on subject 623844.Green shows the reference tract and red shows the tract mask of the respective method.

Figure 9 :
Figure 9: Qualitative comparison of results on Phantom test set: reconstruction of right corticospinal tract (CST), commissure anterior (CA) and right inferior occipito-frontal fascicle (IFO) on subject 623844.Green shows the reference tract and red shows the tract mask of the respective method.

Figure 10 :
Figure 10: Qualitative comparison of results on one Alzheimer patient with enlarged ventricles from the OASIS dataset: reconstruction of right corticospinal tract (CST), commissure anterior (CA) and right inferior occipito-frontal fascicle (IFO).Green shows the manual dissection and red shows the tract mask of the respective method.

Fig. 11
Fig. 11 shows the results for an multiple sclerosis (MS) patient with severe lesions in the pathways of the CST and OR (marked with arrows in the figure).Inside of

Figure 11 :
Figure 11: Qualitative comparison of results on one multiple sclerosis patient with several lesions inside of the tracts: reconstruction of right corticospinal tract (CST) and right optic radiation (OR).Red arrows show lesions close to the tracts.

Figure 12 :
Figure 12: Qualitative comparison of results on one subject with brain volume loss and schizoaffective disorder: reconstruction of left Thalamopostcentral tract (T POSTC).Green shows the manual dissection and red shows the tract mask of the respective method.

Figure 13 :
Figure13: Runtime of our proposed method and reference methods to generate bundle-specific trackings for all 72 tracts in one subject.On HCP Quality 10 million streamlines are used for the whole brain tracking, on lClinical Quality only 500,000 streamlines.

Figure 14 :
Figure 14: Model uncertainty for the commissure anterior (CA) of three subjects from three different datasets: HCP Quality, Clinical Quality and a non-HCP dataset (BrainGluShi [4]).The segmentation is shown in red and the uncertainty is shown as gray scale heat map on the 2D plot.Lighter values indicate higher uncertainty.For lower quality datasets (Clinical Quality) and datasets the model was not trained on (non-HCP data) the model uncertainty is increased: Also the inner parts of the CA start to be more uncertain, whereas for the HCP Quality data the model was only uncertain in the border regions.

Figure 15 :
Figure 15: Qualitative results of our proposed method on 10 healthy subjects from 9 different datasets (see table 1) while being trained on the HCP reference dataset: reconstruction of right corticospinal tract (CST), commissure anterior (CA) and right inferior occipito-frontal fascicle (IFO).*: For CA lowered threshold to 0.03 to be more sensitive.

Figure 16 :
Figure 16: Qualitative results of our proposed method on 9 subjects with pathologies from 7 different datasets (see table 1) while being trained on the HCP reference dataset: reconstruction of right corticospinal tract (CST), commissure anterior (CA) and right inferior occipito-frontal fascicle (IFO).*: For CA lowered threshold to 0.03 to be more sensitive.

Table 1 :
Acquisition parameters of additional test datasets.