Deep Structure Learning using Feature Extraction in Trained Projection Space

Over the last decade of machine learning, convolutional neural networks have been the most striking successes for feature extraction of rich sensory and high-dimensional data. While learning data representations via convolutions is already well studied and efficiently implemented in various deep learning libraries, one often faces limited memory capacity and insufficient number of training data, especially for high-dimensional and large-scale tasks. To overcome these limitations, we introduce a network architecture using a self-adjusting and data dependent version of the Radon-transform (linear data projection), also known as x-ray projection, to enable feature extraction via convolutions in lower-dimensional space. The resulting framework, named PiNet, can be trained end-to-end and shows promising performance on volumetric segmentation tasks. We test proposed model on public datasets to show that our approach achieves comparable results only using fractional amount of parameters. Investigation of memory usage and processing time confirms PiNet's superior efficiency compared to other segmentation models.


Introduction
Deep convolutional networks have experienced tremendous success in various scientific applications in the last few years. An important contribution to the development of network architectures was made in 2015 by Simonyan and Zisserman [1]. The VGG Net redefined the state-of-the-art in image classification utilizing deep multi-scale feature learning via convolutional blocks with small 3 × 3 filters. With this approach for feature extraction of richsensory data, the Visual Geometry Group of University of Oxford secured the first and the second places in the localisation and classification tracks of the ImageNet Challenge 2014 [2,3], respectively. In addition to VGG Net's superior capability of capturing image distribution, also its generalization properties suggested extending this architecture beyond classification tasks. John Long et al. proposed in [4] the idea of leveraging VGG Nets to deep convolutional filters to obtain semantic segmentations of 2D images. As an extension of the approach of fully convolutional filters, Ronneberger et al. [5] introduced the U-net architecture, which redefines the benchmarks in image segmentation till today. The U-net provides a powerful segmentation and restoration tool, in particular for biomedical applications, due to its ability to learn highly accurate annotation masks from only few training data.
Feature extraction utilizing VGG Net architecture in the afore-mentioned applications mainly focuses on 2D convolutions. Representation learning via convolutions in three-dimensional data space yields memory issues, significant increase of time consumption and a limited amount of available annotated training data. For segmentation, one idea to overcome the drawbacks of 3D convolutions is to apply parameter efficient 2D segmentation methods to slice images [6]. However, these slice images do not contain information of the full volumetric data which makes any task much more challenging.
Therefore, Perslev et al. [7] suggested to sample input volumes on 2D isotropic grids along multiple view axes. Different slice images are obtained for each axis, which can be used to train one 2D U-net implicitly enabling data augmentation. This yields different segmentation volumes for each view, which are combined again by a learned fusion model. Perslev et al. achieve remarkable performance on the 2018 Medical Segmentation Decathlon [8], where they prove that this resource saving approach can compete with much more complex methodologies on such generalization challenges. Nevertheless, with an underlying 2D segmentation model of ≈ 62 × 10 6 parameters and with a lot of redundant computations (slice-wise approach for each of these multiple view axes), there is still room for increasing memory efficiency.
Another attempt on memory reduction and data augmentation in the setting of high-dimensional segmentation was made by Angermann et al. in [9], where the Maximum Intensity Projection (MIP) in combination with a learnable reconstruction operator is used to transfer a 3D segmentation problem to the task of 2D segmentation of projection images. The choice of the MIP was motivated by the targeted application of blood vessel segmentation and is not applicable to annotation of arbitrary non convex 3D structures.
Contributions: This paper presents a methodology for feature extraction of high-dimensional large-scale data applying segmentation algorithms in a learned projection space. By developing a data-driven Radon-transform operator, the segmentation task is transferred to a bundle of regression prob-lems in lower-dimensional space, which can be solved with vanilla U-net. This approach especially addresses two challenges of high-dimensional feature learning: • Memory efficiency: The need of large memory resources during training is shortened because representation learning happens via convolutions in a space with reduced dimension.
• Implicit data augmentation: Since we are deploying the Radontransform from multiple directions, we obtain several projection images out of each input volume, which results in implicit data augmentation for the applied convolution algorithm in lower dimension.
Furthermore, this paper proposes the following novelty: • Data-driven Radon-transform: For each projection direction, the Radon-transform operator [10] is combined with input dependent region weightings, which are adjusted automatically during training process. This ensures catching as much as possible information of the volumetric object of interest while reducing artefacts caused by surroundings.
We test PiNet on selected data sets with only few training samples from the 2018 Medical Segmentation Decathlon [8].
To be more precise, we apply exactly the same structure (without any task-specific adaptions) to two datasets, both with very small train data availability, and show that our approach is able to succeed on these tasks only using a fractional amount of parameters and memory compared to other challenge submissions, which also focused on storage efficiency [7]. The numerical experiments are intended to demonstrate the universality and economy of projection based feature learning and also to propose the idea of switching from slice-wise to projection-based approaches when targeting high-dimensional image computations. Automated fine-tuning for particular applications is left for future research.

Method
In this section, we will describe the proposed PiNet in three spatial dimensions and consider the task of volumetric segmentation. Note, that our approach also can be adapted to dimension reduction in spaces with arbitrary dimension. Recall, that a network for multilabel volumetric classification is a function N : R d 1 ×d 2 ×d 3 → [0, 1] d 1 ×d 2 ×d 3 ×c , that maps an input scan to the probabilities that each voxel corresponds to one of the considered c classes.

Definition 1 (PiNet).
For input x, the proposed PiNet ( Fig. 1) takes the following form: The different operators of Equation (1) are discussed in the following paragraphs: . . , v, denote independent random rotations of the volumes in three-dimensional space, i.e. the volume is rotated in the three main planes using uniformly at random chosen angles More precisely, we change the orientation of the input volumes several times to enable data augmentation for the subsequent models and to gain some multi-view information for the Radon-transform operator, where data only is rotated around the z-axis (Figure 1). When training starts, these random orientation angles are fixed and saved as metadata of the model. Therefore, the orientations during inference are the same as during training which preserves the optimality of the subsequent data-driven projections.
Data-driven Radon-transforms P i : . . , p, is a Radon-transform operator (projection along parallel lines) applied to inputx ∈ R d 1 ×d 2 ×d 3 and region weights q ∈ [0, 1] d 1 ×d 2 , which enables reduction of one dimension. Basically, for M ∈ N, the volume is rotated around the third coordinate axis for equidistant angles in and for each direction the interpolated slices are summed over the first spatial axis. The quite complex notation of Θ in Equation (2) indicates, that for M = 2 the Nyquist-Shannon sampling implies, that the volumetric data-set can be uniquely recovered from its Radon projections. When summing over all slices without weighting, we not only collect structural information of the targeted object we want to segment but also of surrounding objects. This causes loss of essential information like abnormalities (e.g. cancer) within the object or clear boundaries, which makes the segmentation in projection space much more challenging. As an example, segmentation of the liver is more difficult if another object like the vertebra contributes higher intensity as the targeted object to the projection image ( Figure 2). This issue can be resolved by automatically removing projection contributions from those objects, which surround the segmentation target. Since the position of the target object varies between different input scans, we make use of an auxiliary binary segmentation from the orthogonal direction, which assists us determining the coarse position of the target in an automatic manner (Figure 3). Auxiliary orthogonal position mask Ψ • P ⊥ : . . , p is a standard Radon-transform operator which generates projection images from the orthogonal direction α i + 90, α i ∈ Θ in Equation (2). Ψ : R d 1 ×d 2 → [0, 1] d 1 ×d 2 is a simple U-net for binary segmentation of 2D projection images. For each channel, the network returns values near 1 if the integrated line of the projection image contained a voxel of one of the targeted objects and values near 0 else. The purpose of this mask from orthogonal direction is to assist the subsequent data-driven Radon-transform operator with a coarse position of the targets, which makes automated region emphasis in P input-dependent. 2D network Φ: Φ : R d 2 ×d 3 → [0, 1] d 2 ×d 3 ×c is a U-net for segmentation of 2D projection images following the architecture in [5]. This U-net operator is responsible for Figure 2: In both images, we see application of the Radon-transform to liver CT scans taken from the Medical Segmentation Decathlon [8]. The left projection images are generated by integration over the first spatial axis without any region weighting along the projection direction (plain Radontransform). The right images display the output of data-driven Radontransform operator, which pays more attention to regions along the projection axis containing liver. As a result, liver shape is better recognizable in the right projection images (red marked areas), which increases accuracy for the subsequent segmentation model. Figure 3: Left: Vanilla Radon-transform integrating equally over both ellipsoids, where each slice has equal contribution. Right: Multiplication with an automatically generated binary mask from the orthogonal direction ensures that the data-driven Radon-transform P pays more attention to those regions which contain the ellipsoid of interest. feature extraction to be able to succeed on the targeted segmentation task. Since we consider Radon-transform projection inputs x R ∈ R d 2 ×d 3 , it is also necessary to apply plain Radon-transform (applied without region weighting) to the corresponding targets. Therefore, the targets y R ∈ R d 2 ×d 3 ×c are not binary any more which changes the segmentation into a regression task for each output channel. This suggests modification of the output layer of U-net architecture in [5]. Due to high sparsity of targeted objects, wellknown regression losses as mean-squared-(logarithmic)-error (MSE/MSLE), mean-absolute-error (MAE) or Huber-loss fail on that task. We take the idea of discretizing the targets as it has been proposed in depth-map-prediction works [11,12]. To be more precise, we divide every target y R ∈ R d 2 ×d 3 ×c by its maximum, discretisize for the scaled versions the interval [0, 1] into b equidistant bins and receive the new targets y d ∈ {0, 1} d 2 ×d 3 ×c×b . The regression problem is therefore transformed again into a segmentation problem for each of the b depth levels.
For binary segmentation, deployment of convolution with softmax-activationfunction as output layer enables optimization via standard classification loss functions (cross-entropy-loss [13], Dice-loss [14], etc.) separately on each depth channel while maintaining dependency between bins.

Fusion model F:
As proposed in [7], we make use of a fusion model F : [0, 1] d 1 ×d 2 ×d 3 ×c v → [0, 1] d 1 ×d 2 ×d 3 ×c to combine the different orientations to one final outputŷ: where the weight matrix W ∈ R v×c is adjusted during framework training. Application of a threshold to the final outputŷ yields the desired segmentation mask.

Experiments & Results
We apply PiNet to two different segmentation tasks (without any taskspecific modifications) taken from the 2018 Medical Segmentation Decathlon [8]. Task 1 and task 2 consist of left cardiac atrium segmentation of monomodal MRI and spleen segmentation of CT scans, respectively. All in all, the whole PiNet framework consists of ≈ 12.6 × 10 6 adjustable parameters, which is only fifth of the parameter number of MPUnet [7] (≈ 62×10 6 ). Regarding Equation (1), only ≈ 4×10 2 parameters are used by the trainable operators X , B, F, the rest is part of 2D networks Φ and Ψ. We choose v = 1 for the number of orientations, b = 5 in the hierarchical output module and M = 24 in Equation (2) (i.e. 41 projection directions), since for higher choice only computational costs but not accuracy increase (see Table  2). Optimization is done enabling Dice-loss function [14] with Adadeltaalgorithm [15] (learning rate = 0.125) and stopped after 100 epochs with minibatch of 10 projections per step. Everything is implemented in Python using the neural networks API Keras [16]. The parameters of Φ and Ψ are initialized with the default Keras initialization, the start weights of X , B, F are initialized empirically.
We present results for PiNet via 3-fold cross validation on the available training pairs. Table 2 shows perfomance for different modifications of PiNet architecture, Table 1 displays for each task the corresponding size of given training pairs and compares the Dice scores and Haussdorf distances (H-dist) [17] in voxels between MPUnet [7], 3D U-Net [6,14] and our methodology. Last column indicates how much GPU memory is at least necessary to handle training with very small minibatches. For MPUnet, we use the GitHubimplementation (https://github.com/perslev/MultiPlanarUNet) with predefined parameter settings , since Perslev et al. [7] claim that the fixed hyperparameter set ensures high generalizability and no further fine-tuning has to be conducted. For the 3D U-net approach with convolutions in three dimensions, we make use of a pre-built implementation in the same GitHub repository. Furthermore, we have to decrease input dimension from 320 to 80 for this model to ensure that training is still feasible for our GPU. PiNet 0.891 ± 0.062 12.87 ± 8.30 > 3.5 GB As we observe in Table 1, PiNet, although it only needs fifth of the parameters and significantly less GPU memory, can compete with MPUnet, which constitutes in this work the benchmark for memory-efficient 3D segmentation. PiNet achieves slightly smaller Dice scores for left cardiac atrium MRI segmentation, but significant higher Dice and decrease of average Hausdorff distance are reached for spleen CT annotation compared to MPUnet and 3D U-net. Exploring deviation values between all three methodologies we conclude, that there exist a few spleen CT scans in the test data where MPUnet and 3D U-net heavily fail. This could be caused by strong variations of spleen position and size in the very small train data set (41 × 2 3 ).  Therefore, although PiNet is not able to significantly improve segmentation accuracy of very detailed structures like fine heart arteries (see Figure 4), our approach exceeds other existing 3D segmentation methods in terms of capturing objects with strongly varying position and size in small data sets due to deploying multiple projection directions. Also necessity of transferring the Radon-transform to a trainable operator, which optimizes input-dependent region weightings, is confirmed by the test results in Table 2.
We were able to train a fast and efficient volumetric segmentation model, which needs GPU-resources of only 3.5 Gigabytes during optimization and generates satisfying annotations on very small train data sets without any task-specific fine-tuning. This encourages us to further improve our PiNet implementation and, as a next step, to leverage it to a multilabel classifica- tion tool, so it can be easily applied to arbitrary volumetric segmentation tasks.

Conclusion
To sum up, this work proposes a network architecture enabling a datadriven Radon-transform operator to transfer the segmentation task to lowerdimensional space. The segmentation task of projection images can then be solved with well-known model architectures (e.g. U-net) and are lifted again to original dimension deploying self-adjusting reconstruction operators. Application to two datasets of the 2018 Medical Segmentation Decathlon with small train data availability encourage universality and plausibility of our memory-efficient PiNet approach, which eliminates the need of further data augmentation for tasks with few training pairs. Further steps include leveraging our approach to multilabel segmentation, and implementation of a fully-automated self-tuning model framework, which can easily be applied to arbitrary high-dimensional segmentation tasks without need of large memory resources.