Framelet Pooling Aided Deep Learning Network : The Method to Process High Dimensional Medical Data

Machine learning-based analysis of medical images often faces several hurdles, such as the lack of training data, the curse of dimensionality problem, and the generalization issues. One of the main difficulties is that there exists computational cost problem in dealing with input data of large size matrices which represent medical images. The purpose of this paper is to introduce a framelet-pooling aided deep learning method for mitigating computational bundle, caused by large dimensionality. By transforming high dimensional data into low dimensional components by filter banks with preserving detailed information, the proposed method aims to reduce the complexity of the neural network and computational costs significantly during the learning process. Various experiments show that our method is comparable to the standard unreduced learning method, while reducing computational burdens by decomposing large-sized learning tasks into several small-scale learning tasks.


Introduction
Recently, medical imaging is experiencing a paradigm shift due to a remarkable and rapid advance in deep learning techniques. Deep learning techniques have expanded our ability by sophisticated disentangled representation learning through training data, and appear to show superiority of performance in various medical imaging problems including undersampled magnetic resonance imaging(MRI), sparse-view computed tomography(CT), artifact reduction, organ segmentation, and automated disease detection. In particular, U-net (Ronneberger et al. 2015), a kind of convolutional neural network, seems to show remarkable capability of learning image representations. However, there are some hurdles to overcome, one of which comes from the high dimensionality, i.e. the high resolution or the large size, of medical images. This paper addresses a way to resolve this issue through a so-called framelet pooling aided deep learning network.
Machine learning performance is closely related to the number, the quality, and the pixel dimensionality of the sampled data. For ease of explanation, let us consider a simple question to learn an unknown function f : r0, 1s d Þ Ñ r0, 1s from a given sample px, yq, where x is an input gray scale image lying in r0, 1s d and y " f pxq is the corresponding output on the interval r0, 1s. Then one can ask how many training samples are needed to approximate f with a given tolerance ą 0. It is well-known that for Lipschitz continuous function f , we need to sample Op ´d q points (Mallat et al. 2016). In addition, the author in (Barron 1994) observed that the estimation error of the function f by 1 hidden layer neural networks is given by Op c f m q`O´m d n data log n data¯, where n data is the number of training data, m is the number of neurons in the hidden layer, and c f is a constant depending on the regularity of f . This means that in the case of d " 512 2 (i.e. considering 512ˆ512 images) and m " d, we roughly need huge training data n data " Op10 12 q to achieve the error of Op10´1q. This high number of required training data makes the problem intractable, especially when data lies in the high dimensional space. Such a phenomenon is referred as the curse-of-dimensionality in approximation sense. Even though the effect of dimensionality on deep networks is relatively weaker than shallow ones (Bruna et al. 2013, Pascanu et al. 2013, Mhaskar et al. 2016 in approximation sense, deep learning requires huge computational scale for training process. Thus, deep networks with high dimensional data also experience the curse-of-dimensionality in terms of computational burden. In the literature, framelets are known to be effective in capturing key information of images. This is due to the multiscale structure of the framelet systems, and the presence of both low pass and high pass filters in the filter banks, which are desirable in sparsely approximating images without loss of information (Bin et al. 2017). In this work, we propose a framelet-based deep learning method to reduce computational burdens for dealing with high dimensional data in the learning process. This method, called a framelet pooling, is based on the decomposition of a d-dimensional input-output pair px, yq into several d{2 2k -dimensional pairs tpW k,α x, W k,α yq : α " 1,¨¨¨, ru, where each W k,α and W k,α are d{2 2kˆd matrices corresponding to kth level framelet packet transform (Mallat 2009). Instead of learning the pair of high dimensional original data px, yq, the proposed method tries to learn much lower dimensional pairs pW k,α x, W k,α yq in parallel passion, so that we can achieve the computational efficiency in dealing with the large size images.
As an application of our proposed method, we deal with the undersampled MRI (Hyun et al. 2018) and the sparse-view CT problem (Jin et al. 2017), where huge memory problems may arise in recovering high resolution images. Experiments on undersampled MRI and sparse-view CT show that our framelet pooling aided reduced method provides very similar performance to the standard unreduced method, while reducing the computation time greatly by reducing the dimension of inputs and learning parameters in neural networks.

Method
Both undersampled MRI and sparse-view CT problem aim to find a reconstruction function f , which maps from an undersampled data P 7 (violating Nyquist criteria) to a clinically meaningful tomographic image y. Here, the undersampled data P 7 can be expressed as the subsampling of the fully-sampled data P (satisfying the Nyquist criterion), where S is a subsampling operator. The standard MRI and CT use the fully-sampled data P to provide tomographic images, where the reconstruction functions f in MRI and CT are the inverse Fourier transform and inverse Radon transform, respectively. However, when we use the undersampled data P 7 , these standard methods do not work as the Nyquist criterion is not satisfied any more. (See Fig. 1 and Fig. 2.) For the sake of clarity, we shall briefly state the mathematical framework of undersampled MRI and sparse-view CT in the following subsections.

Undersampled MRI
Let ypzq be a distribution of nuclear spin density at the position z " pz 1 , z 2 q. The measured k-space data P is governed by the Fourier relation, where ξ " pξ 1 , ξ 2 q (Nishimura 2010). Therefore, with the fully-sampled data P, the reconstruction image y can be obtained by taking the inverse Fourier transform to the measured data P, y " F´1P Note that the direct inversion method (3) can also be applied to the undersampled data P 7 , y 7 " F´1S˚P 7 (4)

S S˚P
" F y F´1 f P : Fully-sampled k-space data y" F´1P P 7 : Undersampled k-space data S˚P 7 y 7 " F´1S˚P 7 ξ1 ξ2 Figure 1. A k-space data in the standard MRI system is often measured with respect to two encoding directions, frequency direction ξ 1 and phase encoding direction ξ 2 . A total time for MR imaging is, roughly speaking, is proportional to the number of phase encodings, since each phase encoding requires an repeated excitation process of nuclear spin. Thus, the undersampled MRI reconstruction problem deal with how to reduce the number of phase encoding lines by undersampling P such like P 7 , and keep its image quality simultaneously. However, the reconstruction image y 7 obtained from the direct inversion method contains the aliasing artifact in the reconstruction domain, according to the Poisson summation formula. Therefore, we aims to develop a function f to recover the artifacted image y # to the high quality image y.
Here, S˚is an adjoint operator of S in the 2 space. However, the image y 7 obtained from (4) contains aliasing artifacts as P 7 violates the Nyquist criterion(See Fig. 1).

Sparse-view CT
In CT, the tomographic image ypzq can be regarded as the distribution of linear attenuation coefficients at the position z " pz 1 , z 2 q. For CT data acquisition, X-ray beams are transmitted at various directions θ :" pcos ϕ, sin ϕq, 0 ď ϕ ď 2π. Under the assumption of monochromatic X-ray generation, the projection data P at the direction θ is dictated by the following Radon transform where L θ,s is the projection line L θ,s :" tz P R 2 : θ¨z " su (Seo et al. 2013). With the fully-sampled data P satisfying the Nyquist criterion, y can be reconstructed by the inverse Radon transform y " R´1P For the undersampled data P 7 , which is measured with the low sampling frequency along the projection-view, we can apply the direct inversion formula (6) by filling zeros  Figure 2. A sparse-view sinogram data is acquired from a sparse measurement of CT scanner with respect to projection view. The sparse-angle sinogram data can be considered as a subsampled sinogram SP from a full-view sinogram P. The direct application of R to sparse-angle sinogram data with zero-filling S˚generates the streaking artifacts in the reconstruction image y 7 . Main objective of sparse-angle computed tomography is to find a function f to recover the artifacted image y # to the high quality image y.
to unmeasured parts of undersampled data However, the reconstruction image y 7 contains streaking artifacts, which result from the violation of Nyquist criterion. Fig. 2 shows the schematic and visual descriptions of the sparse-view CT problem.

Main result: Undersampled reconstruction using framelet and deep learning
The objective of the undersampled reconstruction problem is to develop a deartifacting map f , which converts y 7 P R d 2 (artifacted image) to y P R d 2 (artifact removed image) with d 2 being a pixel dimension of reconstructed image. In particular, deep learning techniques, such as U-net, infer f by minimizing training data-fidelity : using a set of training data px piq , y piq q N i"1 . Here, N is the number of training data, x piq denotes the artifact image instead of py 7 q piq , DL net is a set of all learnable functions from a user-defined deep learning network architecture, and L is a user-defined energy-loss function to evaluate the metric between deep learning output f px piq q and label y piq . However, if the pixel dimension of input increases, the total computational complexity in the training process increase largely. To address this curse-of-dimensionality issue, we propose the framelet pooling aided deep learning method to learn the deartifacting map f indirectly.
Then the corresponding affine system XpΨq " tψ α,n,k " 2 n ψ α p2 n¨´k q : 1 ď α ď r, n P Z, k P Z 2 u forms a tight frame for L 2 pR 2 q, and the filter banks tq α u iPt0,1,¨¨¨,ru form a tight frame on 2 pZ 2 q. This means that these filter banks project high dimensional data into low dimensional space without any information loss. In other words, an invertible decomposition procedure, called framelet decomposition, can be defined from these filter banks. Now, we define the first level framelet decomposition operator W p1q by where W 0,α is the d 2 {2´2ˆd 2 matrix given by Here, Ó stands for 2 dimensional down-sampling operator and f is convolution operator with stride 1. Likewise, we can define the second level framelet decompoosition W p2q by W p2q " rpW 1,0 W 0,0 q T ,¨¨¨, pW 1,0 W 0,r q T ,¨¨, where W 1,α is the d 2 {2´4ˆd 2 {2´2 matrix given by   We can continue the above process to define the kth level framelet decomposition operator W pkq . Fig. 3 illustrates two examples of framelet decompositions using Daubechies wavelet(db4) (Daubechies 1988) and piecewise linear B-spline frame (Shen et al. 2013). Now, we are ready to explain our proposed deep learning network. Let W and W be framelet decomposition operators. The proposed framelet pooling deep learning network aims to infer the relation between W pk 1 q x and W pk 2 q y in the following leastsquared minimization sense : Here, each pW pk 1 q x piq q α 1 and pW pk 2 q y piq q α 2 are images with d 2 {2´2 k 1 and d 2 {2´2 k 2 pixel dimension, respectively. For example, let W p2q be the second level Daubechies 4 tab wavelet decomposition. If the second level Daubechies 4 tab wavelet decomposition is taken for W pk 1 q and W pk 2 q in the equation (14), the proposed deep learning method tries to find the function f satisfying fpW p2q xq " W p2q y in the sense of (14), as shown in Fig  4. Compared to the direct deep learning scheme (8), the framelet-pooling aided deep learning method (14) is expected to mitigate the total computational complexity and time caused by high dimensional data in the learning process. In this paper, we test only the case that training inputs and labels are decomposed using same framelet decomposition W pkq . However, our method is not restricted only in this specific case.  Deep Learning Figure 4. A one example of the proposed method with Daubechies 4 tab wavelet for sparse-view CT problem. The dataset tx piq , y piq u N i"1 is decomposed by Daubechies 4 tap wavelet, W p2q . Our proposed network tries to infer the relation between tW p2q x piq , W p2q y piq u N i"1 .

Experimental Set-up for Undersampled MRI
Let ty piq MR P R 256ˆ256 u N i"1 denote the set of MR images reconstructed with the Nyquist sampling. Using ty piq MR u, we compute the training input tx where F is the 2D discrete Fourier transform, F´1 is the 2 dimensional discrete inverse Fourier transform, and S is a specifically user-chosen subsampling operator. In our experiments, we use the MR images y piq MR obtained from T2-weighted turbo spin-echo pulse sequence with 4408 ms repetition time, 100 ms echo time, and 10.8 ms echo spacing (Loizou et al. 2011). The Fourier transform and its inverse are computed via fft2 and ifft2 in the Python package numpy.fft. Finally, for the sampling strategy, we choose the uniform subsampling with factor 4 and 12 additional low frequency sampling among total 256 lines (Hyun et al. 2018).
In order to test our proposed method, we decompose dataset using k level framelet decomposition W pkq with various filter banks. We obtain where both W pkq x piq MR and W pkq y piq MR contains r k pairs of 256{2 2kˆ2 56{2 2k image. Here, k is the decomposition level and r is the number of filter q α .

Experimental Set-up for Sparse-view CT
Let ty piq CT P R 512ˆ512 u N i"1 be a set of CT images reconstructed with the Nyquist sampling. The corresponding deep learning training inputs are computed in the following sense; where R is the discrete Radon transform, R´1 is the filtered-back projection algorithm, and S is a user-defined sampling operator. In our implementations, we use the projection algorithm radon and filtered back-projection algorithm iradon in the Python package skimage.transform for computing R and its inverse R´1 respectively. Uniform subsampling with factor 6 in terms of projection-view is also used for S in (17).
Applying the same process used to generate a dataset (16) for undersampled MRI experiments, we obtain the following decomposed dataset for sparse-view CT problem; where W pkq is a k level framelet decomposition.

Network Configuration
To test our proposed method, we adapt the U-net architecture (Ronneberger et al. 2015), as shown in Fig. 5, where the first half of network is the contracting path and the last half is the expansive path. At the first layer in U-net in Fig. 5, the input W pkq x is convolved with the set of convolution filters C p1q so that it generates a set of feature maps h p1q , given by h p1q " ReLUpC p1q f 1 W pkq xqs where ReLU is the rectified linear unit ReLUpxq " maxtx, 0u and f 1 stands for the convolution with stride 1. We repeat this process to get h p2q " ReLUpC p2q f 1 h p1q q   Figure 5. Image restoration performances of U p0q -NET (unreduced), U p1q -NET (one pooling), and U p2q -NET (double pooling) using undersampled MRI and 1500 training data using various filter banks with r filters. We set all networks to have 16 feature depth, which means that channels in the network paths are multiples of 16. Although the numbers of learning parameters and training speed are very different, the final results are very similar. and apply max pooling to get h p3q . Through this contracting path, we can obtain low dimensional feature maps by applying either convolution or max pooling. In the expansive path, we use the 2ˆ2 average unpooling instead of max-pooling to restore the size of the output. To restore details in image, the upsampled output is concatenated with the correspondingly feature from the contracting path. At the last layer a 1ˆ1 convolution is used to combine each feature with one integrated feature (Ronneberger et al. 2015). The U-net in the top row of Fig 5 will be denoted by U p0q -NET. The U-net in the middle row, denoted by U p1q -NET, is the reduced network by eliminating two 3ˆ3 convolution layers and one pooling/unpooling layer in the first and last part of U p0q -NET. Similarly, U p2q -NET is the reduced network by eliminating 3ˆ3 convolution layers and pooling/unpooling layer in the first and last part of U p1q -NET. Thus, this process can be viewed as the replacement of operations with unknown and trainable paramters into framelet operations with known and fixed paramters. In our experiments, U p0q -NET is used to learn f in the sense of direct learning (8). The reduced U pkq -NET (k " 1, 2) is trained with k level framelet decomposed dataset in the sense of (14).

Experimental Result
All training processes are performed in two Intel(R) Xeon(R) CPU E5-2630 v4, 2.20GHz, 128GB DDR4 RAM, and four NVIDIA GTX-1080ti computer system. We initialize all weights by a normal distribution with zero-centered and 0.01 standard deviation, . We use the 2 loss for the loss function L . The loss function is minimized using the Adam Optimizer and the batch normalization for fast convergence (Kingma et al. 2015, Ioffe et al. 2015. For stability on training, the small learning rate 10´6 is used. In order to guarantee the convergence of loss function, the network is trained until the training loss seems to converge sufficiently.   Table 1 and 2. For the sparse-view CT application, evaluations and comparisons are given in Table 3 and Table 4. Table 1 and Table 3 shows comparisons of average computational time per epoch among U p0q -NET, U p1q -NET, and U p2q -NET. The average computational time is computed by dividing the total computational time by the total number of epoch. Table 2 and Table 4 contains test error evaluations and comparions using three different metrics; mean square error(MSE), peak signal to noise ratio(PSNR), and structure similarity(SSIM) (Wang et al. 2004).
These experimental results support the fact that the proposed method reduces the total computational time efficiently and provides competitive results compared to the direct learning algorithm using high dimensional images. Namely, our reduced method provides very similar performance to the standard unreduced method (U p0q -NET), while reducing the computation time greatly by reducing the input dimension. We also test our proposed method with three different framelets and compare performances, as shown in Table 2 and 4 for the quantitative evaluation and Table 1 and 3 for the computational time. Experimental results report that Haar and Db4 Wavelet reduce the computational time more efficiently than PL framelet, but PL framelet exhibits the better performance than Haar and Db4 Wavelet. Compared to Haar and Db4 consisting of 4 filter banks, PL framelet has 9 filter banks (i.e. the number of filter banks equals the size of filters), which can increase the computational time. However, it should be noted that Haar and Db4 are orthonormal bases while PL framelet is a redundant tight frame system. This means that, thanks to the redundancy, it is likely that the error generated by the nonlinear deep learning process can lie in the nontrivial null space of the reconstruction operator, which can make the PL framelet yield better results than the orthonormal basis (Haar and Db4) (Bin et al. 2017). Lastly, we would like to mention that the computational time increases in the case of U p2q -NET with PL framelet in the undersampled MRI problem, compared to the original network U p0q -NET. We can observe that the reduction of computational time depends on the feature depth of network. In order to reduce total computational complexities of experiments as possible, our networks are set to have 16 feature depth, as described in Fig 5. However, when the feature depth increases, U p2q -NET with PL framelet also exhibits the computational time reduction ability, as shown in the

Conclusion and Discussion
In this paper, we proposed the framelet pooling aided deep learning network to reduce computational burdens in the training process. The proposed method decomposes large-scale learning tasks into several small-scale learning tasks through the framelet packet transformation so that we can handle large-scale medical imaging in a limited computing environment. Experimental results on undersampled MRI and sparseview CT reconstruction problems show that our framelet pooling method is at least comparable to the standard deep learning based method, but is able to reduce total computational time in the training process significantly. Hence, we expect that our method is not limited to the two dimensional medical imaging problem. It seems possible that the framelet pooling method can be extended to deep learning problems with largescale 3 dimensional medical imaging, which inevitably suffers from high computational complexity due to the high dimensionality of dataset.
In the experiments, we can see that the choice of filter banks indeed affects the performance of the proposed method. The use of tight frame can increase the reconstruction accuracy thanks to rich representation under the redundant system, but the computational time reduction ability can be marginal due to the increasing number of convolutions. In contrast, the orthogonal wavelet representation provides high computational time reduction by only using 4 filters, but generates less accurate results. Hence, the future work will focus on the construction of framelet transformation which is both adaptive to a given task (Cai et al. 2014) and computationally efficient. It would also be interesting to provide a theoretical analysis on the approximation property of our deep learning network.