Efficient Knowledge Distillation for Brain Tumor Segmentation

Qi, Yuan; Zhang, Wenyin; Wang, Xing; You, Xinya; Hu, Shunbo; Chen, Ji

doi:10.3390/app122311980

Open AccessArticle

Efficient Knowledge Distillation for Brain Tumor Segmentation

¹

School of Information Science and Engineering, Linyi University, Linyi 276000, China

²

College of Engineering, University of Sydney, Sydney 2006, Australia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 11980; https://doi.org/10.3390/app122311980

Submission received: 9 October 2022 / Revised: 15 November 2022 / Accepted: 16 November 2022 / Published: 23 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep learning has allowed great progress to be made in obtaining more accurate prediction results for brain tumor segmentation. The current mainstream research approaches obtain segmentation accuracy improvements by modifying deep-learning model architectures, while ignoring the computational and storage efficiency issues of segmentation. In this paper, we proposed an improved knowledge distillation method: coordinate distillation (CD), which integrates channel and space information and completes brain tumor segmentation by training the student network with the teacher network without changing the original network architecture. Experimental results showed that the method was effective and that it could enhance the segmentation accuracy of brain tumors without changing the segmentation efficiency.

Keywords:

knowledge distillation; brain tumor segmentation; lightweight networks; medical image segmentation

1. Introduction

In the field of medical image processing and analysis [1], medical image segmentation is a critical and complex step. Deep learning is used to extract the important regions of medical images to assist doctors in clinical treatment and pathology research automatically. The task of segmenting brain tumors, as a type of medical image segmentation, is also a focus of researchers’ attentions. Tumors originating from the neuroepithelium are collectively called gliomas and are the most common intracranial primary tumors. Typically, gliomas are classified into four grades [2], where the higher the grade, the worse the prognosis is. Much research on brain tumor segmentation has been proposed; however, due to the uncertainty of a brain tumor’s shape, location and size, it is impossible to accurately locate it during segmentation, making brain tumor segmentation a severe challenge [3,4,5].

Meanwhile, a large majority of the approaches improve the original model accuracy by modifying the model architecture, inevitably adding various expensive computational components and expanding the required storage space. For a long time, people have discussed how to solve this dilemma and find a better balance between efficiency and accuracy. Hinton [6] introduced knowledge distillation (KD) to the deep-learning domain, which has gained much attention for its simplicity and efficiency. The core idea of KD is to use soft labels to learn class distribution. Specifically, first train a teacher network to obtain its output, and then, in the process of student network training, let the output of the student network learn the output of the teacher network. The final effect of the knowledge distillation method is that the output of the student network is similar to that of the teacher network. Many methods have been proposed to improve the learning accuracy of students’ networks following KD. For example, when using KD, Yang et al. [7] found that compared to the ground truth class, the secondary class could effectively learn inter-class similarity to obtain knowledge better. Xie et al. [8] focused the problem on a data issue; they first trained the teacher network to generate pseudodatasets, and then they combined it with the original datasets to form noise datasets to learn knowledge. Phuong et al. [9] stated that KD performed poorly when the difference between the student and the teacher capacity was large, and they proposed a new ensemble distribution KD for training. Romero et al. [10] proposed learning knowledge through intermediate feature layers. Jiao et al. [11] proposed a new transformer distillation, which performed the transformer distillation in the pretraining and specific learning phases. Sanh et al. [12] just simply applied KD to Bert, making it 60% faster and guaranteeing 97% accuracy. Wang et al. [13] distilled the final transformer layer from the teachers and introduced teaching assistants to aid distillation. Unfortunately, the above approach did not take into account the validity of distillation in brain tumor segmentation scenarios. Only a few researchers have applied knowledge distillation methods to brain tumor segmentation; however, most of them focused on solving the problems of small datasets and multimodality [14,15,16], and they did not focus on the efficiency of brain tumor segmentation.

In this paper, considering brain tumor segmentation as a structured dense prediction task, we proposed a new knowledge distillation method. Different from channel attention, which just considers the importance of different channels, the proposed method of coordinate distillation (CD) considers spatial information. It encodes channels along two spatial directions, which is particularly important for spatial structure capture in segmentation tasks and helps the model to segment brain tumors accurately. Our extensive experiments on the public dataset BraTs2018 demonstrated the remarkable performance of the method.

In general, our contributions are summarized as follows.

(1) By studying knowledge distillation for training networks, a simple and effective CD module is presented in the paper, which aims to model channel correlation and remote dependency in order to improve student network segmentation accuracy.

(2) An effective knowledge distillation architecture is provided for brain tumor segmentation. Experiments on the public dataset BraTS2018 showed the feasibility of the method.

2. Related Work

2.1. Brain Tumor Segmentation

In recent years, research on brain tumor segmentation has been continuously developed. Ronneberger et al. [17] proposed the UNet, whose biggest feature was the U-shaped structure and the skip connection. The skip connection combined low-resolution information and high-resolution information; low-resolution information helped the category identification of objects, and high-resolution features provided accurate positions for segmentation. This network was perfectly suitable for medical image segmentation as the parameters reached 39.4 M, which also made it suitable for playing the teacher network. After UNet was proposed, a series of upgraded networks appeared, but most of them made some changes based on UNet. For example, Isensee et al. [18] reported an improved UNet for brain tumor segmentation that successfully avoided overfitting by using the Dice loss and data expansion. In [19], a comprehensive scheme of a long connection and a short connection was proposed. The parameter of the model was only 9.2 M, but the precision was low, so it was suitable to be a student network. Zhang et al. proposed DeepResUNet [20], pointing out that it was important to combine low-level semantic information with high-level semantic information using a skip connection; however, it was difficult to train such a deep neural network with limited medical training samples. The residual units were proposed to simplify the deep network, thus the model parameters were reduced to 32.6 M; however, the same problem of accuracy reduction still existed, so it was suitable to act as a student network in our experiments. Oktay et al. [21] indicated that there was a lot of redundant information in the map extracted from the encoder. Soft attention was proposed to add at the end of the skip connection to suppress the activation of irrelevant regions, in order to segment the medical images better; however, at the same time, it increased the computational complexity, causing the parameters to have values up to 39.7 M. This network was also suitable for playing a teacher network. Other methods, such as [22,23,24], also enhanced UNet to segment brain tumors better. In addition, some researchers have recognized the importance of capturing spatial continuity information, confirming that 3D brain tumor segmentation is more effective than 2D brain tumor segmentation [25,26,27].

2.2. Knowledge Distillation

Knowledge distillation is a method that improves the performance of lightweight models by transferring knowledge from powerful, but heavy, networks without losing efficiency. Hilton et al. formally defined distillation and proposed corresponding training methods in 2014. After that, many distillation strategies were developed, which were roughly classified into three categories [28]. The first was logits-based knowledge, which extracted knowledge from the output layer of the teacher network. Zhang et al. [29] broke this predefined “strong and weak relationship” and proposed deep mutual learning (DML), which allowed a group of student networks to learn from each other in the training process, rather than a static predefined one-way transition path between teachers and students. In [30], the author observed that when the teacher network segmentation performance was very great, the distilled student network segmentation performance might not be very great, presumably due to capacity mismatch, which caused the student not to imitate the teacher and instead brought out the main loss. The author proposed an early-stop teacher regularization for distillation, which should be stopped early, when near convergence. Next was feature-based knowledge, which extracted knowledge from the middle-hidden layer of the teacher network. Compared with the first category, it could better learn deeper knowledge. Liu et al. [31] included pairwise distillation techniques and GAN to extract knowledge. Qin et al. [32] proposed the idea of computing the interclass contrast between different tissue regions under the guidance of the marker segmentation template. Zagoruyko et al. [33] guided students’ attentional maps to learn those of the teachers so that the students’ attentional maps were similar to those of the teachers. Finally, relation-based knowledge, which learnt the relationship between layers to acquire knowledge, argued that learning the relationship between layers was similar to learning how to solve a problem. Yim et al. [34] did not fit teacher output but fitted the relationship between the layers of the teacher model. The relationship was defined by the inner product (dot product) between layers. Tung et al. [35] proposed a preserved similarity loss to motivate the student to mimic the teacher regarding the relational expressions within the data. Our proposed approach distilled knowledge from both the output and the intermediate hidden layers because it was easier to learn the answers directly than to learn the solution ideas.

3. Methods

The proposed integral distillation structure is shown in Figure 1. Let an MRI brain tumor image be denoted by

X \in R^{h \times w}

, which was used as an input image, and a prediction image of the same size

Q \in R^{h \times w}

was finally output, where h represents the height of the image and w represents the width of the image.

The overall distillation structure was realized by the two green rectangular modules in the figure: CD and KD. KD drove the student’s final layer output to simulate that of the teacher. CD drove the student’s intermediate feature mapping

E_{1}^{s}

,

E_{2}^{s}

to simulate the teacher’s intermediate feature mapping

E_{1}^{t}

,

E_{2}^{t}

. Finally, the student’s segmentation task loss

L_{s e g}

needed to be added to ensure the input domain’s basic performance. Through this distillation structure, the student network could learn the segmentation experience of the teacher network, and then better carry out its segmentation task.

3.1. Knowledge Distillation

Using the basic method of knowledge distillation KD [6], we treated the partitioned graph as a collection of pixelwise labeling problems, and by computing the difference between the last layer of the two networks, the student s could learn from the teacher t. This training was calculated by the following loss function.

L_{K D} = \frac{1}{N} \sum_{i \in N} K L (P_{i}^{s} ‖ P_{i}^{t})

(1)

where

N = w \times h

represents all the pixels in the segmented graph and

K L (\cdot)

is the Kullback–Leibler divergence [36].

P_{i}^{s}

represents the probability of the ith pixel in the prediction map of the student network, and

P_{i}^{t}

represents the probability of the ith pixel in the prediction map of the teacher network.

3.2. Coordinate Distillation

Inspired by [37], we recommended adding location information to the channel attention (AT) [33], prompting the attention graph to capture remote dependencies with accurate location information, which is especially important for spatial structure capture in segmentation tasks. Based on the works of channel attention transfer, we assumed that a neuron’s absolute value indicated its importance. Specifically, given the input E, we summed the channel dimension along the horizontal w and vertical h coordinates, so the output could be described as follows.

φ^{h} (E) = \sum_{i = 1}^{c} {|E_{i}^{h}|}^{2}

(2)

φ^{w} (E) = \sum_{i = 1}^{c} {|E_{i}^{w}|}^{2}

(3)

where c is the number of channels. Next, we multiplied the output

φ^{w}

and

φ^{h}

to obtain the final coordinate attention map y.

y = σ (φ^{w} (E)) \times σ (φ^{h} (E))

(4)

where

σ (\cdot)

is the

S i g m o i d

operation. This training was calculated by the following loss.

L_{C D} = \sum_{(i, j) \in M} {∥\frac{y_{i}^{s}}{{∥y_{i}^{s}∥}_{1}} - \frac{y_{j}^{t}}{{∥y_{j}^{t}∥}_{1}}∥}_{1}

(5)

where M is the set of index pairs of all possible positions with the same embedding size,

(i, j)

is the samples in M. The operation

{∥\cdot∥}_{1}

is the

l_{1}

normalization.

3.3. Network Optimization

The total loss function is shown below.

L_{t o t a l} = α L_{s e g} + β_{1} L_{K D} + β_{2} L_{C D}

(6)

where

L_{s e g}

is the cross-entropy loss. The hyperparameter

α

was set to 0.9,

β_{1}

was 0.1 and

β_{2}

was 1.8. Our experiments demonstrated that such parameter settings were most conducive to the effectiveness of our method.

4. Experiments

4.1. Datasets

The experiment was conducted on the publicly available brain tumor segmentation dataset BraTS2018 [38]. Multimodal brain tumor segmentation has the longest history of all MICCAI competitions and has the largest number of participants in almost all competitions each year. It is a good platform for understanding the cutting edge of segmentation methods. The dataset has a training set of 285 cases, using T1, T2, T1CE and flair modes, respectively, and needs to segment the whole tumor (WT), tumor core (TC) and enhanced tumor (ET). The size of each MR image is

240 \times 240 \times 155

.

4.2. Implementation Details

To verify the effectiveness of our distillation method, we employed networks with good performances as teacher networks, such as UNet and AttUNet. We used networks with less effective performance as student networks, such as DeepResUNet and UNet++. We implemented all the segmentation networks in our experiments with PyTorch and ran them on a single NVIDIA GeForce RTX 2060 GPU (6 GB). The Adam algorithm was used to train the networks with a weight decay of 0.0001. The batch size was 8 and the initial learning rate was 0.0003. On the BraTs dataset, we first normalized each sequence of the brain MR images by z score, cropped the redundant background information around the brain images, then sliced them into 2D data to fit the 2D network. In addition, to mitigate the category imbalance problem, the parts of the slices that did not contain lesions were discarded, and, finally, because it was multimodal, the slices of each modality were combined into a training case with four channels. A total of 285 BraTs2018 cases were training sets. As BraTs only published a training set and did not have test set, the test set was part of the training set that BraTs2019 had more of than BraTs2018, including 49 HGGs and 1 LGG. There were 50 test sets in total. During the training, we used the

160 \times 160

images as training samples, and we trained all the networks to reach the fusion of 20 epochs of training.

4.3. Evaluation Metrics

The Dice score and Hausdorff distance are generally employed as measures of brain tumor segmentation. The Dice score measures the similarity between two samples, the range of values is 0–1, the best value is 1 and the worst value is 0. The definition of Dice is as follows.

D i c e = \frac{2 |Q \cap G|}{|Q| + |G|}

(7)

where Q represents the prediction mask and G represents the ground truth.

The Hausdorff distance is a metric that evaluates the maximum segmentation boundary distance between Q and G, the definition is as follows.

H D = max \{max_{q \in Q} \underset{g \in G}{min d} (q, g), max_{g \in G} \underset{q \in Q}{min d} (q, g)\}

(8)

5. Ablation Study

5.1. Effectiveness of Distillation Methods

The proposed distillation methods, CD and KD, are collectively referred to as CK, and a large number of experiments were carried out on the BraTs2018 dataset to verify the effectiveness of CK. Figure 2 shows some visual examples, which show that our method corrected some of the students’ mistakes and made the predicted mask closer to the ground truth.

We used UNet and AttUNet as our teachers and DeepResUNet and UNet++ as our students. Table 1 lists the performance of our distillation method, CK, for different combinations of teacher and student networks. The experimental results showed that the knowledge distillation architecture proposed in this paper played a positive role in improving the segmentation accuracy of the student network. After learning from UNet, DeepResUNet had maximum improvements of 1.25, 0.72 and 2.84 in ET, WT and TC Dice scores, respectively, and 0.08, 0.07 and 0.16 in Hausdorff distance, respectively. Meanwhile, after learning from the teacher AttUNet, the Dice score of the student network DeepResUNet in ET broke from 78.44 to 80.22, and the Dice score of the student network UNet++ in ET broke from 78.47 to 80.48. Thus, we honored DeepResUNet as the most outstanding student. To more visually demonstrate the effectiveness of our method, the obtained experimental results were compared with the currently popular networks in this experiment using AttUNet and UNet++ as the teacher network and the student network, respectively. Kong et al. [24] proposed HybridResNet, using resnet blocks for downsampling and normal double convolution blocks for upsampling. Compared to UNet, ResNet blocks helped to improve the performance of the network, so we also included [24] in our comparison network. In Table 2, it can be seen that the UNet++ using our method achieved the crown while keeping the lower parameters.

5.2. Comparison with AT

We compared our approach with AT. We performed this part of the experiment using the UNet and the DeepResUNet, and extracted feature maps at the same locations in the network to ensure the fair implementation of different distillation methods. We visualized their output, as shown in Figure 3. It can be seen intuitively that CD corrected some errors of AT when segmenting.

Table 3 shows the comparison results. Obviously, in contrast to AT, our distillation method CD improved results in every metric. It is worth mentioning that the Dice score of CD in TC improved from 85.66 to 87.34, and the Hausdorff distance of CD in TC improved from 1.64 to 1.55.

5.3. Ablation Experiments

As the final part of the experiment, to better verify the effectiveness of CD, we performed ablation experiments on the BraTs2018 dataset.

Table 4 lists the performance of the student with and without KD and CD when segmented. The results showed that each method brought positive effects to the student, especially when the two methods were combined; they enabled the student network to obtain better segmentation results. In addition, in Table 5, we also demonstrated the sensitivity of our method to hyperparameters, given random weights

α

,

β_{1}

,

β_{2}

of Equation (6). Within a certain range, each experiment was trained by 20 epochs with different combinations of hyperparameters, and we finally chose 0.9, 0.1 and 1.8 to complete all our experiments.

6. Discussion

In this work, we proposed CD, which is a rare knowledge distillation method for brain tumor segmentation. One of the reasons why this method worked is that CD embeds location information into channel attention, and, hence, student networks can better solve the problem of uncertain tumor location. In [32], the author proposed a knowledge distillation method to solve the boundary ambiguity problem of medical image segmentation. This method used the ground truth to extract regional information by class from the feature map and calculated the regional contrast value by measuring the similarities between different categories of regions. However, besides the information on ground truth, the rest of the information was also important. Compared to this distillation method, our method obtained global information better and would be easier to use and understand.

Table 1 contains some interesting experimental results, the segmentation metrics of every student network were improved after using our method, and almost every student network outperformed the segmentation metrics of the teacher network after using our distillation method: this is also a rare phenomenon in [32]. Our knowledge distillation architecture is a combination of KD and CD. Similarly, more knowledge distillation modules could be added to this architecture. The current segmentation was based on 2D brain tumor images, and we believe that, with a slight extension, it could be applied to 3D brain tumor images as well. 3D images have rich spatial information, and, if using our method, the student network could obtain more spatial information.

7. Conclusions

In this article, the CD distillation method was proposed for brain tumor segmentation. A series of experiments were conducted on the BraTs2018 dataset. In future work, we will further explore the most appropriate combination of teacher and student networks. In addition, in some related studies, such as dataset distillation, we will apply the proposed knowledge distillation method to dataset distillation to address the problem of small datasets in medical images.

Author Contributions

Conceptualization, Y.Q., W.Z. and S.H.; methodology, Y.Q., W.Z. and X.W.; validation, Y.Q. and X.Y.; formal analysis, J.C. and S.H.; investigation, Y.Q. and X.Y.; writing—original draft preparation, Y.Q. and X.Y.; writing—review and editing, Y.Q., W.Z., J.C. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China (NSFC) 62006107 and by the Natural Science Foundation of Shandong Province (No. ZR2020MF058, ZR2020MF029).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar]
Louis, D.N.; Perry, A.; Reifenberger, G.; Von Deimling, A.; Figarella-Branger, D.; Cavenee, W.K.; Ohgaki, H.; Wiestler, O.D.; Kleihues, P.; Ellison, D.W. The 2016 World Health Organization classification of tumors of the central nervous system: A summary. Acta Neuropathol. 2016, 131, 803–820. [Google Scholar]
Liu, Z.; Tong, L.; Chen, L.; Jiang, Z.; Zhou, F.; Zhang, Q.; Zhang, X.; Jin, Y.; Zhou, H. Deep learning based brain tumor segmentation: A survey. Complex Intell. Syst. 2022, 1–26. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, J.; Xia, Y.; Shen, C. A mutual bootstrapping model for automated skin lesion segmentation and classification. IEEE Trans. Med. Imaging 2020, 39, 2482–2493. [Google Scholar] [PubMed] [Green Version]
Xie, Y.; Zhang, J.; Lu, H.; Shen, C.; Xia, Y. SESV: Accurate medical image segmentation by predicting and correcting errors. IEEE Trans. Med. Imaging 2020, 40, 286–296. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Yang, C.; Xie, L.; Qiao, S.; Yuille, A.L. Training deep neural networks in generations: A more tolerant teacher educates better students. Proc. Aaai Conf. Artif. Intell. 2019, 33, 5628–5635. [Google Scholar] [CrossRef] [Green Version]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Phuong, M.; Lampert, C.H. Distillation-based training for multi-exit architectures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1355–1364. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
Lachinov, D.; Shipunova, E.; Turlapov, V. Knowledge distillation for brain tumor segmentation. In International MICCAI Brainlesion Workshop; Springer: Cham, Switzerlands, 2019; pp. 324–332. [Google Scholar]
Hu, M.; Maillard, M.; Zhang, Y.; Ciceri, T.; La Barbera, G.; Bloch, I.; Gori, P. Knowledge distillation from multi-modal to mono-modal segmentation networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerlands, 2020; pp. 772–781. [Google Scholar]
Nalwade, A.; Kisa, J. Experimenting with Knowledge Distillation techniques for performing Brain Tumor Segmentation. arXiv 2021, arXiv:2105.11486. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerlands, 2015; pp. 234–241. [Google Scholar]
Isensee, F.; Kickingereder, P.; Wick, W.; Bendszus, M.; Maier-Hein, K.H. Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge. In International MICCAI Brainlesion Workshop; Springer: Cham, Switzerlands, 2017; pp. 287–297. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote. Sens. Lett. 2018, 15, 749–753. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Chen, Y.; Cao, Z.; Cao, C.; Yang, J.; Zhang, J. A modified U-Net for brain Mr image segmentation. In International Conference on Cloud Computing and Security; Springer: Cham, Switzerlands, 2018; pp. 233–242. [Google Scholar]
Deng, Y.; Sun, Y.; Zhu, Y.; Zhu, M.; Han, W.; Yuan, K. A strategy of MR brain tissue images’ suggestive annotation based on modified U-net. arXiv 2018, arXiv:1807.07510. [Google Scholar]
Kong, X.; Sun, G.; Wu, Q.; Liu, J.; Lin, F. Hybrid pyramid u-net model for brain tumor segmentation. In International Conference on Intelligent Information Processing; Springer: Cham, Switzerlands, 2018; pp. 346–355. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Zhang, J.; Xie, Y.; Wang, Y.; Xia, Y. Inter-slice context residual learning for 3D medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 661–672. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerlands, 2016; pp. 424–432. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4320–4328. [Google Scholar]
Cho, J.H.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4794–4802. [Google Scholar]
Liu, Y.; Chen, K.; Liu, C.; Qin, Z.; Luo, Z.; Wang, J. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2604–2613. [Google Scholar]
Qin, D.; Bu, J.J.; Liu, Z.; Shen, X.; Zhou, S.; Gu, J.; Wang, Z.; Wu, L.; Dai, H. Efficient medical image segmentation based on knowledge distillation. IEEE Trans. Med. Imaging 2021, 40, 3820–3831. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
Tung, F.; Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1365–1374. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The proposed knowledge distillation architecture. The orange box represents the teacher network, and the blue box represents the student network. They obtained the same image as input. The green box between the two networks, from left to right, is the coordinate distillation (CD) module and the knowledge distillation (KD) module.

Figure 2. The segmentation results were obtained by applying our distillation method in four cases (a–d) of the BraTS validation dataset. ET—yellow, TC—yellow+red, WT—yellow+red+green. The teacher was UNet and the student was DeepResUNet.

Figure 3. The segmentation results were obtained by applying our distillation method in four cases (a–d) of the BraTS validation dataset. ET—yellow, TC—yellow+red, WT—yellow+red+green. The teacher was UNet and the student was DeepResUNet.

Table 1. The results of our experiments on different combinations of networks on BraTS2018. The bold values are the highest Dice scores and lowest Hausdorff distances in their columns. When the segmentation index of the teacher was lower than that of the student, theoretically, knowledge distillation was not applicable to this situation.

Method	Dice Score (%)			Hausdorff Distance			#Params (M)
	ET	WT	TC	ET	WT	TC
Teachers
T1: UNet	78.83	85.69	85.81	2.77	2.57	1.66	39.4
T2: AttUNet	78.79	85.31	85.82	2.83	2.65	1.64	39.7
Students and their performances by our approach
S1: DeepResUNet	78.44	85.19	85.36	2.79	2.60	1.68
S1+T1 (ours)	79.69	85.91	88.20	2.71	2.53	1.52	32.6
S1+T2 (ours)	80.22	86.49	87.14	-	-	1.61
S2: UNet++	78.47	85.16	85.60	2.78	2.60	1.61
S2+T1 (ours)	79.13	85.62	87.64	2.73	2.55	-	9.2
S2+T2 (ours)	80.48	86.46	87.57	-	-	-

Table 2. Comparison with current popular methods.

Models	GFLOPs	#Params	Dice Score (%)			Hausdorff Distance
			ET	WT	TC	ET	WT	TC
UNet	18.03	39.4M	78.83	85.69	85.81	2.77	2.57	1.66
DeepResUNet	23.27	32.6M	78.44	85.19	85.36	2.79	2.60	1.68
AttUNet	18.47	39.7M	78.79	85.31	85.82	2.83	2.65	1.64
UNet++	13.52	9.2M	78.47	85.16	85.60	2.78	2.60	1.61
HybridResUNet	22.0	31.6M	76.81	84.08	86.88	2.83	2.64	1.62
UNet++ (ours)	13.52	9.2M	80.48	86.46	87.57	-	-	-

Table 3. Comparison with AT on the BraTS2018.

Method	Dice Score (%)			Hausdorff Distance
	ET	WT	TC	ET	WT	TC
T: UNet	78.83	85.69	85.81	2.77	2.57	1.66
S: DeepResUNet	78.44	85.19	85.36	2.79	2.60	1.68
S+AT	78.01	84.95	85.66	2.86	2.66	1.64
S+CD (ours)	78.83	85.44	87.34	2.78	2.60	1.55

Table 4. Ablation experiments of our method on the BraTS2018 dataset. The teacher was UNet and the student was DeepResUNet.

Models	KD	CD	Dice Score (%)			Hausdorff Distance
			ET	WT	TC	ET	WT	TC
T: UNet	×	×	78.83	85.69	85.81	2.77	2.57	1.66
S: DeepResUNet	×	×	78.44	85.19	85.36	2.79	2.60	1.68
	✓	×	79.03	85.73	87.68	2.76	2.55	1.56
S+ours	×	✓	78.83	85.44	87.34	2.78	2.60	1.55
	✓	✓	79.69	85.91	88.20	2.71	2.53	1.52

Table 5. The effects of component weights represented by hyperparameters

α

,

β_{1}

and

β_{2}

in Equation (6). The values in bold are the highest Dice score and the lowest hausdorff distance in their columns. As shown in the first two rows, when these weights were set to 0, as with the original segmentation network, the training process would be similar.

Table 5. The effects of component weights represented by hyperparameters

α

,

β_{1}

and

β_{2}

in Equation (6). The values in bold are the highest Dice score and the lowest hausdorff distance in their columns. As shown in the first two rows, when these weights were set to 0, as with the original segmentation network, the training process would be similar.

Method	Weight of Components			Dice Score (%)			Hausdorff Distance
	$α$	$β_{1}$	$β_{2}$	ET	WT	TC	ET	WT	TC
T: UNet	0	0	0	78.83	85.69	85.81	2.77	2.57	1.66
S: DeepResUNet	0	0	0	78.44	85.19	85.36	2.79	2.60	1.68
	0.9	0.1	0.9	79.54	85.95	88.07	2.72	2.52	1.53
	0.7	0.1	0.2	79.26	85.75	87.55	2.75	2.54	1.57
	0.9	0.1	50	79.50	85.90	88.03	2.73	2.54	1.55
S+ours	0.6	0.1	0.3	79.39	85.88	87.75	2.71	2.54	1.56
	1	0.1	0.9	79.14	85.68	87.73	2.74	2.53	1.56
	0.9	0.1	1.8	79.69	85.91	88.20	2.71	2.53	1.52

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, Y.; Zhang, W.; Wang, X.; You, X.; Hu, S.; Chen, J. Efficient Knowledge Distillation for Brain Tumor Segmentation. Appl. Sci. 2022, 12, 11980. https://doi.org/10.3390/app122311980

AMA Style

Qi Y, Zhang W, Wang X, You X, Hu S, Chen J. Efficient Knowledge Distillation for Brain Tumor Segmentation. Applied Sciences. 2022; 12(23):11980. https://doi.org/10.3390/app122311980

Chicago/Turabian Style

Qi, Yuan, Wenyin Zhang, Xing Wang, Xinya You, Shunbo Hu, and Ji Chen. 2022. "Efficient Knowledge Distillation for Brain Tumor Segmentation" Applied Sciences 12, no. 23: 11980. https://doi.org/10.3390/app122311980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Knowledge Distillation for Brain Tumor Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Brain Tumor Segmentation

2.2. Knowledge Distillation

3. Methods

3.1. Knowledge Distillation

3.2. Coordinate Distillation

3.3. Network Optimization

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

5. Ablation Study

5.1. Effectiveness of Distillation Methods

5.2. Comparison with AT

5.3. Ablation Experiments

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI