Abstract

The segmentation of brain tumors in medical images is a crucial step of clinical treatment. Manual segmentation is time consuming and labor intensive, and existing automatic segmentation methods suffer from issues such as numerous parameters and low precision. To resolve these issues, this study proposes a learnable group convolution-based segmentation method that replaces convolution in the feature extraction stage with learnable group convolution, thereby reducing the number of convolutional network parameters and enhancing communication between convolution groups. To improve utilization of the feature maps, we added a skip connection structure between learnable group convolution modules, which increased segmentation precision. We used deep supervision to combine output images in the network output stage to reduce overfitting and enhance the recognition capabilities of the network. We tested the proposed algorithm model using the open BraTS 2018 dataset. The experiment results revealed that the proposed model is superior to 3D U-Net and DMFNet and has better segmentation results for tumor cores than No New-Net and NVDLMED, the winning methods in the BraTS 2018 challenge. The segmentation precision of the proposed method with regard to whole tumors, enhancing tumors, and tumor cores was 90.25%, 80.36%, and 86.20%. Furthermore, the proposed method uses fewer parameters and a less complex model.

1. Introduction

Early diagnosis is crucial for the surgical treatment of brain tumors. This has been aided by recent advances in medical imaging technology. Magnetic resonance imaging (MRI) technology can display brain tissue information in great detail and is widely used for the diagnosis of brain tumors. Four types of MRI modes are used: T1 weighted, T2 weighted, postcontrast T1 weighted, and fluid-attenuated inversion recovery (FLAIR). Each of these reflects different aspects of brain tissue. T1-weighted scans highlight tumor contours, T2-weighted scans show distinct tumor regions, and FLAIR scans can distinguish edema from cerebrospinal fluid.

The accurate segmentation of brain tumors in medical images is a critical step before treatment. Manual segmentation is time consuming and labor intensive, and as a result, efficient and accurate automatic segmentation methods have become a popular research topic in recent years. Brain tumor segmentation methods can generally be divided into three categories: manual segmentation, semiautomatic segmentation, and fully-automatic segmentation. The semiautomatic and fully-automatic methods can be further divided into two categories: unsupervised segmentation and supervised segmentation [1]. Depending on the segmentation principle, unsupervised segmentation includes threshold-based segmentation [24], region-based segmentation [59], graphic-element classification-based segmentation [1013], and model-based segmentation [14, 15].

The disadvantages of unsupervised methods are that they require a confirmed number of segmentation regions in advance and the MRI images must first undergo intensity nonuniformity correction and skull stripping. Supervised methods are based on graphic-element classification, including conventional machine learning and convolutional neural networks (CNNs). Segmentation methods using conventional machine learning include support vector machines [1622], conditional random fields (CRFs) [23, 24], and random forests (RFs) [25, 26]. In conventional machine learning methods, the features must be manually selected, in which boundary and tumor region details can be easily overlooked.

CNN-based methods include CNN models, fully convolutional neural network (FCNN) models, and U-Net models. CNN models include the CNN structure with small kernels proposed by Pereira [27] and the cascade CNN model proposed by Havaei [28]. FCNN-based models include the residual module-containing FCNN model structure presented by Chen [29] and the model structure integrating FCNNs and CRFs proposed by Zhao [30]. Cicek [3133] examined 3D convolution operations, upgraded U-Net from 2D to 3D, and proposed a 3D U-Net for the segmentation of 3D medical images. Models based on U-Net include the 3D U-Net structure used by Sherman [34], in which residual structures were added between convolutions in the same layer. Nuechterlein and Mehta [35] developed 3D-ESPNet, which applies the pointwise convolution of semantic segmentation to medical image processing to reduce the number of network parameters; however, the resulting precision of segmentation is lower. Kao et al. [36] employed an ensemble comprising seven 3D U-Nets with different parameters and training strategies for brain tumor image segmentation; the higher number of models resulted in longer training time. In the BraTS 2018 challenge, Isensee et al. [37] made minor structural modifications to a 3D U-Net and obtained No New-Net. With additional training data and a simple postprocessing technique, this approach won the second place in the challenge. Myronenko [38] proposed an encoder-decoder architecture network, NVDLMED, added another decoder pathway to recover input images, and imposed additional constraints. This approach won the first place in the BraTS 2018 challenge.

CNN-based methods all involve a large amount of computation and highly complex models and room for improvement in segmentation precision remains. To reduce the number of segmentation network parameters, Chen et al. [39] proposed a dilated multifiber network (DMFNet), which replaces regular convolutions with group convolutions, greatly reducing the number of parameters while maintaining the precision of the segmentation network. Group convolutions were first proposed for Alexnet [40] and were later successfully applied in ResNeXt [41]; they are currently popular in network design. However, in standard group convolutions, each group processes information independently, and there is no communication between groups, which limits their feature representation capabilities. Zhang et al. [42] presented a dynamic group convolution, which can learn different numbers of convolution groups in training data and improves the information flow between groups, thereby achieving better performance than regular group convolutions.

Skip connections can accelerate network convergence and increase the precision of the segmentation network. Deep convolutional neural networks exhibit better performance than shallow networks, but the gradient vanishing problem may apply. Residual connections were, thus, introduced to ResNet [43] to solve this degenerative issue. DenseNet [44] presented densely connected layers with more shortcut connections and used the cascade strategy to combine the feature maps of the first few layers. Residual connections and dense connections all use information from the previous convolutional layers and are added to networks in the form of skip connections. DenseNet achieves better performance, but as the number of input channels increases, the network consumes more memory.

The idea underlying deep supervision is to directly supervise the hidden layer rather than just pay attention to the output layer. In GoogLeNet [45], supervising the two hidden layers of a 22-layer network achieves better effects. Dou et al.[46] applied deep supervision to segment 3D liver CT scans. After the features of the lower and middle layers were deconvoluted in the convolutional network, they were then combined with the output layer, reducing training and verification errors and granting the network better convergence. Chen et al. [47] utilized three classifiers to classify features in intermediate layers. The outputs of the classifiers serve as moderators during training, and the network combines multilevel contextual information for deep supervision, thereby enhancing the recognition capabilities of the network.

Regarding the problems of too many parameters and low segmentation precision in conventional CNNs, we modified DMFNet, a group convolution network with a smaller number of parameters. To enhance the communication between groups in DMFNet, we replaced the regular group convolution with learnable group convolution. To make full use of the feature maps, we added a skip connection structure between the learnable group convolution modules and introduced deep supervision to merge output images in the network output stage, thereby enhancing the segmentation precision of the network. We refer to this lightweight brain tumor image segmentation algorithm with learnable grouping as DLSDNet.

2. Lightweight Brain Tumor Image Segmentation Network with Learnable Grouping

We modified DMFNet for our segmentation network, replacing the regular group convolution in the feature extraction stage of DMFNet with learnable group convolution. We also added a skip connection and introduced deep supervision to the network output stage to merge the network outputs.

2.1. DMFNet

DMFNet comprises lightweight 3D convolutional neural networks. The network structure is similar to that of U-Net. It divides complex neural networks into lightweight networks or fiber sets, replaces regular convolution with group convolution, and uses a multiplexer to exchange information.

The multiplexer consists of two convolutions to promote information flow between fibers [39]. To expand the receptive field and capture the multiscale 3D spatial correlations of the brain tumors, we added dilated group convolution to the fiber units in the encoder stage. The feature extraction stage involves multiple dilated multifiber units (DMFunits). In the output stage, we used a regular group convolution multifiber unit (MFunit), as shown in Figure 1.

2.2. Learnable Grouping

To further enhance communication for group convolution and the feature extraction capabilities of convolutions, we replaced the group convolution in DMFNet with learnable group convolution (LGConv).

We suppose a convolutional feature map is , where N, C, H, and W, respectively, denote the number of samples, number of channels, and the height and width of the channels in the small batches. If a convolution with kernels is applied to F, the output feature map is , where each unit output is . Learnable group convolution (LGConv) can be defined as follows:where ; represents the hidden unit F of the input feature map, denotes the convolution weight, and indicates the element-wise product.

LGConv is an expansion of group convolution. It can use the binary relation matrix to learn group principles. Many convolutions can be regarded as unique forms of learnable grouping.

Let , which gives and represents a regular convolution, as shown in Figure 2(a). Let , where is a unit matrix. then becomes a matrix where the diagonal elements are 1 and the nondiagonal elements are 0, as shown in Figure 2(b), indicating that each channel is independent. Thus, LGConv is a depth-wise separable convolution [48]. If is a binary block diagonal matrix as shown in Figure 2(c), then divides the channels into groups. If is a constant matrix where all of the diagonal elements are 1, then LGConv represents regular group convolution in which adjacent channels are grouped together. If is any given binary matrix as shown in Figure 2(d), it will result in unstructured convolution. Thus, appropriately constructing binary relation matrix can produce various convolutional operations.

To reduce the complexity of , we decompose it into K submatrices:

The shape of submatrix is , where and . We define aswhere indicates the Kronecker product. Thus, and . Through a series of Kronecker products, we can decompose matrix into a set of submatrices [49].

To construct each submatrix , let . This is a general setting in ResNet and ResNeXt. To reduce the parameters in the convolutional operations, we use a single binary variable to express :where 1 denotes a constant matrix where the elements are 1, represents a unit matrix, is the kth component, is a learnable gate vector of continuous values, is a binary gate vector output from , and denotes a sign function:

We can combine equation (4) with equation (3) as follows:

With such a structural relationship, the parameter in need of optimization becomes , so the number of parameters in reduces from to . is a unit matrix where all of the diagonal elements are 1. When and , equation (6) becomes , which is an matrix with two groups, as shown in Figure 3(a). When , equation (6) becomes , which is an matrix with four groups, as shown in Figure 3(b). It is shown from that mentioned above that LGConv can group nonadjacent channels. In this way, only three continuous parameters , , and are needed to generate , , and and learn the original large matrix in which 64 parameters require training.

This study replaced the regular group convolution in the network with learnable group convolution to enhance the agility of the network and increase the precision of the segmentation network. The network units following the replacement are as shown in Figure 4.

2.3. Skip Connection Unit

This study proposed a novel skip connection unit to extract early feature maps and enhance feature reuse. Features are transferred between key layers so that the radiation range of early features expands to deeper levels, thereby enhancing the global integration of information flow. The neural network mapping function of the novel skip connection can be expressed as follows:where is the nonlinear transform after each level and indicates level . The output of layer expressed as . refers to the cascade of feature maps generated by the selected layers .

Downsampling of the feature maps in the upper layers is first conducted, and then, cascading is used to merge the feature maps with the posterior layer features. Each input includes the features selected from the first layer of the current block and the last layer of the previous block. The structural schematics with the newly added skip connection are displayed in Figure 5.

2.4. Deep Supervision

Deeper networks encode higher-level features. In the training of deep neural networks, deep supervision helps to reduce overfitting, extract more meaningful features, promote network convergence, and solve the problem of vanishing gradients [46, 50]. By adopting deep supervision in every stage of the decoder, the outputs of each intermediate stage can be used for supervision. Via upsampling, the output of each decoder can be adjusted to have the same dimensions as the final output segmentation map. The outputs of these intermediate stages are merged into the final output segmentation map, and then, softmax is used to derive the probability map. Losses can be calculated using ground truths and softmax outputs. In this way, the intermediate stages and the final output will implicitly contain the loss and gradient backpropagation, and the outputs of the intermediate stages will also gradually approach the ground truths. Figure 5 presents the structure of the network following the inclusion of deep supervision. We refer to this network model (which includes LGconv, a skip connection, and deep supervision) as DLSDNet.

3. Experiments and Result Analysis

3.1. Dataset and Evaluation Indices

In our experiment, we employed the BraTS 2018 dataset [51, 52], which contains multimodal MRI scans from multiple institutions and has served as the official dataset in a challenge. This dataset comprises four types of MRI sequences: T1 weighted, T2 weighted, postcontrast T1 weighted, and FLAIR. The dimensions of the data are 240 × 240 × 155. The dataset contains a training set and a validation set. The training set provides 285 sets of data for training with ground truth, while the testing set contains 66 sets of data with no ground truth. The objective of the BraTS 2018 challenge was to segment the data images into background, necrotic and nonenhancing tumors, edemas, and enhancing tumors. The researchers had to submit their validation results to an online evaluation platform to validate the effectiveness of their algorithms.

Segmentation accuracy was gauged using the Dice similarity coefficient and the Hausdorff distance. The former indicates the degree of similarity between the experimental segmentation results and the ground truth, with a higher value indicating greater segmentation precision. The latter calculates the maximum distance between the contours of the segmentation results and the ground truths to indicate the segmentation quality of the tumor boundaries, and a smaller absolute value represents better segmentation performance.

The number of model parameters () represents the computer memory consumed by the model, and the amount of calculation represents the computing running time of the model, expressed in floating point numbers per second. The calculation is as follows:

In the formula given above, , , and denote the height, width, and depth of the convolution kernel, and are the number of input and output channels, and , , and denote the height, width, and depth of the input data.

For the loss function, we adopted generalized Dice loss (GDL), which was proposed to cope with data imbalance issues. Using Dice loss is disadvantageous for the detection of small targets, so GDL combines multiple Dice classes and uses a weight to quantify and weight the segmentation results:

In the formula given above, denotes the ground truth of voxel in class , is the corresponding predicted value, is the total number of voxels, is the total number of classes, and is the weight of each class:

3.2. Experiment Environment and Preprocessing

We used the deep-learning platform Pytorch to achieve the proposed model. We employed two NVIDIA GeForce 2080Ti GPU for 500 epochs of training. While training the model, we used the Adam optimizer with a self-adjusting learning rate. The initial learning rate was 0.0001. The L2 norm was used to normalize the model, and the weight decay rate was 10-5.

Due to the different imaging equipment and protocols, data artifacts were present within the MRI images [53], so we used the N4 bias field correction algorithm to correct the bias in T1-weighted, T2-weighted, and postcontrast T1-weighted modes.

Due to GPU memory limitations, we expanded the data using the following techniques: random cropping of the original images to 128 × 128 × 128, random mirroring of the axial, coronal, and sagittal views with a probability of 0.5, random rotation of the images between angles [−10°, 10°], random changes to intensities between [−0.1, 0.1], and scaling of images between [0.9, 1.1].

3.3. Implementation of Experiment

Our model uses residual models as building blocks. The overall structure is similar to that of an encoder-decoder. The inputs are four channels of data corresponding to four modes of MRI data. During the feature encoding stage, the residual units of learnable groups are used, and the modified skip connection enhances the multiscale representation capabilities. During the decoding stage, the high-resolution features of cascading the encoder are used to supplement lost information. Upsampling is performed using trilinear interpolation. After each convolution block, Batch Normalization and Rectified Linear Unit (ReLU) activation are executed.

At the encoder end of the network, the skip connection was modified for information links between stages. Before encoding, max pooling is used for the downsampling of the high-level features in the encoder to match the scale on lower levels; in other words, max pooling is applied to the output of the previous level. Previous features can be accessed directly for the inputs of each stage to enhance feature reuse.

In the decoding stage, the outputs of each stage can guide the final segmentation results, adjusting the outputs of each decoder so that they have the same dimensions as the final output segmentation map, thereby producing three different outputs that are combined and then subjected to the softmax operation to derive the final segmentation map.

3.4. Experiment Results and Analysis

To verify the effectiveness of the proposed network, we trained and validated the proposed network and the original DMFNet using the same training set and validation set. Table 1 presents the experiment results of the original DMFNet model, a network using learnable group convolution (DMFNet + LG), and the proposed DLSDNet (DMFNet + LG + Skip + DS). Performance in terms of brain tumor image segmentation was compared using the Dice score and the Hausdorff distance. Wt, Et, and Tc denote whole tumors, enhancing tumors, and tumor cores, respectively.

A comparison of the first and second rows in Table 1 show that using learnable grouping can improve the Dice score by 1.2% for Tc, thereby indicating that learnable grouping facilitates the detection of small targets. A comparison of the second and third rows in Table 1 show that the addition of the skip connection significantly improves the Dice score by 0.19% for Et, 0.23% for Wt, and 0.98% for Tc, thereby indicating that the skip connection enables thorough utilization of multiscale information. Adding deep supervision facilitated the extraction of more features, as well as better supervision of the training process. DLSDNet based on DMFNet improves the Hausdorff by 0.54 mm for Wt and 0.7 mm for Tc.

Table 2 compares the proposed network model and typical brain tumor image segmentation networks. As shown in Table 2, the proposed network model exhibits better segmentation performance than 3D U-Net [31] with improving the Dice score by 4.4% for Et, 1.72% for Wt, and 14.43% for Tc and improving the Hausdorff by 3.26 mm for Et, 12.63 mm for Wt, and 5.88 mm for Tc. Compared with another similar lightweight network model, 3D-ESPNet [35], DLSDNet improved the Dice score by 6.66% for Et, 1.95% for Wt, and 4.8% for Tc. Compared to the methods that won the first and second place in the BraTS 2018 challenge (No New-Net and NVDLMED) [37, 38], DLSDNet achieved the best Dice score and Hausdorff on Tc with 86.20% and 5.74 mm, respectively. The proposed network model DLSDNet involved a smaller number of network model parameters, was less complex, and occupied fewer resources due to fewer FLOPs, as calculated in [39].

The visualized segmentation results are as shown in Figure 6. As can be seen, the original DMFNet can already roughly segment the contours of the tumor region. However, in some details, such as smaller tumor core targets, the segmentation performance is poorer. Figure 6(b) displays the results using the network model with learnable grouping. The segmentation performance of this model with regard to tumor cores was superior to that of the original DMFNet. Figure 6(c) shows the segmentation results using the network model with a skip connection and deep supervision. The segmentation performance of this model with regard to whole tumors, enhancing tumors, and tumor cores improved further, and the results were even closer to the ground truth. As can be seen, the proposed DLSDNet model is better at segmenting small targets.

4. Conclusions

Brain tumors vary significantly in intensity and are irregular in shape. This study modified DMFNet to use fewer parameters and introduced LGConv to the feature extraction stage so that it can flexibly choose the number of groups based on dataset and network characteristics. This facilitates adaptation to more complex data features and has wider applicability. We added a skip connection between LGConv blocks to enable thorough utilization of multiscale information and to enhance feature reuse. We added deep supervision to the network output stage to merge the outputs of different stages and to reconstruct outputs with the same dimensions for the extraction of more distinctive features and the enhancement of segmentation accuracy. Experiments using the BraTS 2018 dataset revealed that the proposed model is superior to networks with conventional U-Net structures and has greater precision than other lightweight brain tumor image segmentation methods. The segmentation precision of the proposed network with regard to whole tumors, enhancing tumors, and tumor cores was 90.25%, 80.36%, and 86.20%. Compared with the methods that won the first place in the BraTS 2018 challenge (NVDLMED), the and are reduced by 7 and 40 times, respectively. The proposed network, thus, has significantly greater precision in the segmentation of enhancing tumors and tumor cores than the original DMFNet and also offers strong competition against the methods that won the first and second place in the BraTS 2018 challenge. Furthermore, the proposed method uses fewer parameters and is a less complex model.

Data Availability

The data used to support the findings of this study can be obtained from the corresponding author on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61471263 and the Natural Science Foundation of Tianjin, China, under Grant 16JCZDJC31100.