ICA-Unet: An improved U-net network for brown adipose tissue segmentation

Brown adipose tissue (BAT) is a kind of adipose tissue engaging in thermoregulatory thermo-genesis, metaboloregulatory thermogenesis, and secretory. Current studies have revealed that BAT activity is negatively correlated with adult body weight and is considered a target tissue for the treatment of obesity and other metabolic-related diseases. Additionally, the activity of BAT presents certain di®erences between di®erent ages and genders. Clinically, BAT segmentation based on PET/CT data is a reliable method for brown fat research. However, most of the current BAT segmentation methods rely on the experience of doctors. In this paper, an improved U-net network, ICA-Unet, is proposed to achieve automatic and precise segmentation of BAT. First, the traditional 2D convolution layer in the encoder is replaced with a depth-wise over-parameterized convolutional (Do-Conv) layer. Second, the channel attention block is introduced between the double-layer convolution. Finally, the image information entropy (IIE) block is added in the skip connections to strengthen the edge features. Furthermore, the performance of this method is evaluated on the dataset of PET/CT images from 368 patients. The results demonstrate a strong agreement between the automatic segmentation of BAT and manual annotation by experts. The average DICE coe±cient (DSC) is 0.9057, and the average Hausdor® distance is 7.2810. Experimental results suggest that the method proposed in this paper can achieve e±cient and accurate automatic BAT segmentation and satisfy the clinical requirements of BAT.


Introduction
Brown adipose tissue (BAT) is a special type of adipose tissue. From one perspective, it is similar to white adipose tissue (WAT) and is an energy storage tissue in the human body. From another perspective, BAT is undergoing extremely active metabolic activities owing to a considerable number of mitochondria in BAT cells. Studies have revealed that three main physiological purposes of BAT can be identi¯ed, namely, thermoregulatory thermogenesis, metaboloregulatory thermogenesis, and secretory. 1,2 Therefore, BAT has essential physiological signi¯cance for human body temperature regulation, resistance to cold, prevention of obesity, regulation of energy balance, and resistance to infection. [3][4][5] The activity of BAT expresses signi¯cant di®erences at di®erent ages and genders. Its activity in infants and women is higher than that in adults and men. The latest research has reported that BAT activity is negatively correlated with body weight, and BAT may be adopted as a target tissue for the treatment of obesity. 6 Therefore, BATrelated research is associated with a wide range of clinical applications. With the vigorous development of modern medicine, detection methods for BAT are constantly improved. Positron emission tomography/computer tomography (PET/CT) technology has become the mainstream detection technology for BAT. Besides, 18F-°uoro-2-deoxyglucose (18F-FDG) is broadly used as the PET/ CT tracer. 7 The mechanism is that glucose is metabolized vigorously and accumulated in BAT due to the di®erent metabolic states of di®erent tissues of the human body. 8,9 These characteristics can be re°ected through PET images for detection and analysis, as shown in Fig. 1.
Regarding the segmentation of BAT, existing studies are primarily based on the experience of radiologists and nuclear medicine physicians. Researchers have demonstrated that thresholding and clustering are very suitable for the segmentation of BAT since BAT exists in speci¯c anatomical locations and PET images have high contrast. [10][11][12][13] Generally, simple thresholding was used for segmenting BAT. First, the individual's standard uptake value (SUV) is calculated based on PET data. 14,15 Second, it is manually divided with medical imaging software. BAT is present if the diameter of the tissue area is greater than 5 mm and the CT density is restricted to À190 À À30 HU, 16 and the SUV is more than 2 g/ml or 3 g/ml in the corresponding 18F-FDG PET image. 17,18 Finally, BAT is distinguished from lymph nodes, blood vessels, bones, thyroid, and other tissues according to anatomical knowledge. 19 However, this type of method has the following shortcomings: (1) The traditional BAT segmentation step requires excessive imaging and clinical knowledge, and the conclusions are mostly the subjective judgment of clinicians; (2) the traditional BAT segmentation method depends on the standard division of thresholds, while it is difcult for standard thresholds segmentation in complex di®erentiation problems because of the speci¯city of the tissues and organs of di®erent patients. Therefore, it is urgent to develop an automatic BAT segmentation method for tackling the above problems.
Recent success in deep learning, especially the use of a Deep Convolutional Neural Network, 20 has accelerated the development of automatic image segmentation. In deep learning algorithms, the Unet network structure 21 is widely used for medical image segmentation. [22][23][24][25][26] The U-net network consists of an encoder and a decoder, as well as skip connections between the encoder and the decoder. The image is¯rst convolutionally down-sampled by the encoder several times to obtain feature maps of di®erent scales, indicating that the features of different scales are learned. Then, the bottom feature map is up-sampled in the decoder, and the encoder is correspondingly connected by skipping scale feature maps. As a result, feature maps of di®erent scales are merged. Finally, the network combines low-resolution information and high-resolution information. The low-resolution information provides the location and category information of the segmentation target, while high-resolution information is required in the edge segmentation. Under the combination of both, the medical image segmentation task can be well completed by U-net. H. Wang et al.

2250018-2
Accordingly, a BAT segmentation method is designed in this paper based on improved U-net, ICA-Unet. The U-net network is improved based on channel attention block, a depth-wise overparameterized convolutional (Do-Conv) layer, and image information entropy (IIE) block, which will be described in detail in Sec. 2.2. The images need to be preprocessed before the dataset is segmented. First, the SUV value is calculated based on the original PET data according to formula (1) (hereinafter referred to as the SUV calculated based on the original PET data as the PET data). During the process of calculating the SUV value, the calculation rule based on the DICOM tag is adopted. 27 Besides, X PET is set as a three-dimensional matrix of PET data and Y SUV is set as a three-dimensional matrix of SUV data. Other variables are acquisition time (T A Þ, patient weight (W P Þ, radiopharmaceutical start time (T RS Þ, radionuclide total dose (D RT Þ, radionuclide half-life (L RH Þ, rescale intercept (I R Þ, and rescale slope (S R Þ. The calculation formula is:

Materials and
Second, the original CT data have been optimized with the median¯ltering method, which is widely used in medical images noise processing, to reduce the noise in the CT data. Finally, the PET data are up-sampled by the nearest neighbor interpolation resampling method and enlarged to be the same as the CT data. Besides, the related spatial parameters of the PET data are matched with the corresponding CT parameters to manage the problem of the inconsistent size and image parameters of the PET and CT.
Since BAT is broadly distributed in the neck region of the human body, it is necessary to obtain the neck ROI region before BAT segmentation for avoiding interference from other nonBATs with high SUV values in the human body. Additionally, the PET/CT image cannot be guaranteed to be in ā xed proportion of the image in the process of acquiring the images. Thus, Faster RCNN 28,29 has been employed to obtain the ROI area of the neck to overcome this complication.
The ROI area is obtained through the network. On this basis, the corresponding upper and lower boundary slice interval ðx; x þ hÞ of the axial direction of the PET/CT data are acquired. This is taken as the interval to obtain several PET/CT images and construct a bimodal dataset.
After the data processing described above, about 4500 sets of bimodal PET/CT images can be obtained. Then, annotations for each PET/CT image are created. Consequently, the annotated data are randomly divided into 70% training data and 30% test data for training and testing.

Network architecture
The architecture of the proposed ICA-Unet network is illustrated in Fig. 2. The network is based on the classic U-net structure, 21 with the introduction of image information entropy, 30 channel attention, 31 and Do-Conv layer. 32 Speci¯cally, our study follows the classic structure of the U-net. In the encoder module, the PET/CT bimodal image is¯rst input into the network as two channels. Second, the 2D convolution module in the U-net structure is replaced with the Do-Conv module. After a layer of 3 Â 3 Â 3 convolution with the stride of 1 and zero padding, the recti¯ed linear unit (ReLU) activations and batch normalization (BN) are calculated. Then, the channel weight is calculated by the channel attention module and multiplied by the feature map. Finally, it passes through a block of Do-Conv, ReLU, and BN. Additionally, successive 2 Â 2 Â 2 max pooling with the stride of 2 is performed to enlarge receptive¯elds after the doublelayer convolution.
Symmetrically with the encoder, the feature maps of the subsequent decoder are up-sampled four times with de-convolutions to restore spatial details. Speci¯cally, a 2 Â 2 de-convolutions module with a step size of 2 is performed, followed by the same convolution operation as the encoding module. Furthermore, the skip connections are deployed, and the IIE module is introduced to calculate the IIE of the down-sampled feature map, so as to strengthen the edge information. Then, this feature map is fused with the feature map of the same level obtained from the decoding module. The global context information is complementary to the spatial details. Finally, the segmentation result is output after 1 Â 1 single-layer convolution.

Image information entropy
In 1948, C. E. Shannon, the father of information theory, published a paper \A Mathematical Theory of Communication", pointing out that any information has redundancy, and its size is related to the probability or degree of confusion of each symbol in the information. 33 Shannon cited the concept of thermodynamics and called the average amount of information after eliminating redundancy as IIE.
In the information theory, information entropy is de¯ned as the expectation of a random variable IðXÞ in the set ðX; qðXÞÞ, where HðXÞ denotes the information entropy of X, which describes the degree of confusion and uncertainty of the elements in X.
Pixel is the basic unit of a digital image. Image data are essentially a matrix of pixels in a computer. Essentially, the di®erence in images is that pixels of di®erent gray levels are distributed in di®erent spatial regions with di®erent probabilities. Therefore, the value of k is set to 255 for the image of k-level grayscale, and the iði 2 1; . . . ; kÞ level grayscale is represented by pi. Then, the entropy is as follows: where 0 i k, the accumulation of information entropy of di®erent grayscale levels is de¯ned as image information entropy, then the IIE of the entire image is as follows: where p i indicates the probability of each level of grayscale pixels in the entire image. When p i ¼ 0, p i logðp i Þ ¼ 0. p i is calculated by the grayscale histogram, that is, the quotient of the number of pixels of grayscale i and the total number of pixels in the image. H. Wang et al.

2250018-4
In the U-net network, the decoder involves a combination of four times double convolutional layers and pooling layers. There are multiple feature maps of di®erent levels, scales, and aspects in the output of each pooling layer. By calculating the IIE based on the feature map, the edge of the object and the rapidly changing pixel information in the image can be captured through the retention of the detailed texture structure of the original image. Meanwhile, the edge feature of the object can be enhanced to make the generated image feature more expressive. Furthermore, the enhancement of edge information can contribute to better contour integrity and coherence of the¯nal segmentation results of the network.

Channel attention
In 2017, Senet 31,34 won the championship in the image classi¯cation task of the ImageNet competition. It performed an attention mechanism in the channel dimension to signi¯cantly improve the network performance.
The channel attention mechanism consists of three operations: Squeeze, Excitation, and Scale. The Squeeze operation compresses the twodimensional features of each channel into a real number through global pooling, equivalent to having a global receptive¯eld. Assuming that there are C channels in total, a 1 Â 1ÂC feature will eventually be obtained. The purpose of the Excitation operation is to generate a weight for each channel. Speci¯cally, the dimension of the feature map is¯rst reduced to 1/r of the original through a fully connected layer. Then, the ReLU is calculated, and the original dimension C of the feature map is obtained through a fully connected layer. Finally, the sigmoid function is performed for normalization. The Scale operation is to multiply the normalized weight coe±cient with the feature map of each channel.
In this paper, bimodal data of PET/CT are introduced. The channel attention module can assist the network in judging the importance of di®erent channels. In other words, the network can better judge the information importance of CT and PET data after convolution. Therefore, the introduction of the channel attention module is conducive to learning crucial information in the bimodal data and better extracting image features.

Depth-wise over-parameterized convolution
Li et al. proposed depth-wise over-parameterized convolution (Do-Conv), 32 which can replace the traditional convolution to accelerate the network convergence and improve the performance of the network. Do-Conv is a combination of traditional convolution and depth-wise convolution. Assume that the number of channels of the input feature map is C in , the size of the convolution kernel is M Â N, and the output channel of the feature map is C out . The convolution kernel W can be expressed as W 2 R C out ÂðMÂNÞÂC in . With Ã representing the traditional convolution operation, Di®erent from the traditional convolution operation, a channel of the output feature in depth-wise convolution is only related to a speci¯c channel of the input feature rather than other channels of the input feature. Assuming that there are D mul convolution kernels with a size of M Â N and the input feature channel is C in , the convolution kernel D can be expressed as D 2 R ðMÂNÞÂD mul ÂC in , and the output channel can be expressed as D mul Â C in . With indicating depth-wise convolution, O ¼ D P .
Do-Conv is to perform depth-wise convolution on the input feature vector and then calculates traditional convolution. It can be written as follows: Since the network performs IIE modules and channel attention modules, the convergence speed of the network will be a®ected. Therefore, Do-Conv is conducted in the encoder and decoder to replace the traditional 2D convolution, so as to accelerate the network convergence speed and improve the network performance.

Implementation
The proposed method was implemented using Python language and Pytorch package 35 on the workstation with single graphics processing unit ICA-Unet: An improved U-net network (NVIDIA GeForce GTX TITAN V). The Loss function of the network was composed of Sigmoid and BECLoss. Assuming there are N batches and each batch predicts n labels, the Loss function can be de¯ned as follows: where ðx n Þ indicates the Sigmoid function, which can map x to the interval (0, 1): The network was trained by an RMSProp optimizer 36 with rho of 0.9 and e of 0.0001. The initial learning rate was set to 0.00003. The training contained 100 epochs, and the batch size was set to 4. In the training and testing, 512 Â 512 PET and CT images were combined to form a double-channel 2 Â 512 Â 512 matrix and input into the network. After inference, the segmentation of each image was formed. Moreover, the proposed ICA-Unet was performed three times and the average value was taken as the¯nal result to alleviate the impact of random initialization in training. The network had been fully trained, as shown in Fig. 3. The loss of the network has converged.

Evaluation metrics
With expert manual annotations as ground truth, the segmentation performance of ICA-Unet was quantitatively evaluated with the following six metrics 37 : (1) mIoU, (2) Sensitivity (SEN), (3) Speci¯city (SPE), (4) Dice Similarity Coe±cient (DSC), (5) accuracy (ACC), and (6) Hausdor® Distance (HD) 38 : where TP and FP denote the numbers of true positives and false positives, respectively; TN and FN refer to the numbers of true negatives and false negatives, respectively; HD indicates the maximum distance between two pixels sets; d HD ðA; BÞ designates the directed Hausdor® distance between the ground truth and the predicted value; dðx; yÞ stands for the Euclidean distance between two pixels.

Comparison to state-of-the-art methods
The segmentation results on the PET/CT dataset built previously were obtained by our method with the ICA-Unet network. The other six state-of-theart methods, CT threshold, PET threshold, CT and PET Threshold intersection, U-net, 21 U-netþþ, 39 and SegNet, 40 are presented in Fig. 4 and Table 1.
The visualization results imply that in the examples, the automatic segmentation results obtained by our method are more anxious and consistent with the ground truth. Particularly, the segmentation boundary is clearer, complete, and coherent. It can be observed from Fig. 4 that the segmentation results obtained from the CT threshold method (Fig. 4(b)), in which the areas with a CT density of greater than À190 HU and less than À30 HU are regarded as BAT, contain the area of BAT. However, there are signi¯cant errors of segmentation and a considerable number of WATs treated as a BAT area mistakenly. The results obtained from the CT threshold method are unreasonable. The segmentation results from the PET threshold method (Fig. 4(c)), in which the areas with the SUV values of greater than 2 g/ml or 3 g/ ml are regarded as the BAT area, exclude a large number of WATs, and contain a more accurate area of BAT. However, the results are under-segmented and over-segmented, as well as discontinuous and noisy. Moreover, the edges are not smooth, re°ecting that the poor segmentation results from the PET threshold. Segmentation results from CT and PET threshold intersection methods ( Fig. 4(d)) demonstrate that the accuracy of the segmentation area is signi¯cantly improved compared with the PET threshold method and the CT threshold method. Nevertheless, the results are also undersegmented, and the region is discontinuous, resulting in the average segmentation results. The segmentation results obtained from the K means method (Fig. 4(e)) contain the area of BAT. However, there is signi¯cant over-segmentation, and a considerable number of surrounding tissues are mistakenly treated as BAT regions. The results obtained from the K means method are unreasonable. Segmentation results from U-net ( Fig. 4(f)) contain most BAT areas, while the area is hollow, discontinuous, and noisy, and the edges are not smooth. This is contrary to the ground truth. Compared with the results from U-net, the accuracy of segmentation results from U-netþþ ( Fig. 4(g)) has been enhanced. Nonetheless, there is still a certain amount of under-segmentation.  Segmentation results from SegNet ( Fig. 4(h)) are similar to those from U-netþþ, while there are more under-segmentation and over-segmentation compared to U-netþþ. The method proposed (Fig. 4(i)) in this paper can obtain semblable segmentation results with the ground truth compared with the previous methods, with the more delicate and smoother boundary of results. It grapples with the problems of under-segmentation and discontinuity in other methods. Table 1 suggests that the ICA-Unet we proposed had the highest Dice coe±cient (DSC) and mIoU compared with the traditional threshold methods. Besides, the ICA-Unet we proposed also had the highest DSC and mIoU and the lowest Hausdor® distance compared with mainstream medical image segmentation networks.
The data in Table 1 reveal that the method we proposed had signi¯cant advantages in performance and accuracy and avoids the problems of undersegmentation, discontinuity, and di®erence in individual thresholds caused by static threshold segmentation methods. Objectively, the ICA-Unet can realize the automatic segmentation of BAT.

Evaluation of network architectures
In this study, the six possible combinations of the Unet, IIE, Do-Conv, and CAT modules were conducted to assess the e®ectiveness of our network (ICA-Unet) architecture. Moreover, its accuracy was evaluated with mIoU and loss convergence speed. The quanti¯cation results are provided in Table 2 and Fig. 5. Thus, the following conclusions can be drawn. (1) Our architecture (U-net þ IIEþ Do þ CAT) had the highest accuracy compared with other architectures; (2) the convergence rate of Loss is relatively fast when our architecture is trained.
As suggested from Table 2 and Fig. 5, further analysis was conducted on the IIE module. To sum up, the segmentation accuracy of the network had been improved with the introduction of the image IIE module. It exhibited an improvement of 0.07 compared to the mIoU of U-net, U-net þ IIE. However, the loss convergence speed of the network has signi¯cantly slowed down (convergence is reached at epoch ¼ 32). The analysis of Do-Conv implied that the loss convergence speed can be signi¯cantly improved by replacing the 2D convolutional layer with the Do-Conv convolution layer in the U-net. Besides, it can also e®ectively manage the problem of the decrease of network convergence rate caused by the introduction of the IIE module and channel attention module (CAT).
The CAT module demonstrated that the introduction of the CAT module can e®ectively improve the segmentation accuracy of the network. Additionally, it can dramatically improve the segmentation accuracy when used together with IIE. Simultaneously, the loss convergence speed of the network decreased (loss convergence is reached at epoch ¼ 50).
In summary, the IIE and CAT modules can effectively strengthen the segmentation accuracy of the network, and the Do-Conv layer can highly accelerate the loss convergence speed of the network. Therefore, the three are combined in the Unet network architecture. The experimental results veri¯ed that the best segmentation result can be obtained on the basis of retaining a certain loss convergence speed of the network.

Conclusions
In this study, an improved U-net network (ICA-Unet) for BAT segmentation has been proposed based on the classic U-net architecture, the image information entropy, channel attention, and Do-Conv modules. The network can learn the PET/CT channel weights from the channel attention modules and enhance the edge features of the maps through IIE modules. After the introduction of these modules, the decrease in loss convergence speed of the network can be mitigated by Do-Conv layers. The network architecture and method we proposed present the mIoU score of 0.832. Compared with other methods, ICA-Unet has signi¯cant advantages and avoids the problems of under-segmentation, discontinuity, and threshold di®erence caused by static threshold segmentation methods, realizing automatic BAT segmentation. The proposed method can assist radiologists and nuclear medicine physicians in e±ciently segmenting BAT and sig-ni¯cantly facilitate clinicians and researchers to conduct related research on BAT.

Con¯cts of Interest
The authors declare that there are no con¯cts of interest relevant to this paper.