TAGU-Net: Transformer Convolution Hybrid-Based U-Net With Attention Gate for Atypical Meningioma Segmentation

Meningioma is derived from the cap cells that reside on the arachnoid membrane. The atypical meninges of Grade II, a classification established by the World Health Organization, are included in one of the grades of meningioma. It has been discovered that early surgical resection significantly reduces the recurrence rate and mortality of tumors. Accurate segmentation of magnetic resonance images of brain tumors is crucial for diagnosing and treating atypical meningiomas. However, the traditional automatic segmentation framework heavily relies on convolution. The convolution-based segmentation network has limitations such as the size of the convolution kernels, a restricted receptive field, and a lack of spatial aggregation ability. To overcome these limitations, this paper presents a novel hybrid architecture named TAGU-Net, which combines Transformer and convolution based on U-Net with an attention gate. The TAGU-Net architecture extracts features of different resolution feature scales using convolutional neural network and Transformer. This approach effectively captures the image’s long-distance dependency and global characteristics in the encoder stage, relying on the global self-attention mechanism of the Transformer. Additionally, the inductive bias of the convolution neural network is combined to enhance the local modeling information and improve the model’s overall modeling ability. In the decoder phase, the attention gate is introduced to adaptively learn the skip connection information and up-sampling information in the network. This information is weighted and fused to highlight important features and suppress irrelevant features. To obtain better model training and avoid the vanishing gradient, deep supervision technology is used in the training process. Supplementary loss is added in some stages to supervise the training and achieve the best effect of atypical meningioma segmentation. The proposed method is evaluated on both the private atypical meningioma dataset and the publicly available BraTs2018 dataset.TAGU-Net has achieved Dice Scores of 97.67% and 97.62% and Jaccard index of 96.35% and 95.35% on the private atypical meningioma dataset and BraTs2018 dataset respectively, which is a state-of-the-art segmentation result beyond existing methods. According to the research results, the TAGU-Net model significantly improves atypical meningioma segmentation and can effectively assist doctors in processing MRI images.

resection can be curative for nearly 80% of benign tumors, but intracranial meningioma remains a dangerous disease [7]. However, high-grade meningiomas exhibit an increased risk of recurrence after treatment, exhibit aggressive behavior, and increase morbidity and decrease survival [8], [9], [10]. Numerous studies have shown that grade II and III meningiomas are recurrent, aggressive, and aggressive [11] and that grade III meningioma are considered the most aggressive, i.e., malignant. Therefore, the clinic is of great significance for diagnosing and segmenting grade II meningiomas, i.e., AM, especially as the tumor grows slowly and inhibits vital organs before progressing to malignancy. Early detection of AM holds significant value in the treatment of meningiomas and ultimately enhances patient survival rates.
The segmentation method based on traditional machine learning is not popular with the public because of its complexity, cumbersome operation, and low accuracy. Currently, the mainstream deep learning method still relies on the pure convolution architecture, and the pure Transformer and convolutional neural network (CNN) and Transformer hybrid architectures have their own defects. The traditional CNN segmentation network is limited by the size of the convolution kernel, which has the problems of limited receptive field and insufficient spatial aggregation ability [12]. While dilation convolution can increase the receptive field of CNN, it is not sufficient to overcome these problems [13]. Due to the lack of prior knowledge like CNN inductive bias (ie, locality and translation equivariance), the pure Transformer architecture requires a large amount of data to learn enough information, which is extremely difficult and particularly challenging on medical image data sets with few samples [14]. In the hybrid architecture of CNN and Transformer, Transformer typically operates on the feature map extracted by CNN [15]. Obviously, this approach leads to a significant loss of valuable information.
In this study, we propose a novel hybrid architecture that combines Transformer and convolution, based on U-Net with attention gate, to achieve automatic segmentation of atypical meninges. two types of Encoders are designed, namely ConvEncoder, and FormerEncoder. Different from the conventional hybrid architecture of Transformer and CNN, the proposed FormerEncoder does not model the feature maps extracted by CNN, but in the encoder stage, the two types of encoders extract the features of different resolution feature maps at different scales. ConvEncoder and FormerEncoder extract different information from different resolution feature maps at different scales, and the information obtained by the same Encoder at different scales and resolutions is also different, and the shallow features obtained in high resolution contain texture, contour and position information, while the deep features obtained in low resolution contain rich semantic information. In the encoder stage, the two types of encoders extracted the features of different resolution feature maps at different scales. At the same time, TAGU-Net fused the long-distance dependency and global features of images captured by FormerEncoder with the local features extracted by ConvEncoder to produce a more effective feature representation. Moreover, for FormerEncoder, it is a flexible and efficient encoder, which can replace the Former Encoder Block of FormerEncoder based on the characteristics of different tasks or different data sets, such as Swin-Transformer [16], PVT [17] or T2T-ViT [18]. In the decoder stage, we only use ConvDecoder to avoid the high complexity of the model. At the same time, we introduce the attention gatie mechanism to adaptively learn the skip connection information and the up-sampling information in the network, and carry out the weighted fusion of the two, highlight the important features and suppress the irrelevant features, realizing the feature reuse in the decoder stage. At the same time, in the training phase, we use deep supervision, and in some stages, we introduce an auxiliary loss function to carry out supervision training. The main contributions of this paper can be summarized as follows: • A transformer convolution hybrid architecture based U-Net with attention gate is proposed for MRI segmentation and learning of atypical meningiomas. The results demonstrate that this framework surpasses state-of-theart models in terms of performance.
• Use genetic algorithm-based adaptive histogram equalization to preprocess the original MRI image to enhance image details, thereby achieving a more precise segmentation.
• The FormerEncoder module is designed to capture global features at different scales and model the long-distance dependence of the image, and it is flexible and replaceable based on different data characteristics. In addition, the convolution features generated by ConvEncoder are fused to achieve a complementary structure.
• The Attention Gate module is introduced to adaptively learn feature information from different structures in the decoder branch, highlighting important features and suppressing irrelevant features.
The rest of this paper is organized as follows: The related work of atypical meningioma segmentation and segmentation network is presented in Section II, respectively. Section III introduces the data set used and provides a detailed description and the model framework and algorithm proposed in this paper. Subsequently, the experimental results are presented and analyzed in Section IV, including the performance comparison with other methods. Finally, Section V draws the main conclusions about the work introduced.

II. RELATED WORK
Clinically, for meningioma diagnosis and recognition divided into invasive and non-invasive methods, non-invasive medical imaging techniques such as computed tomography (CT) and magnetic resonance imaging (MRI) which is more favored in the diagnostic stage as brain tumor recognition tools, outperforming invasive methods such as tissue biopsy [19]. Among noninvasive medical imaging techniques, MRI is considered the most common technique for diagnosing meningiomas because it can provide detailed images and noninvasive properties of human tissues and organs.
Brain tumor segmentation is an essential step before applying any treatment, and the current standard method of brain tumor segmentation is manual and is based on expert experience. Experts must manually segment the MRI to delineate the target image. The surge in the number of patients can lead to a decrease in the quality of physicians' work, creating a situation of manual segmentation error. With the development of computer technology, computer-aided diagnosis (CAD) systems have been developed quickly and applied to segment tumors. A large number of studies have achieved great success in the fields of breast cancer [20], [21], brain tumors [22], [23] and other fields.

A. MACHINE LEARNING
For meningioma, the field of meningioma segmentation has developed rapidly. The main methods are divided into segmentation methods based on traditional machine learning and segmentation methods based on deep learning. There are two kinds of segmentation methods based on conventional machine learning, one is based on the unsupervised clustering method, and the other is to transform the segmentation problem into a pixel classification problem. Almahfud et al. used a combination of two K-means and Fuzzy C-means (FCM) grouping methods to detect brain tumors [24]. Benson et al. implemented an improved version of the fuzzy C-mean clustering and watershed algorithm. An effective way of selecting the initial centroid based on histogram calculation was proposed to improve the accuracy of clustering. In addition, a set-based tag detection method was proposed to avoid over-segmentation [25]. Saha and Hossain proposed a way to automatically classify brain images of MRI using K-means clustering, nonsubsampledcontourlet transform (NSCT), and support vector machine (SVM). Because NSCT has significant characteristics such as multiscale, multidirectional, and displacement invariance, K-means clustering and NSCT are used to segment brain images of MRI, which improves the efficiency and accuracy of segmentation [26]. Amin et al. used a fused eigenvector to apply a random forest (RF) classifier to classify between three sub-tumor regions using a mixture of gabor wavelet features (GWF), histograms of oriented gradient (HOG), local binary pattern (LBP), and segmentation based fractal texture analysis (SFTA) features [27]. Al-Dmour and Al-Ani proposed an efficient and fully automatic brain tissue segmentation algorithm based on clustering fusion technology. A Neural network simulates clustering and divides the target based on superpixels, three clustering algorithms, and a neural network [28]. Kaya et al. used principle component analysis (PCA) for multivariable data reduction and five standard PCA algorithms for target segmentation [29].

B. DEEP LEARNING
Among the segmentation methods based on deep learning, due to the excellent performance of the convolutional neural network in the field of image processing and computer vision, especially after the birth of AlexNet [30], CNN has ushered in a blowout of explosive development. At the same time, CNN architecture has become the leading choice in medical image segmentation. Kamnitsas et al. proposed a dual-channel, 11-layer deep three-dimensional convolutional neural network and designed an efficient and effective intensive training scheme while automatically adapting to the inherent class imbalance in the data, using the dual-channel architecture to combine local and more extensive context information [31]; Havaei et al. proposed a fully automatic brain tumor segmentation method based on deep neural network (DNN). By using dual-channel CNN architecture and cascade architecture, the system can more accurately model local label dependency by using local features and more global context features [32]; Díaz-Pernas et al. proposed a deep convolution neural network with multiscale methods. Inspired by the inherent multiscale operation of the human visual system (HVS), the input images are processed at three spatial scales along different processing paths. The multiscale processing strategy can effectively extract discriminatory texture features for different types of tumors [33]; Haq et al. proposed an integration and hybrid method based on a deep convolution neural network and machine learning classifier. By learning the feature map from the brain MRI image space to the tumor marker region through CNN, a faster regionbased CNN was developed for tumor region localization, followed by the potential region network (RPN). Finally, the deep CNN and machine learning classifier were connected to achieve target segmentation [34]; Ding et al. proposed Stack Multi-Connection Simple Reduction Net (SMCSRNet) based on U-Net framework, which reduces the number of model parameters and adds bridging between stacked cascaded networks to improve information loss [35]; Maji et al. proposed an Attention Res-UNet (ARU-GD) with a guided decoder, which designs the loss function by guiding the decoder, and introduces the attention gate to focus on the activation of relevant information [36].

C. SEGMENTATION NETWORK
The complete convolution networks (FCNs) [37] proposed by Long et al. achieve the state-of-the-art (SOTA) of image segmentation and semantic segmentation under the premise of only using convolution; Ronneberger et al. proposed a symmetrical encoder-decoder structure of medical image segmentation network U-Net [12]. U-Net has played an excellent role in medical images with small data scale. Most future segmentation networks will continue to use the U-Net structure and make improvements; Ibtehaz et al. re-thought based on U-Net. Inspired by Inception [38], they replaced the traditional convolutional layer with a multi-resolution idea and VOLUME 11, 2023 53209 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
introduced residual connection [39]. Instead of simply connecting the feature maps from the encoder to the decoder, they passed through the convolutional layer chain with residual connection and then combined with the decoder features to enhance the feature representation [40]; Influenced by Transformer [14], [41], a large number of studies have explored the feasibility of Transformer in the field of medical image segmentation. Hatamizadeh et al. used Transformer as an encoder to learn the sequence representation of the input quantity and effectively capture the global multi-scale information, and combined the information with the CNN decoder through the skip connection of different resolutions [42]; Wang et al. proposed a network based on Transformer's coder-decoder structure [43]. The 3D CNN is used to extract the spatial feature map to carefully transform the feature map of the global feature modeling of the input Transformer. At the same time, the decoder uses the features embedded in the Transformer and performs progressive up-sampling to predict the detailed segmentation map.

A. PRIVATE DATASET
This study used the private atypical meningioma patient dataset from Weihai Municipal Hospital. In this dataset, researchers retrospectively collected pre-operative MRI scans of 203 subjects from 2010 to 2019. All subjects had the following MRI findings: (1) First operation of tumor resection in Weihai Municipal Hospital; (2) The grade of postoperative pathological diagnosis results was precise; (3) Preoperative high-quality cranial T2 weighted imaging (T2) and contrastenhanced T1 weighted imaging (T1C) MRI; (4) Preoperative complete clinical data and information; (5) No history of surgery, gamma knife and other treatments; (5) No MRI sequence was incomplete (T2 / T1C) and imaging was free of artifacts.
Multimodal MRI delivers a great deal of information for segmentation and extraction of meningiomas, specifically for meningioma machine scans that provide hundreds of 2D imaged brain slices with high soft tissue contrast, the common MRI sequences are T1, T2, T1C, and fluid-attenuated inversion recovery (FLAIR). Each MRI sequence produces images with different tissue contrast, which has a different role in distinguishing tumors [44], [45]. T1 modality is usually used to process healthy tissue, T2 modality is more suitable for detecting the boundary of edematous regions, T1C modality highlights the tumor boundary, and flair modality is favorable for detecting edematous regions in cerebrospinal fluid.
In this study, we utilized T1C and T2 MRI sequences at the same time. Because different MRI sequences come from different signal information and belong to different modal information in a broad sense, the MRI sequences that use T1C and T2 simultaneously in this study belong to multimodal information fusion. In contrast, the joint multimodal information fusion usually constructs a multi-branch structure. Each mode has its flow; feature fusion is performed after feature extraction of different branches or streams. However, this method is challenging to obtain the characteristic relationship between different modes, and it is difficult to use the complementary information between the modal information. Moreover, although the mode represents different signals, it represents the same feature. To address this issue, we propose a novel approach that directly inputs the two modal information to effectively extract the relationship between different modal information.
The resolution of most images in this dataset is 512 × 512, and the resolution of a few images is 432 × 512 and 496 × 512 in order to unify the resolution and facilitate image feeding into the model, we set all image resolutions to 512 × 512. Fig.1 shows the MRI images of different modes and their corresponding segmentation results.

B. PUBLIC DATASET
The BraTs dataset serves as a public benchmark for brain tumor segmentation, and for our study, we utilized the BraTs 2018 [44], [46] training dataset acquired from the official website to evaluate our proposed method. This dataset comprises two types of gliomas, high-grade glioma (HGG) typically classified as WHO grade III or IV, and low-grade glioma (LGG), typically classified as WHO grade I or II. Given that atypical meningiomas are only LGG, we focused our verification solely on LGG patients, totaling to 75 patients in the BraTs 2018 dataset. Each patient's MRI includes corresponding T1, T1C, T2, and FLAIR sequences, which led to a collection of 4845 slice images with each slice containing information from the four sequences. The size of each slice image was 160 × 160 × 4.

C. PREPROCESSING OF IMAGE DATA
Because medical images are very susceptible to noise, the quality of the obtained images could be lower, with noticeable noise and low contrast. However, the quality of the image has a significant impact on the subsequent diagnosis and segmentation. In low-quality images, the region of interest (ROI) may not be observed, resulting in abnormal diagnosis or segmentation. Therefore, it is necessary to de-noise and enhance the image's contrast. This preprocessing aims to solve the defects of MRI and generate the most precise and representative MRI possible to achieve the most accurate segmentation process. In this paper, we performed a series of preprocessing operations, and Gaussian filter denoising was used to remove general noise. Then we used a genetic algorithm-based adaptive histogram equalization (GAAHE) [47] to enhance the contrast of MRI.

1) GAUSSIAN FILTER
The gaussian filter is a smooth linear filter. Gaussian filter is used to smooth the image to remove noise. When calculating the gaussian smoothing result, the origin is the center point. Other points are weighted according to their positions on the standard distribution curve to obtain a weighted average value. The template used in this article is 5 × 5. The size of the Gaussian filter is publicly defined as follows: of which x and y indicates the size of the kernel filter, σ 2 is the variance of the Gaussian filter.

2) GENETIC ALGORITHM BASED ADAPTIVE HISTOGRAM EQUALIZATION
Adaptive histogram equalization (AHE) is commonly used to enhance contrast in medical images, but artifacts and noise amplification often occur in the actual process. It is also evident in Fig.2 that the artifacts and noise of MRI images after AHE are severe. Contrast limited adaptive histogram equalization (CLAHE) [48] is an improved approach of AHE, which suppresses the problem of AHE noise amplification by limiting the contrast. This paper uses the genetic algorithm-based adaptive histogram equalization method, which is also improved based on AHE. A new subdivision method is applied to the histogram through exposure threshold and optimal threshold to maintain brightness and reduce information loss. The threshold parameters are optimized using the concept of genetic algorithm. Then, modify each sub-histogram's probability density function (PDF) to improve the image quality. Fig.2 shows the comparison between the pre-processed image and the original image.

D. PROPOSED FRAMEWORK
This paper attempts to solve the segmentation problem of atypical meningioma MRI. The proposed framework is to design a hybrid model of transformer and convolution to adapt to the input of multimodal MRI sequences to achieve accurate segmentation. As shown in Fig.3, the framework is divided into three stages: preprocessing image data, TAGU-Net segmentation model, training, and model evaluation.

1) TAGU-NET NETWORK ARCHITECTURE
The TAGU-Net proposed by us is improved based on the classical segmentation network U-Net architecture, which consists of two parts, the encoder branch, and the decoder branch. The U-Net achieved precise positioning mainly by contracting and expanding paths. The encoder branch of the U-Net network is primarily composed of convolution and down-sampling operations, which are responsible for feature extraction. The decoder branch is used to restore the original resolution of the feature map. The connection between the two branches is mainly through a skip connection. The skip connection completes information fusion by splicing the underlying position information and deep semantic information.
In this paper, considering that U-Net belongs to a fully convolutional network and is limited by the local spatial information of convolution, we propose a novel hybrid architecture consisting of Transformer and convolution based on U-Net with attention gate, combining Transformer's global characteristics and long-distance dependence of the image, in which the attention gate acts on the skip connection and up-sampling. Fig.4 shows the overall architecture of the proposed TAGU-Net model.
The model's input is an MRI image, and the output is an AM mask image. In this model, we unified the size of the input MRI image and used the image resolution of 512 × 512 MRIs. All input images go through ConvEncoder and FormerEncoder, respectively, in the Encoder branch. After feature fusion, they are gradually down-sampled. After the final encoder, the size of the feature image has been reduced VOLUME 11, 2023  to the size of the original MRI image 1/16. The structure of ConvEncoder and FormerEncoder will be introduced later.
After passing the Encoder branch, the feature map enters the Decoder branch. The Decoder branch is mainly composed 53212 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. of ConvDecoder, attention gate, and deep supervision, and the ConvDecoder has the same structure as ConvEncoder. The attention Gate is primarily used for adaptive learning of skip connection information and up-sampling information, a weighted fusion of the two, highlighting important features and suppressing irrelevant features. The purpose of deep supervision is to train the network better and increase the training of the auxiliary loss supervision network.

2) ConvEncoder
ConvEncoder is used to extract the image's inductive bias and local feature information. The traditional ConvEncoder is full convolution, but the ConvEncoder in this paper is not full convolution. The main purpose is to learn the feature map weight better. We added the channel attention SE module [49]. After passing the SE module, ConvEncoder learned the correlation between channels and improved the weight of important channels in the subsequent feature fusion. ConvEncoder is the stack of convolution layers. The super parameter depth determines the number of convolution layers. To prevent Con-vEncoder from following depth, the amount of calculation and parameters added needs to be more significant. We only add the SE module after the first layer of convolution. Meanwhile, because the size of MRI image at the time of input is 512 × 512 × 1, using the SE module has little effect. The SE module is not added in the first ConvEncoder. For the activation function in all volume layers σ (x) using SiLU function, as shown below: In addition, for each ConvEncoder, we have added the residual connection of the bottleneck structure to avoid vanishing gradient problem and network degradation. The equation is shown as follows: where X is the input image X ∈ R H ×W ×C , H , W is the resolution of the image, C is the number of channels, F ( * ) indicates same-padding convolution operation, W c is the weight of convolution, BN ( * ) indicates batch normalization, σ ( * ) indicates the activation function, SE ( * ) indicates SE module, W s is residual connection convolution weight, N represents the depth of ConvEncoder, Fig.5 shows the structure diagram of ConvEncoder.

3) FormerEncoder
Recently, Transformer has gradually become the primary means of natural language processing (NLP). At the same time, Transformer also shines brilliantly in computer vision (CV), and gradually becomes the basic component of a large number of CV. Transformer has also received much attention and research in medical image processing. Transformer's primary approach in CV is to split the input image into patches with different strategies, at the same time embedding the patch in high dimensions, and use the self-attention mechanism to model long-distance dependency. Transformers is immune to convolution imperfections. However, in the hybrid architecture based on Transformer and CNN, Transformer is usually used to model the feature map after CNN extracts features. It can be expected that such a method loses most of the image information, and the Transformer only models feature maps containing rich semantic information, and the representation of shallow features is missing. In this paper, we design a Transformer-based encoder called FormerEncoder. FormerEncoder will work with Con-vEncoder to perform feature extraction on feature maps of different resolutions at different scales, combining different representations of convolution and Transformer and deep and shallow features will help the model perform better segmentation. Meanwhile, FormerEncoder can be flexibly replaced according to the task and data characteristics, such as Swin-Transformer [16], PVT [17] or T2T-ViT [18]. In this article, for the convenience of consideration, we only designed it based on the basic ViT. FormerEncoder follows the classic ViT [14] architecture. FormerEncoder comprises three parts: Patch Embedding Block, Former Encoder Block (FEB), and Upper Sampling Layer (USL). Image in a FormerEncoder X ∈ R H ×W ×C enter the Patch Embedding Block and cut it into several non-overlapping patches x p ∈ R N ×P 2 ×C , embedding the patch in high dimension, where P is the resolution of each patch, N = HW /P 2 is the number of patches generated. In FormerEncoder, the image is cut into patches from ordered spatial information to unordered sequence information. At this time, spatial position information is essential.
To retain the spatial position information of the image, after all patches are embedded, we set a learnable position coding information position embedding E pos ∈ R N ×D before the end of the Patch Embedding Block. Then, the embedded information and position embedding are fused and added, where D is the latent vector size set in the Patch Embedding Block. The equation definition of the Patch Embedding Block is as follows: FormerEncoder mainly includes FEB stack of, FEB it is divided into three parts: Multiheaded Self-Attention (MSA), Feed Forward Network (FFN) and LayerNorm (LN), MSA capture the long-distance dependency and global feature information of the image through self-attention, LN carry out normalization adjustment and finally pass FFN perform dimension transformation and mapping. The specific equation is defined as follows, where L is number of layers stacked for FEB, d = D/H is the self-attention embedding dimension in the FEB and H is number of heads in MSA: Attention MSA(x ℓ−1 ) = Attention 1 (x ℓ−1 ); Attention 2 (x ℓ−1 ); The Upper Sampling Layer is the last component in For-merEncoder. Because the resolution of the image halves after Patch Embedding Block and FEB, the resolution does not decrease after ConvEncoder. To fuse the feature map obtained by the FormerEncoder with the feature map of the ConvEncoder, the resolution of the feature map needs to be restored. Fig.6 shows the structure diagram of the FormerEncoder. 53214 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

4) ATTENTION GATE
The traditional U-Net structure only uses simple concatenation in skip-connection and up-sampling information fusion, and the more complex consideration is to use some nonlinear transformation to concatenate. However, these methods do not consider the correlation between skip-connection feature information and up-sampling feature information.
In this paper, we propose an attention gate at this connection, which will consider both skip-connection feature information and up-sampling feature information. With this addition, the model can adaptively learn skip-connection feature information and up-sampling feature information and weigh the two. Highlight important features while suppressing irrelevant features. It can be seen from Fig.7 that the input of the attention gate is the skip connection feature information generated by the feature map of the encoder and the up-sampling feature information generated by the decoder of the upper layer. The skip connection feature information and the up-sampling feature information are operated in parallel, and finally, the concatenated fusion feature map is obtained. Some equations are defined as follows: where h is the skip connection feature information generated by the feature map of the encoder, x ℓ is up-sampling feature information generated by the upper decoder, H is the decoder depth, F ( * ) indicates a convolution operation, α is the attention coefficient obtained.

5) DEEP SUPERVISION
Deep supervision [50] is one of the commonly employed to overcome the problems of vanishing gradients and slow convergence in neural networks. Its main idea is to add auxiliary classifiers to some hidden layers in the model as the network branch structure and supervise and train the backbone net-work. The most important thing about deep supervision is that it provides a method to judge the hidden layer feature map quality during the training process. In this study, we also use the deep supervision method to accelerate the convergence of the proposed network structure and supervise training. As seen in Fig.4, we added three groups of branch structures in the decoder branch. These three groups respectively perform depth supervising on feature maps of different resolutions and add auxiliary loss to calculate the corresponding loss of feature maps restored by the three groups of depth supervising during training, namely UpperLoss, MidLoss, and LowerLoss, which ultimately adds different weights to the main network loss MainLoss.
where α, β, γ , and δ is the weight coefficient corresponding to the loss, which determines the impact of the predicted loss on the whole loss at different scales, MainLoss will be given more weight.

6) LOSS FUNCTION
The most commonly used loss function in medical image segmentation is pixel-by-pixel cross entropy (CE). Image segmentation is the classification of each pixel. CE checks each pixel separately and makes the cross entropy of the predicted pixel value with ground truth one by one. The formula of CE is as follows: Among them, y i is the real category of the input image pixels, p i is the probability of prediction category 1, N is the number of all image pixels. Weighted cross entropy (WCE) improved CE by putting the weight before the loss of each corresponding class to alleviate the class imbalance. The formula is VOLUME 11, 2023 as follows: Dice loss and IOU loss [51] is another function based on area loss, which aims to minimize the mismatch or maximize the overlapping area between the ground truth and the predicted segmentation. The formulas are as follows: In medical image segmentation, there are only one or two targets in an image, and the proportion of the target will be much smaller than the background. In essence, image segmentation is a classification problem, which causes the problem of class imbalance and severe imbalance between positive and negative sample scales. Focal loss [52] added a penalty item to solve this problem. Its basic idea is that the network will tend to predict only negative samples in the case of highly unbalanced categories. As a result, the prediction probability of negative samples p i will be very high, and the return gradient is also huge. Adding (1 − p i ) γ will reduce the loss of samples with high prediction probability, and increase the loss of models with low prediction probability, thus strengthening the attention to positive samples. The formula is defined as follows: In this paper, we design a mixed loss function, which is the sum of Focal loss, Dice loss, and IOU loss. Its goal is to reduce the point-by-point cross-entropy of pixels through the maximum matching on the region. At the same time, because Focal loss is used, the problem of class inequality is solved to some extent, w 1 , w 2 , and w 3 is the weight coefficient of various losses, where DiceLoss will be given more weight.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we use some evaluation metrics to evaluate the performance of our TAGU-Net model and the effectiveness of the experimental results. The model is mainly tested on the data set introduced in the second section, including training and testing. In the experiment process, we first discussed the comparison between the TAGU-Net model and the commonly SOTA segmentation model. In this section, we analyzed the performance of each model and the model we proposed. Then a group of ablation experiments is given to analyze the performance of some module designed of TAGU-Net proposed to confirm the superiority of our proposed methods in actual performance and excellent in AM segmentation.

A. IMPLEMENTATION DETAILS
The TAGU-Net model we proposed was implemented through Python 3.8.13 and Pytorch 1.12.1. All experiments were conducted in NVIDIA GTX 2080Ti GPU environment. In order to maximize the superiority of the proposed method and ensure the fairness of the experiment, all experiments use the same experimental settings and training strategies. The selection of some training configs and optimizers is as follows: the optimizer selects the Adam optimizer. The initial learning rate is set to 0.0001, β1 = 0.9, β2 = 0.999, and weight_decay = 1e − 5 the learning rate adjustment strategy adopts cosine annealing, T max = 50. Model parameters in 4 were updated in batches. The maximum number of epochs of training duration is set to 200. At the same time, we have normalized all image pixel values. The pixel value from [0 − 255] adjusts to [0 − 1] the image size is uniformly adjusted to 512×512. In terms of training strategy, to prevent over-fitting, we use the K-fold cross-validation training method, K is set to 5. Detailed hyperparameters see Table 1.

B. EVALUATION METRICS
To evaluate the performance of the proposed model, we adopted the following evaluation metrics commonly used for segmentation tasks: Dice score (Dice) and Jaccard index (Jac) are the two most essential segmentation indicators. The definition of metrics is as follows: 53216 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The four essential items in the formula are true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Hausdorff distance (HD) is a measure that describes the similarity between two sets of points. It defines the distance between two groups of points and is also commonly used for segmentation metrics. HD is sensitive to the boundary of segmentation, which is mainly used to measure the accuracy of boundary segmentation. In the experiment, we use 95% HD, the 95th percentile of HD. Compared with HD, this metric is slightly stable for small outliers.

C. PRIVATE DATASET EXPERIMENTAL RESULTS
Through private dataset experiments, we compared the performance of the proposed model with some SOTA models, and the results are shown in Table 2. It can be seen from the experimental results that the proposed TAGU-Net can obtain the highest Dice and Jac, which indicates that the TAGU-Net has higher performance than these SOTA models, and the prediction mask generated by TAGU-Net is highly consistent with the ground truth mask.
FCN fused the characteristic images with different sampling coefficients through strip structure and full convolution and restored the resolution by the operation of up-pooling and transposed convolution, reaching the SOTA of pixel-level segmentation at that time; U-Net achieved better performance with a symmetric encoder-decoder structure and the skip connection between the encoding feature and the decoding feature; U-Net++ redesigns the skip connection based on U-Net so that the decoder can aggregate different scale features to achieve the effect of dense connection; U-Net 3+ proposed a full-scale skip connection, which combines low-level details from different scale feature maps with high-level semantics to maximize the use of full-scale feature maps and improve segmentation accuracy; AttU-Net introduces the attention mechanism into U-Net, and designs the attention gate in the skip connection. The soft-attention method gradually strengthens the weight of local ROI, inhibits the activation in unrelated regions, and reduces the redundant part of the skip. This method is similar but different from the attention gate proposed in this paper. The attention gate proposed in this paper aims to obtain the concatenate feature map of the skip connection and decoder features through the attention mechanism.The comparative results about the attention gate experiments are given in Table 4; ChannelUNet uses spatial channel-wise convolution, which can perform convolution operation along the feature map channel direction to extract the mapping relationship of spatial information between pixels, which is conducive to learning the mapping relationship between pixels in feature maps; R2U-Net applies recurrent neural network and residual network to U-Net, and designs recurrent residual layer to add features to better extract features; SegNet is a segmentation network based on FCN with encoder and decoder structure. In pooling operation, a Pooling Indices method is proposed to save pooled point source information; U2Net proposes a two-layer nested u-shaped structure, which turns the simple convolution structure in UNet into RUS (Residual U-blocks). RUS realizes the mixture of feature maps of different scales and different receptive fields through this u-shaped structure, which can capture more global information from different scales; TransUNet is an attempt to combine with Transformer. It uses the transformer's encoder structure on the encoder structure to enhance the representation of features, and the rest still follows the U-Net architecture; Swin-Unet is a pure transformer-based U-shaped architecture. The contextual features extracted based on Swin-transformer are upsampled by a decoder with a patch expanding layer, and the spatial resolution of the feature map is restored through skip connection and multi-scale feature fusion of the encoder, further segmentation prediction; DeepLabv3+ uses dilated convolution to solve the problem of the receptive field, and obtains multi-scale object information based on spatial pyramid pooling. Furthermore, it uses a fully-connected conditional random field to improve the ability of the model to capture structural information and solve the problem of fine segmentation.
To make a reliable comparison, we compared the results of these studies with our work. The TAGU-Net proposed by us has reached the highest level in important metrics, Dice, and Jac, surpassing other SOTA models. Fig.8 shows the performance of each model in the Dice score and Jac index. In terms of AM segmentation, the Dice of TAGU-Net is 97.67%, and the Jac is 96.35%. Except for TAGU-Net, the best performer is U2Net. Its Dice is 95.56%, and the Jac is 92.03%. In contrast, TAGU-Net absolute accuracy is 2.11% higher in Dice and 3.36% higher in Jac. In terms of relative accuracy, Dice is 2.21% higher and Jac is 4.69% higher. In the evaluation metrics of 95HD, DeepLabv3+ is 0.456, while TAGU-Net is 0.550, which lags behind DeepLabv3+ by a narrow margin and is also far higher than VOLUME 11, 2023   other models. In addition to the Dice and Jaccard metrics, Table 2 presents a comparison of the sensitivity, specificity, accuracy, and precision of our model with those of stateof-the-art models. Sensitivity refers to the ability of the method to detect tumors in MRI pixels, while specificity reports the ability to identify MRI pixels without tumors. Our proposed TAGU-Net model demonstrated a sensitivity value of 97.76% for atypical meningiomas, indicating its ability to accurately detect tumor-associated pixels in MRI. Similarly, the TAGU-Net model exhibited a specificity of 99.96% for atypical meningiomas, demonstrating a strong ability to distinguish tumor and non-tumor pixels. Finally, accuracy describes how well the model classified each pixel class (tumor/non-tumor class). Compared to state-of-the-art models, the proposed TAGU-Net model exhibits the highest pixel-wise recognition ability, achieving the highest values in various metrics. Our proposed model is generally superior to other models in AM segmentation.
At the same time, we use the above model to generate the predicted mask image and visually compare it with the ground truth. As shown in Fig.9, the first column on the left is the MRI image of the input model, the mask image   generated by the model from left to right is the ground truth of the image, and the penultimate column is the mask image produced by the proposed model. It is evident from Fig.9 that the mask image generated by TAGU-Net is the closest to the ground truth. However, other models have different results in generating mask images due to their characteristics and generally have defects.
In addition, we have calculated the Dice distribution and Jac distribution of the output of each model, which is displayed in the form of a boxplot. Fig.10 and Fig.11 show the Dice and Jac boxplot of the proposed method and other SOTA models respectively. It can be seen from Fig.10 and Fig.11 that the box diagram of TAGU-Net is at the far right.
Excluding outliers, the median, maximum and minimum values of the Dice and Jac of the proposed method are higher than those of other methods. It can be seen from the figure that the performance of the proposed TAGU-Net is much higher than that of other models.

D. ABLATION EXPERIMENT RESULTS
To evaluate the effectiveness of our proposed method, we have carried out many ablation experiments, mainly discussing the impact of FormerEncoder, Attention Gate, and Deep supervision on TAGU-Net. For the sake of simplicity, we only select the Dice and Jac, and the experimental results are shown in Table 3.
From the results of the ablation experiment, it can be seen that the FormerEncoder has the most significant impact, improving the performance by 1.89% in Dice, and 3.41% in Jac because it provides a global modeling capability for the model, and the input through the FormerEncoder is information from different scales. This multi-scale information will enable the model to obtain richer semantic information when fused. The role of the Attention Gate must be addressed. As can be seen from Table 3, the results obtained without the structure of the Attention Gate are generally lower, which also fully proves that the Attention Gate adaptively learns the skip connection feature information and up-sampling feature information in the decoder branch, effectively enhancing the activity of important information while suppressing the activity of irrelevant information. The impact of deep supervision is not particularly clear in Table 3. We find that the effects of deep supervision are not so fixed. In most cases, improving the model's performance is beneficial, and occasionally it does not work. Based on the experimental results and the mechanism of the deep supervision, we can speculate that in most cases, the deep supervision increases the loss during the model's training to prevent vanishing gradient, which makes the model better optimized and improve the performance of the model. In a small part of the time, deep supervision can only play a role if the model has converged or the model fitting ability is limited. The experimental results also show that the best results can be achieved simultaneously using For-merEncoder, Attention Gate, and Deep supervision, which  improves the performance by 3.07% in Dice, and 5.12% in Jac compared with backbone. Overall, each module is indispensable for achieving better performance.
In our experiments, we also perform an experimental comparison of loss functions to verify the effectiveness of the proposed hybrid loss function. As shown in Table 5, compared with WCELoss, our proposed hybrid loss improves 1.61% in Dice and 2.91% in Jac, achieving the best results, which verifies that our loss can facilitate model optimization. In Table 4, we compare the attention gate proposed by Oktay et al. with the attention gate of this paper. The results show that the proposed attention gate has a better effect. In Fig.12, the comparison of the heatmap generated by the Grad-CAM [61] based on different methods is shown. It can be clearly seen that the region of interest of the proposed method tends to coincide with the ground truth. It is worth emphasizing that the heatmap generated by TAGU-Net pays more attention to the meningioma boundary region, achieving better accurate segmentation.

E. BraTs 2018
The proposed architecture was compared with other state-ofthe-art models used for semantic segmentation on the BraTs 2018 dataset as shown in Table 6. The metrics of the TAGU-Net model, namely the Dice and Jac, demonstrate that our proposed model surpasses all other state-of-the-art models in LGG segmentation. Specifically, the Dice score is 97.62% and the Jaccard index is 95.35%, indicating superior performance compared to other models. Similar to the experimental results presented in Table 2, TAGU-Net performs equally well on the public benchmark dataset BraTs 2018. It is evident that our model generally outperforms other models in terms of low-grade glioma segmentation. Therefore, the proposed framework has been demonstrated to be capable of accurately distinguishing tumor tissue from other brain tissues (normal and pathological) while precisely following tumor tissue boundaries.

V. CONCLUSION
In the task of atypical meningioma segmentation, the shape and size of atypical meningioma are irregular, and the boundary is not apparent, especially in MRI images with a lot of noise. Therefore, how to accurately segment atypical meningioma accurately is very significant,arduous and challenging. In this study, we used GAAHE to improve the quality of MRI images. At the same time, experimental verification is carried out under the TAGU-Net framework we propose. The proposed TAGU-Net is a hybrid architecture of convolution and transformer. It combines ConvEncoder and FormerEncoder in the encoder branch and introduces Attention Gate in the decoder branch. ConvEncoder and FormerEncoder extract different information from feature maps of different resolutions at different scales, effectively reducing the small drawbacks of the limited receptive field in convolution, while aggregating information from different encoders at various scales. At the same time, FormerEncoder can well capture global features with its unique properties, and the long-distance dependency of the image is modeled to retain fine details. Moreover, it is flexible and replaceable based on different tasks and data characteristics. Furthermore, the Attention Gate adaptively learns the skip connection information and up-sampling information at the decoder stage, highlights the essential features, and suppresses the irrelevant features when fusing the two features. In addition, we have built three sets of losses and one main loss at different scales through the in-depth monitoring technology to help the model learn and train better. TAGU-Net can effectively extract features from MRI images and fuse features of different scales, and achieve accurate segmentation of atypical meningiomas through these proposed modules. We have conducted rigorous experimentation on both a private atypical meningioma dataset and the publicly available BraTs 2018 benchmark dataset. Our proposed methodology has been found to achieve state-of-the-art atypical meningioma segmentation. In comparison to other models, our model has exhibited superior segmentation results, boasting higher levels of accuracy and precision with Dice and Jaccard coefficients of 97.67% and 96.35%, respectively, in the private dataset, and 97.62% and 95.35%, respectively, in the BraTs 2018 dataset.