GAM-SpCaNet: Gradient awareness minimization-based spinal convolution attention network for brain tumor classification

Brain tumor is one of the common diseases of the central nervous system, with high morbidity and mortality. Due to the wide range of brain tumor types and pathological types, the same type is divided into different subgrades. The imaging manifestations are complex, making clinical diagnosis and treatment difficult. In this paper, we construct SpCaNet (Spinal Convolution Attention Network) to effectively utilize the pathological features of brain tumors, consisting of a Positional Attention (PA) convolution block, Relative self-attention transformer block, and Intermittent fully connected (IFC) layer. Our method is more lightweight and efficient in recognition of brain tumors. Compared with the SOTA model, the number of parameters is reduced by more than three times. In addition, we propose the gradient awareness minimization (GAM) algorithm to solve the problem of insufficient generalization ability of the traditional Stochastic Gradient Descent (SGD) method and use it to train the SpCaNet model. Compared with SGD, GAM achieves better classification performance. According to the experimental results, our method has achieved the highest accuracy of 99.28%, and the proposed method performs well in classifying brain tumors.


Introduction
Brain tumors are abnormal tissues in the brain that consist of cancer cells that can continuously differentiate. It will compress nerve tissue and cause great pain to the patient, such as headaches, weakness, numbness, nausea, vomiting, or seizures. The survey shows that the average cost of treating a brain tumor patient is $1.9 million in the United States, which places a huge financial burden on families and society. The World Health Organization divides brain tumors into four grades. The higher the tumor grade, the lower the prognosis and survival rate (Addeh and Iri, 2021). That is, an early cancer diagnosis can detect potential tumors and prevent them from developing further and deteriorating into cancer. Non-invasive methods, such as CT and MRI, diagnose most brain tumors. However, the manual evaluation of brain tumor images is complicated, and the current non-invasive diagnosis requires rich clinical experience, which quickly causes misdiagnose (Afshar et al., 2018).
Deep learning techniques provide great advantages for medical image analysis and can better diagnose brain tumors. Zhou et al. (Zhou, 2018) trained images of different types of brain tumors by recurrent neural networks and got 92.13 % accuracy via DenseNet-LSTM. The key to his research is to directly use the entire sequence of 3D images as training samples to model slices of 2D images, omitting the time-consuming process of individually labeling each frame in the sequence. Chang et al. (Chang, 2017) conducted a retrospective study of MR imaging data and molecular data of 259 patients with gliomas. They proposed 2D/3D hybrid CNN to classify IDH1 mutation and 1p19q co-deletion. Different from previous research, the principal component analysis technique was employed to determine the most predictive imaging features of each molecular state. The prediction accuracy rates are 94 % and 92 %, respectively. Yang et al. (Yang et al., 2018)  from scratch with pre-trained CNN. The results show that a pretrained CNN and GoogLeNet can achieve 94.5 % accuracy. Jiang et al. (Linqi and Jingyang, 2022) proposed SE-ResNeXt to simplify the classification process of gliomas. Three optimization methods are used in their study. Firstly, a multi-step learning strategy is used to adjust the learning rate dynamically. Second, the label smoothing strategy is adopted to optimize the unique thermal labels to reduce the network's reliance on the true label probability distribution and improve the network's prediction ability. Finally, transfer learning method based on the CE-MRI simplifies migration learning process. The accuracy and specificity reach 98.99 % and 98.33 % on the BraTS2019 dataset, respectively. Gull et al. (Gull and Khan, 2021) proposed the fully convolutional neural network (FCNN) and transfer learning techniques for brain tumor detection. The framework is divided into five stages: preprocessing, skull dissection, tumor segmentation, post-processing, and binary classification. In addition, the global thresholding technique has been employed to eliminate the enhanced small non-tumor regions, and a focal loss function is used to solve the problem of category imbalance. The average classification accuracy is 96.49 %, 97.31 % and 98.79 %, respectively. Rao et al. (Rao, 2022) proposed a kernel support vector machine (KSVM) and social ski driver (SSD) algorithm in the study of tumor classification. They used NMF-based preprocessing to perform image smoothing and quality enhancement and divided the image into non-overlapping regions by the binomial thresholding segmentation method. During the preprocessing of classification, they combined GLCM with SGLDM to handle feature extraction, which can be used to select the best subset of features by a meta-heuristic HHO algorithm.
However, few studies have reported on the following three issues. First, the position information of the feature map will be lost during the feature extraction process, which results in insufficient feature extraction. As the number of convolutional layers increases, the perceptual field of the feature map mapped to the original image becomes larger, and the perception of position information becomes poorer, thus losing a certain amount of position information. This leads to the underutilization of position information. Amirul et al. (Islam and Bruce, 2001) explain how position information is exposed to neural network learning. Experiments show that position information is implicit in the extracted feature maps and can be utilized to a large extent. However, recent research rarely takes position information into account in the diagnosis of brain tumors.
Second, sharpness-based learning methods suffer from the rescaling sensitivity of model parameters, which weakens the correlation between sharpness and the generalization gap. Recently, many scholars have studied the generalization of the deep neural network to solve the shortcomings of pure optimization. They have attempted to elucidate the relationship between the geometry of the loss surface and generalization performance, where the minimization of the sharpness of the loss surface and the derived generalization boundary has been proven effective (Sun et al., 2021;Chaudhari, 2019Chaudhari, /12/20 2019M. H, 2016;Hochreiter and Schmidhuber, 1997). However, even sharpness-based learning methods, including SAM (Foret et al, 2010) and some sharpness measures, will be sensitive to rescaling model parameters. Dinh et al. (Dinh et al., 2017) pointed out that parameter rescaling without changing the loss function leads to differences in sharpness values, so that this feature may weaken the correlation between sharpness and generalization error. To compensate for the scaledependent sharpness problem, scholars have conducted many studies recently (Tsuzuku et al., 2020;Yi et al, 2019;Liang et al., 2019;Karakida and Amari, 1906). However, these previous works are limited to proposing generalization measures that do not suffer from scale-dependent problems.
Third, redundant parameters and overfitting problems are caused by the fully connected (FC) layer. Because of fully connected characteristics, FC generally has the most parameters. Traditionally, as the last layer of the model, FC acts as a classifier. However, the size of FC weight rises dramatically with the increase of network scale, which quickly causes overfitting. Although current research (Byerly and Dear, 2021;Kowsari et al., 2018;Bengio et al., Aug 2013;Ciregan et al., 2012) has focused on this problem, its efficiency and poor recognition accuracy have not been well addressed.
To solve the above problems, we propose a new computeraided diagnosis method for diagnosing suspected brain tumors. The main contributions are as follows: 1) Reinforced attention (RA) is proposed to preserve long-range spatial dependencies and precise position information to enhance the attention of objects.
2) To solve the problem of scale dependence and improve the generalization ability, we design a GAM optimization algorithm.
3) To prevent the loss of essential features and solve the problems of parameter redundancy and overfitting, we propose an IFC layer.
The rest structure is as follows: Section 2 shows the dataset and preprocessing process, Section 3 describes the methodology, Section 4 shows the experimental result and discussion, and Section 5 provides a conclusion and subsequent work.

Source of dataset
To make the experimental process easier to implement and the experimental results more comparable, we used the BraTS2019 dataset that has a total of 3040 images containing brain tumor patient MRI and benign patient MRI. Fig. 1 shows a sample dataset of this paper. Among them, the first row is malignant patients, and the second is benign patients. We randomly select 80 % of each class of images as the training set. The remaining 20 % of images are used as the test set. We keep the same division ratio of the training set and test set to perform 5-fold cross-validation.

Data augmentation
Due to the dataset in this study coming from different sources and the size of images varying wildly, we have performed different data augmentation operations on the BraTS2019 dataset, including resize, random rotation, random crop, and random horizontal flip. The specific operation is shown in Fig. 2, which has an example of two images. First, considering the different sizes of each sample image, all images are resized to 230 Â 230, then we randomly rotate them by 15 degrees. Finally, we randomly crop the images to 224 Â 224 and perform random horizontal flip operations. By data augmentation operation, the robustness of the network model can be effectively improved (Perez, 2017).

SpCaNet
Although transformers have larger capacities, they may have poorer generalization ability than convolutional neural networks (CNNs) (Wu, 2021). We devise a tandem stacking approach to integrate inductive biases of convolution into the transformer by (a) imposing local perceptual fields for the attention layer and (b) adding attention and feedforward neural network layers with implicit or explicit convolution operation. Depth-wise convolution and self-attention are the sums of weighted values for each dimension in the predefined receptive field. Convolution relies on a fixed kernel to collect information about the local receptive field: In Eq. (1), x i , z i 2 R D are the input and output at position i respectively, w is weight matrix and f i is local neighborhood in position i. Depth-wise convolution has translation invariant characteristics. The convolution weight w ij focuses on the relative bias offset between i and j, instead of specific value of i and j. The translation invariance has improved the generalization capability.
In contrast, the perceptual field of self-attention  is not a local neighborhood, and its weights are calculated based on pairwise similarity and then activated by the softmax function. As shown in Eq. (2): where G represents the global space, A ij represents attention weight and x i , x j are two patches of an image. The input of self-attention is adaptively weighted. This makes self-attention easier to capture relationships on different elements. Besides, self-attention provides a global receptive field, which can obtain more contextual information than CNN's local receptive field. As is shown in Eq.
(3) and Eq. (4), to merge the input of self-Attention with adaptive weighting, a global static convolution kernel is added to the adaptive attention matrix around softmax normalization, which has combined the global perceptual field with the translational invariance.
The global context has a quadratic complexity in terms of spatial size (Guo et al, 2202). Therefore, if relative attention is directly applied in Eq. (3) or Eq. (4) to the original input, the calculation speed will drop sharply because of the large number of pixels.  Therefore, down-sampling is adopted to lessen spatial size after the feature map reaches a manageable level and apply globalrelative attention. Fig. 3 shows the general structure of SpCaNet. The stem convolution consists of two 3 Â 3 convolutions designed to reduce dimensionality and make global attention feasible when the overall size increases. Compared to models using a local attention mechanism, SpCaNet always uses sufficient attention to guarantee the model's capacity.
The relative transformer, shown in Fig. 3(e), takes up most of the calculations and parameters. For all general convolution and PA convolution blocks, kernel size is set to 3. For all transformer blocks, the attention head is respectively set to 32. The inflation rate of an inverted bottleneck is 4. SpCaNet stacks convolutional and attention layers vertically. In the last layer, we adopted IFC to reduce calculation and integrate features through progressive input and feature splicing. SpCaNet gives more global information to brain tumor images, which is more sensitive to the lesion area and has the advantage of low computational overhead.

Reinforced attention
Positional information is crucial to generate spatial selective attentional maps. To solve the underutilization of positional information, scholars have sought to determine this problem. SENet (Hu et al., 2017) simply squeezes each 2-dimensional feature map and then constructs interdependencies between channels effectively. CBAM (Woo, et al., 2018) introduced spatial information through large-scale kernel convolution. GENet (Hu et al., 2017), GALA (Linsley et al., 2019), AA (Bello et al., 2019), and TA (Misra et al., 2021) extended the above idea by designing spatial attention and attention block.
However, SENet (Hu et al., 2017) only considers inter-channel information. CBAM and later methods mainly used convolution to capture attention information, which is insufficient to model long-term dependencies. For the optimization of the above problems, a non-local/self-attention network is built to focus on spatial and channel attention, such as GCNet (Cao et al., 2019), SCNet (Liu et al., 2020), CCNet (Huang, et al., 2019), NLNet (Wang et al., 2018), which could capture different spatial information by utilizing the non-local mechanism. However, these methods are computationally expensive. Unlike the non-local/self-attention approach, we propose a novel RA method to effectively capture position informa-tion and inter-channel relationships to enhance feature representation. Fig. 4 illustrates the detailed steps of RA. Since the global pooling method compresses the global spatial information into channel descriptors, it leads to difficulty in preserving the positional information. To obtain attention to the image width and height and encode the exact positional information, the RA mechanism first divides the input feature map into two directions, width, and height. Then it performs global average pooling to obtain the feature maps in both width and height directions, respectively.
As is shown in Eq. (5) and Eq. (6), where y h c h ð Þ, y w c w ð Þ are the output of the c-th channel at height h and width w, which is obtained by encoding each channel along horizontal and vertical coordinates using pooling kernels of size H Â 1 and 1 Â W.
Then, the width and height of the global perceptual field are stitched together and fed into a shared convolution module with a 1 Â 1 convolution kernel to reduce its dimension to the original C=r. Then, the batch normalized feature map is fed into the ReLU activation function to obtain a feature map shaped as 1 Â W þ H ð ÞÂC=r. As shown in Eq. (7), where T 1 is the 1 Â 1 convolution, y h ; y w are the feature map in horizontal and vertical directions.
Then the feature map a 2 R C=rÂðHþWÞ is convolved with a kernel of 1 Â 1 according to the original height and width to obtain the feature map with the same number of channels as the original one, and the attention weights k h and k w in the height and width directions of the feature map are obtained after the Sigmoid activation function. As shown in Eq. (8) and Eq. (9), where a h and a w are two independent tensors obtained by splitting a in the spatial dimension, and r is the Sigmoid function, T h and T w are two 1 Â 1 convolutions to transform feature maps a h and a w to the same channels.
After the above calculation, the attention weights of the input feature map in the height direction and the attention weights in the width direction will be obtained. Finally, multiplying and weighting the original feature map, the final feature map with attention weights in the width and height directions will be obtained, as shown in Eq. (10).

PA convolution block
To solve the mismatch problem of the combination of convolution and transformer, we propose PA convolution block. The overall architecture of the PA convolution block is shown in Fig. 5, which uses depth-wise convolution with inverted residuals. The expansion compression scheme is the same as the transformer's feedforward neural network module.
The PA Convolution block first performs 1 Â 1 Conv for dimensionality upscaling and then performs depth-wise separable convolution. In the short connection part on the right, RA is added. First, the feature map passed from the Swish activation function is subjected to one-dimensional average pooling from x and y directions to obtain two directional feature maps with a global perception field. The feature maps obtained are stitched together, then fed into a shared 1 Â 1 convolution, batch normalized, and finally passed to the sigmoid activation function. Then, channel multiplication is performed on the feature maps with the processed image by Swish (Ramachandran and Le, 2017). At last, 1 Â 1 convolution is employed to reduce the feature map's dimension. After a series of batch normalization and drop-connect operations, elementwise addition is performed on the information of the short connection part on the left and the backbone of the PA convolution block to obtain the output.
Depth-wise separable convolution used above is a technique for reducing parameters, which corresponds to the depth-wise convolution and point-wise Convolution in Fig. 5. The specific structure is shown in Fig. 6. The depth-wise separable convolution splits an ordinary 3 Â 3 convolution into two convolutions. The first convolution applies a 3 Â 3 convolution to each input channel. A convolution kernel convolves one channel. This operation is called depth-wise convolution. Another convolution applies a 1 Â 1 kernel to all channels to generate a new feature map by weighting the combination of previous feature maps in the depth direction, which is called point-wise convolution. Depth-wise separable convolution is the same as ordinary 3 Â 3 convolution transformation, with the advantage of reducing the parameters. However, its operating efficiency still needs to be improved. Therefore, we propose the Fused Inverted Residual, as shown in Fig. 7. We fuse the first 3 Â 3 convolution and the second 1 Â 1 convolution of the upper inverted residual block into one 3 Â 3 convolution to get the lower side of the Fused Inverted Residual, which solves the slow problem caused by the depth-wise convolution. Meanwhile, it can accelerate the operation of the PA convolution block.
To introduce weight sparsity, we propose DropConnect operation in PA convolution block instead of Dropout, reduce overfitting and improve performance. As shown in Fig. 8, the output of hidden layer nodes is not randomly cleared to 0 but instead clears the input weight of each node connected to it with 1 À p probability. In Fig. 8, v is the input layer, and r is the output layer, both of which are n Â 1 dimensional column vectors. W is a multi-dimensional matrix of weight parameters, aðxÞ is the form of excitation function satisfying a 0 ð Þ ¼ 0, m is a column vector composed of 0, 1, and the multiplication of m and a W v ð Þ is the multiplication of correspond-ing elements. The right side is similar, where M is the binary matrix used to encode the connection information.

Relative self-attention transformer block
Transformer ) is a self-attention mechanism for learning the relationship between sequence elements. The Relative self-attention transformer block based on the attention head is proposed to utilize the relative positions and distances between sequence elements effectively. It can form the output of the self-attention sublayer. Self-attention sublayer adopts h attention heads. Each head concatenated and parametric linear transformation is applied to obtain the output of the sublayer: In Eq. (11), each attention head operated on n elements of input patch x ¼ ðx 1 ; ::; x n Þ, where x i 2 R dx , and computed new sequence  To propagate relative information of input patch to sublayer outputs, we modify Eq. (11): where a ij is weight coefficient, calculated by Eq. (13) using the softmax function: In Eq. (14), the relative information between the input patch x i and x j is represented by vector a V ij ; a K ij 2 R da . e ij is calculated by comparing the compatibility function of the two input elements, where W Q , W K , W V are parameter matrices that used for each layer and attention head. In the relative self-attention transformer, the relative position between patches replaces the absolute position. Fig. 9 shows some example patches with relative positions and distances between elements. We learn a representation for the relative position within k distance.
When considering relative positions, the representations of different position pairs are different. This makes it impossible to compute all e ij for all positions with single matrix multiplication. Therefore, we decompose Eq. (14) into two items to solve: In Eq. (15), the first term is the same as Eq. (14) and can be calculated as above. The second term is the representation of relative position, which can use tensor reshaping to compute n parallel multiplications of the matrix. Each matrix multiplication computes all head and batch contributions to e ij , corresponding to a particular sequence position. The relative self-attention transformer block introduces the relative position information in the calculation process, thus breaking the Permutation-invariant property of selfattention  and improving the relationship construction between patches.

Intermittent fully connected layer
The size of the FC layer's hidden layer is critical. A largely hidden layer with more parameters usually improves the prediction accuracy but dramatically increases the number of weights. And a small hidden layer does not propagate all input features well, resulting in suboptimal results. To make up for both shortcomings and solve the parameter redundancy and overfitting caused by a fully connected layer, we propose an IFC layer.
The architecture of IFC is shown in Fig. 10, which includes the input layer, middle layer, and output layer. First, the feature map obtained from the transformer block is split into 1-k, k-2 k, 2 k-M, where k is the hyperparameter needed to set, and M is the size of the entire input. We adopted the step-by-step repeating input mode, consisting of multiple split data for the input layer. The middle layer of IFC consists of different hidden layers, each composed of multiple neurons. The output layer is also composed of multiple neurons based on classification numbers. The number of interneurons is usually kept small to reduce the number of multiplications. Since the number of middle hidden neurons is usually small, the network may underfit. Therefore, we make each layer repeatedly receive input from the previous layer to preserve certain features of the middle layer. The progressive and repeated input functions enabled the neural network to achieve the desired results with fewer parameters, improving performance with faster responses.

Gradient awareness minimization
To solve the scale-dependent problem and improve the generalization performance, we introduce the concept of scale-invariant adaptive sharpness and propose a novel learning method named GAM.
In GAM, gradient-aware sharpness is adopted to minimize the corresponding generalization bound, which could avoid the scale-dependent problem faced by SAM (Foret et al, 2010). Based In Eq. (17), N À1 w is the normalization operator of R k , h is a strictly increasing function on R k ! R k , n ¼ S j j, r ¼ ffiffiffi k p rð1 þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log n=k p Þ. The generalization bound in the right side of Eq. (17) can be described by gradient-aware sharpness: Since h kwk 2 2 n 2 r 2 is a strictly increasing function with kwk 2 2 , it can be viewed as a standard L 2 regularization term (Foret et al, 2010). Therefore, the gradient awareness minimization problem can be defined as: where e $ ¼ N À1 W e. Then, the two-step process of GAM can be described as: Table 2 shows the principle and process of the GAM algorithm. The algorithm's input is the MRI image of the training set, and the algorithm's output is the trained weight of the model. First, we set the hyperparameter p to 2. For the input, we first define the loss function and then set the radius of maximization region r, weight decay coefficient k, and learning rate a. GAM solves the minimax problem by iteratively applying a two-step procedure for t = 0, 1, 2, . . . by Eq. (21). For t = 0, 1, 2, . . ., n, especially if p ¼ 2, the calculation formula of e can be obtained as illustrated in Table 2. In the rigid region with fixed radius r, GAM estimates the point w t þ e t at which the loss is approximately maximized and perform gradient ascent around w t , then performs gradient descent at w t using the gradient at the maximum point w t þ e t . In the training phase, the   optimal e is calculated iteratively for each batch, and the initialized weight w is updated according to e until the algorithm get convergence.

Experimental settings
Our experiment is performed on NVIDIA QUADRO RTX 8000, whose CUDA version is 10.2. GPU memory of the server is 48 GB, and the memory type is GDDR6. All experiments were based on python 3.9. The framework adopted in the experiment is Pytorch (Paszke, et al., 2019) and scikit-learn (Pedregosa et al, 2011).

Performance measures
K-fold cross-validation is commonly employed to test the accuracy of the classification model. We divide the BraTS 2019 dataset into five parts and use four parts as a train set and one as test data in each round of algorithm experiments, which is shown in Fig. 11. The advantage of this operation is that ratio could be maintained at 8:2, which is better for guaranteeing the size of the test set. Each trial yields a correct rate. It treats an average of 5 correct rates as an estimate of accuracy.
Five evaluation metrics have been employed to evaluate our method: precision, recall, specificity, F1-score, and accuracy. They are defined as follows: Recall Accuracy

Settings of hyperparameters
The setting of hyperparameters is particularly important, and it is generally determined by experience, such as batch size and learning rate.
Batch Size is the number of samples selected for a training session. The Batch Size affects the degree and speed of model optimization and directly affects the GPU memory usage. GPU to the power of 2 of the batch can play a better performance, so set to 16, 32, 64, 128. . ., which is often better than when set as a multiple of 10 or 100. In our study, when setting Batch Size, a larger Batch Size is first selected to fill up the GPU, and the loss convergence is observed. If there is no convergence or the convergence effect is not good, the Batch Size will be reduced. Finally, we select a Batch Size of 64.
The initial learning rate plays a decisive role in the convergence of the deep network. If the initial learning rate is too low, the loss of network will decline very slowly; if the initial learning rate is too large, the range of parameter updating will be very large, which will lead to the convergence to the local-optimal solution, or the loss will directly start to increase.
The selection strategy of the learning rate is constantly changing in the process of network training. In the beginning, the parameters are relatively random, so we should choose a relatively large learning rate, so that loss will decrease faster. After training for a period of time, the update of parameters should have a smaller range, so the learning rate generally attenuates.
There are many ways of attenuating, the way we adopt one of the exponential attenuation methods, StepLR and its specific steps are shown in Table 3. First, we initialize the learning rate to Table 2 Schematic diagram of the GAM algorithm.

Algorithm: Gradient Awareness Minimization (p = 2)
Input: Training dataset S :¼ U n i¼1 fðx i ; y i Þg, loss function l, batch size b, the radius of maximization region r, weight decay, coefficient k, scheduled learning rate a, initial weight w 0 .
Output: A model with trained weight w. Initialize weight: w :¼ w 0 , While the model does not converge, do Sample a mini-batch B of size b from S, Þ . end while return w

Exploration of the best combination by optimization algorithm
We first investigate the influence of the stack way of convolution and attention blocks for optimal performance. Convolutions perform down-sampling and global relative attention operations only when the feature maps are small enough to be processed. There are two ways to do down-sampling. The first is to divide the image into blocks, as in the ViT model (Dosovitskiy et al, ., 2010), and stack-related self-attention blocks. The second is a multi-stage operation with progressive pooling.
Our methods can be described in four stages. The first stage, called C, consists of classical convolutions and PA convolution blocks to achieve dimensionality reduction. The last three stages consist of convolutional blocks or Transformer blocks, resulting in 5 combinations: CPTT, CTTT, CPPT, CPPP, and TTTT, where P represents the PA convolution block, and T represents the transformer block. Table 4 shows the detailed metrics obtained for different combinations, and the schematic diagram of each indicator is illustrated in Fig. 12. The accuracy, precision, recall, and F1-score values of CPTT are 99.34 %, 99.9 %, 99.68 %, and 99.84 %, respectively. Compared with other combinations of CTTT, CPPT, CPPP, and TTTT, CPTT has the best performance in all indicators.

Exploration of the best numbers of blocks and channels by optimization analysis
In this section, we analyze the impact of GAM on SpCaNet further by permuting and combining different numbers of blocks and channels. As shown in Table 5, N represents the number of each module. These modules include classical convolution, PA convolution, and transformer. C represents the number of feature map channels passed in by each module. Block vs channels 1-5 represent different combinations of classical convolution, PA convolution, and transformer. For the number of channels L1 to L4, we use incremental doubling while ensuring that Stem L0 has a smaller or equal width to be the same as L1. For simplicity, only the number of blocks in L2 and L3 are scaled when increasing the network depth.
In Fig. 13, the size of the dots markers represents the number of parameters, and the accuracy increases slightly from Block vs channel 1 to Block vs channel5. The accuracy of Block vs channel1 is 99.28 %, which is similar to the performance of other schemes with 18.2 M parameters. The parameters of other combinations are 33.6 M, 56.1 M, 118 M, and 205 M, respectively, which are nearly twice as likely to Block vs channel1. Considering the balance of accuracy and number of parameters, we use 'Block vs channel1 0 in the following experiments.
Corresponding indicators for different combinations.

Table 3
The pseudocode of the settings of the learning rate of StepLR.
The procedure of StepLR

Input:
The number of total epochs N, initial learning rate a 0 , current epoch n.
Output: current learning rate a. To demonstrate the performance of the proposed RA and IFC, we performed a series of ablation experiments, the corresponding results of which are listed in Table 6. These experiments show that in the case of comparable computational cost, positional information embedding is more conducive to classifying brain tumor images, and RA achieves the greatest improvement in accuracy by 2.0 %. RA inherits the advantages of extrusion and excitation attention from channel attention methods that simulate relationships between channels while capturing long-distance dependencies with precise positional information. Experiments in the classification of brain tumors demonstrated the effectiveness of RA. In the comparison experiment between ''+RA + IFC" and other experiments, it can be seen that the performance of the proposed IFC further improves the performance. Its parameter amount is reduced to 18.21 M from 25.97 M compared with the FC in the baseline. Experiments show that IFC achieves higher classification accuracy of brain tumor images while reducing computational costs.
Each method's results are shown in Table 7, where FLOPs represent the number of floating-point operations, and params represent the size of the model parameters. By observing each model's accuracy and other indicators, it could be found that SpCaNet achieves the best performance. The design of the PA convolution block and IFC makes SpCaNet more lightweight. Compared with the FLOPs of other models ranging from 11443.2 to 59669.8 M, the FLOPs of SpCaNet-1 and SpCaNet-2 are 3336.8 M and 6846.4 M, respectively, which significantly reduces floating-point operations.
As is shown in Fig. 14, we use a bullet chart to briefly show the accuracy and parameters of each model, in which the thin black line represents the accuracy rate, and the thick gray line represents the number of parameters. For example, the accuracy rate of SpCaNet-1 is 99.18 %, and the number of its parameters is 18.2 M. Compared with BotNet (53.4 M), the parameter amount of SpCaNet-1 is reduced by more than three times. Compared with , the number of SpCaNet-1 parameters is reduced by about eighteen times. From the comparison of each model, it can be inferred that SpCaNet has achieved the best performance in feature extraction.

Evaluation of GAM-SpCaNet
To verify the effectiveness of the proposed GAM, we compare it with the SGD optimizer. The accuracy rates and the difference between the accuracy rates of the test and training sets are shown in Table 8. For example, DeiT_B has an accuracy of 95.23 % at 50 epochs under SGD, and the difference between the training set and test set is 2.6 ± 0.47 %. In the comparison test with SGD, the superior performance of GAM has a significant performance benefit. In general, the accuracy of each model is improved by 1 % at 200 epochs, such as DeiT_B, ViT-B/16, ViT-L/16, and BotNet. Furthermore, with increasing training epochs, GAM can continue improving accuracy without overfitting.
On the contrary, the standard training method without GAM often overfits when training multiple epochs. In Fig. 15, the curve is the difference value between the test set and the training set, and the bar graph represents the accuracy value. GAM dramatically reduces the accuracy difference between the training and test sets. It can also reduce the risk of overfitting to a certain extent, especially for DeiT_B and Vit_L/16.GAM achieves excellent results with its scale-invariant gradient-aware sharpness, which improves the training path in the weight space by adjusting the maximized region relative to the weight scale. The experimental results confirm that GAM helps to improve model performance and generalization ability.
In the evaluation of GAM-SpCaNet, 5-fold cross-validation is performed, and the results are as follows in Table 9, which describes the model's specificity, precision, recall, and F1-Score values. The performance measures for malignant were 100 %, 100 %, 99.66 %, and 99.83 %, respectively. According to Table 10, the overall average accuracy of the system is 99.28 ± 0.34 %. Fig. 16 shows the visualization of GAM-SpCaNet on Grad-CAM, the parameters of the last layer Transformer of SpCaNet are selected to combine with Grad-CAM. It can be found that GAM-SpCaNet can accurately focus on the lesion area. The confusion matrix plot and ROC curve are shown in Fig. 17. It could be seen our model has excellent performance.

Comparison with the state-of-art methods
To verify the effectiveness of GAM-SpCaNet, we conducted the comparative experiment with four SOTA methods (SE-ResNeXt-MLT (Linqi and Jingyang, 2022), FCNN-CRFs (Gull and Khan, . From the experiment results, it can be found that our method has an excellent overall discriminative ability and performs well in all indicators, which indicates that our method has stronger diagnostic ability for patients with suspected brain tumors.      difference is 0.446, which is near medium. It can be obtained from the significant difference that GAM has a certain improvement over SGD to a certain extent.

Conclusion
Accurate brain tumor detection is still challenging because of the irregular shape and variable appearance of brain tumors. Existing work has limitations for identifying the substructure of tumor regions and classifying malignant and benign images. GAM and SpCaNet are proposed to diagnose the brain tumor. Experimental results confirm that our method is superior to eight SOTA CNN and Transformer models and exceeds four SOTA brain-tumor diagnosis models.
Our GAM-SpCaNet has the best performance because (i) SpCa-Net leverages the power of both convolutional neural network (CNN) and transformer. Based on the PA convolution block's translation equivariance and the global receptive field of relative selfattention, SpCaNet can better extract pathological features. (ii) The proposed GAM optimizes the model training process, which significantly increases the generalization characteristics of the model. (iii) The lightweight of the PA convolution block and IFC helps reduce training time and improve training efficiency.
In future work, we will further optimize the SpCaNet variant to reduce the hardware resource requirements. Meanwhile, we will optimize the interpretability of the model better to show the detailed process of brain tumor recognition.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.