SAMS-Net: Fusion of attention mechanism and multi-scale features network for tumor infiltrating lymphocytes segmentation

: Automatic segmentation of tumor-infiltrating lymphocytes (TILs) from pathological images is essential for the prognosis and treatment of cancer. Deep learning technology has achieved great success in the segmentation task. It is still a challenge to realize accurate segmentation of TILs due to the phenomenon of blurred edges and adhesion of cells. To alleviate these problems, a squeeze-and-attention and multi-scale feature fusion network (SAMS-Net) based on codec structure, namely SAMS-Net, is proposed for the segmentation of TILs. Specifically, SAMS-Net utilizes the squeeze-and-attention module with the residual structure to fuse local and global context features and boost the spatial relevance of TILs images. Besides, a multi-scale feature fusion module is designed to capture TILs with large size differences by combining context information. The residual structure module integrates feature maps from different resolutions to strengthen the spatial resolution and offset the loss of spatial details. SAMS-Net is evaluated on the public TILs dataset and achieved dice similarity coefficient (DSC) of 87.2% and Intersection of Union (IoU) of 77.5%, which improved by 2.5% and 3.8% compared with UNet. These results demonstrate the great potential of SAMS-Net in TILs analysis and can further provide important evidence for the prognosis and treatment of cancer.


Introduction
TILs are the types of immune cells, which exist in tumor tissues and are of great significance for the diagnosis and prognosis of cancer [1].As the gold standard for cancer diagnosis, pathological images contain a lot of information [2].TILs can be observed in pathological images, and their role is particularly important as the main immune cells in the tumor microenvironment [3,4].Now many studies have shown that the number and spatial characteristics of TILs on pathological images can be used as predictors of breast cancer prognosis [5,6].Part of the pathological images of TILs are shown in Figure 1.Pathological image analysis relies on professional doctors, which is time-consuming and laborious, meanwhile, the specificity of pathological images will also affect the reliability of doctors' diagnosis [7].Deep learning technology has attracted extensive attention in the medical field because of its autonomy and intelligence [8].It has been gradually applied to many fields, such as medical image classification [9,10], detection [11,12] and segmentation [13,14], etc.Using deep learning methods to segment TILs in pathological images, and quantify the number and characteristics of TILs has become one of the hotspots of current research.However, due to the specificity of pathological images and cells, there are three challenges in the segmentation tasks of TILs: 1) The problem of cell adhesion and overlap.During the sampling process, many cells tend to cluster together because of cell movement; 2) The coexistence of multiple types of cells.There are many kinds of cells in a pathological image, it is difficult to segment a kind of cells accurately; 3) The problem of the large difference between the front and background.Compared with the background area, the cells occupy a small area and are not easy to capture in the segmentation process.
Considering the above challenges, we take advantage of deep learning technology to design a segmentation network, which is called as SAMS-Net.The proposed network model has three contributions:  Squeeze-and-attention with the residual structure module (SAR) fuses local and global context features, which makes up for the missing spatial information in the ordinary convolution process.
 Multi-scale feature fusion module (MSFF) is integrated into the network to capture TILs of smaller size, and combine the context features to enrich the decoding stage features.
 Convolution module with residual structure (RS) merges feature maps from different scales to strengthen the fusion capability of high-level and low-level semantic information.

TILs segmentation
Early cell segmentation methods such as threshold segmentation method [15], watershed algorithm [16] etc., are mostly using local features while ignoring global features, so the segmentation accuracy needs to be improved.Cell segmentation algorithms based on deep learning have been proposed and widely used in medical image segmentation, like fully convolutional networks (FCN) [17], UNet [18] and DeepLab networks [19].The experiment has shown that compared to traditional segmentation algorithms, these networks have high performance.
Automated cell segmentation methods have been studied extensively in the literature [20][21][22][23][24].The literature [20] introduced a combined loss function and adopted 4 × 4 max-pooling layers instead of widely used 2 × 2 to reinforce the learning of the cell's boundary area, thereby improving the network performance.The study [21] applied a weakly supervised multi-task learning algorithm for cell's segmentation and detection, which effectively solved the problems of difficult segmentation.In addition, Zhang et al. [22] put forward a dense dual-task network (DDTNet), this network uses the pyramid network as the backbone network.The boundary sensing module and feature fusion strategy are designed to realize the automatic detection and segmentation of TILs at the same time.The results show that it is not only superior to other advanced methods in detecting and segmentation indexes, but also can complete automatic annotation of unlabeled TILs.Study [23] found a new approach for the prognosis and treatment of hepatocellular carcinoma by utilizing Mask-RCNN to segment lymphocytes and extract spatial features of images.Based on the concept of autoencoder, Budginaite et al. [24] devised a multiple-image input layer architecture to ensure the automatic segmentation of TILs, where the convolutional texture blocks can not only improve the performance of the model but also reduce the complexity.However, the cell segmentation methods proposed by the above scholars are single network models, without considering the characteristics of pathological images and cells.Improving the network model by utilizing the characteristics of images can help further increase the segmentation effect of cells.

Attention mechanism
Attention mechanism is a method to measure the importance of different features [25].Originally, the attentional mechanism is initially used in machine translation, but has gradually been applied to semantic segmentation because of its ability to filter high-value features.The attention mechanism can be divided into soft attention and hard attention.Since the hard attention mechanism is difficult to train, the soft attention mechanism module is often used to extract key features [26].
Related researches have shown that the spatial correlation between features can be captured by integrating learning mechanism into the network.Study [27] presented the squeeze-and-excitation (SE) module by introducing channel learning to emphasize useful features and suppress useless features.Residual attention network [28] exploited a stacked attention module to generate attention-aware features, and the residual learning coupled with the attention module can make the network expansion easier.Furthermore, Yin et al. [29] employed a selective attention regularization module based on the traditional classification network to improve the interpretability and reliability of the model.This type of attention module only used channel attention to enhance the main features, while ignoring the spatial features, and is not suitable for segmentation tasks.With the transformer, architecture success has been Volume 20, Issue 2, 2964-2979.
achieved in many natural language processing tasks, Gao et al. [30] proposed UTNet, which integrated self-attention into UNet frame for enhancing network performance.In addition, the literature [31] believed that semantic segmentation included two aspects, one is pixel-wise prediction, and the other is pixel grouping.Thus, the squeeze-and-attention (SA) module is designed to generate the attention mask of pixel group to improve the segmentation effect.

Multi-scale module
Ordinary segmentation networks applied single convolution and pooling operations to extract features, which led to under-segmentation due to a lack of relevant information between images.To address this problem, a number of studies have proposed multi-scale feature fusion methods to mine context information that improve the effect of network segmentation.Feature pyramid network [32] extracted semantic feature maps at different scales by a top-down architecture with lateral connections.The atrous spatial pyramid pooling (ASPP) module capitalized on dilated convolutions with different expansion rates to obtain semantic information of multi-scale contexts.UNet++ [33] introduced nested and dense jump connections to aggregate semantic features from different scales.Moreover, UNet3+ [34] exploited full-scale jump connections to make full use of multi-scale features.which combined low-level details and high-level semantics in full-scale feature maps to improve segmentation accuracy.In addition, atrous convolution and deformable convolution obtained multiscale semantic information by changing the size and position of the convolution kernel.In this section, we elaborate on the proposed TILs segmentation network.First, the pathological images of TILs were labeled by labelme software, and then segmented by the SAMS-Net algorithm.The algorithm framework of SAMS-Net is shown in Figure 2. Specifically, the coding structure of the model consists of a SA module and a residual structure, this structure is named SAR modules, and the Volume 20, Issue 2, 2964-2979.

Methodology
blocks are connected by down-sampling operations.SAR modules enhance the spatial features of pathological images while extracting their features.In the middle of the second layer and the third layer, multi-scale feature fusion (MSFF) modules are added to fuse the low-level and high-level features.In the decoding stage, RS modules are designed based on the residual network to enhance the feature recovery capability of the model.

Residual structure
As the depth of the network increases, the "gradient disappearance problem" follows.A common solution method is to add residual learning.Residual learning structure was first proposed by He [35], which mainly uses jump connections to realize the identity mapping from the upper layer features to the lower layer network.The formula is as follows: Where, x indicates the network input of the local layer.F(x) stands for the residual learning part.This paper applies the idea of residual network to design the residual block.Because of the short connection, the convergence speed of the network is accelerated.The research utilizes the residual idea in both the encoding and decoding stages.In the encoding stage, the function of the residual structure is to enhance the ability of feature extraction, while in the decoding stage, the purpose of the residual structure is to fuse features from different scales to enhance the feature recovery ability.As shown in Figure 3, two 3 × 3 convolutions are used to extract features in the decoding module, and a 1 × 1 convolution is used to form a residual connection, so that the network can be extended to integrate high-level and low-level features.


global soft attention mask at the same time.In addition, For the input image whose feature maps is  ∈ ℝ , a 1 × 1 convolution operation is used to match the output feature maps.Finally, the attention mask obtained from SA module and the feature map generated by trunk convolution are added to capture the key features.Among them, the role of the SA module is to enhance the attention feature of pixel-grouping.Encoding module is shown in Figure 4. Figure 4 shows that the output characteristic graph is obtained by adding three input values, and its formula is as follows: Where,  ∈ ℝ ，  ∈ ℝ are input and output feature maps,  • is the residual function, and  stands for two 3 × 3 convolutions. • represents the up-sampled operation, which is used to expand the number of channels of the output feature maps. • represents the average pooling layer, which implements the compression operation of SA modules.

Multi-scale feature fusion module
Receptive field is often regarded as the mapping region of the input image that can be seen by convolutional neural network (CNN).Receptive field size increases as the number of network layers deepen [36].A large number of studies show that there are great differences in the characteristics of different scales.Small receptive field has lower detailed information, and large receptive field has stronger semantic information.The calculation formula of receptive field is shown in the formula: Among them,  represents the current number of network layers;  stands for the size of the convolution kernel of a certain layer of the network;  denotes the step size of a certain layer of the network.When  0,  is the input layer receptive field, and  1.Using features of different scales for segmentation tasks can obtain richer semantic information, which is conducive to improving the segmentation effect.The feature fusion method of the early network model is the jump connection between the same layers.This method only employs singlescale features and does not apply multi-scale features.After experimental verification, the characteristics of the receptive fields in the second and third layers of the SAMS-Net network are suitable for TILs that capture pathological images.Therefore, this study uses the second and third layers of the encoding part as the multi-scale feature fusion layer.To effectively combine shallow detail information with deep semantic information, feature maps of different scales are connected to each layer of the decoding module through up-sampling or pooling operation.The specific implementation is shown in Figure 5.  is taken as an example to represent the implementation process of the multi-scale feature fusion module.When the image passes through the coding module, the features from the  layer and  layer are fused with the features of the  layer through the maximum pooling operation of different sizes, and the  features from the decoding part after the upsampling operation to obtain the rich information of the joint context.
Assuming that  and  are the input feature maps of the encoding part and the output feature maps of the decoding part, respectively. indicates the number of current network layers. * is used to represent the nonlinear transformation of layer  ,which can be realized by a series of operations, such as ReLu, Batch Normalization, and Pooling etc.The formula of the MSFF module is as follows: where • is concatenate operation,  and  stand for the feature maps of the 2 and 3 layers in the encoding stage, respectively. is the feature map of the current layer in the decoding stage. is the feature map of the current layer in the encoding stage.

Experimental data
The experiment uses the HER2-positive breast cancer tumor infiltrating lymphocyte data set in the literature [37], which is marked by a professional pathologist, and the image size is 100 × 100 pixels.There is a risk of overfitting when the data set is too small.The data enhancement methods such as clipping, mirror transformation and flipping are used to prevent overfitting.According to the ratio of 8:1:1, the dataset was divided into a training set, validation set, and test set.This research uses a ten-fold cross-validation method to evaluate the generalization performance of the model.

Implementation
The SAMS-Net algorithm is written using the Pytorch1.8.1 deep learning framework, and is trained on the experimental platform of Intel(R) Core (TM) i5-1135G7 CPU and NVIDIA Tesla V100 32 GB GPU.The initial learning rate of the algorithm is set to 0.0025.In this network, adaptive moment estimation (Adam) is used as the optimizer, DiceLoss is employed as the loss function, and L2 regularization operation is used to prevent overfitting.

Evaluation index
To verify the effectiveness of the algorithm proposed in this study, we use IoU, DSC, positive prediction value (PPV), F1 score, pixel accuracy (PA), recall, Hausdorff distance (Hd) indicators to evaluate the performance of the algorithm.The IoU is used to measure the coincidence of the predicted graph with the ground-truth, the DSC is used to calculate the similarity between the predicted map and the ground truth, the closer the value is to 1, the better the segmentation effect.On the contrary, Hausdorff distance is a distance defined between any two sets in the metric space, the closer the value is to 0, the better the splitting effect.The calculation formulas are: 11) max{ ( , ), ( , )} Hd h P G G P  (12) Among them, in Eqs ( 6), ( 7) and ( 12), P represents the area of TILs predicted in the segmentation result, G represents the area of TILs in the ground truth image.In Eqs ( 8)- (11), TP is a true example, FP is a false positive example, TN is a true negative example, and FN is a false negative example.

Results and disscussion
In order to use multi-scale features more effectively, the fusion strategy between different layers of the algorithm is experimentally studied.The experimental results show that using different layers of information to integrate multi-scale features in TILs segmentation has a certain effect on improving the segmentation accuracy.However, the second and third layers of SAMS-Net can retain the semantic information of TILs to the maximum extent, improve the overall segmentation effect, and perform the best in TILs segmentation task.The experimental results are shown in Table 1. ,  ,  and  represent the first, second, third, and fourth layers of the coding part respectively.It can be seen from the table that using  and  joint feature vectors have the best effect for the SAMS-Net algorithm.Where ↑ means that the larger the value, the better the effect, ↓ means that the smaller the value, the better the effect.The best results are highlighted in bold.

Table1. Comparison results of fusion
In order to verify the effectiveness of the proposed algorithm, the proposed SAMS-Net algorithm is compared with other classical segmentation algorithms in Table 1 (such as FCN network, DeepLab V3+ network, and UNet network, etc.) on the same experimental platform.The experimental results are shown in Table 2.It can be seen from the experimental results that SAMS-Net performs best in the TILs segmentation task, and its IoU, DSC and other indicators are optimal among the eight segmentation algorithms.Note: Different metrics between the automated and ground truth for evaluating segmentation performance.
Where ↑ means that the larger the value, the better the effect, ↓ means that the smaller the value, the better the effect.The best results are highlighted in bold.
The experimental results show that the SAMS-Net has a good effect in the TILs segmentation task, and its IoU, DSC and other indicators have achieved the best results among the eight segmentation algorithms.Compared with UNet, IoU increased by 3.8%, DSC promoted by 2.5%, compared with FCN, DeepLabV3+, SegNet, R2UNet and UNet+, IoU increased by 3, 7.4, 4.3, 3.1 and 1.9%, respectively.DSC is improved by 2.1, 4.9, 2.8, 2.1 and 1.4% respectively, which proves the effectiveness of SAMS-NET in segmentation.The analysis shows that the FCN and SegNet networks have the problem of long training time due to a large number of parameters, and the failure to consider the global information is easy to lose the image details, which leads to the segmentation is not fine enough.In order to reduce the number of model parameters, ENet algorithm carries out a downsampling operation in advance, which leads to the serious loss of image spatial information and poor segmentation ability.DeepLabV3+ algorithm adds a variety of modules to reduce model parameters and enhance feature extraction ability, which leads to feature information redundancy and makes the network unable to learn key information, thus making the network segmentation effect low.Although the UNet, UNet++ and R2UNet networks consider the relationship between pixels, they fail to fully relate the context information to obtain richer features and thus lose part of the edge information, resulting in a slightly lower segmentation ability.
In view of the residual attention module and multi-scale feature fusion module designed by our proposed SAMS-NET algorithm, the network not only pays attention to the key information in the image but also considers the context connection, so the image segmentation results are better and can achieve better segmentation.In order to better analyze the segmentation effect, this study conducts a visual analysis on SAMS-NET and its comparison algorithm, and the comparison results are shown in Figure 6.
According to the segmentation results, SegNet, UNet and UNet++algorithms mistakenly divide normal cells into TILs cells.FCN and DeepLabV3+show the problem of cell segmentation edge adhesion in the segmentation process, and ENet shows unclear segmentation edges and burrs.Compared with other segmentation networks, the overall segmentation effect of SAMS-Net is improved, which effectively avoids under segmentation and over-segmentation, and the overall segmentation effect is better.However, although the SAMS-Net has a certain improvement effect on the segmentation ability of TILs, there are still some unclear edges and segmentation errors in some segmented regions, which may be caused by the small dataset and unbalanced front and background pixels.Adding more training samples to enhance the feature learning ability of the network can further improve the segmentation effect.

Ablation experiment
To measure the generalization performance of the algorithm and explore the influence of different modules on the algorithm, multiple improved modules were split and ablation experiments were used to validate the contribution of each module to SAMS-Net.The verification results are shown in Table 3.It can be seen from the table that compared to the basic network, each module of SAMS-Net contributes to the segmentation task of this paper, moreover, the combination of multiple modules can achieve the best effect.Where ↑ means that the larger the value, the better the effect, ↓ means that the smaller the value, the better the effect.The best results are highlighted in bold.As can be seen from the table, compared with the basic network, each module of SAMS-NET has contributed to the segmentation task of this research, and the best effect can be achieved through the combination of multiple modules.
In order to verify the effectiveness of the data enhancement operation and L2 regularization [41] method on the algorithm, the benchmark algorithm is compared with the algorithm after adding data enhancement and L2 regularization, and the comparison results are shown in Figure 7. Where, Base is the algorithm without data enhancement and L2 regularization, Aug stands for data enhancement operation, and L2 stands for L2 regularization method.As can be seen, compared with the Base network, the IoU index of the algorithm is increased by 4.4% and DSC index is improved by 3% after adding the data enhancement operation and L2 regularization method.The results show that these two operations play a certain role in improving the segmentation effect.

Conclusions
Related research shows that TILs can predict cancer chemotherapy response and survival outcome [42], and can provide a basis for precise treatment of cancer.This paper proposes a segmentation network based on the squeeze attention mechanism and multi-scale feature fusion to segment TILs in breast cancer pathological images.SAMS-Net has three modules: SAR module, MSFF module, and RS module.Different from the traditional attention mechanism, the interdependence between spatial channels is effectively taken into consideration by the SAR module, which can enhance the dense prediction at the pixel level.MSFF module effectively combines lowlevel and high-level semantic features in different scale feature maps on the basis of enhancing context features.RS module can enhance the ability of gradient return to speed up training.
Lacking the spatial information of the image and the pixel difference of the segmentation target are common problems in traditional segmentation networks, which cause the unsuitability for the task of cell segmentation.Based on the traditional network, the segmentation effect of different receptive fields on the cell area was taken into account in this paper, and a MSFF module combining multiple receptive fields were proposed to solve the problem of difficulty in capturing the segmentation process due to small cell pixels.SAMS-Net uses the attention mechanism combined with the residual structure to extract richer semantic information.A large number of experiments have proved that among the state-of-the-art methods, SAMS-Net has a better segmentation effect and can further provide important evidence for the prognosis and treatment of cancer.In addition, this study can also be applied to the diagnosis of various diseases by optical imaging (optical coherence tomography), such as age-related macular degeneration and Stargardt's disease [43][44][45].Due to the uses of multiple modules to improve the segmentation effect, which increases the number of parameters and calculations of the model.In the future, the network model needs to be further improved to reduce the scores of parameters and calculations.

Figure 2 .
Figure 2. SAMS-Net overall framework diagram.The left side is the encoding structure, and the maximum pooling operation is used between blocks; the right side is the decoding structure, and the operation of up-sampling and 1 × 1 convolution is used between blocks; The encoding and decoding structures are connected by multi-scale feature fusion modules.

3. 2 .
Squeeze-and-attention with residual structure module SA module and residual structure are used to extract image features simultaneously.In the encoding module, two 3 × 3 convolutions are parallel with SA module and residual structure.Each SA module includes two parts: compression and attention extraction.Compression part uses global average pooling to obtain feature vectors.Attention extraction part realizes multi-scale feature aggregation through two attention convolutions channels and up-sampling operations, and generates a 

Figure 7 .
Figure 7. Data enhancement and L2 regularization operation are added to compare the test results.
between different layers.Different metrics between the automated and ground truth for evaluating segmentation performance.

Table 2 .
Model performance comparison results.

Table 3 .
Performance comparison results of each module.