Self-attention CNN for retinal layer segmentation in OCT

The structure of the retinal layers provides valuable diagnostic information for many ophthalmic diseases. Optical coherence tomography (OCT) obtains cross-sectional images of the retina, which reveals information about the retinal layers. The U-net based approaches are prominent in retinal layering methods, which are usually beneficial to local characteristics but not good at obtaining long-distance dependence for contextual information. Furthermore, the morphology of retinal layers with the disease is more complex, which brings more significant challenges to the task of retinal layer segmentation. We propose a U-shaped network combining an encoder-decoder architecture and self-attention mechanisms. In response to the characteristics of retinal OCT cross-sectional images, a self-attentive module in the vertical direction is added to the bottom of the U-shaped network, and an attention mechanism is also added in skip connection and up-sampling to enhance essential features. In this method, the transformer’s self-attentive mechanism obtains the global field of perception, thus providing the missing context information for convolutions, and the convolutional neural network also efficiently extracts local features, compensating the local details the transformer ignores. The experiment results showed that our method is accurate and better than other methods for segmentation of the retinal layers, with the average Dice scores of 0.871 and 0.820, respectively, on two public retinal OCT image datasets. To perform the layer segmentation of retinal OCT image better, the proposed method incorporates the transformer’s self-attention mechanism in a U-shaped network, which is helpful for ophthalmic disease diagnosis.


Introduction
The structure of the retinal layer provides crucial diagnostic information for ophthalmologists.In most typical retinal diseases, such as age-related macular degeneration (AMD) [1], the central macula, which is originally smooth and slightly concave, is elevated, and the normal retinal layer structure is disrupted.In addition, retinal diseases, such as detachment of the retinal pigment epithelium, retinal pigmentary changes, and diabetic macular edema, can cause degrees of deformation in the retinal layers [2].Ophthalmologists diagnose the condition of patients by assessing the deformations in the retinal layers.
Optical Coherence Tomography (OCT) is a non-invasive imaging modality.It utilizes lowcoherence light to analyze the internal structure of biological tissues, acquiring high-resolution cross-sectional images of tissues with sufficient depth of penetration [3][4][5].Now, OCT is widely used to observe the layered structure of the retina and the pathological fluids within it.By observing the biological properties of the retinal layers, such as layer thickness, analyzing the morphology of each layer, and comparing this layer information with normal layers, it is possible to diagnose retinal diseases such as diabetic macular edema and AMD.Due to the manual annotation of retinal layer boundaries relying on the subjective judgment of annotators and being time-consuming and labor-intensive, the issue of automatic segmentation of OCT retinal images is gaining attention from researchers.
In recent years, the role of deep learning in image segmentation has become increasingly important, leading to the emergence of many retinal layer segmentation methods based on convolutional neural networks (CNNs).CNNs have dominated various biomedical image segmentation tasks, with U-Net being one of the most used models [6].U-Net features an encoder-decoder structure and skip connections, enabling it to preserve details.Consequently, many retinal layer segmentation models are built upon the foundation of U-Net.However, due to the inherent limitations of convolutional operations, these models struggle to capture long-range information in tasks involving distant relationships effectively.
Due to the global receptive field of the self-attention mechanism within the Transformer, many studies have begun to integrate Transformers with Convolutional Neural Networks (CNNs) to analyze medical images.For instance, Chen et al. proposed TransUnet, a model based on U-Net, where the encoder part employs a Transformer mechanism.It takes feature maps from the CNN output as input, passes them through the Transformer, and then feeds the encoded tokenized image blocks to the decoder for up-sampling, merging them with the high-resolution feature maps from CNN.This process allows the model to acquire advanced learning information from CNN and global contextual information from the Transformer [7].Cao et al. introduced Swin-Unet, a pure Transformer U-shaped structure resembling U-Net, which utilizes SwinTransformer blocks for global and local feature learning with identical skip connections between the encoder and decoder [8].Gao et al. proposed UTNet, featuring a novel self-attention decoder in the model.It extracts locally enhanced features using convolutional layering and captures long-range information through a self-attention mechanism, achieving accurate segmentation while reducing computational complexity [9].
Various segmentation techniques for OCT retinal layers continue to emerge with the development of medical image processing and analysis methods.Classical approaches involve using adaptive mathematical models to analyze the anatomical structure of the retina and leverage clinical prior knowledge to detect typical layered structures [10].For instance, Monemian et al. utilized a model based on Laplace distribution to calculate the probability of neighboring pixels becoming boundary pixels, completing the segmentation of retinal layers [11].Sun et al. proposed a level-set method based on the Bayesian theorem, combining anatomical prior information and adaptive details to generate boundary probability maps iteratively, enhancing the sub-pixel accuracy of the boundaries [12].Chiu et al. introduced a fully automated layering method based on graph theory and dynamic programming, achieving precise segmentation of the boundaries of eight retinal layers [13].However, due to the inherent limitations of each mathematical model, these methods exhibit shortcomings in terms of robustness and computational complexity.
Most recent methods are based on deep neural networks.For example, Roy et al. introduced a fully convolutional neural network named Relaynet, which employs an encoder-decoder structure and incorporates skip connections with anti-pooling.This model can segment seven retinal layers, fluid, and background [14].Wang et al. extract boundaries and retinal layers simultaneously through two U-shaped networks and subsequently fuse the two results to enhance the correctness of layer segmentation [15].He et al. proposed a retinal layering method based on a fully convolutional regression network.This method takes the original B-SCAN map and the normalized spatial position of each pixel as input, outputs the pixel segmentation result of the retina as well as a structured surface, and combines the two results to achieve layering [16].Kumar et al. presented a multi-layer, multi-scale encoder-decoder architecture.By stacking two different encoder-decoder networks multiple times, they iteratively performed feature extraction and denoising, ultimately obtaining the layer segmentation results of the retina [17].
These newly developed methods have shown significant improvements in both robustness and accuracy.However, they ultimately extract information within a limited receptive field and still face challenges in long-range modeling.Therefore, some researchers have introduced various attention mechanisms into CNNs to expand the receptive field and enhance the network's ability to learn more effectively.For instance, Moradi et al. proposed a semantic segmentation model based on the Residual-Attention-UNET model, designed to segment ten layers of the retina.The model forms a U-shaped structure using residual blocks and incorporates an attention gate in the up-sampling and skip-connection processes.This enhances strong-feature correlation while suppressing weaker-feature correlations, leading to improved segmentation accuracy [18].A-SCAN was used as training data for retinal layering, achieving the segmentation of nine retinal boundaries [20].These methods combine attention mechanisms with CNNs, resulting in higher precision.In terms of global feature extraction, the self-attention mechanism of the Transformer holds an advantage due to its extensive receptive field.However, this large receptive field is not always necessary at every stage of feature extraction, especially considering the presence of noise in low-level feature maps, which is unfriendly to global feature extraction.
Therefore, we propose a retinal OCT image layer segmentation model based on self-attention mechanisms and CNNs.The model combines an encoder-decoder structure with self-attention.As the network's maximum receptive field occurs at the bottom of the encoder, a transformer block is added to the bottom of the network, receiving deep feature maps from the encoder.Before passing the feature maps to the transformer, a transformation is applied, allowing the transformer to compute attention only in the vertical direction.This approach reduces the model's computational complexity and improves training speed.The main contributions of the proposed method are as follows: 1.A new retinal layer segmentation framework is proposed based on the encoder-decoder structure.In the shallow layers, a convolutional neural network is employed to extract fine and low-level features, while in the deep layers, a self-attention mechanism is used to capture global semantic information.Combining these two components enhanced the model's performance, resulting in better segmentation results than existing methods.
2. A one-dimensional Transformer is added to the bottom of the encoder-decoder structure, calculating self-attention only in the vertical direction of the feature map.This reduces the computational complexity of the model while enhancing its performance.This combination enables the model to exhibit better generalization, even on small datasets, showcasing excellent performance.
3. An attention mechanism has been introduced in the up-sampling and skip connection process, combining the up-sampled feature map with the one from the same encoder level after channel attention.Different weights are assigned to these features through linear and non-linear transformations, which are then applied to the original image, amplifying or suppressing the importance of features.Adding this module can reduce the model's parameters and enhance its performance.

Methodology
The retinal layer segmentation involves assigning each pixel on a cross-sectional image of the retina to different classes using a network model, thereby accomplishing the segmentation.
Inspired by Attention-UNet [21] and TransUnet, we proposed a new network to accomplish the retinal OCT image layering.The framework consists of an encoder and a decoder, as shown in Fig. 1.

Encoder improved by one-dimension Transformer
The left part of the framework is the encoding branch, which consists of several encoder blocks and a Transformer block.
To enable the network to learn more comprehensive features, each encoder's block consists of two convolutional layers, batch normalization layers, and ReLU activation layers.The first set of convolutional layers is designed to capture relatively low-level features, while the second captures higher-level features.By continuously stacking layers, the network learns increasingly complex and abstract features.The convolutional layers utilize a 3×3 convolutional kernel, a stride of 1, and the zero-padding on the feature map to ensure consistency in output and input sizes.
In providing the same receptive field, combining two 3×3 convolutional kernels, as opposed to a larger kernel, not only introduces fewer parameters but also facilitates the extraction of more affluent and more complex features.This aids in the learning process of the network.
The ReLU layer introduces non-linear factors to enhance the model's expressive ability.The maximum pooling layer between convolutional blocks alleviates the feature invariance of the convolutional layers and reduces the redundant information introduced by the convolutional layers.
The output of the last convolution block is serialized before being fed into the transformer block.First, the feature map from convolution blocks, a two-dimensional feature map of size ), is reshaped to size (B× , C) before being passed into the Transformer.Then, the self-attention is only calculated for the reshaped feature map, i.e., in the vertical direction of convolution block outputs.A one-dimensional Transformer block is added after the encoders' last convolution block to process the convolution blocks' feature map output, which obtains a larger perceptual field and more spatial location information.In addition, channel and spatial attention mechanisms are added to the skip connection and up-sampling processes, where the channel attention mechanism focuses more on the correlation between channels and improves the representation of features on each channel.The spatial attention mechanism allows the model to focus more on critical local areas in the image, improving the spatial localization of features.Combining these two attention mechanisms enhances the model's ability to extract local features.The following details the encoder improved by the one-dimension transformer, the decoder improved by the Attention Gate, and the combined loss function.

Encoder improved by one-dimension transformer
The left part of the framework is the encoding branch, which consists of several encoder blocks and a Transformer block.
To enable the network to learn more comprehensive features, each encoder's block consists of two convolutional layers, batch normalization layers, and ReLU activation layers.The first set of convolutional layers is designed to capture relatively low-level features, while the second captures higher-level features.By continuously stacking layers, the network learns increasingly complex and abstract features.The convolutional layers utilize a 3 × 3 convolutional kernel, a stride of 1, and the zero-padding on the feature map to ensure consistency in output and input sizes.
In providing the same receptive field, combining two 3 × 3 convolutional kernels, as opposed to a larger kernel, not only introduces fewer parameters but also facilitates the extraction of more affluent and more complex features.This aids in the learning process of the network.
The ReLU layer introduces non-linear factors to enhance the model's expressive ability.The maximum pooling layer between convolutional blocks alleviates the feature invariance of the convolutional layers and reduces the redundant information introduced by the convolutional layers.
The output of the last convolution block is serialized before being fed into the transformer block.First, the feature map from convolution blocks, a two-dimensional feature map of size (B, C, H 16 , W 16 ), is reshaped to size (B× W 16 , H 16 , C) before being passed into the Transformer.Then, the self-attention is only calculated for the reshaped feature map, i.e., in the vertical direction of convolution block outputs.The Transformer block consists of multiple Transformer layers, each of which consists of a multi-headed self-attentive (MSA) as well as a multilayer perceptron (MLP) block, as shown in Fig. 2, and the output of layer i is represented as follows: where LN (•) represents layer normalization and Z i represents the encoded map.The Transformer block consists of multiple Transformer layers, each of which consists of a multi-headed self-attentive (MSA) as well as a multilayer perceptron (MLP) block, as shown in Fig. 2, and the output of layer  is represented as follows: ).

Decoder improved by Attention Gate
The decoder branch of the network mainly includes an up-sampling block, an improved Attention Gate (AG) block, and a convolution block.We improved the AG block based on the literature [21], as shown in Fig. 3, which incorporates channel attention to assign weights to each feature channel.These weights amplify or suppress the importance of different features, reducing the parameter count and enhancing the model's performance.Additionally, this module adopts the Exponential Linear Unit (ELU) activation function to alleviate the gradient explosion problem and improve model accuracy.
In Fig. 3, y represents the result of up-sampling on the previous layer of feature maps, and x represents the feature maps from the same level of encoder that is fed to the channel attention first to enhance features.The next two feature maps are convolved by conv 1×1 to get outputs with the same size and number of channels, which are summated to highlight important features.Then, the result is processed by an ELU (Exponential Linear Unit).A weight α is generated by a conv 1×1 and a sigmoid function, which is multiplied with the original input x, and obtained the result .The final decoder block is a convolutional layer with a convolutional kernel size of 1×1 and a SoftMax layer for the output classification result.To facilitate subsequent convolution in the decoder, the result after the Transformer module is reshaped from (B× W 16 , H 16 , C) to (B, C, H 16 , W 16 ).

Decoder improved by attention gate
The decoder branch of the network mainly includes an up-sampling block, an improved Attention Gate (AG) block, and a convolution block.We improved the AG block based on the literature [21], as shown in Fig. 3, which incorporates channel attention to assign weights to each feature channel.These weights amplify or suppress the importance of different features, reducing the parameter count and enhancing the model's performance.Additionally, this module adopts the Exponential Linear Unit (ELU) activation function to alleviate the gradient explosion problem and improve model accuracy.The Transformer block consists of multiple Transformer layers, each of which consists of a multi-headed self-attentive (MSA) as well as a multilayer perceptron (MLP) block, as shown in Fig. 2, and the output of layer  is represented as follows: ′  = (( -1 )) +  -1 (1) ).

Decoder improved by Attention Gate
The decoder branch of the network mainly includes an up-sampling block, an improved Attention Gate (AG) block, and a convolution block.We improved the AG block based on the literature [21], as shown in Fig. 3, which incorporates channel attention to assign weights to each feature channel.These weights amplify or suppress the importance of different features, reducing the parameter count and enhancing the model's performance.Additionally, this module adopts the Exponential Linear Unit (ELU) activation function to alleviate the gradient explosion problem and improve model accuracy.
In Fig. 3, y represents the result of up-sampling on the previous layer of feature maps, and x represents the feature maps from the same level of encoder that is fed to the channel attention first to enhance features.The next two feature maps are convolved by conv 1×1 to get outputs with the same size and number of channels, which are summated to highlight important features.Then, the result is processed by an ELU (Exponential Linear Unit).A weight α is generated by a conv 1×1 and a sigmoid function, which is multiplied with the original input x, and obtained the result .The final decoder block is a convolutional layer with a convolutional kernel size of 1×1 and a SoftMax layer for the output classification result.In Fig. 3, y represents the result of up-sampling on the previous layer of feature maps, and x represents the feature maps from the same level of encoder that is fed to the channel attention first to enhance features.The next two feature maps are convolved by conv 1 × 1 to get outputs with the same size and number of channels, which are summated to highlight important features.Then, the result is processed by an ELU (Exponential Linear Unit).A weight α is generated by a conv 1 × 1 and a sigmoid function, which is multiplied with the original input x, and obtained the result x.
The final decoder block is a convolutional layer with a convolutional kernel size of 1 × 1 and a SoftMax layer for the output classification result.

Loss functions
In this paper, the network is trained using a combination function of the multi-classification cross-entropy loss and the Dice loss.The loss function formula is as follows: where λ 1 and λ 2 are the weights of the two loss functions, and the sum of λ 1 and λ 1 is 1.
The Dice score is used to assess the similarity between two samples and takes values in the range [0, 1], which is expressed by the following formula: where X represents the probability map of the ground truth labels, Y represents the probability map obtained from the model's predictions, and |X∩Y | represents the overlapping regions between the two maps.It is calculated by element-wise multiplication and summation of the pixels in both maps.|X| and |Y | represent the pixels' summation in each respective map.
The Dice loss function is expressed as: The multi-classification cross-entropy loss function measures the similarity between the actual and predicted probability maps.A smaller loss value indicates a smaller discrepancy and helps prevent gradient vanishing.The formula for this loss function is expressed as follows: where M represents the number of categories in the classification, p(x i ) represents the true distribution of the sample of category i, if it is a sample of that category then it is 1, otherwise, it is 0, and q(x i ) represents the probability that the sample is predicted to be of category i.
For image segmentation, the dice loss accesses the images globally.In contrast, the multiclassification cross-entropy loss accesses the images pixel by pixel, and the two complement each other to some extent.To highlight the advantages of combining these two loss functions, in our experiments, we also combined the Dice loss function with the other loss functions MultiLabelSoftMaginLoss [14] and Focal Loss [22], which are commonly used in medical image segmentation, and compared their results.

Datasets and preprocessing
The proposed model was evaluated on two datasets, DUKE DME [23] and the optic disc retina dataset from Shanghai Jiao Tong University [24].The subjects include highly myopic patients, those with peripapillary atrophy, and cataract patients.For each subject, two B-SCAN images are randomly selected as the dataset, which is manually annotated by an expert with nine retinal layers and the optic disc.There are 11 labels in the masked image, including background, RNFL, GCL, IPL, INL, OPL, ONL, IS/OS, RPE, Choroid and disc.Our experiments are divided into independent training, validation, and test sets with a ratio of 6:2:2.
In our experiments, both datasets were done with data enhancement by horizontal flipping and the images were resized by nearest neighbor interpolation.The first dataset is fed to the network with the size of 224 × 224 and the second one with the size of 512 × 496 to ensure a better segmentation result with as little loss of image detail features as possible.

Experimental settings
The neural network framework used in our experiment was PyTorch.The optimizer used Adam with the initial learning rate of 0.001, the linear warm-up of 10 epochs, and the learning rate schedule of cosine annealing.The first dataset used a cosine annealing of 200 epochs to set the decay of the learning rate, and the second used 50 epochs.The experiments ran on a graphic workstation, of which the CPU is i5-11400F with 16 G of RAM and the GPU is NVIDIA TITAN X (Pascal) with 16 G of video memory.

Evaluation indicators
This article utilizes the Dice score and Pixel Accuracy (PA) [25] as evaluation metrics.As explained earlier, the Dice score is employed to assess the similarity between the segmentation results and the ground truth images.PA evaluates the percentage of accurately classified pixels in the image, considering the overall segmentation accuracy.The formula of PA is as follows: ∑︁ n j=0 P ij (7) where n represents the total number of categories, P ii denotes the total number of pixels with real pixels i predicted to be in the category i, and P ij represents the total number of pixels with true pixels i predicted to be in the category j.

Experiments on DUKE DME
The proposed method was trained and tested on the first dataset, using the dice score as the evaluation metric, and the results are shown in Table 1 and Fig. 4. It can be seen from Fig. 4 that the proposed method gives more accurate delamination compared to other methods, without any mixing between each layer, and is with higher accuracy in the segmentation position of the cumulus, without misidentifying the cumulus as another layer.From Table 1, the proposed method has the best Dice for most layers and the best average, while the INL layer's Dice is similar to the Relaynet's, and the ONL-ISM layer's is slightly worse than the other three methods', probably due to the effect of fluid accumulation.The inclusion of the self-attention mechanism, as well as the spatial channel attention mechanism, has improved the model with a higher accuracy.Overall, our proposed method Outperforms other methods.
Here, it is worth noting that during our research, we came across a method based on Residual Attention-UNET from [18].This method achieves a Mean Dice coefficient of 91.5% and MIOU The best performance is indicated by "*", the second best performance is indicated by "**".The proposed method was trained and tested on the first dataset, using the dice score as the evaluation metric, and the results are shown in Table 1 and Fig. 4. It can be seen from Fig. 4 that the proposed method gives more accurate delamination compared to other methods, without any mixing between each layer, and is with higher accuracy in the segmentation position of the cumulus, without misidentifying the cumulus as another layer.From Table 1, the proposed method has the best Dice for most layers and the best average, while the INL layer's Dice is similar to the Relaynet's, and the ONL-ISM layer's is slightly worse than the other three methods', probably due to the effect of fluid accumulation.The inclusion of the self-attention mechanism, as well as the spatial channel attention mechanism, has improved the model with a higher accuracy.Overall, our proposed method Outperforms other methods.
Here, it is worth noting that during our research, we came across a method based on Residual Attention-UNET from [18].This method achieves a Mean Dice coefficient of 91.5% and MIOU of 93% in the layer segmentation on their private dataset.However, since this method does not provide Dice coefficients for each layer, it has not been included in the table for comparison.

Experiments on optic disc retina dataset
The proposed method was trained and tested on the optic disc retina dataset and compared with other methods, evaluating the metrics Dice score and PA, as shown in Table 2, Table 3 and Fig. 5.It can be seen from Fig. 5 that the results of other methods are not accurate for some layers, such as pixels between two layers are misclassified or pixels of one layer are wrapped by another layer, whereas our method gets rid of this problem.Out method gets better segmentation possibly because the self-attention mechanism enhances the model's ability to extract global features, which also includes positional information on the image, allowing the model to rely not only on pixel values for classification, but also its position.
As can be seen from Table 2 and Table 3, our method gets the best metrics on the layers of RNFL, IPL, INL, OPL, ONL, and IS/OS, while its layers' average metrics are also best, the same as another method, MGU-Net.In addition, the U-net gets the best metrics on Disc but not good on retina layers, probably because it is more suitable for general objects than layer structure.Our method achieves the best segmentation results for the retina layers, not weakened by the presence of the optic disc, even though the results for the Disc are slightly worse,  of 93% in the layer segmentation on their private dataset.However, since this method does not provide Dice coefficients for each layer, it has not been included in the table for comparison.

Experiments on optic disc retina dataset
The proposed method was trained and tested on the optic disc retina dataset and compared with other methods, evaluating the metrics Dice score and PA, as shown in Table 2, Table 3 and Fig. 5.
It can be seen from Fig. 5 that the results of other methods are not accurate for some layers, such as pixels between two layers are misclassified or pixels of one layer are wrapped by another layer, whereas our method gets rid of this problem.Out method gets better segmentation possibly because the self-attention mechanism enhances the model's ability to extract global features, which also includes positional information on the image, allowing the model to rely not only on pixel values for classification, but also its position.
As can be seen from Table 2 and Table 3, our method gets the best metrics on the layers of RNFL, IPL, INL, OPL, ONL, and IS/OS, while its layers' average metrics are also best, the same as another method, MGU-Net.In addition, the U-net gets the best metrics on Disc but not good on retina layers, probably because it is more suitable for general objects than layer structure.Our

Experiments with the number of Transformer layers
On a small-scale medical dataset, the excessive number of layers may cause over-fitting due to the lack of training data, while too few layers do not fully reflect the role of the Transformer.
In our experiments, we tried the number of Transformer layers in the Transformer module to achieve the best results.According to the current standards of the Transformer, five layer numbers of 2, 3, 4, 5 and 6 were chosen for comparison.Fig. 6 shows the best results for both datasets when the Transformer layer is 3.   method achieves the best segmentation results for the retina layers, not weakened by the presence of the optic disc, even though the results for the Disc are slightly worse, probably because the model has some bias in selecting features at the boundary between the optic disc and the retinal layers, resulting in poor segmentation results for the Disc.On a small-scale medical dataset, the excessive number of layers may cause over-fitting due to the lack of training data, while too few layers do not fully reflect the role of the Transformer.
In our experiments, we tried the number of Transformer layers in the Transformer module to achieve the best results.According to the current standards of the Transformer, five layer numbers of 2, 3, 4, 5 and 6 were chosen for comparison.Fig. 6 shows the best results for both datasets when the Transformer layer is 3.  5 and 6 were chosen for comparison.Figure 6 shows the best results for both datasets when the Transformer layer is 3.

Experiments with combined loss functions
In the experiment, a combined loss function that integrates multi-class cross-entropy loss and Dice loss was used, and the results were compared with different weights.Additionally, we compared this combined loss function with other commonly used loss functions in medical image segmentation, such as MultiLabelSoftMarginLoss and Focal Loss.According to the results shown in Fig. 7, it can be observed that the combined loss function improved the Dice values by 0.011 and 0.012 compared to these two loss functions separately in the DUKE DME dataset.Similarly, in the optic disc and retina dataset, the Dice scores were increased by 0.009 and 0.042, respectively.These indicate that the combined loss function significantly enhances the model's performance.In the experiment, a combined loss function that integrates multi-class cross-entropy loss and Dice loss was used, and the results were compared with different weights.Additionally, we compared this combined loss function with other commonly used loss functions in medical image segmentation, such as MultiLabelSoftMarginLoss and Focal Loss.According to the results shown in Fig. 7, it can be observed that the combined loss function improved the Dice values by 0.011 and 0.012 compared to these two loss functions separately in the DUKE DME dataset.Similarly, in the optic disc and retina dataset, the Dice scores were increased by 0.009 and 0.042, respectively.These indicate that the combined loss function significantly enhances the model's performance.

Ablation experiments
We implemented ablation experiments to validate the Transformer's self-attention mechanism at the bottom of the network, and the attention mechanism in the decoder.
In the experiment, the Transformer's self-attention mechanism was incorporated with the regular CNN and the self-attention mechanism was improved too.The results in Fig. 8 showed that the model with the enhanced one-dimensional Transformer module had better results than the original one, with Dice scores increasing by 0.01 and 0.1 in the two datasets, respectively.The improved capability of the model to extract global features has significantly contributed to the enhancement of segmentation accuracy.Considering the distinctive attributes of retinal data, the one-dimensional Transformer selectively employed the self-attention mechanism solely along the vertical direction of the feature map.By specifying the orientation of the self-attention mechanism within the Transformer, the utilization of image information became more effective.This approach maximized the utilization of samples within the same dataset size and reduced computational complexity while enhancing the model's segmentation accuracy.

Ablation experiments
We implemented ablation experiments to validate the Transformer's self-attention mechanism at the bottom of the network, and the attention mechanism in the decoder.
In the experiment, the Transformer's self-attention mechanism was incorporated with the regular CNN and the self-attention mechanism was improved too.The results in Fig. 8 showed that the model with the enhanced one-dimensional Transformer module had better results than the original one, with Dice scores increasing by 0.01 and 0.1 in the two datasets, respectively.The improved capability of the model to extract global features has significantly contributed to the enhancement of segmentation accuracy.Considering the distinctive attributes of retinal data, the one-dimensional Transformer selectively employed the self-attention mechanism solely along the vertical direction of the feature map.By specifying the orientation of the self-attention mechanism within the Transformer, the utilization of image information became more effective.This approach maximized the utilization of samples within the same dataset size and reduced computational complexity while enhancing the model's segmentation accuracy.
In addition, an attention mechanism module has been added to the decoder of our method to enhance the features.As can be seen in Fig. 8, after the enhancement of the features in the channel dimension and spatial dimension, the Dice scores of the model improved by 0.009 and 0.005 in the two Datasets, respectively.This indicated that the addition of the attention mechanism module enhanced the important features, which play a greater role in subsequent operations and improve the model's performance.
Ultimately, through examination of Fig. 8, it is apparent that the introduction of the enhanced one-dimensional Transformer alongside the attention mechanism module in the decoder yields discernible improvements in the model's performance.Specifically, the Dice coefficients In addition, an attention mechanism module has been added to the decoder of our method to enhance the features.As can be seen in Fig. 8, after the enhancement of the features in the channel dimension and spatial dimension, the Dice scores of the model improved by 0.009 and 0.005 in the two Datasets, respectively.This indicated that the addition of the attention mechanism module enhanced the important features, which play a greater role in subsequent operations and improve the model's performance.
Ultimately, through examination of Fig. 8, it is apparent that the introduction of the enhanced one-dimensional Transformer alongside the attention mechanism module in the decoder yields discernible improvements in the model's performance.Specifically, the Dice coefficients experience notable enhancements of 0.041 and 0.145 in the respective datasets, providing empirical evidence for the efficacy of the incorporated modules.

Discussion and Conclusion
In this study, a new network architecture combining Transformer and CNN is proposed for OCT retinal layer segmentation, which not only acquires local features by the CNN, but also extracts global features because of the Transformer's powerful global perceptual field, accomplishing the layer segmentation of the retina.According to the characteristics of the retinal data, we use the Transformer's self-attention mechanism only for the vertical direction of the image, and convert the two-dimension feature map into a one-dimension one before being processed by the Transformer, which not only increases the number of samples for the Transformer module, but also reduces the computation cost, improving the performance.Compared to other methods, the proposed method significantly improves the performance in segmenting the retinal layers.In summary, the proposed method can perform the task of retinal segmentation well and assist professional ophthalmologists in diagnosing retinal diseases.Disclosures.The authors declare that they have no conflict of interest.experience notable enhancements of 0.041 and 0.145 in the respective datasets, providing empirical evidence for the efficacy of the incorporated modules.

Discussion and conclusion
In this study, a new network architecture combining Transformer and CNN is proposed for OCT retinal layer segmentation, which not only acquires local features by the CNN, but also extracts global features because of the Transformer's powerful global perceptual field, accomplishing the layer segmentation of the retina.According to the characteristics of the retinal data, we use the Transformer's self-attention mechanism only for the vertical direction of the image, and convert the two-dimension feature map into a one-dimension one before being processed by the Transformer, which not only increases the number of samples for the Transformer module, but also reduces the computation cost, improving the performance.Compared to other methods, the proposed method significantly improves the performance in segmenting the retinal layers.In summary, the proposed method can perform the task of retinal segmentation well and assist professional ophthalmologists in diagnosing retinal diseases.
Tan et al. introduced a model that combines CNN and a lightweight Transformer.The model processes image inputs through two main frameworks, one based on Transformer and the other on Cross-Convolution, to extract global and local features.They also designed and introduced a boundary regression loss function and feature polarization to improve boundary accuracy and maximize feature distance between different layers, reducing mutual interference during segmentation [19].Cao et al. proposed an enhanced Transformer-based single-step regression method, incorporating convolution to improve the multi-head self-attention of the Transformer.

Fig. 1 .
Fig. 1.The framework of the proposed method.

Fig. 1 .
Fig. 1.The framework of the proposed method.

16 , 16 ,
) where LN (•) represents layer normalization and   represents the encoded map.To facilitate subsequent convolution in the decoder, the result after the Transformer module is reshaped from (B×   C) to (B, C,

16 , 16 ,
2) where LN (•) represents layer normalization and   represents the encoded map.To facilitate subsequent convolution in the decoder, the result after the Transformer module is reshaped from (B×   C) to (B, C,  16 ,  16
DUKE DME is publicly released by Chiu et al. at Duke University Eye Center, which contains 110 (11 B-SCAN per patient, 496 × 768) hand-annotated B-SCAN images from 10 patients with diabetic macular edema.These 110 images are labeled by experts for retinal fluid and seven retinal layers, which are represented by 10 labels in the masked image, namely RNFL, GCL-IPL, INL, OPL, ONL-ISM, ISE, OS-RPE, cumulus, and upper and lower background.Our experiments are divided into independent training, validation, and test sets with a ratio of 6:2:2.The optic disc retina dataset is a collection of 61 subjects (12 B-SCANs per patient, 1024 × 992) of peripapillary images of the retinal disc collected by Li et al. at Shanghai Jiao Tong University.

Fig. 4 .
Fig. 4. Segmentation results of OCT retina with diabetic macular edema.(a) Original image.(b) Ground truth.(c) Results of U-net.(d)Results of Attention-Unet.(e)Results of TransUnet.(f)Results of our method.

Fig. 4 .
Fig. 4. Segmentation results of OCT retina with diabetic macular edema.(a) Original image.(b) Ground truth.(c) Results of U-net.(d)Results of Attention-Unet.(e)Results of TransUnet.(f)Results of our method.

Fig. 6 .
Fig. 6.Experiments with the number of Transformer layers.

4. 3 . 4 .
Experiments with the number of transformer layersOn a small-scale medical dataset, the excessive number of layers may cause over-fitting due to the lack of training data, while too few layers do not fully reflect the role of the Transformer.In our experiments, we tried the number of Transformer layers in the Transformer module to achieve the best results.According to the current standards of the Transformer, five layer numbers of 2, 3, 4,

4. 3 . 4 .
Experiments with the number of Transformer layers

Fig. 6 .
Fig. 6.Experiments with the number of Transformer layers.

Fig. 6 .
Fig. 6.Experiments with the number of Transformer layers.

Fig. 8 .
Fig. 8. Ablation experiments.The values in the Figure are the corresponding average Dice score.

Funding.
This work was supported by the Natural National Science Foundation of China (62175156, 81827807, 61675134), Science and technology innovation project of Shanghai Science and Technology Commission (19441905800), Shanghai Institute of Technology Collaborative Innovation Fund Project (XTCX2022-4).

Fig. 8 .
Fig. 8. Ablation experiments.The values in the Figure are the corresponding average Dice score.

Table 1 . Dice scores on the DUKE DME dataset. a
a The best performance is indicated by "*", the second best performance is indicated by "**".

Table 2 . Dice scores on the optic disc retina dataset. a
The best performance is indicated by "*", the second best performance is indicated by "**".

Table 3 . PA on the optic disc retina dataset. a
a The best performance is indicated by "*", the second best performance is indicated by "**".