TCU-Net: Transformer Embedded in Convolutional U-Shaped Network for Retinal Vessel Segmentation

Optical coherence tomography angiography (OCTA) provides a detailed visualization of the vascular system to aid in the detection and diagnosis of ophthalmic disease. However, accurately extracting microvascular details from OCTA images remains a challenging task due to the limitations of pure convolutional networks. We propose a novel end-to-end transformer-based network architecture called TCU-Net for OCTA retinal vessel segmentation tasks. To address the loss of vascular features of convolutional operations, an efficient cross-fusion transformer module is introduced to replace the original skip connection of U-Net. The transformer module interacts with the encoder’s multiscale vascular features to enrich vascular information and achieve linear computational complexity. Additionally, we design an efficient channel-wise cross attention module to fuse the multiscale features and fine-grained details from the decoding stages, resolving the semantic bias between them and enhancing effective vascular information. This model has been evaluated on the dedicated Retinal OCTA Segmentation (ROSE) dataset. The accuracy values of TCU-Net tested on the ROSE-1 dataset with SVC, DVC, and SVC+DVC are 0.9230, 0.9912, and 0.9042, respectively, and the corresponding AUC values are 0.9512, 0.9823, and 0.9170. For the ROSE-2 dataset, the accuracy and AUC are 0.9454 and 0.8623, respectively. The experiments demonstrate that TCU-Net outperforms state-of-the-art approaches regarding vessel segmentation performance and robustness.


Introduction
A large number of clinical studies have shown that diseases such as diabetic retinopathy (DR) [1], cataracts [2], dry eye syndrome (DES) [3], and glaucomatous lesions [4] are associated with structural and morphological alterations of retinal vessels. As part of ophthalmic diagnostic criteria, optical coherence tomography angiography (OCTA) enables the identification and measurement of blood flow to obtain high-resolution images of the blood vessels in the retina, choroid, and conjunctival areas [5]. Compared with traditional fluorescein fundus angiography and indocyanine green angiography, OCTA has the advantages of non-invasive, rapid, and three-dimensional imaging, making it a very promising vascular imaging technique in the field of ophthalmology [6]. As shown in Figure 1a, color fundus images obtained by conventional retinal imaging techniques have difficulty capturing fine vessels and capillaries. The optical coherence tomography angiography [7] techniques can generate images of the retinal vascular plexus at different depths in Figure 1b-d. High-quality OCTA images can present microvascular information in different OCTA depth layers, which can be easily applied to clinical research. To precisely identify and diagnose the variations in retinal blood vessels, medical personnel need to extract the retinal vessels from the fundus image to observe the length, curvature, width, and other morphological conditions of the retinal vascular trees. However, the manual segmentation of retinal vessels requires complicated work and is both tedious and timeconsuming [8]. Various automatic segmentation algorithms that can improve efficiency and reliability have gradually attracted much attention in clinical practice procedures to solve this situation. In the past few decades, many efforts have been made to segment retinal vessels. For instance, Gao et al. [9] proposed an automated method for the diagnosis of diabetic retinopathy that could help physicians diagnose patients more quickly and accurately. The approach relies on annotating a large number of images, which requires a lot of time and human resources, and reannotating images for different cases. Jin et al. [10] presented a new dataset of fundus images based on vascular segmentation, which can provide researchers with rich experimental data. The size of the dataset is not extremely large and includes only one disease (diabetic retinopathy), which may affect the generalization ability of the algorithm. Song et al. [11] presented a machine-learning-based clinical decision model that uses a set of rules developed by physician experts and combines traditional feature extraction methods with automatic feature learning by convolutional neural networks (CNNs) to improve the diagnostic accuracy of pathological ptosis. However, the study lacks comparative experiments to assess the advantages and disadvantages of the model with other methods. The state-of-the-art methods for retinal vessel segmentation come from the fully convolutional networks (FCNs), such as U-Net and its variants [12], which are based on the encoder-decoder architecture. U-Nets can capture contextual semantic information by using a cascade of convolutional layers and combining high-resolution feature maps with skip connections to achieve precise localization. The impact of skip connections is improved by Attention U-Net [13], which introduces an attention module to weight encoder features and fuse them with corresponding decoder features. This enhances the retention and reinforcement of critical vessel features in the decoder. However, the interactions between information at different scales are ignored by the skip connections, which only enhance the vessel representation by adding over the channels to the corresponding decoder features. It has been indicated by studies [14] that not all skip connections effectively connect the encoder and decoder. Additionally, it was found that the original U-Net performs worse than a U-Net without skip connections on some datasets.
Many studies have focused on retinal vessel segmentation in OCTA images due to the superiority of OCTA images in visualizing the retinal plexuses. OCTA images are characterized by rich retinal vessels, complex branching structures, and a low signal-tonoise ratio, making it difficult to distinguish small capillaries, arterioles, and venous regions in the image, which leads to poor segmentation. In addition, variety in vessel size, shadow artifacts, and retinal abnormalities further complicates segmentation. To address these challenges, Ma et al. [7] proposed a split-based coarse-to-fine OCTA image segmentation network (OCTA-Net) that comprises a coarse segmentation stage and a fine segmentation stage. The coarse segmentation network is utilized to generate preliminary confidence maps for pixel-level and centerline-level vessels, while the fine stage serves as a fusion network to obtain the final refined segmentation result. Although this approach divides OCTA image segmentation into two stages, mitigating the problem of discontinuity in vessel segmentation, the training process is laborious and impractical. Pissas et al. [15] presented an effective recurrent CNN for vessel segmentation in OCT-A, which uses fully convolutional networks (FCNs) to segment the entire image in each forward pass and iteratively refines the quality of vessel generation through weight-sharing coupled with perceptual losses. Despite achieving a good performance, CNN-based approaches generally exhibit limitations for capturing long-range (global) dependencies due to the intrinsic convolution operations. It causes the convolutional network to only focus on local features of the retinal vessel image, making it prone to breaking and missing the widely existing small blood vessels.
The existing studies have proposed that transformer architecture using the selfattention mechanism has emerged to make up for the information loss in convolution operations and effectively establish long-range dependencies. Self-attention is the key computational primitive of the transformer. It can implement pairwise entity interactions with a context aggregation mechanism, giving the transformer the ability to handle long-range dependencies. Preliminary studies with different forms of self-attention have shown its practicality in various medical image segmentation tasks [16,17]. Despite their exceptional representational power, the training and progress of the transformer architecture have intimidating challenges. One of the challenges is that complexity is quadratically related to the image input size in the vanilla transformer module. Secondly, without the ConvNet inductive biases, transformers cannot perform well on a small-scale dataset. The above challenges make it difficult to process a lesser number of medical images with higher resolutions, leaving a large amount of room for further improvements.
In summary, we have identified several limitations of existing OCTA retinal vessel segmentation methods: (1) The continuity of retinal vessels amplifies the defects of convolution calculations, and the convolutional network's weak global capturing ability makes it susceptible to breaking or missing segmented vessels. (2) The skip connections in U-Net simply propagate vessel information from the encoder to the decoder on features of the same scale, resulting in limited interaction between features at different scales, which fails to prevent information loss and blurring. (3) Although the pure transformer network structures can achieve global context interaction through the self-attention mechanism, the high computational complexity of self-attention remains a challenge, especially for processing larger images with transformer-based structures.
To address these issues, this paper introduces a transformer embedded in a convolutional U-shaped network: TCU-Net, combining the advanced convolutional network and self-attention mechanism for OCTA retinal image segmentation. Specifically, an efficient cross-fusion transformer (ECT) is proposed to replace the original skip connections. The ECT module leverages the advantages of convolution and self-attention to avoid large-scale pre-training by exploiting the image induction bias of convolution, as well as the capability of the transformer to capture long-range relationships with linear computational complexity.
Moreover, features with different scales are input by the encoder into an efficient multihead cross-attention mechanism to achieve interaction between different scales and compensate for the loss of vessel information. Finally, the efficient channel-wise cross attention (ECCA) module is introduced to fuse the transformer module's multiscale features and decoder features to solve the semantic inconsistency between them and enhance effective vessel features. The main contributions of this work include the following:

•
We proposed a novel end-to-end OCTA retinal vessel segmentation method that embeds convolution calculations into a transformer for global feature extraction. • An efficient cross-fusion transformer module was designed to replace the original skip connections, thus achieving interaction between multiscale features and compensating for the loss of vessel information. The multihead cross-attention mechanism of the ECT module reduces the computational complexity compared to the original multihead self-attention mechanism. • To reduce the semantic difference between the output of ECT module and decoder features, we introduce a channel cross-attention module to fuse and enhance effective vessel information. • Experimental evaluation on two OCTA retinal vessel segmentation datasets, ROSE-1 and ROSE-2, demonstrates the effectiveness of the proposed TCU-Net.

Related Studies
The retinal vessel segmentation studied and considered herein can be divided into a CNN-based method and a transformer-based method. Among them, the transformer-based method focuses on its application to medical image datasets. In this section, we introduce corresponding algorithms for each category.

Based on Convolution Neural Networks
In recent years, deep learning models have been greatly used for retinal images since they do not need any handcrafted features and outperform existing unsupervised methods. Such models, especially U-Net [18], are still the most popular segmentation frameworks applied to fundus images up to now. Due to the blurred state of small blood vessels located at the end of blood vessels and the edges of blood vessels in retinal vascular images, as well as the unclear distinction between the blood vessel area and the background area, it is difficult to achieve accurate segmentation. To solve this issue, Xiao et al. [19] introduced the residual structure and combined it with U-Net to achieve a powerful feature extraction capability to obtain high-accuracy retinal vessel segmentation. However, ResU-Net [20] utilizes more convolutional layers and parameters, which might be overfitted. Guo et al. [21] introduced the spatial attention module to make the network focus on the vascular features and inhibit the unnecessary features, thus improving the expression ability of the network. As the attention module of SA-Unet only focuses on local information, that means it is not sensitive enough to long-range dependencies. Zhang et al. [22] proposed pyramid U-Net, which was employed in both the encoder and decoder to aggregate features at higher and lower levels for accurate retinal vessel segmentation. In this way, contextual information sharing and aggregation from coarse to fine can be achieved, thus improving the segmentation of capillary regions.
With the widespread use of OCTA techniques in ophthalmic diseases, researchers have gradually switched their targets from color fundus images to OCTA retinal vessel segmentation. Li et al. [23] proposed a new image magnification network (IMN) with a structure of an upsampling encoder and then a downsampling decoder. This design is to capture more image details and reduces the omission of thin-and-small structures. Xu et al. [24] introduced an OCTA-based cascaded neural network to automatically segment and distinguish small blood vessels before and after the capillary plexus, followed by a graph neural network (GNN) to improve the connectivity of the initial segmentation. Wu et al. [25] proposed a progressive attention-enhanced network (PAENet) for 3D-to-2D retinal vessel segmentation. It consists of a 3D feature learning path and a 2D segmentation path. To obtain more detailed information, a feature fusion module (FFM) is designed to inject 3D information into the 2D feature path and then model the semantic relationship between spatial and channel dimensions to achieve feature interaction. The above CNNbased segmentation networks achieved great performance in the retinal vasculature, but the local and limited receptive field of the convolutional network is still one of its shortcomings. Moreover, the existing U-Net-based retinal vascular segmentation networks only fuse features for the same level of encoder and decoder, ignoring the correlations between features of different layers. Therefore, the method proposed here interacts with the encoder features of different scales of U-Net to compensate for the loss of vascular information.

Based on Transformer Architecture
One of the first transformer-based architectures proposed for medical image segmentation is the TransUnet [26] architecture, which regards a hybrid CNN-transformer architecture as an encoder and outputs the final segmentation mask in the decoder. Zhang et al. introduced TransFuse [27] to effectively integrate the transformer and CNN features through the BiFusion module utilizing self-attention and a multimodal fusion mechanism. It was evaluated for polyp segmentation, skin segmentation, and hip segmentation and has been shown to be effective. In other work, TransAttUNet [28] is the first network to apply transformer layers between the encoders and decoders in a U-shaped architecture. The robust self-aware attention module and multiscale skip connection have been embedded between the encoder and decoder of U-Net, which not only enhances the flexibility of U-Net but also increases the expression ability of global spatial attention and transformer self-attention.
Plenty of experiments with TransAttUNet on five benchmark medical image segmentation datasets have shown its effectiveness. The above transformer-based model implements global context modeling and exhibits a strong ability to capture key features in images. Nevertheless, the computational complexity of the original self-attention is high and requires a longer training time and a larger amount of computational resources. To address this, Tan et al. [29] proposed a novel transformer network (OCT2 Former) for OCTA retinal vessel segmentation, using a dynamic token aggregation transformer to reduce the huge computational overhead of the original transformer and designing an assisted convolution branch to speed up the convergence of the transformer. In addition, Guo et al. proposed a UTNet [30] model in which transformer layers are present in both the encoder and decoder. It effectively combines the attention mechanism with convolution operations and reduces the quadratic complexity of the self-attention mechanism to a linear type, respectively. In order to accelerate the convergence of the segmentation network, we reduce the computational complexity of the model by using the latter scheme to embed the features into the self-attentive mechanism after reducing their size through convolutional computation. Figure 2 provides an overview of the TCU-Net network. The U-Net architecture comprises a downsampling encoder and an upsampling decoder, and the skip connections refer to adding encoder and decoder features at symmetric positions on the channel, thus preserving the original input feature map in the deep transformation. Inspired by methods such as UTNet [30] and UCTransNet [14], we aim to improve the performance of U-Net by designing an efficient cross-fusion transformer to replace the original skip connections. The ECT module is situated on the original skip connection structure. The output of the ECT module is not directly added to the channel with the corresponding layers of the decoder. Instead, it is fused with the output features and upsampled features layer-wise by the ECCA module. This process guides the decoder stage and enhances vascular information.

ECT: Efficient Cross-Fusion Transformer for Encoder Feature Transformation
To solve the high computational complexity issue when fusing the multiscale features of encoders, the proposed efficient cross-fusion transformer (ECT) module integrates convolution into the self-attention mechanism to avoid the large-scale pre-training of the transformer. This is attributed to the theory proposed by wang et al. [31] that self-attention is essentially low rank for long sequences and most of the information is concentrated on the largest singular value. A more efficient attention mechanism based on this theory was proposed by UTNet [30], which successfully reduced the computational complexity of self-attention. In addition, UCTransNet [14] identified that some skip connections may not be effective due to the incompatible feature sets between the encoder and decoder stages. To address this issue, they introduced the CTrans (channel transformer) model as an alternative to U-Net skip connections. The CTrans model effectively solves the semantic gap and achieves the accurate automatic segmentation of medical images. Inspired by them, the ECT module can effectively fuse features at different scales as well as reduce the computational complexity of the self-attention mechanism in Figure 3.
In previous studies [32], we calculated the attention function for a set of queries simultaneously, packed into a matrix Q. The keys and values are also packed into matrices K and V. We use 4 heads and consider an input feature map X ∈ R C×H×W , where H, W is the spatial height and width, and C is the number of channels. The computation process is described as follows: where the Q, K, V ∈ R d×H×W and d is the embedding dimension of each head. Accordingly, the Q, K, and V are flattened and transposed into sequences with size R n×d , and n = HW. Consequently, the dot-product attention leads the complexity to O(n 2 d). Typically, selfattention layers are slower than recurrent layers when the sequence length n is longer than the representation dimensionality d, affecting the self-attention's flexible applicability. Therefore, the main idea of the effective cross-fusion self-attention we employed is embedding the projection into the lower dimension.
In the efficient cross-fusion transformer (ECT) module, for each output X i ∈ R HW i 2 ×C i , i = (1, 2, 3, 4) of the encoder, it needs to be regularized to X i ∈ R HW i 2 ×C i , i = (1, 2, 3, 4) before entering the attention mechanism. As shown in Figure 3, we use three 1 × 1 convolutions to 2,3,4) and concatenate the four layers of K, V as the ultimate key and value K Σ = Concat(K 1 , On each of these projected versions of queries, keys, and values we then perform three projections to project them into low-dimensional embedding in each head: Q ∈ R k×d i , K ∈ R k×d Σ , and V ∈ R k×d Σ , i = (1, 2, 3, 4), where d is the dimension of embedding in each head, k = hw ≤ HW i 2 , and h and w are the reduced size of each feature map after a bilinear interpolation.  The proposed module contains six inputs containing four queries and two aggregated K Σ , V Σ as the key and value, as shown in Figure 4. We compute the matrix of outputs through an efficient cross-attention (ECA) mechanism as: where d Σ = Concat(d 1 , d 2 , d 3 , d 4 ) is aggregated through the dimensions in the four skip connection layers. Finally, we computed the dot products of the transpose of the query with all keys, divide each by √ d Σ , and apply a softmax function to obtain the weights on the values. In practice, we use 4 heads and employ k = HW 16 2 as the limited length. Due to the reduced size of each feature map, the total computational complexity is similar to O k 2 d and much smaller than O n 2 d .  To distinguish our model from conventional vision transformer models, we perform a convolutional layer for each output of the multiheaded self-attention, accompanied by a batch normalization and a ReLu activation function to achieve information complementarity. Hereinafter, applying a convolution calculation and residual structure, the output is obtained as follows:

V′
The operation in Equation (3) is repeated four times to build the outputs of the transformer. Finally, we use an upsampling followed by a 1 × 1 convolution to reconstruct the four outputs E 1 , E 2 , E 3 , and E 4 and splice them with the decoder features D 1 , D 2 , D 3 , and D 4 , respectively.

ECCA: Efficient Channel Cross-Attention
To solve the semantic inconsistency between the effective transformer and U-Net decoder, we apply a channel cross-attention module [14] by exploiting the inter-channel relationship of features. To compute the channel cross-attention efficiently, we firstly squeeze the spatial dimension of the input features E i ∈ R C i ×H×W and D i ∈ R C i ×H×W (i = 1, 2, 3, 4), respectively. For aggregating spatial information, average-pooling and max-pooling have been commonly adopted so far. In previous studies, we argued that max-pooling can gather the unique object features to infer finer channel-wise attention and average-pooling can learn the extent of the target object effectively [33]. Thus, we empirically confirmed that exploiting both of them in a parallel or sequential manner obtains the best result (see Section 4.1). We describe the computational process as follows: where σ denotes the sigmoid function, M i (E i ) ∈ R C i ×1×1 , and M i (D i ) ∈ R C i ×1×1 . Note that L 1 ∈ R C i ×C i and L 2 ∈ R C i ×C i are the weights of two linear layers. Through these computations, we generate two different pieces of spatial context information and merge the features using element-wise summation. Finally, the channel attention map is built by a single linear layer and sigmoid function.

Datasets and Metrics
To evaluate the effectiveness and superiority of TCU-Net, we have conducted extensive experiments on the Retinal OCTA SEgmentation (ROSE) dataset [7], which is the first public ROSE dataset for the vessel segmentation task. ROSE consists of two subsets (ROSE-1 and ROSE-2) obtained by two different devices. To be specific, there are 117 OCTA images with a resolution of 304 × 304 pixels in ROSE-1, while ROSE-2 contains 112 OCTA images with 512 × 512 pixels. ROSE-1 can be divided into three kinds of OCTA images with both centerline-level annotation and pixel-level annotation, i.e., SVC, DVC, and SVC+DVC. In ROSE-2, only SVC images with centerline-level annotation are provided. We considered the consensus of centerline-level annotation and pixel-level annotation as the ground truth in the SVC of ROSE-1. Given a predicted segmentation result and its corresponding ground truth, true positives (TPs) mean the correctly segmented vessel pixels and those wrongly classified as non-vessel pixels are denoted as false negatives (FNs). Similarly, true negatives (TNs) mean correctly segmented non-vessel pixels and those incorrectly detected as vessel pixels are denoted as false positives (FPs). The evaluation metrics are calculated as follows:

Implements Details
We implemented the proposed method with PyTorch on an NVIDIA TITAN GPU and empirically set the number of epochs to 50 epochs for ROSE-1 and 300 epochs for ROSE-2. The stochastic search strategy was used to find the optimal hyperparameters, and after constant iterations of training, the best combination for the model was identified. We finally used Adam optimization to adaptively adjust them with a learning rate of 0.0006, a batch size of two, and a weight decay of 0.0001. Each kind in ROSE-1 is composed of 30 training images and 9 testing images, while 90 images in ROSE-2 are used for training, and the remaining 22 images are chosen for testing. Only when training, the random rotation of an angle of −10 and 10 is conducted for data augmentation. The poly learning rate policy with a poly power of 0.9 is adopted for better performance and stable training. It is worth noting that we train TCU-Net in an end-to-end manner with binary cross-entropy loss. To simplify the training process, we utilized the ground truth instead of centerline-level annotation and pixel-level annotation for ROSE-1.

Performance Comparison and Analysis
To comprehensively prove the superiority of the proposed method, we have compared it with many other state-of-the-art segmentation methods: seven CNN-based deep learning approaches-U-Net [18], ResU-Net [20], CE-Net [40], CS-Net [41], and OCTA-Net [7]and two transformer-based deep learning networks-TransFuse [27] and TransUnet [26]. We report the objective metrics of these methods in Tables 1-4 and subjective results in Figure 5. The network's vascular segmentation ability can be observed from the ground truth comparison with the predicted mask. Subjective comparisons. Figure 5 compares the resulting images of three advanced vascular segmentation methods, including two networks based on transformers for medical image segmentation. It can be observed that the two transformer networks have several vascular breakpoints in their prediction plots. Meanwhile, the OCTA-Net [7] outperforms the other two networks except for our proposed method, but it achieves weak performance in capturing thin vessels due to convolutional limitations. In contrast, the proposed method (TCU-Net) identifies more complete vessels without separate training of coarse and fine vessels and performs a more sensitive and accurate segmentation of capillaries. The graph of SVC and DVC vessel results in ROSE-1 demonstrates that TCU-Net is quite coherent in terms of overall vessels with minimal truncation points, and the results are better than the other three networks' segmentation results, especially on the fine capillaries. Similar results are demonstrated in the ROSE-2 dataset. In the following, we will analyze the proposed method's objective metrics.  Table 3. Quantitative results on ROSE-1 (SVC+DVC) datasets compared to previous SOTA.

AUC (%) ACC (%) G-Mean (%) Kappa (%) Dice (%) FDR (%)
outperforms CNN-based SOTA methods by a large margin and all metrics evaluated show the best performance. Specifically, compared to other transformer-based methods, TCU-Net also shows a superior learning ability on the majority of vessels. The performance of the proposed method is consistent with the segmentation results, demonstrating a strong connectivity and integrity in both coarse and fine vessels.
Results of the DVC dataset in ROSE-1. For the DVC images, their ground truth contains only the intermediate fine vessels. Our method shows the same optimal performance in fine vessel segmentation in Table 2. It is commendable that all objective metrics are higher than the latest methods; in particular, the mean value of the AUC is up to 98.23%, with an improvement of 1.41%, respectively, and a reduction of about 17.73% in FDR as compared to OCTA-Net. This result shows that TCU-Net is more sensitive to capillaries compared to other methods.
Results of the SVC+DVC dataset in ROSE-1. Each image of this dataset contains both SVC and DVC vascular maps. We repeat the experiments for U-Net and its variants several times again. The results are shown in Table 3, these prove that TCU-Net achieves state-of-the-art performance. Specifically, compared to CS-Net, the proposed network improved 0.21%, 0.3%, and 0.68% in the three metrics of the AUC, ACC, and Kappa, respectively, and reduced 2.22% in FDR. Tables 3 and 5 show that the proposed method not only outperforms the two transformer frameworks but also can effectively reduce the computational complexity of the original transformer model and the number of parameters of the model. Results of the ROSE-2 dataset. The difference between ROSE-2 and ROSE-1 is that ROSE-2 has a high pixel size of 512 × 512. Due to the high pixel count of the images, training on this dataset converges more slowly compared to ROSE-1. Therefore this dataset needs 300 epochs of training on the TCU-Net network to obtain the best value. As shown in Table 4, the proposed method achieves the best results on the AUC, ACC, G-mean, and Kappa, respectively. This result demonstrates that the TCU-Net network is equally adapted to high-pixel fundus image segmentation with the introduction of a self-attention mechanism.

Ablation Studies
In this paper, we conduct an ablation study to assess the effectiveness of the proposed method. Experiments are conducted to evaluate the effectiveness of the proposed branched design by choices of different attention combination schemes. The ROSE dataset that has been used and the results of the experiments are recorded.
Ablation for the proposed modules. To perform a thorough evaluation of the ECT module and the ECCA module, we added each component to U-Net, and the performance results are shown the by applying each component to the original scheme in Tables 6-9 for the SVC, DVC, SVC+DVC, and ROSE-2 datasets, respectively. The performance of all datasets is improved by both the ECT module and the ECCA module. Specifically, the efficient cross-fusion transformer module successfully fuses multiscale features, leading to significant performance improvements and preventing information loss from the encoder. Furthermore, the ECCA module enhances performance by establishing an effective connection to the decoder features, thereby reducing ambiguity. Note that both types of attention are crucial, and the 'Base+ECT+ECCA' approach achieves the best values on all metrics, driving the performance of retinal vessel segmentation.
Ablation for the projection of efficient self-attention and to reduce size. Figures 6 and 7 show the comparison of the Dice scores when the dimensions H and W of the feature map are reduced to 1/16, 1/8, and 1/4 of the original size. Among them, using interpolation downsampling is slightly better than using maximum pooling, and the best results are obtained by reducing the size of the ROSE-1 and ROSE-2 datasets to 1/4 and 1/16, respectively. In addition, we compare in terms of the model size and floating point of operations. As shown in Table 5, the proposed model has a substantial reduction in the number of parameters compared to the other transformer model, along with a significant performance improvement. This indicates that the proposed model shows superiority in vessel segmentation.  Table 7. Ablation studies on ROSE-1 (DVC) dataset.

Conclusions
In this paper, we present a novel strategy to combine a transformer and U-Net for retinal vessel segmentation. Transformers are knowns as architectures with strong innate self-attention mechanisms. To enhance the effective vascular information, we propose an ECCA module to fuse the ECT module features with the decoder features. The proposed approach has a lower memory occupation and computational complexity than other transformer-based models [26,27], without pre-training. Nevertheless, it is crucial to emphasize that the clinical application of TCU-Net should be carefully evaluated by medical professionals due to potential variations in real images, such as illumination, shooting angles, and lesion areas. The proposed TCU-Net architecture achieves a state-of-the-art performance for ROSE-1 and ROSE-2 on SVC and DVC datasets, but further research is needed to address potential biases in practice. Future research could further explore and improve this approach to address potential biases in clinical practice and facilitate the model's widespread use in clinical applications.