HDB-Net: hierarchical dual-branch network for retinal layer segmentation in diseased OCT images

Optical coherence tomography (OCT) retinal layer segmentation is a critical procedure of the modern ophthalmic process, which can be used for diagnosis and treatment of diseases such as diabetic macular edema (DME) and multiple sclerosis (MS). Due to the difficulties of low OCT image quality, highly similar retinal interlayer morphology, and the uncertain presence, shape and size of lesions, the existing algorithms do not perform well. In this work, we design an HDB-Net network for retinal layer segmentation in diseased OCT images, which solves this problem by combining global and detailed features. First, the proposed network uses a Swin transformer and Res50 as a parallel backbone network, combined with the pyramid structure in UperNet, to extract global context and aggregate multi-scale information from images. Secondly, a feature aggregation module (FAM) is designed to extract global context information from the Swin transformer and local feature information from ResNet by introducing mixed attention mechanism. Finally, the boundary awareness and feature enhancement module (BA-FEM) is used to extract the retinal layer boundary information and topological order from the low-resolution features of the shallow layer. Our approach has been validated on two public datasets, and Dice scores were 87.61% and 92.44, respectively, both outperforming other state-of-the-art technologies.


Introduction
Optical Coherence Tomography (OCT) is a non-contact imaging method capable of detecting photons backscattered from tissues with high sensitivity and micron-scale resolution [1].The non-destructive acquisition of OCT images is the gold standard in the diagnosis of various chronic eye or optic nerve diseases [2].Many diseases may change the morphology of retinal tissue.By segmenting and measuring relevant retinal layers, it allows for effective identification of lesion types, tracking disease progression, lesion locating, as well as carrying out surgical area selection and path planning for the subsequent treatment that may be needed [3].
As shown in Fig. 1, the segmentation and measurement of retinal layers in diseased OCT images remains a major challenge for the following reasons.First, speckle noise and motion artifacts will reduce the imaging quality of OCT images, and the small imaging depth makes the refractive index difference insignificant [4].These lead to the low contrast of adjacent retinal tissue layers and increases the difficulty of layer boundary definition.Second, some diseases such as diabetic macular edema (DME) and multiple sclerosis (MS) will change the morphology of retinal tissue [5,6], and the size, shape and location of cysts, subretinal fluid (SRF) or edema caused by lesions also have pathological and individual differences [7].Finally, the interlayer morphology of the retina is highly similar [8].The blue area is the subretinal fluid caused by the lesion; The color dividing lines are the layers of the retina that have a strict physiological order from inner to outer.(c) The green area is retinal layer with highly similar interlayer morphology.
With the rapid development of convolutional neural network (CNN), its applications have gradually extended into areas of semantic segmentation and medical image processing [9], among which the typical representative is full convolutional network (FCN) [10] and U-Net [11].Due to its simple structure suitable for high-resolution OCT images and combined with data-enhanced training strategies that requires few annotated images, U-Net and its modified model have been widely used in the field of retinal layer segmentation [12,13].
However, despite the advantages of CNN based networks demonstrated in spatial positioning, it is difficult to directly model global semantic interaction and context information due to the limitation of convolutional operations [14].In addition, in CNN, the feature downsampling process adopted to reduce the amount of computation is easy to cause the loss of small-scale features [15,16].The common way to solve this problem is to introduce attention mechanism or multi-scale feature fusion strategy.In the segmentation task, the attention mechanism can be regarded as a dynamic selection process, which adaptively assigns different weights to different channels or regions in the space according to the importance of the input, influencing the final segmentation result through the weights [17].In terms of multi-scale feature fusion, PSPNet [18] and its enhanced version UperNet [19] aggregate context information through pyramid pool module.All of the above methods aggregate global information from local features obtained by CNN, rather than directly encoding the global context, which is determined by the inherent limitations of convolution operations in CNN.
The emergence of Transformer opens up new research ideas for this situation.Transformer [20] is originally a series prediction model applied in natural language processing.It only uses attention mechanism to carry out machine translation tasks, focusing on global dependency in data, and has achieved good results.A series of subsequent works, such as ViT [21] and DETR [22], have proved that Transformer also has great potential in the image field.When there is enough data for pre-training, ViT will outperform CNN and break through the limitation of Transformer's lack of inductive bias, which can achieve better migration effect in segmentation tasks.On this basis, Swin Transformer [23] was built and showed great potential in several intensive prediction tasks.Cao et al. [24] and Lin et al. [25] developed Swin-UNET and DS-TransUNet U-shaped encoder-decoder frameworks for semantic segmentation of medical images based on Swin Transformer, respectively.However, these networks ignore the characteristics of Transformer which requires a large amount of data for pre-training, and they do not design fine boundary segmentation and feature aggregation modules for the diseased retinal layer.As a result, the full potential of networks cannot be realized.
Our network design draws on the method of processing small-scale objects in remote sensing (RS) images [26].A new OCT retinal layer segmentation network HDB-Net is proposed by introducing Transformer and hybrid attention mechanism, which are good at global information processing, and collecting detail information generated during feature extraction in shallow network.The framework combines the advantages of Swin Transformer and CNN for retinal layer and lesion segmentation.A specially designed feature aggregation module (FAM) fuses the global and local features obtained from the dual encoder to strengthen the performance of the network for SRF segmentation.In addition, a new Boundary Awareness and Feature Enhancement Module (BA-FEM) is added in the skip connection to enhance the ability to extract retinal layer edges in the shallow network, so as to enhance the network's ability to accurately segment and measure the thickness of each retinal layer.The decoder leverages the pyramid pooling module of UperNet, which possesses multi-scale feature fusion capabilities, to aggregate global information from different feature maps.
The main contributions of this paper are as follows: 1) We propose an HDB-Net Network segmenting diseased OCT layers, which is composed by two branches encoder extracting global and local features, and employs FAM and BA-FEM to enable the network to solve the problem that the presence, location and size of lesions are uncertain.
2) We design FAM to extract global context information from Swin Transformer while securing local feature details from ResNet.The hybrid attention mechanism is introduced to compensates for the inherent limitations of the Swin Transformer's window mechanism, augmenting network segmentation performance for lesions.
3) We integrate BA-FEM into the skip connection layer to extract the retinal boundary details and topological order from low-resolution features, addressing prevalent challenges like imprecise retinal boundary demarcation and the incorrect physiological ordering of retinal layers from innermost to outermost layers.

4)
We propose a new standard for the retinal thickness measurement and error evaluation, and on this basis, we complete the relevant experiments.The criteria and associated experimental data may contribute to the automated diagnosis of MS in the future.

Feature extraction
Early algorithms for OCT retinal layer segmentation [27,28] primarily adhered to conventional image processing techniques.However, these segmentation approaches frequently do not consider the regions with SRF.Consequently, their efficacy was compromised when processing severely deformed retinal images from pathological conditions, limiting their widespread adoption.
With the wide application of CNN and encoder-decoder architectures in medical segmentation field, similar network structures are also applied in retinal layer segmentation.Liu et al. [4] developed an EfficientNet-B5 architecture for detecting retinal thickening and IRF in OCT images.Based on the reference of retinal thickness and SRF presence, DME can be detected from two-dimensional color fundus photographs.Wang et al. [12] designed a dual-task framework for retinal layer segmentation.This framework leveraged low-level outputs for boundary discernment and high-level outputs for layer-wise segmentation.Xing et al. [13] proposed a new network based on the FCN framework, which includes attention gate and spatial pyramid pool modules, which can better extract multi-scale objects in OCT images.
In order to make up for the shortcomings of CNN's convolution operation in global modeling, we introduce Transformer with stronger generalization ability and context extraction ability to obtain a global receptive field.Dosovitskiy et al. apply Transformer to image classification, and the ViT network [21] proved that the pure Transformer based network can achieve the optimized performance in image recognition.DeiT [29] proposes a new token-based knowledge distillation strategy that enables ViT to perform well on smaller ImageNet-1 K datasets.However, for dense prediction tasks, ViT still fails to achieve good performance due to its huge training costs.Therefore, Liu et al. proposed Swin Transformer [20].This design incorporates a shifting window strategy that constrains the multi-head self-attention to non-overlapping windows but still permits inter-window communication, which has linear computational complexity.However, compared to traditional image segmentation, medical imaging has fewer training samples.In this case, The Transformer's performance often lags behind traditional CNNs because of less inductive bias [30].
With the wide application of Transformer and Swin Transformer models in the field of medical segmentation, many dual encoder networks based on them and CNN have been developed.Li et al. [31] combined CNN and Transformer to design X-Net, a dual encoder network for medical image segmentation, and added a variational auto-encoder branch in the decoder to reduce the impact of insufficient medical image data.Hu et al. [32] proposed an efficient R-Transformer dual encoder network ERTN for accurate segmentation of gliomas from 3D brain magnetic resonance images.Lewis et al. [33] proposed a dual encoder-decoder segmentation network PSNet with multiple deep learning modules to solve the polygon segmentation problem in colorectal polyps.Hong et al. [34] proposed a dual encoder network based on Transformer-CNN for multi-organ segmentation.Lesions may appear in OCT retinal images, and their shapes and sizes are not fixed.Therefore, we introduce the hybrid attention mechanism based on the above dual encoders to achieve the fusion of multi-scale features.

Feature aggregation
Aggregating features according to different convolutional blocks or context embeddings has been shown to be an effective way to enhance the representation of semantically segmented features.Gao et al. [35] proposed a feature aggregation bilateral symmetric network with two branches of semantic information and spatial details, and the specific module is used to realize the feature aggregation.Wu et al. [36] proposed a new scale and context sensitive network (SCS-Net) for retinal vessel segmentation.An adaptive feature fusion module is designed for SCS-Net, which is used to guide the efficient fusion between the features of adjacent layers to capture more semantic information.Song [37] proposed a two-stage segmentation network (TSFM-Net) with feature aggregation and multi-level attention mechanism to solve the problem of unbalanced sample size and multi-target interference in heart images.Ye et al. [38] proposed a lateral ventricle segmentation CNN based on adaptive multi-scale feature aggregation and boundary perception (MB-Net), which effectively solves the problem of different target regions in magnetic resonance images.
However, while the encoder-decoder framework described above is capable of capturing multi-scale context information efficiently, skip connections only transmit single scale features to the decoder, which can result in limited feature representation.Besides, the transmission process of features is basically without information filtering.Therefore, Huang et al. [39] proposed the AlignSeg network to solve the problem of misalignment caused by step-up downsampling operations and indiscriminate fusion of context information.Su et al. [40] proposed a new feature enhancement network, which uses feature propagation enhancement and feature aggregation enhancement modules to achieve more efficient feature fusion and multi-scale feature propagation.Compared with other medical images, OCT retinal images are highly similar in retinal interlayer morphology, and the presence, shape and size of lesions are uncertain.For this purpose, we propose a dual encoder network HDB-Net with FAM and BA-FEM modules.The specially designed FAM greatly improves the segmentation accuracy of SRF in DME, and the modified BA-FEM improves the accuracy of the measurement of retinal inter-layer thickness.

Methods
In this section, we first introduce the overall structure of the proposed network HDB-NET and describe the Swin Transformer and Res50 parallel dual encoders involved.Then two important modules in HDB-NET, FAM and BA-FEM, are introduced respectively.

Network structure
Figure 2 illustrates the HDB-NET architecture.Leveraging the success of hierarchical networks in medical segmentation, we employ an encoder-decoder design.In the encoder section, we use a four-stage Swin Transformer and Res50 parallel dual encoder architecture.The Swin Transformer branch on the left builds hierarchical feature maps by merging image patches (grey grids) in deeper stages.Due to the characteristics of Transformer, there is no difference in distance between each gray grid when extracting features, so it is easy to obtain the overall global information of the image.In addition, the input image size has linear computation complexity size due to computation of self-attention only within each local window (red image block).The ResNet branch on the right builds feature maps layer by layer through convolution and pooling operations.The features extracted by this branch retain the original spatial information of the image.These features extracted from each of the two branches are screened and fused by a specially designed FAM to obtain retina layer image features that integrate the respective advantages of Transformer and CNN.Then the BA-FEM was used to collect the retinal layer boundary information and topological order filtered during shallow network feature extraction, which was combined with the image features output by FAM, and passed to the decoder.The decoder part uses UperNet with added attention mechanism to aggregate the characteristic information of each of the four stages for the final retinal layer segmentation.
It should be noted that the Boundary awareness and feature enhancement module (BA-FEM) can also be refined again into two submodules: Boundary awareness (BA) and feature enhancement (FE).BA module has a fixed structure and is only used in the first two stages of the network.Its function is to extract the retinal layer boundary information lost in the downsampling operation from the shallow network.FE module is only applied in the first three layers (stage i = 0, 1, 2), and the depth of hierarchical structure decreases with the increase of corresponding stage i.FE module can be regarded as a U-Net with adaptive depth, which can extract and aggregate features of different stages based on lightweight design.The deepest layer (stage i = 3) still utilizes the original PPM layer in Upernet.PPM, also known as the Pyramid pooling module, fuses four different scale features by using different sized pooling cores.
For the input OCT retinal layer images X ∈ R H×W×1 (grayscale images, therefore channel = 1), Swin Transformer segments them into non-overlapping patches corresponding to the tokens used by Transformer in NLP.The patch size is S, X P ∈ R (H/S)×(W/S)×(1×S 2 ) .The linear embedding task is carried out to flatten these patches and project them linearly to the dimension C 1 acceptable to Swin Transformer, where X L ∈ R (H/S)×(W/S)×C 1 .X L is the input of four stages.The first three stages are composed of different numbers of Swin Transformer blocks and patch merging layers, while the last stage only has Swin Transformer blocks.Each block consists of two parts in series, window-based MSA and shifted W-MSA.This process is similar to the convolution operation in CNN, which is used to calculate the self-attention between the pixels inside each moving window, and the overall process does not change the size of the input and output.Patch merging layer is similar to the down sampling operation in CNN, which is used to reduce resolution and increase dimension.For each patch merging layer, resolution is halved, and dimension is doubled.Therefore, the output resolutions of the four stages are respectively S 1 ∈ R (H/S)×(W/S)×C 1 , S 2 ∈ R (H/2S)×(W/2S)×2C 1 , S 3 ∈ R (H/4S)×(W/4S)×4C 1 , and S 4 ∈ R (H/8S)×(W/8S)×8C 1 .S = 4 and C 1 = 96 in the network design.
The retina layer image X ∈ R H×W×1 inputted to Res50, first passes through a convolution layer to halve its resolution, and the dimension becomes C 2 , where X c ∈ R (H/2)×(W/2)×C 2 .After that, X c goes through four residual stages.In each stage, a max pooling layer reduces its resolution by half and doubles its dimension.Following this, it progresses through varying numbers of Residual blocks, known as Bottlenecks.In Bottleneck, the dimension of the feature map is four times of the output and then restore at the input of the next Bottleneck.Therefore, the output feature map of the four stages are respectively R It can be seen from the above that the output feature map resolution of Swin Transformer and Res50 in each stage of the encoder is the same.Therefore, the FAM is designed to unify the number of channels and fuse the features S i and R i extracted by the two respectively to obtain features F i .Next, the feature information is integrated and passed to BA-FEM and the pyramid pooling module (PPM) to obtain features B i and P. Finally, the above feature is passed to the decoder using skip connection.The decoder uses UperNet as the overall architecture.For the transmitted features, bilinear interpolation upsampling is used three times to expand the resolution, and the feature information passed from the corresponding encoder stage through skip connections is added to recover spatial information.After that, bilinear interpolation is used to unify the resolution of the feature maps from the four stages of the decoder to X f ∈ R (H/4)×(W/4)×512 , and a convolutional layer is used to merge the four X f and reduce their channel count to 512.Finally, the feature X f is upsampled using deconvolution and bilinear interpolation to obtain the final segmentation mask.

Feature aggregation module
CNN is the mainstream network in the field of medical image segmentation, which can efficiently extract the image details while preserving the semantic correlation of adjacent pixels, realizing the recognition and segmentation of different objects [41].Swin Transformer has a global receptive field that can efficiently extract contextual information and global features from images.These two networks have complementary advantages and complement each other.Therefore, after obtaining the respective output features S i and R i from Swin Transformer and Res50 dual encoder, the remaining problem is how to effectively merge them to draw multi-scale feature maps for subsequent decoding and segmentation.The direct approach is to simply add at element-level or concatenate the multi-scale features and then perform the convolution operation.However, this method cannot highlight the respective advantages of the two networks, nor can obtain remote dependencies and global context information between features at different scales.Therefore, we propose a new FAM, which introduces channels and spatial attention mechanisms to efficiently integrate feature information extracted from the two networks.
The overall structure of FAM is shown in Fig. 3. c 1 = c.S i and R i are the outputs of stage i (i = 1, 2, 3 & 4) of Swin Transformer and Res50 respectively.For Swin Transformer, its global receptive field determines that S i has rich context information in spatial dimension, while lacks the determination of significance between different channels.Therefore, global max pooling and average pooling are used to pool S i in the spatial dimension, compress the global spatial information, and obtain two matrices with dimension 1 × 1 × c.The characteristics of channel dimension are then learned through multi-layer perceptrons (MLPs) to form the importance of each channel.After that, the two matrices were added and calculated by sigmoid function, and the weight matrix c i representing the different importance of each channel was obtained.Finally, the input feature graph S i is multiplied by the channel weight matrix c i to obtain the layer i feature Sc i processed by the channel attention module.At this time, the Sc i has both spatial self-attention from Transformer structure and channel attention obtained from the above process.The whole process which can be expressed by the following equation: where σ represents the sigmoid function, f (x) represents the MLP layers where the number of channels is divided by four to reduce the computational complexity, maxpool c and avgpool c represent the global maximum pooling and average pooling of channel dimension respectively.For Res50, because of the height and weight of S i and R i are the same at each stage, the number of channels c is different.So, the output to Res50 is reduced to the same dimension by a convolution.The CNN-architecture based Res50 uses convolution operation to extract features.Compared with Transformer that focuses on global information and self-attention mechanism, CNN performs better at preserving semantic information and detailed features of adjacent pixels in images.Meanwhile, the built-in position coding of pixels in images is an inductive bias, which can significantly reduce training costs [30].Firstly, global maximum pooling and average pooling are used to pool R i in the channel dimension, and the number of channels is compressed to 1, and two matrices with dimension h × w × 1 are obtained.Then it is spliced in the channel dimension and the spatial features are extracted using 3 × 3 convolution.Then sigmoid function is calculated and the weight matrix s i representing the importance of different positions in the space between all channels is obtained by Relu activation function.Finally, the input feature graph R ′ i with the number of channels c after dimensionality reduction is multiplied with the spatial weight matrix s i to obtain the Res50 layer i feature Ss i processed by the spatial attention module.The above process can be expressed by the following equation: where σ represents sigmoid operation, g(x) represents 3 × 3 convolution compression, maxpool s and avgpool s represents global maximum pooling and average pooling of spatial dimensions respectively.
In the above processing process, c i and s i extract the features of Swin Transformer and Res50 in channel and spatial dimensions respectively.In order to locate significant and representative areas from the whole feature mapping, we extracted channel dependency c i from the global features of Swin Transformer and assigned it to the local features from Res50 to obtain the interaction feature Sm i , so as to establish effective interaction between dual-encoder and multi-scale features.Finally, Sc i , Ss i and Sm i were concatenated in the channel dimension and then compressed into the dimension of h × w × c to obtain the output feature F i of FAM.The overall process is as follows: where σ represents sigmoid operations, and C(x) represents concatenate and 3 × 3 convolution compression.

Boundary awareness and feature enhancement module
In the task of retinal layer segmentation in OCT, the accuracy of the segmentation of each retinal layer and the edge of SRF largely determines the final result.Especially for MS diagnosis, accurate retinal layer segmentation is an important prerequisite for the measurement of retinal layer thickness.In general, low-level features in hierarchical networks generally contain more spatial details and are used to restore target boundaries [42].The boundary information can be obtained by subtracting the subsampled feature from the original [9].Inspired by this, BA-FEM is proposed, shown in Fig. 2. We designed a boundary awareness (BA) module to efficiently extract retinal layer boundary information in shallow layer.To ensure the correct topological order, it requires the assistance of global context information.However, because of the small receptive field, shallow layers can only obtain the relationship between pixels in a small area, and the extracted features lack global context information.Therefore, we proposed a feature enhancement (FE) module similar to U-Net's hierarchy, which uses subsampling to increase the receptive field, and extracts and aggregates multi-scale features at the same stage.The integration of the two modules into the BA-FEM serves as part of the skip connection.This allows the features from the encoder to be channeled to the decoder, effectively extracting edge details and maintaining the topological order of the retinal layer from the low-resolution features of the base network.
The internal structure of BA-FEM is shown in Fig. 4. F i and F i+1 (i = 0, 1) are respectively the output of the current stage and the next stage of the shallow layer after FAM aggregation.First, the feature F i+1 from i + 1 stage is upsampled through bilinear interpolation to the same size as F i , resulting in F ′ i+1 .Then F ′ i+1 passes through the spatial transformer network (STN) to obtain the most prominent feature F si+1 in the subsampled feature map, and aligns them with the current layer features F i by using spatial invariance to ensure the accuracy and directivity of the extracted boundary information in the subsequent subtraction operation.STN includes three steps: localization net, grid generator, and sampler, which realize the functions of parameter prediction, coordinate mapping, and pixel collection respectively [43].In this module, STN mainly plays the functions of deeper feature extraction and spatial invariance.F si+1 is subtracted from feature F i at the current stage to gain the output F bi+1 of BA module, which is the retinal layer boundary information extracted from the shallow layer.The above process can be expressed by the following equation: where S stands for STN calculation and B stands for bilinear interpolation upsampling. +1 =   -(ℬ( +1 )) FE module is a hierarchical feature enhancement module that adaptively adjusts its depth according to the current network depth.Compared with modules with similar functions in other networks [9] [44], its ability to adaptively adjust module depth according to the current network depth can maintain the same overall depth of each shallow network.That is, it maximizes the receptive field of the shallow network and reduces the computing cost as much as possible.Meanwhile, its depth adaptive function enables it to quickly adapt to different types or different depths of the network, with great flexibility.FE module has a down-sampling and up-sampling structure similar to U-Net, which is connected by skip connections.The initial network has a depth of three layers and decreases as the external network depth (stage i) increases.The input and output of each layer are represented by U j and U ′ j respectively.When i = 0, j = 2; i = 1, j = 1; and i = 2, j = 0.The output U ′ j of each layer is compressed by concatenating the current input U j with the upsampled feature from the next layer, ensuring that the feature size remains consistent throughout the process.The outputs of the BA and FE modules are then concatenated to produce the overall BA-FEM output, which is subsequently forwarded to the decoder.The overall process is as follows: where C(x) stands for concatenation and 3 × 3 convolution compression.

Loss function
The most commonly used loss function in image semantic segmentation tasks is the pixel-level Cross Entropy Loss L CE .This loss will examine each pixel individually, comparing the predicted results (probability distribution vector) for each pixel category with the labeled vector of the annotated image.However, when the dataset is extremely unbalanced, it is easy to make the model fall into the local optimal solution, so that the predicted value is easily biased to the background.To address this issue, we can employ the weighted Cross Entropy (WCE) Loss L WCE , assigning a reduced weight to background pixels (with a weight of W C = 0.1 in this study).This approach ensures that the network prioritizes the retina layer during training.
In addition, the region-based Dice Loss function can further address the challenges of unbalanced image classification.Dice Loss function L Dice is the most commonly used evaluation index in medical image segmentation tasks.It measures the overlapping area between the predicted and actual values, reflecting the consistency of the segmentation results with the actual size and positioning.To sum up, our model is optimized using both the weighted Cross Entropy and the Dice Loss functions, denoted as L, which can be expressed as: where β = 3.These values are based on empirical formulas combined with the best performance in experimental tests.

Datasets
The proposed HDB-Net method was examined on two public datasets: Duke DME [45] and HCMS [46].

Duke DME:
The dataset consisted of 110 annotated B-scan images from 10 patients with diabetic macular edema (DME, 11 B-scan images per patient) with an original size of 512 × 740 pixels.All the images were marked 8 layers of retina and SRF area by clinical experts, and the size of the marked area was about 536 × 496 pixels.In order to facilitate subsequent image preprocessing, they were uniformly processed into images of the size of 512 × 512 pixels.In the labeled images, each pixel is classified into one of the following 10 categories: vitreous (background 1), RNFL, GCIP, INL, OPL, ONL, IS, OS + RPE, choroid (background 2), and SRF. Figure 5(a) shows an example of annotation.In the experiment, the 110 images were re-split into a training set, a validation set, and a test set on a patient basis in a ratio of 6:2:2, with both normal and pathological images.

Implementation
Training Setting: Our network is built on the Pytorch framework.The AdamW optimizer [47] and default parameters are used to train the model.The initial learning rate is 0.00018, and a hybrid strategy of "Linear" and "Step" is used to achieve optimal segmentation accuracy.The linear strategy is used for the first 10 epochs, and the step strategy is for the last 40 epochs.The batch size is 8, which is the maximum batch size that can be achieved within the limitations of the GPU used, in order to minimize the training time without affecting the segmentation performance.To improve training speed and prevent overfitting, the maximum number of training epochs for both datasets is set to 50.This parameter is referenced from BAU-Net [12] and verified many times by drawing loss and Dice curves.In addition, to obtain faster convergence speed, the learning rate of the head is ten times that of the trunk in training.All experiments were implemented on NVIDIA Geforce RTX 3090 Ti GPU.
Evaluation Index: We used Mean Dice (MDice) and Mean Intersection over Union (MIoU) scores to evaluate the ratio of the model's predicted retinal layers to the ground true values.The mean absolute difference (MAD) is used to evaluate the accuracy of the model's predicted boundary.
The Dice coefficient is derived from binary classification problem and is a measure of the overlap of two samples.It is a set similarity measurement function, and the index ranges from 0 to 1, where 1 represents complete overlap.The calculation formula is: where | * | and ∩ indicate the size and intersection operation of the set, respectively.The IoU score is the ratio of the intersection and union the ground truth and the predicted value sets.The calculation is based on the confusion matrix, which contains four terms: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).The calculation formula is: The MAD is the mean of the absolute difference between the retina's predicted boundary and the ground true value for each column.It is important to note that MAD is measured in pixels, and the physical distance per pixel of the DUKE DME and HCMS dataset figure is 3.87 µm and 3.90 µm, respectively.When calculating the physical distance between the predicted boundary and the ground true boundary of the two datasets, MAD needs to be multiplied by the physical distance per pixel of each.
Retinal Layer Thickness (RLT) is the average of the vertical distances between the upper and lower boundaries of each column for each layer of retina, and mError is the average of the absolute difference between the predicted retinal thickness and the ground true value.

Ablation study
In order to evaluate the dual coding structure proposed by our HDB-Net and the performance of two modules FAM and BA-FEM, Swin-Upernet was set as a baseline network and ablation experiments were performed on the Duke DME dataset.In addition, we also study the effect of loss function and pre-training on the performance of the proposed network.For the network settings, the dual encoders adopt Swin Transformer with Tiny configuration and ResNet-50 with standard size respectively.Stage = 4, C 1 = 96, C 2 = 64, patch size = 4, window size = 7.The number of layers corresponding to each stage is [2,2,6,2], and the number of heads corresponding to each layer is [3,6,12,24].
Figure 6 shows the overall segmentation results of the ablation study.Note that Swin©Res50 indicates that Swin transformer and Res50 are merged through concatenate.In the figure, white and yellow boxes are used to mark the two segmentation areas worth focusing on.The white box area is the junction of INL and OPL layer above the SRF, which is prone to the wrong classification problem of segmented pixels.The image shows misclassified dark green (INL) and bright red (OPL) patches in areas that should belong to dark blue (SRF).Across Fig. 6, the best performers in this regard are (e) and (j), representing a network with only the FAM module added and HDB-Net, respectively.This shows that FAM module has excellent segmentation ability for SRF area, and proves that the mixed attention mechanism is very good at processing such target features with variable size, shape and location.The yellow box area is the junction of OPL and ONL layer.Because there is an opaque white dot in the original figure, the boundary division of OPL and ONL layer in this area is affected.The best performances here are (f) and (j), representing networks with only BA-FEM modules added and HDB-Net, respectively.This fully demonstrates the important value of the retinal layer edge information extracted from shallow network features by BA-FEM module to improve the fine segmentation performance of each retinal layer.
Effects of dual-coding structure: Four structures were tested, namely Swin-Upernet, Res50-Upernet, direct addition of Swin and Res50 (Swin + Res50), and compression after concatenating of Swin and Res50 (Swin©Res50), respectively, and the results are shown in Table 1.Observing the MDice scores, it can be noted that among the two single-encoder networks, the combination of Res50-Upernet achieved a more accurate segmentation result at 84.52%, while the baseline network Swin-Upernet reached 84.13%.Note for a dataset such as Duke DME with a small amount of data (110 images), the overall performance of Swin Transformer based on Transformer is still inferior to that of Res50 based on CNN.For the latter two dual encoder networks, the MDice of directly adding Swin and Res50 is 84.21%, even lower than result of Res50-Upernet in a single-encoder network, which is the 84.52%.This indicates that using inappropriate feature aggregation methods with dual-encoder structures not only fails to improve performance but also have the negative impact on the final segmentation accuracy.After merging and compressing the features of Swin and Res50 networks, the MDice achieved was 85.54%, which is the highest among the four structures.Compared to the standalone Swin and Res50, this represents an increase of 1.41% and 1.02%, respectively.At the same time, the MIoU reached 79.33%, which is also the highest among the four structures.This demonstrates that appropriately combined dual-encoder can aggregate more information, which benefits hierarchical cascade semantic prediction.In addition, Swin-Upernet had the best performance in SRF caused by DME (71.77%).This shows that Swin Transformer is indeed better at handling SRF-like objects that are not fixed in size, shape, and location.Precise segmentation of this type of objects often requires global context information, which is the strength of Swin Transformer.
FAM and BA-FEM: Table 2 and Table 3 show the effect of using FAM and BA-FEM modules in Duke DME and HCMS datasets respectively on the retinal layer segmentation of OCT.It can be seen that the influence trend of adding the same module to the two datasets on the segmentation results is basically the same.Compared with the method of direct aggregation of dual-encoder features, the MDice results of using the FAM module improved by 1.12% in Duke DME and 1.98% in HCMS.This indicates that the FAM, which incorporates channel and spatial attention mechanisms, can selectively extract and aggregate features from both networks, leveraging  After BA-FEM was used to aggregate the features of the dual-encoder, the MDice of Duke DME was increased by 1.22% from 85.54% to 86.76%.MDice for HCMS also increased by 0.67%.In particular, the segmentation accuracy of each layer of retina is generally improved, while the SRF is not improved much.This indicates that by employing the BA module to enhance boundary feature extraction from the shallow layers and utilizing the FE module to expand the network's receptive field are more conducive to the extraction of layer edge information.By adding BA and FEM submodules separately, it can be found that sometimes their segmentation performance is better than BA-FEM.This indicates that both contribute to improving the performance for retinal layer segmentation, and they can complement each other to improve the robustness of the BA-FEM module.Finally, we tried to add the UFE module in Ba-NET [12], which is a non-lightweight FEM.By comparing the two datasets, we find that the segmentation performance of UFE and FEM is basically the same.This shows that our lightweight modification of FEM is successful.
Loss function and pre-training model: The experimental results of using pre-trained model and different loss functions are shown in Table 4. Without the pre-trained model, the MDice is 87.18% and MIoU is 80.39%, which is 0.45% and 1.8% lower than those using the ADE20 K pre-trained model.This highlights the advantages of leveraging pre-trained models, especially on a large dataset like ADE20 K, in improving segmentation accuracy.In terms of loss function, using only weighted CrossEntropy Loss L WCE or Dice Loss function L Dice , the MDice values are 87.06% and 87.04%, which is 0.55% and 0.57% lower than the joint loss function we used, and then MIoU is 0.51% and 1.77% lower than the joint loss function, respectively.This suggests that for the task of OCT retinal layer segmentation, which involves the imbalanced foreground and background in images, neither L WCE nor L Dice can be used alone for the best results.In this case, the joint loss function L designed in this paper can be used to supervise the model to achieve the best training outcomes.In addition, the segmentation results without using the pre-trained model (87.18%) are better than those using only the loss functions L WCE (87.06%) and L Dice (87.04%).In this task, proper loss function can get more accurate segmentation results than proper pre-training model.

Comparative study
To verify the effectiveness of the method, we compared our HDB-Net with a variety of CNN-based, Transformer-based, and dedicated retinal layer segmentation algorithms, and the results are shown in Fig. 7.There are three algorithms based on CNN: U-Net [11], Upernet [19], and nnU-Net [49].Four Transformer based algorithms include Swin-UNet [24], DS-TransUNet [25], ST-UNet [26], and nnFormer [50].Five specifically designed algorithms for retinal layer segmentation include ReLayNet [48], MGU-Net [51], Y-Net [52], TCCT [53], and LightReSeg [54].The above-mentioned methods all use ResNet-50 and Transformer Tiny with pre-training ADE20 K as the backbone network, and their hyperparameters are consistent with those used in our experiment.In the experiment, Dice/IoU, MAD, and Retinal Layer Thickness and Error (RLTE) were used on Duke DME and HCMS datasets to evaluate the overall segmentation accuracy, boundary segmentation accuracy, and retinal layer thickness measurement accuracy of the model, respectively.The value in brackets in RLTE is the difference between the measured thickness of the retinal layer and the real thickness, and the accuracy of the measurement can be evaluated by this value.The experimental data are presented in Table 5 to Table 10.Bold in the table indicates that the data is optimal for the same column, and * indicates that the data is optimal in the same horizontal line area in the table.The experimental results of the BAU-Net [12] shown in italics in Table 6 and Table 9 are directly cited from the data in the paper, because the relevant code is not open source.Results on Duke DME Dataset: Table 5 to Table 7 present the comparative experimental results of various methods using the Duke DME dataset.Among all 13 segmentation algorithms, our proposed HDB-Net achieves the best performance on mDice/mIoU, MAD, and RLTE.Specifically, in Table 5, which evaluates the overall segmentation accuracy of the algorithm, HDB-Net achieved optimal results in all 8 individual retinal layers.The overall mDice is 87.61%, which is 2.62%, 1.43%, and 0.9% higher than nnU-Net (84.99%) based on CNN, DS-TransUNet (86.18%) based on Transformer, and TCCT (86.71%) designed for retinal layer segmentation, respectively.These three algorithms are already the best performing in their respective categories.In Table 6, which evaluates the accuracy of the algorithm's boundary segmentation, HDB-Net obtains the optimal results in 4 out of all 8 individual retinal boundaries.Its overall mMAD of 3.01 and standard deviation of 0.73 are the smallest of all models.The results show that it has the best performance in the accuracy and stability of retinal boundary segmentation.Table 7 shows   the measurement of the thickness of each retinal layer by different models.HDB-Net received the best results in RNFL, INL, and OS-RPE layer.Its overall measurement error of 0.8 is the smallest among all models.On the whole, the segmentation algorithm combined with Transformer has better overall performance in mDice/mIoU, MAD and RLTE compared with the segmentation algorithm based solely on CNN.This can also be seen in the left figure of  This shows that the global context information of Transformer is of great help to the overall positioning of the retinal layer.In addition, the yellow box is the junction of OPL and ONL layer.Because there is an opaque white dot in the original image, the segmentation results of different algorithms have a large gap, and it can be regarded as the key area for comparing the performance of different algorithms.Among them, (a), (c), and (f) all have isolated pixel blocks with wrong colors in this region, indicating that U-Net, ReLayNet, and ST-UNet algorithms are lacking in ensuring the retinal topological order by using global information.In the remaining algorithm, the shape of the red region in (i) is the most consistent with the ground truth figure, and it has the least interference in the case of segmenting the outline of the white dot.This not only reflects the improvement of layer edge segmentation performance by the feature details extracted by BA-FEM in HDB-Net, but also shows that HDB-Net has the best robustness in the face of interference among all algorithms.
Results on HCMS Dataset: Table 8 to Table 10 outlines the comparative results of various methods using HCMS datasets.Our proposed HDB-Net still has the best mDice/mIoU, MAD, and RLTE performance of all 13 networks, with 92.44%/87.61%,1.96 ± 0.43, and 0.15, respectively.In Table 9 and Table 10, the mMAD and mError of HDB-Net are much smaller than other methods, 0.17 ± 0.07 and 0.09 lower than the second-best network TCCT, respectively, reflecting the excellent layer boundary segmentation performance of BA-FEM.HDB-Net also performed overwhelmingly in all individual evaluations (6 of 8 layers of retina and boundary were the best).The feature that the algorithm combined with Transformer in Duke DME has better segmentation performance than the algorithm based solely on CNN is also maintained in HCMS.Therefore, regardless of the size of dataset, the segmentation algorithm combined with Transformer has better overall performance.In HCMS dataset, the overall difference of segmentation accuracy of each model is minimal.This may be due to the relatively large amount of data in the dataset, and overall, the preparation rate is high.In addition, in the right figure in Fig. 7, it can be found that all the prediction results of (a) -(e) and (g) -(h) have elliptic segmentation errors marked by red circles.Only two figures (f) and (i) do not show this phenomenon.Such small isolated incorrect segmentation is often due to a lack of global context information resulting in a misjudgment of the retinal layer.This reflects the successful integration of global context information in Transformer by dual-coding networks DS-TransUNet and HDB-Net.Compared with (f), (i) has a finer and more accurate  11 shows the time, mDice, and parameters of multiple comparison methods under the same operating environment.Compared with CNN-based methods, Transformer-based methods tend to have more parameters, longer computation times and better mDice scores.This shows that the more complex the design of the network tends to have better segmentation performance.TCCT and LightReSeg are special cases because they are designed to be lightweight.The cost of this is that the segmentation performance is slightly inadequate compared to our method.The main application scenarios of our network are preoperative disease diagnosis and focus segmentation.These tasks do not need to pay much attention to real-time and easy deployment issues, and the accuracy of segmentation is more critical.Therefore, the overall idea that we sacrifice part of the computation speed and parameters to obtain the optimal segmentation performance is reasonable.Disclosures.The authors declare no conflicts of interest.

Fig. 1 .
Fig. 1.The challenges of retinal layer segmentation in OCT images.(a) The white noise points in the red marked area is the image quality degradation caused by speckle noise.(b) The blue area is the subretinal fluid caused by the lesion; The color dividing lines are the layers of the retina that have a strict physiological order from inner to outer.(c) The green area is retinal layer with highly similar interlayer morphology.

Fig. 2 .
Fig. 2. Overall framework of HDB-Net.The left and right branches of the Encoder represent hierarchical feature maps of the Swin Transformer and ResNet, respectively.The orange Module FAM represents the Feature Aggregation Module corresponding to the feature map of each layer, which is used to efficiently fuse the feature information extracted by the above two networks.BA-FEM in the red dotted box is the Boundary Awareness and Feature Enhancement Module, which exists only in shallow networks to extract edge information and retinal layer topological order from low-resolution features.PPM is the Pyramid Pool Module that exists only at the deepest layer of the network.

Fig. 4 .
Fig. 4. Boundary Awareness and Feature Enhancement Module (BA-FEM) structure.The light blue BA portion represents the boundary awareness (BA) module, which exists only in the first two layers of the network, while the light gray feature enhancement (FE) module exists in the first three layers.STN is a spatial transformer network, which uses its spatial invariance to align the features of the upper sampling layer with those of the local layer to ensure accurate extraction of retinal boundaries.

Fig. 5 .
Fig. 5. Two examples of raw image (top) and annotated image (bottom) from (a) Duke DME and (b) HCMS.HCMS: The HCMS dataset consists of 35 volumes of right eye SD-OCT images of 35 participants captured with the Heidelberg Spectralis OCT. 14 of them participated in the health program and 21 of them were part of the multiple sclerosis abnormal program.Each volume consists of 49 B-scan images.There are a total of 1715 B-scan images in the dataset, and the size of each B-scan image is 496 × 1024 pixels.In order to facilitate image preprocessing, the size of each B-scan image is standardized to 512 × 512 pixels.Each image was labeled with nine layers of retina by a clinical expert.In these annotated images, each pixel is classified into one of the following 9 categories: Background, RNFL, GCL + IPL, INL, OPL, ONL, IS, OS, and RPE. Figure 5(b) illustrates an example of annotation.In the experiment, the 1715 images were re-split into a training set, a validation set, and a test set on a patient basis in a ratio of 6:2:2, with both normal and pathological images.

Fig. 7 .
In the figure, (a) -(d) are the segmentation results based on CNN, (e) -(h) are the segmentation results combined with Transformer, and (i) are the results of HDB-Net.It can be clearly seen that isolated wrong color patches of irregular size and shape similar to SRF often appear in figures (a) -(d), which represents the wrong positioning of the retinal layer.In (e) -(h), the problem is much alleviated.
In this paper, we introduce HDB-Net, a hierarchical dual-encoder network for retinal layer segmentation in diseased OCT images.This design effectively leverages the Swin Transformer's adeptness in global information extraction and the multi-scale feature extraction capabilities of the UperNet pyramid structure.Furthermore, it capitalizes on the prowess of CNN's convolutional operations in medical segmentation, negating the need for extensive data training.A specialized FAM is designed to alleviate the problem of poor segmentation performance of the network for variable size, feature and shape targets (such as SRF caused by DME).The BA-FEM can improve the performance of layer boundary fine segmentation and layer thickness measurement.In addition, a new experiment is added to measure the thickness of each layer of retina.To conclude, the proposed network demonstrated superior performance in the retinal layer segmentation on the public datasets of Duke DME and HCMS, achieving the best segmentation and measurement outcomes among the 13 methods evaluated, marking a significant enhancement.Our future work is to extend this network to 3-dimensional OCT segmentation and introduce self-supervised learning to fully leverage Transformer's potential in medical segmentation.Funding.National Natural Science Foundation of China (52175008, 92048301); State Key Laboratory of Robotics and Systems (HIT) (SKLRS2022KF17); National Natural Science Foundation of China Regional Program (61961015).

Table 1 . Effect of four structures on segmentation accuracy on the Duke DME dataset
the respective advantages of Transformer and CNN, thereby enhancing the final segmentation performance.It is worth noting that FAM plays a crucial role in the segmentation of SRF caused by DME.After the introduction of FAM, the segmentation accuracy of SRF was improved from 70.14% to 73.93%, which was the best result.This fully indicates the good segmentation performance of the designed FAM for objects with uncertain size, position and shape, such as SRF, and its ability to combine global context information extracted from Transformer.

Table 9 . MEAD and STANDARD (µm) of MAD SCORES for different network models on HCMS dataset The numbers in parentheses are the standard of the corresponding scores. Bold in the table indicates that the data is optimal for the same column, and * indicates that the data is optimal in the same horizontal line area in the table.
layer boundary, which reflects the accurate distinction of layer boundary details by BA-FAM in HDB-Net.In general, HDB-Net has excellent performance in the Duke DME and HCMS datasets for all three tasks of overall retinal layer segmentation, retinal layer boundary segmentation, and retinal thickness measurement.
Efficiency Analysis: Table