Joint Deep Matching Model of OCT Retinal Layer Segmentation

Optical Coherence Tomography (OCT) is very important in medicine and provide useful diagnostic information. Measuring retinal layer thicknesses plays a vital role in pathophysiologic factors of many ocular conditions. Among the existing retinal layer segmentation approaches, learning or deep learning-based methods belong to the state-of-art. However, most of these techniques rely on manual-marked layers and the performances are limited due to the image quality. In order to overcome this limitation, we build a framework based on gray value curve matching, which uses depth learning to match the curve for semi-automatic segmentation of retinal layers from OCT. The depth convolution network learns the column correspondence in the OCT image unsupervised. The whole OCT image participates in the depth convolution neural network operation, compares the gray value of each column, and matches the gray value sequence of the transformation column and the next column. Using this algorithm, when a boundary point is manually specified, we can accurately segment the boundary between retinal layers. Our experimental results obtained from a 54-subjects database of both normal healthy eyes and affected eyes demonstrate the superior performances of our approach.


Introduction
Optical Coherence Tomography (OCT) is a non-invasive imaging technique that enables high-resolution cross-sectional imaging of ocular tissues such as the retina [Adhi and Duker (2013) ;Geitzenauer, Hitzenberger and Schmidt-Erfurth (2011);Huang, Swanson, Lin et al. (1991)] and has become a key imaging technique for diagnosing retinal diseases. It visualizes the internal structure of the retina and can qualitatively and quantitatively assess morphological changes in underlying disease [Waldstein, Wright, Warburton et al. (2016)]. Retinal OCT has been widely used to image the normal retina and its various layers, as well as to detect ophthalmic diseases such as Age-Related Macular Degeneration (AMD) [Keane, Liakopoulos, Jivrajka et al. (2009) ;Malamos, Ahlers, Mylonas et al. (2011);MeindertNiemeijer, Lee, Abràmoff et al. (2012);Sui, Zheng, Wei et al. (2017)], glaucoma [Puliafito, Hee, Lin et al. (1995)] and diabetic retinopathy. In particular, retinal thickness or Central Macular Thickness (CMT) measured by OCT is closely related to pathological changes and treatment outcomes of various ocular diseases [Fleckenstein, Schmitz-Valckenberg, Adrion et al. (2010);Wood, Binns, Margrain et al. (2011)]. Currently used in clinical practice, automated thickness quantification methods use a hierarchical segmentation algorithm to accurately identify internal and external retinal boundaries and then estimate the thickness of the retina as the space between the surfaces. In ophthalmic clinical practice, manually segmenting the OCT image to extract the retinal layer boundary position of the retinal layer and subsequent structural features still dominate, though there are still many challenges. The first is that it is time-consuming, subjective and error-prone, thus efficient and automatic segmentation techniques are needed. Manual depiction of layers in OCT images is time-consuming and therefore not practical for large numbers of B-scan images obtained in intensive scanning protocols that are often used in clinical settings, as well as for large-scale population studies. Many methods have been proposed to segment the retina and choroid layer. Many papers review methods for OCT image analysis, especially the layer segmentation method [Baghaie, Yu and D'Souza (2015); DeBuc (2011)], which includes various methods such as intensity variation analysis and adaptive threshold analysis [Ishikawa, Stein, Wollstein et al. (2005)], pixel-based Markov boundary model [Koozekanani, Boyer and Roberts (2001)], graph-based method [Chiu, Li, Nicholas et al. (2010)], multi-surface map cutting method [Garvin, Abramoff, Wu et al. (2009)], active contour segmentation model [Mishra, Wong, Bizheva et al. (2009)], pixel classification, edge detection, and machine learning method image registration [Che, Zheng, Sui et al. (2019)]. These are usually unique methods applied to images to extract the boundaries of interest. However, the currently existing methods have some shortcomings. First, there are many OCT retinal layers, which cannot be segmented by one process. It is often that various methods segment different layers. Secondly, they are based on the whole image segmentation. However, the image has more irrelevant information, and the interference to the segmentation process is significant, which leads to the low precision of the segmentation. Duan et al. [Duan, Zheng, Ding et al. (2018)] adopts the latest method of group matching, and matches the two columns of gray value curves in the OCT image by Dynamic Time Warping (DTW) method, which is corresponding to each point on the gray value curve, and then solved by joint optimization. The optimal value is the best matching, so the segmentation of any layer in the entire image can be calculated. However, there are some problems with this method. The process of joint optimization is to calculate all the results obtained in the first step, resulting in long operation time and a large calculation amount. Although these methods show significant advantages, they are not well summarized, and image changes may mean that the rule set also needs to be changed. Recent approaches have attempted to solve these problems by machine learning to segment OCT images. These latest researches include support vector machines, neural networks  [Karri, Chakraborthi and Chatterjee (2016)]. Although earlier methods were based on more standard image analysis programs [Alonso-Caneiro, Read and Collins (2013); Chiu, Li, Nicholas et al. (2010)], many more new segmentation methods have used more sophisticated deep learning algorithms. For layer segmentation, many use machine earning methods to detect retinal layer boundaries. [Lang, Carass, Hauser et al. (2013)] Using the random forest, Canny and random forest and graphical search method, the boundary between the macular region and the retinal OCT image is segmented by nine boundaries. Each pixel in the OCT image extracts 27 features to form a feature vector [Vermeer, Van der Schoot, Lemij et al. (2011)] Features extracted include intensity, adjacent strength of various lengths, and gradients. The method of pixel classification is adopted, and multi-scale gradient and grayscale features are used as feature vectors, and clustering is performed by the SVM algorithm. The number of clusters classified is based on the number of layers. Due to errors in the segmentation process, a smooth set is finally solved using the level set. Venhuizen et al. [Venhuizen, van Ginneken, Liefers et al. (2017)] proposed a CNN-based retinal image segmentation method that is robust to abnormal retinal images. Roy et al. [Roy, Conjeti, Karri et al. (2017)] proposed a new deep learning network structure-ReLayNet, which is used for the hierarchical segmentation of OCT images in normal people and DME patients. ReLayNet draws on the U-net idea and divides it into two steps: down sampling and up sampling. The cross-entropy and dice overlap loss function is used in the training process to optimize. In order to overcome the limitations of the current OCT retinal layer segmentation, we build up a method based on grouping curve convolution to generate retinal segmentation. This method is unique in that it can accurately segment the retinal layer and does not require different methods for the segmentation of the retinal layer. Especially we give the OCT image to be segmented and extract the contour formed by the intensity of the OCT image for each A-scan, and we enhance the relationship between each pixel of any column in the gray value curve for 1D convolution, that is, directly connecting two gray values from two columns to overlap, input into the U-net network, perform 1D convolution and pooling in the form of a pyramid, and output the result as a pixel between two columns of gray values. Correlate the situation and be able to output the best match between pixels. The main contributions of this paper are as follows.
(1) We can automatically extract the features of the A-scan column without any artificial involvement and completely separate the boundaries of the OCT retinal layer.
(2) We extract the gray value curve by A-scan and divide it by the 1D convolution method of the gray value curve, which strengthens the connection between any two-pixel points of the gray value curve. This strategy avoids the most popular at present-the disadvantages of the method of examining the retinal layer boundary features.
(3) We propose an alternative method for segmentation from the OCT retinal layer, which involves curve convolution to find the offset. This method can segment any layered structure in the OCT image without any special technical redesign or experimental retraining.

Group-wise curve alignment
The OCT image is a combination of cross-sectional tomography (B-scan) in which an axial depth scan (A-scan) constitutes a lateral combination containing information about the spatial size and position of the structure within the object of interest. The principle of our OCT retinal segmentation model is based on A-scan. The OCT retinal fundus image is shown in Fig. 1. The gray scale values of the corresponding tissues (the same column A-scan) in the retinal fundus image are basically the same. In addition, the law of gray scale variation between the same two layers at different locations is similar. When the gray level of the image changes from dark to light, the slope of the corresponding gray value curve will change greatly.

Figure 1: OCT retinal fundus image
The change in the gray value of each column of A-scan of the same OCT fundus retina image is consistent with the corresponding tissue structure from top to bottom, and the trend of all gray value curve changes from top to bottom is similar. It can be seen from Fig. 2. that the change of the gray value curve of the same tissue structure in the OCT fundus image has similar shape features so that the feature points can be obtained by the two-two convolution of the gray value curve, and two gray value curves are obtained. Matching in medical applications, we select a layered point in the undivided OCT image and pass the matching through the loop to obtain the matching of the same organizational structure, including the organizational structure between the layers. We are able to segment all the tissue structures of the OCT retina directly.  Fig. 1, the ordinate represents the gradation value, and the abscissa represents the − pixel. Right: the 50th column of the gradation value of Fig. 1, the ordinate represents the gradation value, and the abscissa represents the − pixel

Overview
In view of the shortcomings of the traditional methods, we propose a new method. The original segmentation method is to extract the features of the two curves separately, and the two curves are independent of each other. We intend to use the 1D convolution method to express the characteristics of the two curves and strengthen the relationship between any two pixels. It is possible to directly establish a convolution structure on the interaction space between the two columns of gray value curves. That is, the two columns of gray value curves meet before their own advanced representation matures, while still retaining the space for the individual development of each column of gray value curve abstraction. We extract the gray value contours of two adjacent columns in the OCT image, which are denoted as L1 and L2. The two gray value curves are used as the input of the convolution neural network. By registering L1 to L2, the offset between L1 and L2 is obtained, and the corresponding output is the correspondence of the pixel points of the two columns of gray value curves. By looping the two pairs and matching, the correspondence between the columns listed in the entire OCT image and the segmentation result is obtained. In this paper, we use unsupervised depth CNN. The proposed framework is shown in Fig. 3. The convolutional neural network, the template structure and the space transformer are composed of frames, which are similarities for calculating the contours of two columns of gray values. The following sections describe these aspects and the architecture involved.

Figure 3:
We propose a group matching the segmentation model. In the OCT image to be segmented, the contours of the L1 and L2 gray-level calculation areas are extracted, and the two gray value curves are overlapped and input into the convolutional neural network and the offsets of the two columns of curves are calculated, and two changes are obtained by spatial variation. The correspondence between the column pixel values is iterated sequentially, and the offset of the entire image is output to obtain the segmentation result of the entire image, and the boundary point is selected to complete the segmentation

1D convolution model
CNN is traditionally used for image processing. The image is 2D in nature, so the convolution filter used is also 2D (usually 3×3, 5×5, 7×7 pixels or similar pixels). However, there are other situations besides images that can be processed with different sizes. 1D CNN is commonly used in natural language processing. For 1D convolution, a window of size 3 contains only three feature vectors. In image processing, we arrange the pixels into a 2D mesh, where we arrange the gray value curves into a 1D sequence. The convolution itself works like its 2D copy, but only 1D filters (such as 3, 5, 7) can be used. For 2D convolution, the 3×3 convolution kernel contains 9 eigenvectors. For 1D convolution, a window of size 3 contains three feature vectors. The 2D convolution is a convolution in the X and Y directions, while the 1D convolution is convolution in only one direction in y.
Our method is to convolve the gray value curve of a column in the image, so the 1D convolution is used. We design a sliding window with a size of 3×1 convolution and convolution from the first pixel to the last pixel in one direction.

Convolution model
As shown in Fig. 4, our convolution structure model is similar to u-net ; Ronneberger, Fischer and Brox (2015)]. The coding path on the left and the decoding path on the right form the structure of the network. The encoding path includes two down samplings, and the decoding path includes two-up samplings. The encoding path follows the typical architecture of the convolution network, with the gray values of the two columns of A-scan being used as inputs to the encoder. The continuous layer of the encoder processes a rough representation of the input, similar to the image pyramid used in registration.

Figure 4:
A convolutional network architecture based on a gray value curve is proposed. The input is the contour of the gray value curve. The numbers above indicate the number of channels, and the numbers below indicate the length of the vector. The network will connect the feature map extracted during the coding phase to the new feature map during the decoding phase. The last layer uses a 1×1 fully connected layer A repetitive application of a 3×1 convolution kernel constitutes an encoder, each convolution operation followed by a rectangular linear unit (ReLU) [Nair and Hinton (2010)] and a 2×1 average pool. The step size is 2 for down sampling. The rectified linear activation (ReLU) unit speeds up training by further standardizing these values. Each of the two convolutional layers has an average pool layer that retains most of the information. In each down sampling step, the number of channels is doubled. Each step in the extended path involves up sampling the feature map and then performing a 2×1 convolution to halve the number of feature channels, skipping the features that the connection will learn during the coding phase. Directly propagate to the layer that generates the registration to generate the registration directly, and at the last level, 64 eigenvectors are mapped to the desired class using a 1×1 convolution. The output is an offset matrix of two columns of A-scan.

Spatial transformation
We learn the optimal value by minimizing the difference between L′(x) and Τ. In order to use the standard gradient method, we constructed a differentialoperation based on the space transformer network [Jaderberg, Simonyan and Zisserman (2015)] to calculate L′(x) . We use a spatial transformation function to transform L 1 and ϕ(x) into L′(x) , so that the model can calculate the similarity between the two columns of gray values. The spatial transformation operation formula of single linear interpolation is: where L′(x) is calculated from L 1 by ϕ(x) warping, x is the pixel position, and N�x + ϕ(x)� is the 2-pixel neighborhood around position x + ϕ(x). Since the image values are defined only at integer positions, we perform a linear interpolation on two adjacent pixels. In order to make the loss back propagation during the optimization process, we calculate the gradient of the spatial transformation of position x by taking the partial derivative of Eq. (1).

Loss function
We train the network by finding the minimum of the loss function. The registration loss between L1 and L2 formulated in Eq. (2) is evaluated. MSE is a function of the average error of a batch of data. It is characterized by smooth continuous, steerable and easy to use gradient descent algorithm, which is a commonly used loss function. The mean square error is the average of the square of the distance between the predicted value y i ′ and the true value of the sample . The formula is as follows: where y i is the true value of the i − th data in a batch, and y i ′ is the predictedvalue given by the neural network, and m is the number of samples. The squared error has a characteristic. Since the MSE squares the error e ( y − y i ′ = e), when e is greater than 1, the error is increased; when e is less than 1, the error is reduced. If there is an outlier in our data, the value of e will be very high and will be much larger than |e|. MSE will give more punishment for the case with larger errors and smaller punishments for the case with the smaller error. From a training point of view, the model will be more inclined to punish larger points and give them greater weight. As the error of the MSE curve decreases, the gradient also decreases, which facilitates the convergence of the function. Even if the learning factor is fixed, the function can obtain the minimum value faster.

Image data
In our study, our experiments were performed using eyes from 54 subjects, each of which contained a center-centered size measuring 908×408 pixels. All OCT scans obtained with Spectral is HRA+OCT equipment (Heidelberg Engineering GmbH, Germany). Two professional doctors manually mark the surface of the retina in each image.

Experimental environment and assessment
We used the average of the position of the dividing line manually marked by the two experts as the reference standard for the evaluation. Experimentally segmented layers include all layers of the normal retina, as well as the presentation of layered pathological structures on unhealthy retinal images. According to the method in Tian et al. [Tian, Varga, Tatrai et al. (2016)], we measure the accuracy of the segmentation by calculating the pixel difference between the estimated value and the reference standard (the shortest distance between the point on the boundary generated by the algorithm and the manually specified boundary). For a segmented surface, the estimated value v est and the corresponding ground-truth value v ref are vectors, and the signed error is defined as: The unsigned error is the absolute value of e s . For comparison, we also used the graph-based multi-surface segmentation method [Dufour, Ceklic, Abdillahi et al. (2012)] the deep learning method [Pekala, Joshi, Liu et al. (2019)] to evaluate.

Results
Experiments have shown that our framework is able to segment the retinal layer accurately as shown in Fig. 5. The data set used a slice centered on the fovea, which is capable of segmenting a segment that is more consistent with the ground value. So our framework is also feasible for the fovea. In order to highlight the effect of the method, we use the difference between the predicted value and the ground truth calculated using the mean error and the mean absolute error, as well as the standard deviation of the two errors.
Obs.1 and Obs. 2 refer to the results marked by two doctors respectively; Avg. Obs refers to the average of the results of the two doctors; BC refers to the segmentation results based on the graph theory method; DL refers to the segmentation results based on the deep learning method; Algo. Indicates the segmentation result of our method. All values in our table are in pixels. Tab. 1 shows the average error of the positioning error between our method and ground truth. We can see that the deviation we calculated is less than the deviation calculated between doctors. For example, the average error on the entire OCT retinal image manually labeled by two doctors is 2.06 pixels, and between our algorithm and ground truth. The error is 1.89 pixels. In general, our method can be improved by 0.17 pixels than the result of manual segmentation. However, we need to be aware that when making rigorous comparisons, we should be cautious these data sets have an impact on image resolution, noise and so on.

Comparison with previous studies
This paper introduces an OCT retinal layer segmentation tool, which can describe any retinal layer boundary segmentation tool after the doctors mark the boundary points. The method used is based on the curve matching depth learning model, and the depth is utilized from each OCT image. A-scan sequence information Compared to the traditional OCT image segmentation method, we have established an end-to-end network model that automatically extracts the contour of the curve and calculates the offset between the curves. Compared with deep learning, we use A-scan in the OCT image to match and use the curve registration method to segment the OCT image. The segmentation results show the accuracy and stability of the framework in the analysis of retinal layer structure, avoiding the shortcomings of the most popular retinal boundary segmentation methods.

Future work
In this field, we have the following progress. Next, we plan to make a separate convolution of the gray value curve. Firstly, the two columns of gray value curves are separately obtained by CNN to obtain their vector representations, so that two identical and fixed-length vectors are obtained. The vector represents the feature information abstracted by the gray value curve after convolution, and then these two vectors perform offset calculations to find a match between the two gray value curves. This method has the flexibility of a single column of gray value curve convolution. The two columns of gray value curves are completely independent in the modeling process, without any interaction behavior, which delays the interaction between the two columns of gray value curves until the abstract vector representation is finally generated. As input to the next model). In doing so, the gray value curve will lose a lot of important pixel details in the process of abstract modeling, and at the same time, the chance of interactive calculation between gray value curves is lost prematurely. The matching task for the two columns of gray value features is Critical.  Biomedical Optics Express,vol. 8,no. 9,