Comparative study of deep learning models for optical coherence tomography angiography

: Optical coherence tomography angiography (OCTA) is a promising imaging modality for microvasculature studies. Meanwhile, deep learning has achieved rapid development in image-to-image translation tasks. Some studies have proposed applying deep learning models to OCTA reconstruction and have obtained preliminary results. However, current studies are mostly limited to a few speciﬁc deep neural networks. In this paper, we conducted a comparative study to investigate OCTA reconstruction using deep learning models. Four representative network architectures including single-path models, U-shaped models, generative adversarial network (GAN)-based models and multi-path models were investigated on a dataset of OCTA images acquired from rat brains. Three potential solutions were also investigated to study the feasibility of improving performance. The results showed that U-shaped models and multi-path models are two suitable architectures for OCTA reconstruction. Furthermore, merging phase information should be the potential improving direction in further research.

same location within a short time sequence. Generally, the OCTA algorithms can be classified into three categories, i.e., intensity-signal-based algorithms, phase-signal-based algorithms, and complex-signal-based algorithms. phase-based algorithms, such as Doppler variance [9] and phase variance [6], merely use the phase part of the OCT signal to map the microvasculature. However, most of the phased-based algorithms are sensitive to phase noise, therefore elimination of motion artifacts is required. Unlike phase-based algorithms, intensity-based algorithms, such as speckle variance, correlation mapping [10], power intensity differential [11] and split-spectrum amplitude decorrelation angiography [12] use different statistics (variance, correlation, squared difference, and decorrelation) of OCT signal intensity to compute the blood flow information. Hence, the influence of the phase noise and motion artifacts can be relieved. However, the algorithms cannot apply to the scenarios where the flow induces the Doppler variation only in the phase part of the OCT signal. In contrast, the complex-signal-based algorithms [13,14] use both the intensity and phase information in reconstruction. Some representative algorithms, such as the split-spectrum amplitude and phase-gradient angiography (SSAPGA) [15] have significantly improved the OCTA imaging quality. In summary, the three traditional categories of OCTA algorithms calculate the flow intensity through different methods to measure OCT signal changes across temporally consecutive cross-sectional images (B-scans) taken at the same location. However, due to the limitations of analytical methods, the traditional OCTA algorithms can only utilize a fraction of information in the OCT signal variation.
In recent years, deep learning (DL) has achieved phenomenal success. As a representative category of DL methods, the convolutional neural network (CNN) [16][17][18][19][20] has greatly promoted the progress of various computer vision tasks and has become a popular option in visual and perception-based tasks. In the field of ophthalmology, various CNN models have been developed for disease classification, object segmentation and image enhancement. In order to utilize the DL's great capability to mine the underlying connection between data, some DL-based solutions have been proposed to be alternatives to traditional analytic OCTA algorithms. Lee et al. [21] proposed an attempt to use a U-shaped auto-encoder network to generate a retinal flow network from the clinical data, but the result was not quite satisfactory and the structural details of small vessels were difficult to be distinguished from OCT noise. Meanwhile, Liu et al. [22] also proposed a DL-based pipeline for OCTA reconstruction and obtained promising results in in-vivo studies. Although the results have demonstrated the superiority of the DL-based pipeline to the traditional OCTA algorithms, they only adopted one single-path network modified from the DnCNN, a network originally designed for image denoising [18], as the DL-model to verify their pipeline. In fact, with the rapid development of technology, various new CNN structures have been proposed for image-to-image translation tasks (e.g., image denoising, super-resolution, and image synthesis). Most of these models can be extended to OCTA reconstruction. Therefore, in order to fully utilize the current research achievements of DL and explore the future direction of DL-based OCTA methods, it is necessary to scrutinize the current network architectures for image translation tasks and investigate their suitability for OCTA reconstruction.
In our previous work [22], we demonstrated the DL-based pipeline for OCTA reconstruction. Building on this foundation, more comprehensive investigations have been conducted, and four representative network architectures were investigated, i.e., single-path model, U-shaped model, generative adversarial network (GAN)-based model and multi-path model. The effectiveness of each architecture was investigated using an in vivo animal dataset. The OCTA reconstruction results were quantitatively evaluated to reflect the performance of comparative studies. Three potential solutions (i.e., loss function optimization, data augmentation and merging phase information) were also investigated to study the feasibility of improving the performance.

Conventional OCTA reconstruction
A typical OCT signal at lateral location x l , axial location z l and time t in a B-scan can be written as: where A and Φ are the signal amplitude and phase component, respectively. As mentioned above, OCTA algorithms are mainly categorized into three categories. Among them, complex-signalbased OCTA algorithms are widely used in the field due to their comprehensive utilization of amplitude and phase information of the OCT signal. The SSAPGA algorithm, as a representative complex-signal-based OCTA algorithm, maps the microvasculature with a simple correction of phase artifacts via phase gradient method, demonstrating superior performance over other conventional OCTA algorithms. The SSAPGA method can be described as Eqs. (2) and (3).
In the equations, R l , M and ρ are the number of repetitions at the same B-scan location, the number of narrow split spectrum bands and the weight parameter that controls the contribution from phase gradient contrast, respectively.

Deep learning-based OCTA pipeline
In this work, the DL-based OCTA pipeline proposed in [22] was employed in the comparative study. As shown in Fig. 1, the pipeline deals with OCTA reconstruction as an end-to-end image translation task and consists of three phases, i.e., training data preparation, model learning and OCTA predicting.
In the training data preparation, a learning set of the instance-label pairs {(x n , y n )} N n=1 of ground truth image y ∈ R H×W and their corresponding multi-channel OCT structural image x ∈ R C×H×W were constituted for DL-based OCTA reconstruction, where n is the index of the image pair; C, H and W are the numbers of input channels, rows and columns of the image. Each cross-sectional image pair (x, y) in the learning set corresponds to a slow-axis location. At each slow-axis location, 48 consecutive B-scans was firstly registered by a rigid registration algorithm [23] and further calculated using the SSAPGA algorithm to generate a label angiogram y with high signal-to-noise ratio (SNR); meanwhile, registered consecutive B-scans (OCT structure images) were randomly selected as the multi-channel OCT structural image x.
In the model learning phase, the learning set was first split into a training set {(x p , y p )} P p=1 and a validation set {(x q , y q )} Q q=1 . Then, four representative DL models were selected to investigate the performance of each model, finding out the most promising DL architectures for OCTA reconstruction. The training of each models consisted of forward propagation, loss function calculation and back propagation steps. That is, the input x was sent to the respective neural network to output the predicted image z ∈ R H×W ; then the loss between the predicted image z and the ground truth image y was calculated; finally, the back-propagation procedure passed the loss value back to the network to compute the gradient, and updated the layer weights. Meanwhile, the validation set was used to monitor the model training process with the peak signal-to-noise ratio (PSNR) as the quantitative metric. The PSNR can be defined as: where, MAX y is the maximum value of ground truth image and the MSE is the mean square error between the ground truth y and predicted image z. The expression of MSE can be formulated as: As for the OCTA predicting, the consecutive B-scans at each slow-axis location are extracted and registered; then the angiogram can be reconstructed from the B-scans using the model trained in the learning stage of the process.

Image translation and typical network architectures
Image translation, also known as image-to-image translation, is an important field in computer vision. The goal is to establish a mapping from an image in the source domain to a corresponding image in the target domain through learning. The field includes various type of problems such as super-resolution [19,[24][25][26], noise reduction [18,27] and image synthesis [28][29][30][31][32]. Encoder-decoder network-based models [18,19,24,25] and GAN-based models [28,29,31,33] are two mainstream types of implementation in image-to-image translation. For encoder-decoder network-based models, the goal is to train a single CNN model with a particular structure to convert one input image into the target image. According to the different architectures of the CNN, the encoder-decoder network-based models can be further divided into single-path models, U-shaped models, and multi-path models. On the other hand, Pix2Pix GAN [28] and CycleGAN [29], which work on datasets consisting of paired images and unpaired images, respectively, represent two fundamental frameworks for GAN-based models. Considering that the cross-sectional image pairs in our study are spatially aligned, the Pix2Pix GAN is more suitable for OCTA reconstruction than CycleGAN.
In this work, we investigated four representative CNN models, which are DnCNN for singlepath models, U-Net [17] for U-shaped models, residual dense network (RDN) [19] for multi-path models and Pix2Pix GAN for GAN-based models. Preliminary experiments based on the 1/10 of the full dataset were performed for parameter tuning. Important hyperparameters, such as learning rate and batch size, were investigated, respectively. Early stopping was employed in the training phase, ensuring that the DL models can obtain the optimal model parameters in the training phase.

Single-path model
As a representative DL architecture, single-path models offer a simple yet effective way to implement image translation. The network structures of single-patch models are line-shaped and without any skip connections, as in SRCNN [24] and VDSR [25] for super-resolution and DnCNN for noise reduction. In this work, we employed a modified DnCNN as the representative single-path model and investigated it in the OCTA pipeline. The network structure is shown in Fig. 2. This network included 20 convolutional layers. The first layer consisted of 64 filters of size 4 × 3 × 3 to handle the four input OCT structural images and utilized the rectified linear units (ReLU) [34] as the activation function. Each layer in layers 2 ∼ 19, which were used to extract the features gradually, included 64 filters of size 64 × 3 × 3, batch normalization (BN) [35] and ReLU. With a single filter of size 64 × 3 × 3, the last layer yielded the predicted angiogram reconstructed from the four structural images. The network parameters were denoted as Θ for this end-to-end system. For a given training set P = {(x p , y p )|p = 1, 2, . . . , P}, the model was trained by minimizing the mean squared error (MSE) between the ground truth images y p and reconstructed images z = F(x p ; Θ). The loss function is characterized by: As for the training details, we initialized the weights by the method in [36] and used Adam optimization algorithm [37] to minimize the loss function. The learning rate, batch size, and epoch were set to 1 × 10 −4 , 32 and 50, respectively.

U-shaped model
As one of the most important DL architectures, U-shaped models refer to a category of networks with symmetrical structures [17,38,39]. Among these models, U-Net, which was first proposed by Ronneberger et al. has achieved promising performance in image segmentation tasks. Since then, various U-shaped models were proposed to further improve segmentation performance and extend the application scope. Depending on the high efficiency in training and the multi-level feature concatenation, U-shaped models are also widely used for image reconstruction in medical imaging, such as X-ray computed tomography (CT) and photoacoustic imaging (PAT). Hence, U-Net was selected as the representative of U-shaped models in the OCTA pipeline to study whether the U-Net can capture the variation among the structural images to reconstruct the angiograms.
The network structure is shown in Fig. 3, which was divided into two parts, i.e., contracting encoder and expanding decoder. In the encoder part, the down-sampling procedure was performed hierarchically through five stages. Each down-sampling stage contained two 3 × 3 convolutions, followed by BN and Leaky ReLU [36], and a 2 × 2 max-pooling with a stride of 2. In the first stage, 32 feature maps were extracted from the 4-channel input. Then, the number of feature maps was doubled in each stage. In the decoder part, the features extracted by the encoder part were up-sampled sequentially. Each up-sampling stage contained one 2 × 2 deconvolution with a stride of 2 and two 3 × 3 convolutions followed by BN and Leaky ReLU. After each deconvolution, the feature maps from the symmetric layer in the encoder path were stacked as extra channels through skip connections. In this process, the feature maps were halved in each stage of the decoder. In the last layer, a 3 × 3 convolution was used to reconstruct the angiogram. To train the network, we initialized the weights using the method in [36] and adopted the MSE loss function with Adam optimizer. The learning rate, batch size, and epoch were set to 1 × 10 −4 , 32 and 50, respectively.

Multi-path model
As a kind of effective DL architectures, multi-path models refer to the very deep line-shaped asymmetric networks with skip connections. ResNet [16] and DenseNet [40] are two representative networks of multi-path models, which use residual connection and dense connection as the key to design the network structure, respectively. Figure 4 demonstrates the principles of the residual connection and dense connection, where v l and T l ( * ) represents the output and non-linear transformation of l th convolutional layer, respectively. It can be seen that the residual connection bypasses the middle convolution layers as the identity mapping: This identity mapping is helpful for the backpropagation of the gradient flow, therefore makes it possible to train a very deep network. On the other hand, the dense connection is the connections from any layer to all subsequent layers. That is, for each convolutional layer, the input is a concatenation of all the output of previous layers: This concatenation of feature maps reduces feature redundancy and further improves efficiency and performance. Owing to the advanced characteristics, the residual connection and dense connection are widely used in image translation tasks [41,42]. As the network combing through both residual connection and dense connection, the RDN has achieved significant performance enhancement in super-resolution tasks. Hence, a modified RDN was selected as the representative multi-path model in this study.
The network structure of the RDN is shown in Fig. 5, where the residual dense block (RDB) served as the basic module. There were four parts in the RDN structure. The first two layers, which belong to the shallow feature extraction net, were used to extract shallow features from input to the global residual path and RDBs. Then, the hierarchical features were obtained through 20 successive RDBs, and the feature maps from all the RDBs were concatenated to form the dense hierarchical features. Each RDB consisted of seven convolutional layers with block residual connection and dense connection to get high-quality local features. Afterward, two convolution layers were used to fuse the dense hierarchical features for global residual learning. Finally, the last convolution layer reconstructed the angiogram.
To train the network, we used the method in [36] to initialize the weights and adopted the MSE loss function with the Adam optimizer. The learning rate, batch size, and epoch were set to 1 × 10 −4 , 32 and 50, respectively.

GAN-based model
As an important family of DL networks, GANs [43] have been vigorously studied in recent years for a wide variety of problems. Typically, in the GAN system, a generator network and a discriminator network are coupled and trained simultaneously. The generator is trained to learn a mapping from a random noise vector, and output a realistic ("fake") image which the discriminator cannot distinguish from "real" image. Meanwhile, the discriminator is trained to discriminate between "real" and "fake" images. Among various GAN structures [28,29,44], Pix2Pix GAN, which is based on the idea of conditional GAN (cGAN) [43], is an outstanding architecture for image translation tasks with datasets of paired images. Therefore, Pix2Pix GAN was chosen for our investigations. The structure and training procedure of the Pix2Pix GAN are shown in Fig. 6. We employed the same U-Net model in Fig. 3 as the generator and a Patch GAN [28] as the discriminator. There were five 4 × 4 convolutional layers in the Patch GAN. Except for the last layer, all the convolution layers were with a stride of 2 and used Leaky ReLU as the activation functions. In the layers 2 ∼ 4, instance normalization [28] was added between convolution and ReLU. The number of feature maps was doubled in each layer. In such a GAN system, the goal is to learn an effective mapping of the generator to predict the angiograms. Here, the training set takes the form of P = {(x p , y p )|p = 1, 2, . . . , P}, and the mappings of generator and discriminator were defined as G and D, respectively. Then, the loss function could be expressed as the combination of adversarial loss [43] and global L 2 loss: where E is a statistical expectation and λ is the weight term to balance the adversarial loss and global L 2 loss. The adversarial loss promotes the network to achieve sharp results and the global L 2 loss ensures the correctness of low-frequency information in predicted image z = G(x).
In this study, the generator and discriminator were trained through an alternating iterative scheme to optimize the overall objective of Eq. (9). Method [43] was used to initialize both the generator and the discriminator. Adam optimizers with learning rate 1 × 10 −4 and 2 × 10 −4 were used for generator and discriminator, respectively. Meanwhile, the λ was 1000 to adjust the weight of global L 2 loss. Moreover, the batch size was set to 1 to meet the architectural characteristic of the Pix2Pix GAN.

Spectral-domain OCT system
A customized spectral-domain OCT system [22] was used to acquire the OCT B-scan images for OCTA. The system used a wideband super luminescent diode with a central wavelength of 845 nm and a full width at half maximum bandwidth of 30 nm as the light source, and adopted a fast line scan CCD attached to a high-speed spectrometer with a 28 kHz line scan rate as the detector. The axial resolution and lateral resolution of the system were 10 µm and 12 µm, respectively.

OCTA dataset and experimental protocol
In this study, an OCTA dataset of brain tissue from four Sprague Dawley rats was acquired. Firstly, a 4 mm × 4 mm bone window was prepared through a craniotomy for each rat. Then, six data volumes were obtained from the four rats by the OCT system. For each volumetric scan, the field of view (FOV) and imaging depth were 2.5 mm × 2.5 mm and 1 mm, respectively. Two of the rats were scanned twice using different FOVs, and the interval between the two scans was no less than one day. A total of 300 slow-axis locations were sampled as the data volume. In each slow-axis location, 48 consecutive B-scans with pixel size 1024 × 300 were captured. Following the above-mentioned pipeline, 1800 cross-sectional image pairs were calculated from six volumes to form the OCTA dataset and each image pair includes four randomly selected structural OCT images and one label angiogram.
The training dataset and test dataset were carefully designed so as to avoid using data acquired on the same animal in both datasets. 1500 image pairs (from five volumes acquired from three rats) as the training set and the remaining 300 image pairs (from one volume of the remaining rat) as the test set. Based on the same dataset, comparative studies were carried out for the quantitative analysis and evaluation of the four representative network models. All the networks were implemented via Pytorch (https://pytorch.org/) on NVIDIA GPUs. DnCNN, U-Net, RDN, and Pix2Pix GAN were selected as the respective representative network for the studies. The detailed network structures and parameter settings are elaborated in Section 2.3. Meanwhile, SSAPGA as the reference method was also implemented for comparison. In addition, considering that 2-input and 3-input imaging protocols are alternatively used for OCTA in practical applications to increase the imaging speed, 2-channel and 3-channel models are further trained and discussed.

Evaluation metrics
For all the networks, PSNR, structural similarity (SSIM) [45] and the Pearson correlation coefficient (R) were used as quantitative evaluation metrics. All three metrics were calculated between the ground truth image y and the predicted image z. PSNR measures image distortion and noise level between two images. The larger the PSNR, the better the quality of predicted image z is. The calculation of PSNR can be found in Eqs. (4) and (5). SSIM is a metric to evaluate the structural consistency between the target image and the reference image by fusing the information of brightness, contrast, and structure. The value of SSIM is between 0 and 1, and can be denoted as follow: where µ y and δ y terms are the mean value and the standard deviation of ground truth y, respectively; µ z and δ z terms characterize the mean value and the standard deviation of predicted image z, respectively; δ zy represents the cross-covariance between y and z. The C 1 and C 2 terms are constants used to avoid numerical instability of the numerator and denominator. R reflects the degree of linear correlation between two images, which is between 0 and 1. The Formula for R is: where, δ yz is the covariance between y and z. For both SSIM and R, a larger value represents a better result.

Result and discussion
Representative reconstructed angiograms from four consecutive B-scans via SSAPGA and four DL models are shown in Fig. 7. It can be seen that the DL-based methods were able to predict enhanced blood flow signals with less speckle noise compared with the traditional 4-input SSAPGA algorithm. Although compared with the ground truth image, the predicted angiograms were prone to overemphasize local smoothing due to relatively limited input information (4 input vs 48 input), some DL methods, such as RDN, still successfully predicted the blood flow signal while preserving a certain degree of high-frequency details. Table 1 summarizes the quantitative performance of the five OCTA algorithms with different input protocols on the cross-sectional angiograms. Each evaluation index was made by averaging the corresponding results in the test dataset. As shown in the table, all the DL-based models demonstrate better PSNR, SSIM, and R compared to the traditional SSAPGA, indicating that the deep neural networks have effectively utilized their modeling capability to mine more intrinsic connections from OCT signals for OCTA reconstruction. The overall performance of the four DL-based models can be ranked in the following order: RDN > U-Net > Pix2Pix GAN > DnCNN. One of the major explanations is the use of skip-connection, which was adopted in RDN, Pix2Pix GAN and U-Net, was absent in the DnCNN. As mentioned in section 2.3, skip-connections can improve the vanishing-gradient problem in the training procedure of very deep neural networks and is benefactive to the hierarchical feature fusion. This is important for the OCTA reconstruction because such a property is beneficial for DL-based models to learn multi-level representations under different receptive fields or different scales to capture tiny changes of the OCT signals in angiogram reconstruction. For U-Net and Pix2Pix GAN, the skip-connection is used in concatenating the low-level features and symmetrical high-level features, which belong to the encoding path and decoding path, respectively. For RDN, the residual connection and the dense connection, as two variants of the skip-connection, are used together to facilitates the sufficient application of hierarchical features along the feedforward path in the network. All these connection architecture designs in U-Net, Pix2Pix GAN and RDN help models extract more informative features from the training images with less mid/high-frequency information loss to yield the reconstructed OCTA images. However, compared with U-Net and Pix2Pix, the RDN effectively combined the residual connection and dense connection, which should further ensure the model increase the network depth without sacrifice the trainability.  Hence, depending on both the efficient feature fusion and the strong expression capability, RDN has achieved the best performance in this study. A volumetric OCT data can be rendered in 2D slices along different dimensions, such as the cross-sectional visualization as mentioned before, or as enface representations. In angiogram applications, the scanned results are conventionally converted to enface maximum intensity projection (MIP) to facilitate vasculature visualization. Therefore, the performance of different methods was also compared in a MIP enface view in this study. Table 2 shows the calculated evaluation indices obtained by the five OCTA algorithms with different input protocols on the MIP enface angiogram. It can be seen that every DL-based algorithm outperforms SSAPGA. Across all the reconstruction results, U-Net and RDN have achieved the top two overall performance. On the other hand, although the Pix2Pix GAN outperformed the U-Net under the 4-input protocol, it failed to keep high effectiveness across all the protocols. This may due to the inherent characteristic of the GAN which makes it difficult to train. Adopting special stable training techniques for GAN should be one potential solution to solve the problem, however it is not the case in this study. In order to keep relative fairness of the comparison, we adopted the uniform training strategy for all the DL-based models and found that it is hard to keep the balance between the generator and the discriminator in Pix2Pix GAN to achieve superior performance. Consequently, under the normal training strategy, the performance of Pix2Pix GAN has not met expectations. Figure 8 shows the 4-input reconstructed results of the MIP enface angiogram using different algorithms. It can be seen that DL-based models have demonstrated obvious advantages over 4-input SSAPGA in noise suppression and microvessel reconstruction. Meanwhile, the local smoothing problem existing in Fig. 8(c) of DnCNN has been somewhat alleviated in the reconstructed results of the Pix2Pix GAN, U-Net, and RDN. When we further compared Figs. 8(d) ∼ 8(f), it can be found that RDN-based angiogram contains more high-frequency details than the angiograms obtained by Pix2Pix GAN and U-Net. However, it also suffered from some wavy artifacts, which are unobvious in the Figs. 8(d) and 8(e). The wavy artifacts were essentially caused by the strong sample motion during the acquisition process of 48 consecutive B-scans at each slow-axis location. Hence, the trouble influenced the performance of all the DL-based models. The reason why the RDN was more sensitive to the sample motion than the others should be attributed to the extremely high expression capability of the network, which needs more balanced training data to reduce the wavy artifacts. On the other hand, U-Net, depending on its moderate model size and high efficiency in computation and training, has an intrinsic advantage to learn a relatively generalized model from limited training data. As a result, U-Net and U-Net-based Pix2Pix GAN suffered less from the wavy artifacts, which is also reflected in the higher SSIM values in the Table 2. Based on the discussion above, we believe that U-shaped models and multi-path models should be two suitable categories of architectures for OCTA reconstruction tasks. U-shaped model is more suitable for the tasks with small data size and limited computational power. On the other hand, multi-path models should play a full part under sufficient training data and computational power.
Finally, an additional experiment was conducted to investigate the robustness of the DL-based OCTA reconstruction to noise. We assumed the noise is additive white Gaussian noise (AWGN), which is the most common noise existing in OCT images. Noisy B-scans, which were generated by adding AWGN with a certain noise level to the test set, were tested by the trained 4-input RDN model for demonstration. In the experiment, four noise levels, i.e., σ ∈ [4,6,8,10], were respectively used to generate the noisy inputs for OCTA reconstruction. The MIP enface angiograms of the results are shown in the Fig. 9. It can be seen that, the trained DL-model demonstrated certain robustness to the AWGN. With the noise level σ = 4 and σ = 6, the reconstructed MIP enface angiograms successfully preserved the primary vascular composition without generating obvious false blood flow signal. Although there was some degradation of the reconstruction quality, the results are acceptable with such high noise levels of σ = 8 and σ = 10. In fact, σ of [0, 2] is hampered to OCT [46], and severely limit OCTA performance due its intrinsic high-frequency sensitivity. Table 3 summarized the corresponding quantative evaluation. The quantitative results verify the visual perception of Fig. 9 and indicate that the well-trained DL-based OCTA models are robustness to AWGN.

Improvement exploration
Some potential solutions were pilot explored for further improving the DL-based OCTA reconstruction. Considering the training efficiency of the network, U-Net was chosen as the basic model for the investigation. Three schemes were investigated via the basic model, i.e., L 1 loss replacement, data-augmentation, and phase information merging. For the first scheme, the L 1 loss was used to replace the original MSE loss in the U-Net. Referred to Eq. (6), the L 1 loss can be defined as: As for the second scheme, random data-augmentation was implemented while inputting training data into the network. That is, for each cross-sectional image pair, during the reading process, 180 • clockwise rotate and horizontal flip were randomly conducted on the original image pair.
As for the third scheme, the cross-sectional image pairs were reorganized by adding three OCT phase images into the image pairs. The OCT phase images, which correspond to the four OCT structural images [15], were calculated through the phase gradient angiography method.
All three pilot studies adopted the same training strategy and parameter setting with the basic U-Net model in Section 2.3. Only the 4-input image protocol was tested in these experiments. The results of the MIP enface angiograms are shown in the Table 4. A plethora of information can be deduced from Table 4. First, the L 1 loss replacement has improved the reconstruction quality of OCTA. It may be attributed to the advantage of L 1 loss in preserving the low-frequency information. However, the improvement is relatively limited. Hence, designing a more suitable loss function for OCTA tasks is a direction that remains to be explored in further studies. Secondly, the effectiveness of data-augmentation has been demonstrated. One possible explanation for this is that data-augmentation help mitigate the overfitting problem of the U-Net to a certain extent, resulting in better generalization for the trained model. Therefore, it is meaningful to build a larger dataset with proper data-augmentation techniques for OCTA tasks. Finally, similar to the traditional analytic OCTA algorithms, merging phase information also exerts a positive impact on the performance of the DL models. Figure 10 shows the reconstructed results of the MIP enface angiogram through phase information merging. Compared to the basic U-Net model, the microvessel pointed by red and brown arrows present higher reconstructive precision. Meanwhile, the vessel areas marked by the green arrow and dotted circle demonstrate better connectivity of edge and less wavy artifacts. The corresponding quantitative metrics of the two ROIs are also provided in the Table 5. It can be seen that all the metrics of the phase merging scheme have obtained obvious improvement relative to the corresponding metrics of the basic U-Net model. Hence, given the great modeling capability of the DL models, merging phase information into the training process to comprehensively use OCT signals for accurate OCTA reconstruction is a new impetus for DL-based OCTA tasks.

Conclusion
In this study, we conducted a comparative study on the DL-based OCTA reconstruction algorithms. Four representative models from the field of image translation were investigated using the DLbased OCTA pipeline. From the results, we found that U-shaped models and multi-path models were two suitable architectures for OCTA reconstruction. As an extension of the study, preliminary explorations of three potential solutions for further improvement of the DL-based OCTA tasks were conducted. The results showed that merging phase information should be the potential improving direction in the further research of DL-based OCTA reconstruction algorithms.

Funding
National Natural Science Foundation of China (61875123, 81421004); National Key Instrumentation Development Project of China (2013YQ030651); Natural Science Foundation of Hebei Province (H2019201378).