Neural-network based high-speed volumetric dynamic optical coherence tomography

We demonstrate deep-learning neural network (NN)-based dynamic optical coherence tomography (DOCT), which generates high-quality logarithmic-intensity-variance (LIV) DOCT images from only four OCT frames. The NN model is trained for tumor spheroid samples using a customized loss function: the weighted mean absolute error. This loss function enables highly accurate LIV image generation. The fidelity of the generated LIV images to the ground truth LIV images generated using 32 OCT frames is examined via subjective image observation and statistical analysis of image-based metrics. Fast volumetric DOCT imaging with an acquisition time of 6.55 s/volume is demonstrated using this NN-based method.

DOCT is a method that combines time-sequential OCT acquisition at the same location with analysis of the temporal fluctuations of the OCT signal.Several signal processing methods have been proposed to perform the temporal fluctuation analysis, including signal fluctuationmagnitude analysis [5,6], fluctuation speed analysis [5,6], time-frequency analyses [9,16,17], and eigen-decomposition variance analysis [18].
Although DOCT has been used successfully to assess several dynamic processes of various tissues, the process requires the acquisition of large numbers (i.e., from tens to thousands) of repeated OCT frames within a long time window at a single location in the sample, which then leads to long total acquisition time.Specifically, a frequency-constrained robust principal component analysis demonstrated by Mclean et al. required 1,350 frames per location [17], while a time-frequency-analysis based DOCT demonstrated by Leung et al. required 1,000 frames per location [16].Although our developed DOCT methods, the logarithmic intensity variance (LIV) and OCT correlation decay speed (OCDS) methods, require relatively small numbers of frames, e.g., 16 or 32 frames, the volumetric measurement time is still long and ranges from approximately 30 s up to 1 min [6].The long acquisition time hamper high-throughput imaging, which is important for certain applications, e.g., large-sample-number drug screening.
Recently, use of deep-learning neural networks (NNs) [19,20] to generate high-quality OCT Fig. 1.NN architecture used for LIV generation.The NN consists of three parts: an encoder (red dashed-line box), a decoder (blue dashed-line box), and a skip connection (brown arrows).The input to the NN is a set of four cross-sectional OCT images and the output is an LIV image, which is a DOCT image that is defined as the time variance of the dB-scaled OCT images.
angiography (OCTA) images from small numbers of OCT frames has been demonstrated [21].
Because both OCTA and DOCTA involve the analysis of time-sequential OCT images, we hypothesized that NNs can also generate high-quality DOCT images from small numbers of OCT frames, and such an approach will enable high-speed volumetric DOCT imaging.
In this study, we demonstrate deep-learning NN-based DOCT generation, in which an NN is trained to generate a high-quality LIV image from only four OCT frames.The LIV is one type of DOCT and is defined as the time variance of the logarithmic OCT signals.The NN-generated LIV image was compared subjectively and objectively with conventional LIV images.The comparison showed high consistency between the NN-generated LIV and conventional LIV images.It is also demonstrated that a high-quality LIV volume of a tumor spheroid can be obtained from a volumetric OCT dataset that was acquired within only 6.55 s, while a conventional LIV measurement of the same volume size requires an acquisition time of 52.4 s.

Neural network model
The architecture of the NN that generates the high quality LIV image is shown in Fig. 1.This architecture is a modified version of the U-Net architecture [22].The main structure in this network is its convolutional layer, which consists of a convolution operation and a rectifiedlinear-unit (ReLU) activation function.The input has four channels that correspond to the four time-sequential cross-sectional logarithmic OCT images, and the output is an LIV image.Convolutional layers are used to extract the image features and to change the number of channels.The black numbers over the convolutional layers (yellow and black arrows) shown in Fig. 1 indicate the numbers of kernels of convolution.
The NN comprises three parts: the encoder (red dashed-line box), the decoder (blue dashed-line box), and the skip connections (the brown arrows between the encoder and decoder).The input (i.e., the set of four OCT images) first passes through the encoder.The encoder is realized using 12 convolutional layers with a 3×3 kernel (yellow arrows) and four max-pooling operations with a 2×2 kernel (green arrows).Each max-pooling operation reduces the two-dimensional image size by a factor of four, i.e., by a factor of two for each dimension.After the encoder, the image size is reduced by a factor of 16 for each dimension, and the number of image channels is increased from 4 to 1,024.
The decoder consists of combinations of transpose convolutional layers (gray arrows) and convolutional layers (yellow arrows and one black arrow).Each transpose convolutional layer uses a 2×2 kernel and enlarges the image by a factor of two for each dimension.The output from the convolutional layer is concatenated with an intermediate output from the encoder through a skip connection (brown arrow).The concatenation operation via the skip connection allows feature sharing between the encoder and the decoder.The concatenated image then passes through two convolutional layers, where each layer uses a 3×3 kernel.After four sets composed of a transpose convolutional layer and two convolutional layers, and four skip connection operations, the image size becomes the same as that of the original input image, and the number of channels is reduced from 1,024 to 32.Finally, a set composed of a convolutional layer with a 3×3 kernel and a subsequent second convolutional layer with a 1× 1 kernel (black arrow) reduces the number of channels from 32 to 1.

Target image and input dateset
The NN model is trained to generate a ground truth LIV image from a set of four time-sequential dB-scaled OCT images.The ground truth LIV image is defined as the time variance of a dB-scaled OCT image that is computed from 32 time-sequential OCT frames [5] as follows: where   (, ,   ) represents a dB-scaled OCT intensity image at the spatial position (, ), and  and  are the lateral and depth positions, respectively.  is the time of acquisition of the -th frame.⟨ ⟩  represents the average over time.
In our typical implementation, the number of frames is 32 and the time separation between consecutive frames denoted by  +1 −   is 204.8 ms, which means that the entire time sequence is acquired in 6.35 s.The implementation of this time-sequential data acquisition process is described later in more detail in Section 3.2.
The four input frames are also dB-scaled OCT images and are acquired with a time separation of 1,638.4ms, i.e., with an eight-fold longer time separation when compared with that of the ground truth LIV image.The total time separation from the first frame to the last is 4.92 s.The four dB-scaled OCT frames are then concatenated into a four-channel dataset to be fed into the NN.

Training flow
For the training of the NN model, we use time-sequential OCT frames to generate both the ground truth and the input set, as depicted schematically in Fig. 2. In this particular study, the original OCT time-sequence consists of 32 frames and the ground truth is obtained from all 32 frames using Eq.

OCT device
A custom-built Jones-matrix OCT (JM-OCT) device is used for data acquisition.This system is identical that was used in our previous LIV-imaging studies of both in vitro samples [5][6][7][8] and ex vivo samples [3,4].Because full details of this system can be found elsewhere [23,24], we only describe the system specifications briefly here.The system comprises a swept-source OCT with a microelectromechanical systems (MEMS)-based scanning light source (AXP50124-8, Axsun Technologies, MA).The center wavelength of the probe beam is 1.3 m and the beam scanning speed is 50 kHz.The full-width-at-half-maximum axial resolution is 14 m in tissue and 19 m in the air.The 1/e 2 -width lateral resolution is 19 m.
Although the JM-OCT is a polarization-sensitive OCT and provides four OCT images corresponding to the four polarization channels, we use only a polarization-insensitive OCT intensity image that represents the intensity average of the four OCT images.In other words, our NN-based method does not use polarization-sensitive information and is thus compatible with conventional polarization-insensitive OCT devices.

Volumetric scan protocol for 32-frame sequence
For the volumetric images, we acquire 32 frames at each of 128 B-scan locations.To obtain a time separation between frames of 204.8 ms and keep the total volumetric acquisition time short (e.g., less than 1 min), we used the repeating raster scan method described in Ref. [6].
As shown in Fig. 3(a), the en-face field is split into eight blocks along the slow scan direction.Each block is then scanned 32 times using the volumetric raster scan protocol.In this case, each cross-section (i.e., each frame) consists of 512 A-lines, 32 frames are acquired at each B-scan location, and one block consists of 16 B-scan locations.After this repeating raster scan is performed sequentially for the eight blocks, a volumetric time sequential OCT dataset is obtained.To form the final volumetric dataset, 128 B-scan locations were scanned.The inter-frame time interval was 204.8 ms and the time separation between the first and last frames at each location was 6.35 s.The total acquisition time for the volume image was 52.4 s.The datasets required to train the NN model were acquired using this protocol.

Samples for NN training and evaluation
To train and evaluate the NN, 60 breast-cancer spheroids (MCF-7 cell line) and 60 colon cancer spheroids (HT-29 cell line) were used.These spheroids were cultivated using 96-well plates, and each spheroid was seeded with 1,000 cells.To increase the variety of these spheroids, the spheroids were treated with anti-cancer drugs, including paclitaxel (PTX; for MCF-7) and SN-38 (for HT-29) with concentrations of 0, 0.1, 1, and 10 M, where 0 M means that the spheroid was not treated using the drug.The treatment times were 1, 3, or 6 days, which are denoted by Day-1, -3, and -6, respectively, throughout the manuscript.Five spheroids were prepared for each combination of drug concentration and treatment time.Note that the spheroids and their OCT data are identical to those used in our previous drug-response investigation study.More details of the samples and their cultivation protocols can therefore be found in our previous publication [7].
Each of the spheroids was scanned with a lateral field of view (FOV) of 1 mm × 1 mm.The five spheroids for each condition were split into groups of three, one, and one for training, validation, and evaluation, respectively.As a result, in the NN training process, we used 96 spheroids (i.e., 72 for training and 24 for validation).The remaining 24 spheroids were reserved for the evaluation study, which is described in detail in Section 4.

Data sets for training
To generate the training and validation datasets, the 72 (training) and 24 (validation) spheroids were scanned using the 32-frame-sequence volumetric scan protocol that was described in Section 3.2.From each volume, the ground truth LIV and a corresponding input four-frame sequence were generated in the manner described in Section 2.2.
From each volumetric dataset, we selected 20 B-scan locations, and the cross-section at each selected location contained a sample region of at least 1,000 pixels.Then, from each cross-section, we extracted 40 patch pairs with 64 × 64 pixels, such that 800 image patches were extracted from each spheroid, where the term "patch pair" means a pair composed of the input and the ground truth.Therefore, for the training and validation processes, the sizes of both the input image and the ground truth LIV image are 64 × 64 pixels.The extraction location for each patch is selected at random, but all the patches contain the spheroid region at least partially.The patch locations can also be partially overlapped.The final training and validation datasets consist of 57,600 and 19,200 patch pairs, respectively.

Detailed implementation of neural network and its training
The NN was implemented in Python 3.7 using the open source machine learning platform TensorFlow 2.6 on a PC equipped with a graphics processing unit (GPU; NVIDIA GeForce RTX 3090 with 10,496 Compute Unified Device Architecture (CUDA) cores and 24 GB of memory).The NN model was trained based on mini batches, where the total of 57,600 patch pairs in the training dataset was divided into 800 mini batches.Each mini batch consisted of 72 image patch pairs, and all patch pairs were taken from 72 different spheroids.
The ground truth LIV image is not evenly distributed in terms of its value (as will be discussed in Section 6.2), and thus we used the weighted mean absolute error (wMAE) as a loss function.In this case, we customized the weight () to increase the weights of the high-LIV regions as follows W(, , ; ) = 2 for LIV(, ; ) ≥  1 otherwise where LIV is the ground truth LIV image defined using Eq. ( 1), (, ) is a spatial position within the batch,  is the batch index, and  is a predefined threshold for the LIV.As shown in the equation above, we set higher weights for the pixels whose ground truth LIV values are larger than the threshold value ().In this particular study, we set  empirically at 9 dB 2 .The wMAE was then computed from the NN output and the ground truth using this weight as follows where LIV and LIV ′ represent the ground truth LIV and the NN output, respectively, and • represents the element-wise product.The rationality of this loss function and the optimal threshold selection procedure will be discussed later in Section 6.2.
The NN model parameters were updated using the Adam optimizer [25] with the wMAE loss.To enable the NN model to learn the detailed image pattern, we used a decaying learning rate strategy [26], in which the learning rate is defined as 10 −4 + 5 × 10 −4 /epoch.To prevent over-fitting, the training process was stopped when the validation loss did not decrease for seven consecutive epochs, and the parameters of the eighth epoch from the last epoch were stored as the trained NN model.

Image types
We used three image types in the performance evaluation of our NN-based method.The first image type was a standard LIV image computed from the 32 dB-scaled OCT frames, and this LIV image was identical to the ground truth image.This type is designated C32LIV, where C stands for "conventional."The second image type was the output of the NN model, i.e., the image generated by using our proposed method.This image type was computed from four dB-scaled OCT frames, and is designated NN4LIV.The last image type is the time variance of the four dB-scaled OCT images which were identical to those used for the NN4LIV image.Because the method of computation used for this type of image is identical to that used for the conventional LIV image [Eq.( 1)] and the number of images used is four, this image type is designated C4LIV.This image is used as a reference for comparison with C32LIV and NN4LIV.
To enable subjective observation of the image, pseudo-color images were created for all image types, where the image brightness is the OCT intensity and the color (hue) of the image is one of C32LIV, NN4LIV, or C4LIV.The details of the color image formation process can be found in Section 2.3 of Ref. [5].

Samples for evaluation study
We used 12 breast cancer spheroids and 12 colon cancer spheroids to evaluate the trained NN model.To assess the repeatability, each spheroid was scanned twice consecutively using the 32-frame-sequence volumetric scan protocol.These two consecutive measurement sessions are named S1 and S2.The separation between the starting times for S1 and S2 was approximately 2 min.The three image types, i.e., C32LIV, NN4LIV, and C4LIV, were computed for each volumetric dataset.
Note that each OCT cross-section of these spheroids consists of 384 depth pixels × 512 transverse pixels.Unlike the NN training case, the full size image is fed into the trained NN model at once, thus allowing the full-size cross-sectional LIV image to be obtained via a single inference operation.

Image evaluation metrics
In this study, we used several image-based evaluation metrics that had also been used to quantify spheroid viability in our previous spheroid-based drug response studies [6,7].These evaluation metrics include the mean LIV and the viable cell ratio (VCR) within the spheroid region.To compute these metrics, we first segmented the spheroid region using a semi-automatic OCT intensity threshold-based segmentation method [6].After excluding the B-scans that did not contain the spheroid, 1,312 cross-sectional segmentation masks for 24 volumes were obtained.The segmentation was performed using the S1 dataset, and the same segmentation mask was also used to process the S2 dataset.To apply the mask that was computed from the S1 data to the S2 datasets, small mutual axial shifts between the S1 and S2 datasets were computed and were then corrected.
The VCR is defined as the ratio of the number of pixels with high LIV values to the total number of pixels within a spheroid, where the high-LIV pixels were defined using a predefined LIV threshold.This threshold was defined empirically as 3 dB 2 [5].The VCR was computed for each cross-sectional LIV image and was also computed for the entire three-dimensional volume of the spheroid.Here, we call them the "VCR of each B-scan," and the "VCR of each volume," respectively.
In addition, the mean LIVs were calculated for small regions of interest (ROIs) and for the complete spheroid.The mean LIVs of the entire spheroid region were computed for both the complete spheroid volume and for each B-scan.Here, we call these them the "mean LIV of each volume" and the "mean LIV of each B-scan," respectively.
For the mean LIV of the ROIs, we selected two small ROIs manually, where one was located at the spheroid core and the other was located at the spheroid periphery; these ROIs are similar to those used in Ref. [5].The spheroid core typically shows low LIV because of necrosis, whereas the spheroid periphery shows high LIV because of its high viability.To perform this ROI-based analysis, we only used spheroids that showed clear core and periphery appearances in the LIV images.Five spheroids were selected from the total of 24 spheroids, and all five were MCF-7 spheroids.A We then manually selected 10 cross-sectional images from each selected volume.All the selected cross-sections showed clear core-to-periphery contrast in the LIV images.Finally, one ROI was selected for each of the core and the periphery for each cross-section, which meant that we selected a total of 10 core ROIs and 10 periphery ROIs for each spheroid.The physical size of each ROI is 20 (depth) × 60 (transverse) pixels, which corresponds to 114 m (depth) × 117 m (transverse).The ROI regions selected using dataset S1 were also applied to the S2 dataset after the shift correction procedure.All ROI selections were based on the C32LIV image and the same ROIs were also used for the NN4LIV and C4LIV images.Finally, the mean LIV was computed for each ROI at each cross-section, and the results were then used to perform data visualization (Fig. 5).Additionally, the mean values of all core ROIs and all periphery ROIs were computed for each spheroid, and the results were used to perform the intraclass correlation analysis that will be described in the next section (Section 4.4).Note that we did not use the image metrics (means of the LIV and VCR) from each B-scan to perform the correlation analysis because these metrics, when acquired from the same spheroid, are mutually correlated in principle.

Statistical analysis
The agreements in terms of the evaluation metrics among the different image types (i.e., NN4LIV and C4LIV versus C32LIV) of S1 were evaluated using the intraclass correlation coefficient (ICC) based on a single rating, absolute agreement, two-way mixed-effects model implemented using the Pingouin package (ver.0.5.3)[27] on Python 3.7.The repeatability was also evaluated by computing the same ICC between datasets S1 and S2 for each metric and for each image type (i.e., for C32LIV, NN4LIV, and C4LIV).The criteria for the ICC interpretation are given as follows.ICC values of less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and more than 0.9 represent poor, moderate, good, and excellent agreement (or repeatability), respectively [28].

Demonstration of fast-volumetric measurement
Thus far, we have described methods based on the datasets obtained when using the 32-framesequence volumetric scan protocol (as described in Section 3.2), which has a volumetric acquisition time of 52.4 s/volume.In the studies described above, at each B-scan location, the NN4LIV image was generated using four frames extracted from a 32-frame sequence, and the volumetric data acquisition of the NN4LIV image took 52.4 s.In practical use of the NN4LIV image, we do not need to acquire 32 frames at a single location, and four frames are sufficient.As a result, the volumetric acquisition process can be faster.
To demonstrate the NN4LIV image obtained with this fast volumetric acquisition, we designed another fast volumetric scanning protocol that gives four frames at each B-scan location, and these four frames have the same inter-frame separation time as the four frames extracted from the frame sequence obtained by the 32-frame-sequence protocol.This fast acquisition protocol is a simple repeating raster scan and is depicted in Fig. 3(b).Each raster scan consists of 128 B-scan locations and each frame consists of 512 A-lines.And the raster scan is repeated four times.Therefore, the inter-frame time separation of these four frames is 1638.4ms, and the time window (i.e., the time separation from the first frame to the last frame at each B-scan location) is 4.92 s.These times are identical to those of the extracted four frames of the 32-frame-sequence volumetric scan protocol.But the total volumetric acquisition time is only 6.55 s/volume, which is 12.5% of the corresponding time for the 32-frame-sequence volumetric scan protocol.

Image observations
Figure 4 shows the cross-sectional and en-face C32LIV, NN4LIV, and C4LIV images of the breast cancer (MCF-7) spheroids (upper half of the figure) and the colon cancer (HT-29) spheroids (lower half) from treatment Day-1.The "*" beside an image in Fig. 4 indicates a sample that was used in the selection of the core and periphery ROIs.All images were acquired from the first measurement session (S1).The LIV images of all the other spheroids and the second measurement session (S2) can be found in the figures in Supplement 1 (Figs.S1 to S10).For both spheroid types, we can see that the NN4LIV and C32LIV images show similar spatial patterns and values (i.e., similar colors in the pseudo-color images).The C4LIV images show more low-LIV signal pixels (i.e., red pixels), which make the necrotic core and the vital periphery less distinguishable.These results suggest that the appearance of each NN4LIV image is reproduced well and is consistent with the appearance of the corresponding C32LIV image, whereas the C4LIV images are not really consistent with the C32LIV images.

Agreements of NN4LIV and C4LIV versus C32LIV
The image metrics, i.e., the mean LIVs at (a) the core ROI, (b) the periphery ROI, (c) of each B-scan and (d) of each volume, and the VCRs of (e) each B-scan and (f) each volume, are plotted versus the corresponding metrics of the ground truth C32LIV images in Fig. 5.The blue and red spots represent the NN4LIV and C4LIV data, respectively, and the black lines represent the perfect agreement with the C32LIV data that denotes an "ideal line".For all six metrics, the NN4LIV metrics are close to the ideal line, i.e., the NN4LIV metrics are close to the corresponding ground truth C32LIV metrics.In contrast, the C4LIV results evidently show lower metric values than the ground truth data.
The agreements of the NN4LIV and C4LIV data with the C32LIV were evaluated quantitatively by computing the ICCs of four of the six metrics.Intraclass correlation of the mean LIVs and VCRs of each B-scan were not examined here because the B-scans (i.e., the cross-sectional LIV images) of the same spheroids cannot be independent of each other.The ICC results and their 95% confidence intervals are summarized in Table 1.The NN4LIV metrics show "excellent" agreement with the C32LIV results for all four metrics, i.e., ICC > 0.9.In contrast, the C4LIV metrics show only "good" (0.75 ≤ ICC < 0.9 for mean LIVs) or "poor" (ICC < 0.5 for VCR) agreements.In addition, all the C4LIV metrics show very wide 95% confidence intervals and their lower bounds are very low (as indicated by red numbers).Therefore, C4LIV cannot provide a reliable alternative to C32LIV.

Repeatability of NN4LIV, C4LIV, and C32LIV metrics
The repeatabilities of the image metrics of the C32LIV, NN4LIV, and C32LIV images were evaluated by computing the ICCs of the image metrics between the two measurement sessions (S1 and S2).The resulting ICCs and their 95% confidence intervals are summarized in Table 2.
All the C32LIV, NN4LIV, and C4LIV metrics showed "excellent" repeatability, i.e., ICC > 0.9.Note here that the lower bound of the 95% confidence interval of the mean LIV for C4LIV at the periphery ROI (indicated by red numbers) is only "good" (0.84), but this value is still acceptable.These results indicate that all three types of LIV show high repeatability.It should also be noted that the agreements (i.e., the ICCs) between the NN4LIV and C32LIV metrics shown in Table 1 are close to the repeatability results (ICCs) for C32LIV shown in Table 2.This suggests that the  disagreements between the NN4LIV and C32LIV metrics are within the fluctuation range of the C32LIV type itself.

Demonstration of fast-volumetric measurement
In all the previous results, we used datasets obtained using the 32-frame-sequence volumetric scan protocol and as a result, the volumetric data acquisition process took 52.4 s, even for the NN4LIV images.Here, we demonstrate LIV imaging with a volumetric acquisition time of only 6.55 s.Four untreated beast cancer spheroids that had been cultivated for 11 days were scanned using the fast-volumetric-measurement protocol that was described in Section 4.5.The NN4LIV and C4LIV images were computed from the datasets obtained via this protocol.The same trained NN model that was used in the previous sections was used for NN4LIV generation.In addition, the same spheroids were scanned using the 32-frame-sequence volumetric scan protocol and C32LIV images were then generated for reference.
The cross-sectional and en-face C32LIV, NN4LIV, and C4LIV images of the four spheroids are shown in Fig. 6.Based on subjective observation, the C32LIV and NN4LIV images show consistent image appearances, whereas the C4LIV images show more low-LIV (red) pixels.
It should also be noted that this demonstration was performed with spheroids after 11 days of cultivation, while the NN model training was performed using spheroids with cultivation Table 1.ICCs and their 95% confidence intervals (written in [ ]) for the NN4LIV and C4LIV metrics versus those of C32LIV (the ground truth).Higher (closer to 1.0) ICC values indicate better agreement.All NN4LIV metrics show excellent agreement with the C32LIV metrics.In contrast, the C4LIV metrics show only poor to good agreement levels, and the lower bounds of their 95% confidence intervals are very low.
NN4LIV vs C32LIV C4LIV vs C32LIV  times of only 1, 3, and 6 days.The reasonable generation of the NN4LIV images may suggest generalization of the trained NN model to some degree.

NN4LIV enables fast volumetric LIV imaging
It was found that the NN4LIV results resemble the C32LIV (the ground truth) results closely, both qualitatively and quantitatively, as shown in Sections 5.1.1 and 5.1.2.In addition, the NN4LIV and C32LIV methods are highly repeatable, as shown in Section 5.1.3.Furthermore, as demonstrated in Section 5.2, a volumetric NN4LIV tomographic image can be obtained in 6.55 s, while the time required to generate a volumetric C32LIV image is 52.4 s.Because of its high resemblance to C32LIV, the excellent repeatability, and its compatibility with fast volumetric data acquisition, the NN4LIV method can be substituted for C32LIV and thus enable high-speed volumetric LIV imaging.In contrast, the C4LIV method was found not to bear such a close resemblance to the C32LIV method, both qualitatively and quantitatively (see Sections 5.1.1 and 5.1.2).Therefore, the C4LIV method may not be able to be substituted for C32LIV, although it is compatible with the fast volumetric acquisition protocol.C32LIV image patches used for NN model training that partially include the nonspheroid region (i.e., the cultivation medium).The long-tailed distributions of these histograms indicate that the counts for the high-LIV pixels are far lower than those for the low-LIV pixels.Note that the small left peak in the target-C32LIV-patch histogram corresponds to the non-spheroid region (i.e., the cultivation medium).In general, such a highly skewed distribution for the target image (i.e., the ground truth) will hamper effective training of the NN model.Two well-known strategies can overcome the negative effects of the highly skewed distribution of the ground truth.The first strategy is re-sampling, and the second is cost-sensitive reweighting [29].In this study, we used the latter strategy.In other words, we increased the weights of the high-LIV regions when we computed the losses.Specifically, we used the wMAE [Eq.

Rationality of the loss function selection
(3)] as the loss function, as discussed in Section 3.4.2,where the wMAE loss is computed by allocating weights of 2 to the pixels with LIV values that are higher than a predefined threshold .Despite the number of high-LIV pixels being relatively small, the allocation of higher weights to the high-LIV pixels may help the NN to focus more strongly on these high-LIV pixels.
To compare the wMAE with other standard loss functions, including the nonweighted mean absolute error (MAE) and the mean squared error (MSE), we trained NN models using these two loss functions.The training protocol and the datasets used here are the same as those described in Section 3. In addition, we trained three NN models with the wMAE loss function with predefined thresholds of  = 8, 9, or 10 dB 2 , where the 9 dB 2 threshold is the threshold used in the main study in this paper.As a result, five NN models were trained in total.NN4LIV images were generated by all five trained NN models from the S1 datasets of 24 evaluation spheroid datasets.C32LIV images were also computed from the same evaluation datasets for reference.The agreements of the image metrics with those of the C32LIV images was quantified using the ICC in a similar manner to that described in Sections 4.3 and 4.4.The ICCs and their 95% confidence intervals in the cases of MAE, MSE, and wMAE ( = 9dB 2 ) are summarized in Table 3.Although the MAE results show excellent or good agreement levels, the lower bounds of the 95% confidence intervals are very low (red numbers) for the mean LIVs of both the periphery ROI and of each volume.The MSE also shows excellent agreement levels, but the lower bounds of the 95% confidence intervals are low for the mean LIV of the core ROI and the VCR of each volume (red numbers).The wMAE with the 9-dB 2 threshold, which has been used in the main study, shows excellent agreement, when compared with both the MAE and the MSE, the lower bounds of its 95% confidence intervals are also very high for all metrics.We can thus conclude that wMAE is the best loss function among the three types of loss functions under test here.Table 3. ICCs and their 95% confidence intervals (written in [ ]) between the image metrics obtained with the C32LIV images (ground truth) and those obtained with the NN4LIV images generated using the NN models trained using the MAE, the MSE, and the wMAE with the 9-dB 2 threshold as loss functions.Consideration of the ICCs and the 95% confidence intervals shows that wMAE provides the best resemblance between the NN4LIV images and the C32LIV images.The NN4LIV image metrics with the different wMAE thresholds were also compared with those of the C32LIV images, with the results as summarized in Table 4.As shown in the table, the 9-dB 2 wMAE provides excellent agreement for all metrics, and the lower bounds of all the 95% confidence intervals with the high threshold are also very high.However, the lower (8-dB 2 ) and higher (10-dB 2 ) thresholds provided low lower bounds for the 95% confidence intervals for some metrics (red numbers).As a result, we selected  = 9 for the main study.Note that the threshold value was selected based on the specific dataset used in this study.The optimal threshold may vary among different types of samples, and it may be necessary to re-optimize the wMAE weight to achieve the best performances for other sample types.Table 4. ICCs and their 95% confidence intervals (written in [ ]) for the image metrics of NN4LIVs generated with wMAE loss functions using the thresholds  of 8, 9, and 10 dB 2 versus the image metrics of C32LIV.The 9 dB 2 threshold gives excellent ICCs for all image metrics and provides the best lower bounds for the 95% confidence intervals among the three thresholds.

Limitations of current NN-based LIV generation method
There are some limitations of and some open issues with the current NN-based DOCT generation method.One of these limitations is that the current method is specific to a limited range of sample types, i.e., for two types of tumor spheroid.In addition, the current method can only generate one type of DOCT contrast, i.e., LIV.The generalization of the proposed method for use with other sample types and other DOCT contrasts will be possible to address in future work.
Another limitation is that it is currently necessary to select the optimal parameters of the wMAE loss function manually via empirical and experimental comparisons, as exemplified in Section 6.2.A more universally optimized wMAE loss function and/or a more generalized loss function for a wider variety of samples may be realized through further analysis of the distributions of several datasets.This also will be a future work.

Conclusion
We have demonstrated an NN-based DOCT method that generates LIV images from a small number (typically four) of OCT frames while maintaining similar image quality and similarly high fidelity to conventional LIV images, which are computed from far larger numbers (typically 32) of frames.Additionally, while the conventional method requires a volumetric acquisition time of 54.2 s/volume, the proposed NN-based method requires only 6.55 s/volume.Qualitative image comparisons and quantitative image-metric analyses of the NN-based LIV and the conventional LIV confirmed the strong resemblance between these two types of LIV image.Although the available DOCT contrast types and the measurable sample types are limited at present, the NN-based DOCT method enables high-speed volumetric DOCT imaging with an acquisition time that is eight times shorter than the conventional LIV method.
(1) [part (a) of the figure].The input time sequence is then constructed by extracting the 8th, 16th, 24th, and 32nd frames from the original dataset [part (b)].In the training process, the loss is computed based on the output from the NN model and the ground truth LIV, and is then back-propagated to the NN to update the parameters [part (c)].A detailed description of the implementation of the training process, including the definition of the loss function and selection of the hyper parameters, is presented later in Section 3.4.2.

Fig. 2 .
Fig. 2. Schematic diagram of the data pair generation process for NN training and the training flow.The original data are a time-sequence of 32 OCT images (frames) with an inter-frame time separation of 204.8 ms.(a) The ground truth LIV image is computed as the pixel-wise time variance among the 32 OCT images.(b) Four frames with an inter-frame time separation of 1,638.4ms are then extracted from the 32-frame OCT sequence, and these frames are paired with the ground truth LIV image.(c) In the training loop, the NN accepts the four-frame sequence as an input, and the network parameters are then updated to generate the ground truth LIV image.The loss is computed by a customized loss function (the weighted mean absolute error; see Section 3.4.2) from the network output and the ground truth, and this loss is then back-propagated to the NN model to update the parameters (as indicated by the orange arrow).

Fig. 3 .
Fig. 3. Schematic diagram of the two scanning protocols used for volumetric data acquisition.(a) A 32-frame-sequence volumetric scan protocol is used to acquire a dataset from which both the ground truth LIV images and the NN-based LIV images are generated.In this protocol, the entire lateral field of view is split into eight blocks, and each block is scanned repeatedly using a raster scan protocol with 16 B-scan locations for a total of 32 times.Therefore, for the eight blocks, a total of 128 B-scan locations are scanned.The inter-frame time separation at each B-scan location is 204.8 ms and the total data acquisition time is 52.4 s/volume.(b) A fast volumetric scanning protocol is used to demonstrate fast volumetric LIV imaging based on the NN-based LIV method.The complete field is scanned repeatedly using a raster scan protocol with 128 B-scan locations for a total of four times.The inter-frame time separation at each B-scan location is 1,638.4ms and the total data acquisition time is 6.55 s/volume.

Fig. 4 .
Fig. 4. Comparison of the C32LIV, NN4LIV, and C4LIV images obtained from the same 32-frame-sequence OCT images.In the images, "*" indicates the samples that were used for selection of the core and periphery ROIs, which are used later for the statistical analysis.The upper and lower halves of the figure show pseudo-color images of the breast cancer (MCF-7) spheroids and the colon cancer (HT-29) spheroids, respectively.These breast and colon cancer spheroids were treated using PTX and SN-38, respectively.The left and right halves of the figure show the cross-sectional and en-face images, which were both obtained at the approximate centers of the spheroids.The scale bar represents 200 m.NN4LIV images reproduce the spatial patterns and the values of the ground truth C32LIV image well, whereas the C4LIV images show more low-LIV pixels (the red pixels at the periphery) than the C32LIV images.

Fig. 5 .
Fig. 5. Agreement evaluation of the image matrices obtained from NN4LIV and C4LIV images against those obtained from C32LIV images.The mean LIVs at (a) core ROI, (b) periphery ROI, (c) of each B-scan and (d) of each volume, and the VCRs of (e) each B-scan and (f) each volume of the NN4LIV (blue) and C4LIV (red) images are plotted versus those of the C32LIV (ground truth) images.Black lines indicate perfect agreement.The metrics of the NN4LIV images show high agreement with those of the C32LIV images, while the metrics of the C4LIV images clearly show lower values than those of the C32LIV images.The inset in (a) shows the example of the core ROI and the periphery ROI (white and black boxes, respectively) that were used to compute the metrics.

Figure 7 4 Fig. 6 .
Figure 7 shows the histograms for (a) the spheroid region of a breast cancer (MCF-7) C32LIV volume, (b) the spheroid region of a colon cancer (HT-29) C32LIV volume (b), and (c) the target

Fig. 7 .
Fig. 7. Histograms of C32LIV values (a) in the spheroid region of a breast cancer (MCF-7) volume, (b) in the spheroid region of a colon cancer (HT-29) volume, and (c) of all patches used in the NN-model training process.Note that (a) and (b) are the histograms of the spheroid regions only, while the pixels of histogram (c) partially include the nonspheroid region (i.e., the cultivation medium).The left peak in (c) corresponds to the non-spheroid region.Highly asymmetric and skewed appearances are shown in all the histograms, and the uneven of the training dataset in (c) hampers effective training of the NN model.

Table 2 .
ICCs and their 95% confidence intervals (written in [ ]) for the LIVs from two consecutive measurement sessions (S1 and S2).All images show excellent or good repeatability for all image metrics.