Automated segmentation of dermal fillers in OCT images of mice using convolutional neural networks

: We present a system for automatic determination of the intradermal volume of hydrogels based on optical coherence tomography (OCT) and deep learning. Volumetric image data was acquired using a custom-built OCT prototype that employs an akinetic swept laser at ∼ 1310 nm with a bandwidth of 87 nm, providing an axial resolution of ∼ 6 . 5 µm in tissue. Three-dimensional data sets of a 10 mm × 10 mm skin patch comprising the intradermal ﬁller and the surrounding tissue were acquired. A convolutional neural network using a u-net-like architecture was trained from slices of 100 OCT volume data sets where the dermal ﬁller volume was manually annotated. Using six-fold cross-validation, a mean accuracy of 0.9938 and a Jaccard similarity coeﬃcient of 0.879 were achieved.


Introduction
Soft tissue augmentation describes the process of injecting dermal filler material into the skin. Most commercially available products consist of hyaluronic acid, a natural polymer which can be stabilized by chemical crosslinking [1]. While the most common application follows from the desire to reduce classic signs of aging, other fields of application include treatment of scars in accident victims and reducing the social stigma of HIV patients suffering from lipoatrophy. Dermal fillers provide only temporary enhancement, as they will be gradually absorbed by the surrounding tissue. A major aim in development of new filler formulations is therefore the evaluation of the residence time in tissue.
Optical coherence tomography (OCT) is a contactless imaging modality that can capture high resolution three-dimensional images of material specimens or tissue samples [2]. Analogous to ultrasound, images are acquired by measuring the magnitude and echo time delay of back-reflected Germany) splits the light from the laser into two paths for sample and reference arm, with the sample arm receiving 75 % of the power. In the sample arm, a fiber collimator with a focal length of 10 mm (60FC-4-M10-03, Schäfter + Kirchhoff GmbH, Hamburg, Germany) produces a free space beam that is then scanned across the sample using two galvanometer mirrors (6210HSM40, Cambridge Technology, Novanta Europe GmbH, Planegg, Germany) and focused on the sample by a f = 36 mm scan lens (Thorlabs GmbH). The lateral resolution was measured as ∼22 µm. The reference arm comprises a fiber collimator, a glass block compensating the dispersion introduced by the scan lens, a variable neutral density filter and a mirror. Polarization controllers are integrated into both sample and reference arms. Two circulators are used to redirect the light originating from sample and reference arm to a second fiber coupler with 50:50 splitting ratio (TW1300R5A2, Thorlabs GmbH) where the light back-scattered and back-reflected from sample and reference are superimposed. A dual balanced photodetector (BPD-1, Insight Photonic Solutions Inc.) converts the interference signal into an analog electric signal, which is then digitized using an ATS-9360 digitizer card (Alazar Technologies Inc., Pointe-Claire, Quebec, Canada). An FPGA (field programmable gate array, PCIe-7841R, National Instruments GmbH, Salzburg-Bergheim, Austria) is used to control the galvo mirrors. The synchronization between laser source, digitizer card and galvo mirrors is described in detail in [23]. The custom-built OCT setup is able to achieve a sensitivity of >110 dB for a radiant power of 33 mW on the sample, which is well below the limit specified by the International Electrotechnical Commission [24]. The dermal OCT system was employed for repeated determination of the volume of intradermal fillers and for evaluation of their degradation rates. The exemplary image data presented below originates from a comparative study of different filler formulations in BALB/c mice. The study has been approved by the ethics committee of the Medical University of Vienna, Austria and was performed fully compliant with Austrian legislation.

Image acquisition
OCT in vivo acquisition was performed over a square skin patch of 10 mm × 10 mm. Each volumetric data set comprises 1536 OCT images recorded at 512 different positions, with each OCT image consisting of 512 A-scans. A-scans were calculated from spectra that contain approximately 1630 useful data points. To reduce speckle noise, the three cross-sectional images taken at each distinct position were averaged. Thereafter, the OCT images were resized in axial direction to yield isometric voxel spacing. The final data sets had a size of 512 × 512 × 304 voxels.

Convolutional neural network architecture
A convolutional neural network was trained to automatically segment the dermal filler structure. Its architecture is based on u-net published by Ronneberger in 2015 [16] and designed similar to an auto-encoder, but with additional skip connections (see Fig. 2). The network consists of a contracting path and an expanding path, which together form a "U" shape. The contracting path contains four downsampling steps, each of which is made up of two 3x3 convolutions with batch normalization and rectified linear units (ReLU) followed by a 2x2 maximum pooling layer that cuts the resolution in half. With every downsampling step, the number of feature channels is doubled. The expanding path mirrors the contracting path, with the maximum pooling layers replaced by upsampling layers followed by 2x2 convolutions which cut the number of feature channels in half. Before again applying two 3x3 convolutions, the current feature maps are concatenated with those from the corresponding downsampling step, thus providing additional high-resolution information through these skip connections. With every upsampling step, the number of feature channels is cut in half. The combination of the main path providing high-level context and the four skip connections retaining local features provides excellent segmentation results.
A few modifications were implemented as compared to the originally published method [16]: Firstly, a batch normalization layer was introduced before every rectified linear unit (ReLU) activation layer. Batch normalization was first presented by Ioffe and Szegedy [25] at approximately the same time as u-net and is now widely used in convolutional neural networks. The aim of this technique is to keep the distributions of the inputs of each layer fixed. Without batch normalization, the means and variances of the layers' inputs would change with every training iteration, requiring the weights of the layers to be continuously adapted. Batch normalization improves training stability and allows faster learning.
Secondly, the number of model parameters [26] was reduced: For the application presented here, a reduction of the number of feature channels from 64 → 128 → 256 → 512 → 1024 to 32 → 64 → 128 → 256 → 512 had no influence on segmentation accuracy, but led-due to the smaller model size-to a reduction in training and prediction times.
Thirdly, image padding was used for all convolutions. This approach introduces additional pixels around the edges of the image to allow the convolutions to use the neighboring pixels even for pixels at the edge of the image. Ronneberger's [16] use of unpadded convolutions was because he had to split his large input images into smaller tiles anyway, so overlapping the tiles to avoid padding was a natural choice. The images in the current analysis are small enough to be processed in full at once, thus making padded convolutions necessary for the prediction to cover  the whole image area. Fourthly, since the network was only trained for two classes (selected structure and background), the sigmoid function was used instead of soft-max in the last layer of the network.

Training objective
In the analyzed dataset, the area to be segmented covered only approximately 11 % of the total image area. Class-imbalance like this may result in a network that only learns the more common class and ignores the minority class, resulting in the current case in a prediction of only background pixels. To avoid this effect, Ronneberger introduced pixel weighting into his cross-entropy loss function, with the weights depending on both the class frequency and the distance to the nearest structure border. Through the use of a different loss function that only depended on the sizes of the ground truth and the predicted structures but not the total image size, we were able to forego the creation of weighting maps. Our loss function is a differentiable version of Jaccard distance. Differentiability is required for calculating the gradient, which is used to adjust the weights of the neural network during training. Jaccard similarity coefficient of two sets A and B is defined as intersection over union: The Jaccard distance is obtained by subtracting the Jaccard coefficient from 1: In order to be applied in a loss function, it needs to be made differentiable: Here, x ∈ Ω with Ω ⊂ Z 2 denotes the pixel position, l : Ω → {0, 1} is the true label of each pixel and p(x) : Ω → [0, 1] is the prediction of the network. In order to avoid division by zero, a small number > 0 is added to the denominator.
We used stochastic gradient descent with a very small learning rate of 10 −5 , a high Nesterov momentum of 0.999 and a batch size of eight.

Training data and image augmentation
To acquire training data for the neural network, 100 volumetric OCT data sets recorded from 67 different dermal filler depots in 24 mice were manually annotated. Because of the rather long acquisition time of approximately 14 seconds, motion artifacts appeared in the images due to the breathing of the mice. To minimize these artifacts, all slices of the volume image were registered using enhanced correlation coefficient maximation [29] implemented in OpenCV [30]. The registration algorithm was restricted to only rigid transformations, i.e. translation and rotation, in order to preserve the size of the filler. Figure 3 gives an example of a severe case of motion artifacts together with the corrected image.
Dermal fillers were manually annotated every 8 th frame and interpolated in between using morphological closing, i.e. dilation followed by erosion. The neural network was trained on 2D slices of the volumetric data set in two orthogonal directions. Only images with a certain minimum filler size (5000 pixels, i.e. ∼14 % of the maximum filler size) were used for training and 2D validation, because manual annotation was not pixel perfect and slices that cut the filler depot close to the edge of the three-dimensional structure tended to have annotations of lower quality (particularly in the direction orthogonal to the original annotation planes). In most images, the dermal filler stretches over roughly half of the horizontal image plane in both lateral directions, resulting in on average nearly 2 × 250 annotated slices per volume image and 45509 annotated slices in total.
Data augmentation, a process that generates new samples from existing training data, was performed in order to provide more variability than the training set originally encompassed. Using extensive image augmentation, Ronneberger [16] was able to train his model from only 30 annotated transmission electron microscopy images of Drosophila neuronal structures. While our training data set was not as limited in size, many of the images from intradermal fillers are very similar, as they were recorded adjacent to each other. The main difference in adjacent images is in the speckle noise pattern. Image augmentation allowed to train the network with a larger variance of shapes. We used Augmentor [31] to implement three kinds of transformations: Mirroring, random distortion using a 10 × 6 displacement grid to increase shape variability, and random translation to avoid training the network to only recognize fillers near the center of the image area. Figure 4 presents several examples of image augmentation. We applied six-fold cross validation and split the 45509 annotated 2D images into six subsets of 6714 to 8036 images. Six networks were trained on alternating five of the six subsets of the data, using the remaining sixth for validation. Great care was taken to put all images of the same mouse into the same subset, even when recorded at different time points, so that network performance would always be validated only with images from different mice than those used for training. The results are presented as the averages over the six trained models.

Results
We manually annotated 100 volumetric data sets for training and validation and applied six-fold cross-validation. The network was trained on 2D slices extracted from the OCT volume data sets in both fast and slow scan direction, excluding slices that showed no or only a very small filler area. Training was carried out for 62500 iterations with a batch size of eight, i.e. 500000 images were shown to the network during training. Because of the strong regularization introduced by image augmentation, we observed hardly any overfitting. Training a model from scratch took approximately six hours on an Nvidia Geforce GTX 1080 Ti graphics card. Prediction of the intradermal filler on a single cross-sectional image required approximately 15 ms, resulting in a total processing time of approximately 8 s for one three-dimensional data set. Table 1. Six-fold cross-validated results. Data is presented for 2D slices with fillers of a minimum size as used for training, 3D (one direction) full volumes with the network only applied in one direction and 3D (two directions) full volumes with the network applied in two directions independently and the averages over the predicted probabilities used in order to reduce false positives. The last column Human presents intra-operator variability between two human annotators. Numbers in brackets are the standard deviations calculated for 100 volumes (neural networks) or 10 volumes (human), respectively. We calculated six different measures from the segmentations predicted by the neural networks: The accuracy, which is widely used for assessment of segmentation tasks, describes the number of correctly predicted voxels. For our data set with a large share of background pixels, facilitating correct prediction of the majority of the pixels, accuracy will usually give numbers close to one, with only small differences between good and bad segmentations. Two alternatives that are more sensitive to the quality of the segmentation are the Dice coefficient (also known as F 1 score) and the Jaccard similarity coefficient (also known as intersection over union). Both are independent of the size of the correctly predicted background. Furthermore, sensitivity, i.e. the proportion of correctly predicted structure voxels, and specificity, i.e. the proportion of correctly predicted background voxels, are given. Finally, since the main focus was on the quantification of the volume of the dermal filler, the ratio of predicted to manually annotated volume was calculated.
The results of this analysis are summarized in Table 1 for three different kinds of evaluations. Firstly, we evaluated the networks on the same kind of images as used for training, i.e. 2D slices with a filler size above the threshold of 5000 pixels. Secondly, we present results for full 3D data sets including filler sizes below 5000 pixels, with the network applied on the different slices without any post-processing. As application of the network on slices in fast and slow OCT scan direction produced very similar metrics, the values given here are the averages over all slices in both directions. Thirdly, we applied the same networks in both directions independently and averaged the two predicted probabilities, which further improved the segmentation quality.
Considering the results for the 2D slices, all measures were calculated once for each of the six folds of cross-validation and then averaged to avoid outliers skewing the results. With respect to the results for the 3D volumes, all measures were calculated for each of the 100 volumes individually instead of once per fold. Averaging over the 100 volumes instead of the six folds enabled us to also calculate the standard deviations as a measure of statistical spread.
In order to compare the neural network results with human performance, a training data subset of 10 volumes was independently annotated by two equally experienced operators. The result of this intra-operator comparison is also given in Table 1.
Simple prediction of the neural network in one direction revealed a cross-validated accuracy of 0.9927 and a Jaccard similarity coefficient of 0.860. Using the technique of averaging two orthogonal predictions of the same networks, we were able to improve accuracy to 0.9938 and the Jaccard similarity coefficient to 0.879. The deviation between predicted and manually annotated filler volumes was not significant. For comparison, intra-operator comparison revealed an accuracy of 0.9959, a Jaccard similarity coefficient of 0.926 and a volume difference that was also not significant.
To visualize network performance, comparisons between manual and automatic dermal filler segmentations are given in Figures 5-7. Figure 5 shows five typical examples. In the first one, image (a), manual annotation was performed in orthogonal direction and thus resulted in an uneven surface line -in this case, the automatic result is superior to the manual one. For images (b) and (c), the neural network again provides satisfactory results. Example (d) shows a slightly too large filler area on the bottom left of the automatic segmentation, where the posterior border of the depot provided very little contrast compared to the artifacts in the OCT image. In image (e), the neural network is slightly distracted by a pattern within the dermal filler depot at one position of the posterior boundary, but still provides very accurate results for the rest of the boundary. Figure 6 gives three examples where the segmentation using the neural network exhibits larger errors: In images (a) and (b), the neural network is not able to correctly delineate the full posterior boundary and thus fails to capture some of the filler area. In image (c), the neural network successfully recognizes the filler, but also shows small false positive regions at the bottom of the image. For both kinds of mistakes, our technique of dual application of the neural network in two orthogonal directions will very likely eliminate these inaccuracies.
For completeness, Fig. 7 also gives two examples of images with a very small filler area. Even though images of this filler size were not used for training as they more commonly exhibited an unsatisfactory manual annotation quality, the neural network is still able to correctly recognize the boundaries of these small fillers (Fig. 7(a3) and 7(b3)).

Discussion and conclusion
We built a custom OCT system able to visualize intra-dermally injected soft-tissue fillers. We then trained a convolutional neural network to segment the filler depots in OCT images recorded by our system.
Due to the fact that OCT uses infrared light for detection, the images have a limited depth range of 1-2 mm and the signal-to-noise ratio changes with depth. The penetration depth is limited by two effects, both of which depend on the used wavelength: While scattering of light in biological tissue decreases with increasing wavelength, water absorption roughly increases with the wavelength for the near infrared range [32]. The 1300 nm range-where our OCT system operates-provides a comparatively high penetration depth [33]. When compared to other imaging technologies like ultrasound or X-ray, OCT images provide a lower contrast and contain speckle noise that cannot be easily avoided. On the other hand, since our system uses an akinetic swept light source, it lacks the typical sensitivity roll-off of spectrometer-based OCT systems.
A custom-written segmentation algorithm for OCT images needs to explicitly deal with the limitations of OCT images. On the contrary, a machine learning approach has the advantage that the artificial neural network automatically learns to handle the characteristics of the images in the training data set. Thus the main requirement for a machine learning approach is a sufficiently large training data set that adequately represents the whole range of possible images. To meet this requirement, we manually annotated the dermal filler depots in 100 OCT volume data sets and applied image augmentation using random deformation, translation and mirroring. As training is performed by the computer without human intervention, a complicated set of rules in order to adapt to the wide range of imaging features appearing in in vivo images might be learned. An   equally elaborate custom-written algorithm would be very difficult to realize and, if feasible at all, would take much longer to implement than a machine learning approach producing results of comparable quality.
Our convolutional neural network uses a modified u-net architecture and works with 2D slices of the recorded volumetric data sets. A three-dimensional extension of u-net has been demonstrated to work with confocal microscopy images of Xenopus kidney [34], but to keep the model size viable, the network had to work on downsampled versions of the original images and process the volumes in multiple small tiles instead of the full image at once. In contrast, two-dimensional neural networks can work with images of significantly higher resolution, but run the risk of misclassification due to the loss of context in the third dimension. As our segmentation task does not have a preferred direction, we were able to partially mitigate this disadvantage with the following technique: We trained our convolutional neural network on 2D images in two orthogonal directions and also performed predictions in both directions independently. The final probability was then calculated as the average over the two orthogonal predictions. Using this technique, we were able to reduce both false positives and false negatives and improve segmentation quality.
In the results given in Table 1, sensitivity is the proportion of correctly predicted structure voxels [true pos./(true pos. + false neg.)] and specificity the proportion of correctly predicted background voxels [true neg./(true neg. + false pos.)]. While the numbers given for sensitivity are significantly lower than those for specificity, the only reason for this dissimilarity is the substantial difference between the number of filler and background pixels. In fact, the number of false positives is nearly equal to the number of false negatives.
Our data set originates from a pre-clinical study performed on dermal fillers in mice. Application in humans should also be possible provided that the fillers are injected at a depth within the penetration limit of the OCT probe beam. As the network architecture itself is not specifically optimized for dermal fillers, use of the method on other dermal structures such as cysts could be viable.
Manual annotation is a time-consuming process. On average, it took 20 minutes to manually segment the filler in one three-dimensional OCT data set. Application of the trained network for the same task takes approximately 16 seconds per volume. For a study involving 100 depots monitored over 10 time points to determine the absorption rate of the filler material, manual segmentation would be a full-time endeavor taking a few months. In addition, manual segmentation is too time-consuming to be applied clinically. In contrast, neural network processing is fast enough that it could be directly integrated into the image acquisition workflow and provide results in nearly real-time.
Our approach of combining OCT with automated segmentation using a convolutional neural network to repeatedly quantify the residual volume of intradermal filler depots and thus determine the absorption rate of the applied filler material has significant potential for future studies and clinical applications in this field.

Christian Doppler Research Association; Austrian Federal Ministry for Digital and Economic
Affairs; National Foundation for Research, Technology and Development; Croma-Pharma GmbH, 2100 Leobendorf, Austria, as industrial partner of the Christian Doppler Laboratory for Ocular and Dermal Effects of Thiomers.