Regularising disparity estimation via multi task learning with structured light reconstruction

3D reconstruction is a useful tool for surgical planning and guidance. However, the lack of available medical data stunts research and development in this field, as supervised deep learning methods for accurate disparity estimation rely heavily on large datasets containing ground truth information. Alternative approaches to supervision have been explored, such as self-supervision, which can reduce or remove entirely the need for ground truth. However, no proposed alternatives have demonstrated performance capabilities close to what would be expected from a supervised setup. This work aims to alleviate this issue. In this paper, we investigate the learning of structured light projections to enhance the development of direct disparity estimation networks. We show for the first time that it is possible to accurately learn the projection of structured light on a scene, implicitly learning disparity. Secondly, we \textcolor{black}{explore the use of a multi task learning (MTL) framework for the joint training of structured light and disparity. We present results which show that MTL with structured light improves disparity training; without increasing the number of model parameters. Our MTL setup outperformed the single task learning (STL) network in every validation test. Notably, in the medical generalisation test, the STL error was 1.4 times worse than that of the best MTL performance. The benefit of using MTL is emphasised when the training data is limited.} A dataset containing stereoscopic images, disparity maps and structured light projections on medical phantoms and ex vivo tissue was created for evaluation together with virtual scenes. This dataset will be made publicly available in the future.


Introduction
Recently, it has been shown that when large datasets are available, deep learning approaches define the state-of-the-art in 3D scene reconstruction (Zhao et al. 2020;Laga et al. 2020). This is fundamentally due to a neural network's ability to learn more complex representations of image data than can be handcrafted. However, the coupling of data volume and performance is an issue, particularly for domains that Alistair  This project was supported by UK Research and Innovation (UKRI) Centre for Doctoral Training in AI for Healthcare (EP/S023283/1) and the Royal Society [URF\R\2 01014] and the NIHR Imperial Biomedical Research Centre have limited data availability such as surgery (Hashimoto et al. 2018;Willemink et al. 2020). Capturing large amounts of depth information for surgery, especially Minimally Invasive Surgery (MIS), is laborious, due to issues with hardware constraints and the difficulty of dealing with the tissue; primarily issues with the deformation of the tissue which can disturb information capture. This problems is reflected in the SCARED dataset (Allan et al. 2021), which is the largest annotated depth dataset for surgical scenes, but only contains 45 unique and complete depth images. Training on small datasets complicates the development of networks (Qi and Luo 2020;Brigato and Iocchi 2021), due to the risk of overfitting (Ying 2019). To obviate ground truth, stereo self-supervision (Godard et al. 2017) approaches have been developed; learning disparity by training to warp the left image to become the right and vice versa. However, results are still considerably worse than what is achieved by supervised networks (Uhrig et al. 2017). Accurate 3D reconstruction can provide surgeons a tool for surgical planning and guidance (Hersh et al. 2021). Therefore, strategies for overcoming data limitation are desperately needed. Otherwise, the development of accurate neural networks for surgery is unachievable.
Structured light is currently the most dense and accurate approach for creating ground truth information for depth datasets. Example datasets that were created using structured light include NYU (Silberman and Fergus 2011), Middlebury Stereo (Scharstein and Szeliski 2003a;Scharstein et al. 2014) and SCARED. Structured light is the projection of patterns into a scene, which when captured by an imaging camera, allows for depth recovery through analysis of the pattern distortions (Salvi et al. 2004). Commonly, these pattern projection images are used exclusively for ground truth depth generation. Once depth has been captured, the pattern projection images are discarded afterwards. However, error will occur in the conversion process due to difficulties relating to the environment, surface properties and the hardware (Rachakonda et al. 2019;Jensen et al. 2017;Gupta et al. 2011;Scharstein and Szeliski 2003b). Primarily, errors will occur at the pixel classification stage which for example can be caused by reflections or hardware malfunction. This means that the information within the structured light images and the generated depth maps are not the same.
In this paper, we are proposing a unique approach to depth estimation. Firstly, we show that it is possible in itself to teach a neural network to be able to artificially project structured light patterns into a scene. Which is indirectly learning to perform disparity estimation. To the best of our knowledge, this is the first use of a neural network for the learning of the projection of structured light patterns into a scene. Secondly, we show that the dual training of direct disparity regression and structured light projection, via multi task learning, enhances network training and improves disparity learning without increasing the number of parameters nor requiring the collection of additional ground truth data.

Active Sensors
Structured light can be used to produce unique pixel codes for each stereo image, which enables the use of simple matching techniques for the stereo disparity calculation. This approach can provide dense and accurate depth maps for a given stereo image. However, structured light requires a controlled environment, multiple hardware and sometimes temporal variation. Certain structured light patterns require sequential projection of different patterns over time, such as binary/gray code which is what this paper works with. The control of the environment, when performing structured light projection, is important for good pixel classification. Robust classifiers are normally required to prevent incorrect classification (Salvi et al. 2004). Theses limitations have large ramification for setups designed for real time performance in dynamic environments. Our proposed method for structured light reconstruction removes the need for the additional hardware and the temporal requirement.

Depth estimation
The state-of-the-art for 3D reconstruction is defined by deep learning approaches that directly regress depth values. The general consensus within the deep learning community is that a neural network will discover a better way to solve the stereo matching problem when given free roam in an end-to-end setup where there is limited human supervision. Examples of these networks could be, RAFTStereo (Lipson et al. 2021) or PSMnet (Chang and Chen 2018a) for stereo and DORN (Fu et al. 2018) for monocular. The prevalent issue with these networks is the requirement of large training datasets. In the papers mentioned, the networks are pre-trained on scene-flow which contains approximately 60k images, then fine tuned on popular sets like KITTI. Although generalisability has been shown to be largely acheivable, the further the data similarity is to what has been trained on, the worse the performance. This poses a large hurdle when the 3D reconstruction task is for specific and niche tasks. In this paper we show that recycling structured light images, which is data commonly acquired to produce depth ground truth, can improve the performance of disparity estimation techniques especially when data is limited.

Multi Task Learning
Multi task learning (MTL) has already been proposed for depth/disparity estimation purposes. In previous works, the training of depth/disparity, has, for example, been combined with the training of semantic segmentation or instance segmentation (Sener and Koltun 2018;Kendall et al. 2018). These works have shown that it is possible to achieve an improved depth/disparity performance in comparison to when trained as a single task. However, as has been identified in previous literature (Standley et al. 2020), MTL doesn't guarantee improved performance and balancing the combination of tasks and the combination of weights of the loss terms is difficult. In this work we show that it is possible to use structured light reconstruction in a MTL framework to improve disparity learning. Where the ground truth requires no annotation.

Method
In this section we introduce our stereo disparity estimation techniques. Firstly, we propose the first deep learning model to learn the projection of structured light patterns onto a scene. The goal of this model is to show that it is possible to accurately learn structured light patterns that respect the stereoscopic perspectives and verify that structured light information can be used for 3D reconstruction purposes and should not be discarded after ground truth information has been generated; which is what is most commonly done. Secondly, we propose a novel MTL method for disparity estima-  tion, showing that MTL can enhance disparity learning by dual training on structured light.

Dataset generation
Virtual dataset: Due to the lack of any publicly available datasets containing structured light patterns, we had to create our own dataset. A simulated environment was developed using Blender (Community 2018) to enable automatic generation of virtual scenes containing cultural objects, as shown in Fig.1. The VisionBlender add-on (Cartucho et al. 2020) was used to generate the ground truth information.
The virtual environment is simple, consisting of a cubic room of volume 1 [m 3 ] and a central podium of size 10 [cm 3 ], which is used to hold different objects. All the surfaces in the whole virtual scene are grey and there is no texture, including the surfaces of the objects. The projector (Schell 2021) was rigidly attached to the camera object with a rotation of θ y = 1.5 • and a translation of T = [0.02m, 0.00m, 0.02m]. 800 cultural artefact objects, were used as the diversity factor in this dataset. The objects contain complex 3D surfaces and a smooth texture, making them challenging to reconstruct using existing 3D reconstruction approaches. All the objects were downloaded from the "Scan The World" project which is hosted by MyMiniFactory (Beck 2019). The dimensions were resized using a uniform distribution U (0.03m, 0.1m). For each object, 10 stereo image-pairs were captured from different view points. The camera is always looking towards the geometrical centre of the object. The distance between the camera and that geometrical centre is sampled from U (0.03m, 0.1m). Respecting these two constraints, the pose of the 10 viewpoints are chosen at random.
For each viewpoint, the stereo images and the associated ground truth data were generated. The ground truth includes the depth/disparity maps and stereo images without and with 8 projected binary patterns. For each projection, the number of vertical lines equals 2 n , ∀n ∈ [1..8], where n is the projection number. The resolution of the captured images are 256×256. This dataset contains approximately 8000 images.
Medical dataset: A medical dataset was created using the da Vinci Research Kit (dVRK) (Kazanzides et al. 2014) to extend the evaluation of the model onto real images. The dVRK was chosen as it is the research model of the surgical system that is used in clinical practice and commonly used for research in this field. A structured light projector was attached to a da Vinci camera arm and the stereo laparoscope of the machine was used to capture scenes with projected patterns on medical phantoms and  ex-vivo organs from sheep, cows and chicken. Gray code was used for this dataset with 9 patterns. The decoding was performed using the three phase algorithm (Xu et al. 2022). A turning table was used to hold the objects and were rotated for multiple perspectives. Fig. 2 shows the setup and an example image which would be fed as the input to the neural network (an image without projection patterns).

Structured light reconstruction
Given a stereo image pair, a neural network has been designed and trained to predict how structured light patterns should project on the scene surfaces of each image, separately. Specifically, the network projects binary and gray code structured light patterns, which consist of vertical bars of white and black colour. To predict these patterns, the UNet (Ronneberger et al. 2015) network is trained as a per pixel binary classifier; 1 (white) or 0 (black). These predicted patterns can then be used to generate disparity or depth maps by performing 2D cross correlation over the epipolar lines. This reveals how the network has understood the 3D scene.
The input to the UNet is a pair of rectified stereo images (which do not contain any projection patterns), concatenated along the channel axis, I=[I l ;I r ], I l ,I r ∈ R h×w×c . The network is tasked to predict structured light projections in each input stereo image, where every pixel requires a unique code along a horizontal epipolar line. P denotes the ground truth patterns andP the neural network output which is defined asP =[P l ;P r ],P l ,P r ∈ R h×w×t , 0 ≤p ≤ 1, p ∈ P,p ∈P . The parameter t denotes the number of projected patterns and t = 8 for the virtual dataset and t = 9 for the medical dataset.
Losses: Two losses were chosen for the structured light learning. Since the network is trained to produce a binary mask, binary cross entropy (bce) loss defined in Eq.(1) is used as the primary loss. To emphasise the pattern edges, an L2 loss is used on the horizontal derivatives as in Eq.(2). These derivatives are calculated by convolving over the images using a Prewitt operator mask, generating D = P * −1 0 1 and D =P * −1 0 1 which are the derivatives of the ground truth and predicted output, respectively with elements d ∈ D;d ∈D. These two individual losses are summed together in Eq.(3) with a weighting gain of λ 1 = 1/80.
Extracting disparity from structured light: To generate the disparity map using the proposed structure light projection network, cross correlation is performed along the epipolar lines of a pair of rectified stereo images, fully connected along the pattern dimension. A patch, represented by W k , where k is the reference index, of size 17×17×t is taken inP l , and cross correlation is performed over a range of pixels inP r along the epipolar lines. More specifically, for every pixel on the left image, the cross correlation is estimated for pixels on the right image which are along the epipolar line at a distance within the maximum considered disparity for each dataset, individually. In our work, the maximum disparity has been set to u = 25% of the width of the image. The cross correlation on the right image will start from the same position as the examined patch on the left image, striding with a step s. The patch size was chosen after tuning.
3.3. Direct disparity estimation with multi task leaning For this task, we are exploring the benefits of training disparity and structured light jointly. Specifically, we investigate the relative effect of introducing the MTL framework, whether or not the MTL framework improves the disparity estimation performance produced using STL. The PSMNet (Chang and Chen 2018b) architecture and training procedure were chosen as it is one of the more recognised architectures for disparity estimation and has recorded competitive performances in benchmark challenges. However, to note, any other architecture could have been used. In this work, the PSMNet architecture was modified in such a way that no added complexity or increased number of parameters is introduced for the disparity estimation path. Compartmentalising the design allows the structured light section to be detached when training is complete. The PSMNet contains a stacked hourglass module to regress disparity. The modification which we introduced to the architecture to create the multi task framework was done at this point. More specifically, the stacked hourglass module was duplicated; so that there is bifurcation after the cost volume. This results in two parallel paths, one for either task as shown in Fig. 3. A disparity range of 96 was chosen. The training is performed on RGB images with dimensions 256 × 256. In the evaluation stage, the outputs are resized using bilinear interpolation to match the original dimensions which are 256 × 256 for the virtual images, 720 × 576 for the medical images and 1280 × 1024 for the SCARED images.
The loss function is composed of the structured light cross entropy loss Eq.(1) and an L2 disparity loss. In our work, three different weighting strategies have been used to combine these loss terms. This is because multi task learning is complex and different strategies benefit different tasks in a way which is generally unknown before experimentation is performed. Firstly, a simple constant scaling of the structured light loss was done with Eq.(6). Secondly, the training strategy from (Liu et al. 2019) was implemented Eq.(7)-(8), which modifies the weights every epoch. Parameters λ 3 and λ 4 are the task weightings; a product of the ratio of the previous epoch's total losses. Here, ζ 1 = 2 and ζ 2 = 2. The parameter ζ 2 is used to balance the task weighting distribution, and ζ 1 is used to gain the softmax. The values were chosen after extensive experimentation. Thirdly, the uncertainty weighting strategy from (Kendall et al. 2018) was implemented as in Eq.(9). The weight λ 5 = 0.5 has been introduced to prioritise the disparity learning. The σ parameters denote the observed noise, in practice, these are learnable parameters used to weight each loss.

Evaluation of structured light reconstruction
In this section, the reconstruction capability of the proposed artificially projected structured light model is assessed, using the disparity maps generated with cross correlation, as described in Sec. 3.2. The aim of this validation is not to compare the performance of the proposed structured light reconstruction model to state-of-the-art disparity estimation models. Rather, we want to explore the benefits of using this alternative approach for disparity estimation and evaluate whether or not the results from the proposed method meet the expected levels of accuracy produced by the conventional approach of directly regressing disparity. More specifically, the hypothesis that this validation aims to verify is that using the structured light in the training process, through Multi-Task Learning (MTL), offers improved performance compared to Single-Task Learning (STL). Without requiring extra parameters for the disparity estimation path and not requiring extra data collection; alleviating issues of data limitation.
A comparison network was trained to directly estimate disparity; using the same UNet architecture. The network was trained to predict disparity using a scaled Sigmoid. The output is mapped to the width of the image. An L2 on the disparities was used as the loss function. This network was used to control the experiments and remove the influence of the network architecture on the comparison of the results. There are no major architectural differences between the network proposed for structured light reconstruction and this comparison network. With the exception of the final layer which has been modified to accommodate the output dimension requirements. Comparison to state-of-the-art disparity estimation models is out of the scope of this validation as our focus is to prove the viability of reconstructing structure light.
The 3D reconstruction results are presented in Table 1. The first and second column contain the mean absolute error (MAE) for each model. Both training and evaluation is performed on grayscale images of dimension 256 × 256. The network was trained on a single NVIDIA GeForce RTX 3080 10GB graphics card. Adam (Kingma and Ba 2015) was chosen as the optimizer, with a fixed learning rate of 0.0001. The deep learning model was coded using the PyTorch framework (Paszke et al. 2019) Virtual dataset validation: The results on the virtual dataset, at the first two rows of Table 1, show that it is possible to achieve good reconstruction using the proposed approach. Virtual Seg denotes the metrics for the virtual object and the podium, segmented using masks provided by VisionBlender. Comparison between both approaches shows that comparable performance can be expected. For the virtual data, the training provided to the disparity network is higher quality than the structured light, as the disparity is extracted directly from the simulation environment. We hypothesise that this is the primary cause of the performance difference between the two implementations. The training/test split was 80/20, at the object level.
Medical dataset validation: Taking the networks trained on the virtual scenes, fine tuning on the medical dataset, shown in Fig. 4, is performed to reveal the benefit of using structured light when the size of the datasets are limited. The training data for the structured light was 9 times larger than for the direct disparity network because 9 patterns were collected for every depth map. The results are shown in the bottom row of Table 1. The accuracy recorded from the proposed method was twice as high as   the direct regression approach. This verifies that having the larger volume of data for structured light training, when the datasets are small, improves the accuracy of the estimated disparity because of the increased complexity of the task and the increased number of training samples. The training/test split was 80/20, at the keyframe level.
Uncertainty: As the network performs classification, it is easy to acquire the confidence in the predictions (Cao et al. 2018); the closer the Sigmoid output is to 0.5 the less confident the network is. An empirical assessment of the correlation between the disparity error and network confidence was made. It was concluded that there was a correlation between confidence and accuracy. An example of this is shown in Fig. 5. The ellipses highlight the same areas in each image. These results show a correlation between areas of high uncertainty and areas of low accuracy. Fig. 6 displays a histogram plot of the uncertainty and error in the above example. Each histogram bin represents the number of occurrences for each combination of uncertainty and disparity error. A positive correlation is seen between the disparity error and the uncertainty. This is numerical evidence of the correlation observed in Fig. 5. Analysing confidence is critical for healthcare application and the adoption of deep learning methods in clinical practice. This is an advantage for using structured light reconstruction, over disparity estimation, as this is naturally a classification task (where the classification is the final output).   4.2. Evaluation of the benefit of using multi task learning for the enhancement of disparity learning In this section, we explore the benefit of using joint training for the purpose of improving the performance of the disparity estimation compared to when performing single task learning. 5 algorithms are explored here. The standard PSMNet for single task learning (disparity) is used as the benchmark; denoted by stl. Then the modified, multi task learning PSMNet is trained 4 times using 4 different strategies. Firstly, L const in Eq.(6) is implemented twice, where λ 1 = 0.5, 10; empirically chosen. For simplicity we denote the constant gain algorithms using: cg a := (λ 1 = 10) and cg b := (λ 1 = 0.5). Secondly, L epr in Eq.(8) is implemented; denoted by epr. Finally, L unc in Eq.(9) is implemented; denoted by unc.
Virtual dataset validation: Firstly, all algorithms are assessed using the virtual dataset. The results are shown in Table 2. Again, Full is the MAE for the entire scene and Seg is the MAE for the podium and object only. This experiment is performed twice, firstly on the entire training set and secondly, on 1/16th of the training set; to see the impact of learning on a smaller dataset. 120 epochs were used for training; we determined this number after multiple experiments, balancing computation time and convergence. For the entire dataset experiment, all networks perform quite similarly, but with two of the multi task learning algorithms achieving slightly better results. However, on the 1/16th experiment the relative performance of stl drops, which highlights the regularizing benefits of using the proposed MTL framework. The extra complexity and the extra data has prevented overfitting to the fewer training samples. We observe that cg b also drops in ranking; highlighting the difficulty of balancing the weighting distribution.
Generalisability evaluation: We explore the generalisability properties of the networks trained on the entire virtual dataset by evaluating them on the SCARED dataset and the dense Middlebury 2014 Stereo training dataset. The virtual dataset that was used for training is limited, and therefore, benchmark results were not expected to be achieved. This test was constructed only to compare the performance of using STL and MTL; investigating the impact of the MTL framework with structured light learning. The SCARED dataset contained two test sets (TS8 and TS9), which collectively contain 45 unique images, which were warped to create a much larger volume of test data. So, to avoid the influence of error created during the artificial warping, only the 45 unmodified and complete images are used for evaluation. Table 3 shows the results of this experiment. The main point of interest here is the position of the stl algorithm. In all datasets, the stl algorithm performs worst. What can be understood from these results is that even though the performance of stl was comparable to the multi task learning performances on the virtual dataset, it has overfitted to the virtual dataset distribution. Whereas, the MTL framework has provided a regularizing effect during training, which has resulted in greater generalisability.
Medical dataset validation: Commonly in medical imaging, training data with ground truth is limited. To tackle this limitation, the standard strategy is to pre train a neural network on general and large datasets and then fine tune on data specific to the task. To replicate this scenario, we use the models developed on the entire virtual dataset and fine tune on the medical dataset that we created. Each network was fine tuned for 40 epochs. The results shown in Fig. 7 highlight the training stability of the compared models. The blue dotted line represents the stl performance. Overfitting can be inferred from the descent behaviour across the 40 epochs, due to the divergence of the validation accuracy. This is also reflected in the performance for the MTL models: epr, cg a , cg b . However, for unc the training is stable and it also achieves the best disparity MAE performance. Which again demonstrates the benefit of the multi task learning framework, when the correct training strategy is implemented.  Figure 7. The blue dotted line is the performance of the stl model, which is used as a benchmark. What can be seen is that the only stably training algorithm is unc, in red, which also produces the best accuracy. All other models begin to overfit after 20 epochs.

Conclusion
In this paper, a novel approach to solving disparity estimation has been proposed that uniquely uses structured light information. We have proposed the first neural network to artificially project structured light patterns onto stereo images. This has allowed 3D reconstruction to be achieved using simple post processing similarity metrics. The proposed 3D reconstruction approach requires no explicit depth information during training. The performance evaluation results show that the proposed model accurately respects the surface geometry and achieves similar performance when compared to a direct regression network. As this proposed approach uses classification, it is also possible to estimate confidence in the disparity predictions, which is critical for tasks with high risk. This research was then extended, by designing a novel MTL framework to jointly predict structured light and disparity. Our validation verifies that introducing this MTL framework improves the generalisability and capability of learning from small datasets, for disparity estimation. All without increasing the number of parameters for the disparity estimation and using data which is already available and does not require extra annotation. Specifically, the MTL model unc produced results that were consistently better than stl, which demonstrates that when the correct multi task learning strategy is implemented, this is a better approach for developing a direct disparity estimation network. Our future work will focus on expanding our database to allow further validation of our work.