Monocular Depth Estimation of Old Photos via Collaboration of Monocular and Stereo Networks

Old photos that were captured about a century ago have archaeological and historical significance. Many of the old photos have been successfully digitized, but most of them suffer from severe and complicated distortion. Thus, prior studies have focused on image restoration tasks such as denoising, inpainting, and colorization. In this paper, we pay attention to the depth estimation of old photos, enabling a more enjoyable appreciation of them and helping better understand past human life, activities, and environments. Because most old photos are available as single-view images, monocular depth estimation techniques can be considered a solution. However, most high-performance techniques are based on supervised learning, which requires ground-truth depth maps. Because this kind of supervised learning is not feasible for old photos, in this paper, we present a learning framework that finetunes a pretrained monocular depth estimation network for each old photo. Specifically, the pretrained monocular depth estimation network predicts stereo depth maps for stereo image rendering. Then, the pretrained stereo network predicts depth estimates from the rendered stereo image pair. By extracting reliable depth estimates and using them for supervision of the monocular network, the monocular network can be gradually learned to produce a high-quality depth map of the given old photo. From the qualitative and quantitative performance evaluations on old photos, we demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
Since the invention of the camera, photographs have been used as a means of visual communication and expression. Nowadays, many people are sharing their photos with others through social media. However, this did not apply a hundred years ago when cheap and comfortable cameras were not available to consumers. Fortunately, there are still many valuable photos taken with low-performance cameras in such a period, which play an important role in understanding past human life, activities, and environments. These photos are monochromatic and have low signal-to-noise ratios (SNRs). In addition, since they have been digitized recently, The associate editor coordinating the review of this manuscript and approving it for publication was Guillermo Botella Juan . they suffer from multiple and complicated distortions due to aging and physical damage. Hence, understanding 3D geometric features, i.e., depth estimation, is challenging in these degraded old photos.
Photographic restoration for old photos can alleviate the difficulty of understanding the 3D geometric features in challenging old photos. Although there is mixed degradation in old photos, including scratches and blotches, film noise, and the lack of color, most prior studies address each degradation separately. In particular, the inpainting task received the most interest, which requires two steps: identification of scratches and blotches and recovery of these damaged areas using the textures from the vicinity [3], [4] or external images [5], [6]. One recent work [2] addresses the mixed degradation by introducing latent space mapping with synthetic paired data, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/  [1] and their restoration results (bottom) obtained using [2].
which enables an automatic repair of old photos. Although photographic restoration of old photos has been extensively studied, geometric reconstruction of old photos have received less attention.
In this paper, we pay attention to the estimation of geometric information from old photos. In particular, we attempt to estimate a depth map corresponding to each old photo to enable a more realistic rendering of the previous moment. Since video frames or images from different viewpoints are hardly available for old photos, estimation of the depth map from a single photo, referred to as monocular depth estimation, is required to achieve our objective. Monocular depth estimation has been extensively studied in the last decades, and several state-of-the-art techniques [7], [8], [9], [10], [11] show very promising results on specific scenes, e.g., road scenes. However, these techniques are based on supervised learning, which requires ground-truth depth maps for training. Because such ground-truth depth maps are not available for old photos, an alternative approach is required.
Our approach is to train a monocular depth estimation network with the help of a stereo depth estimation network. Given the monocular depth estimation network pretrained on modern photos, we first obtain a pair of left-view and rightview depth maps of an old photo and then use them to render a stereo image pair. The rendered stereo image and the stereo depth estimation network are used to obtain another pair of left-view and right-view depth maps. We then extract only reliable depth estimates by the left-right consistency check and finetune the monocular depth estimation network on each old photo. Our contributions are three-fold: • We present a novel learning framework that enables finetuning of the monocular depth estimation network without requiring any ground-truth supervision signals.
• We demonstrate the superiority of our method over the state-of-the-art unsupervised monocular depth estimation methods on old photos.
• We introduce a pairwise comparison-based quality evaluation method for the depth images estimated from old photos.
The rest of this paper is organized as follows: Section II reviews previous research related to our work; Section III describes the proposed method; Section IV presents the implementation details and experiment results; Section V concludes the paper.

A. OLD PHOTO RESTORATION
Old photos that were captured about a century ago suffer from multiple and complicated distortions, which naturally require multiple steps of image restoration to enhance their visibility. However, most previous studies on old photos focus on image inpainting [3], [4], [12] because damaged pixels, such as scratches and blotches, need special treatment. One exceptional recent study [2] restores multiple distortions simultaneously by using an image translation framework. Specifically, synthetically distorted image pairs are used to help map old photos to the latent space, and a global branch with a nonlocal block [13] is used with residual blocks for enabling multiple degradation restoration during latent space transformation. Fig. 1 shows several examples of the old photos used in our study [1] and their restoration results [2]. The quality of old photos can be significantly improved by [2], and thus we use these pre-processed old photos for our monocular depth estimation task.

B. STEREO DEPTH ESTIMATION
Classical stereo matching algorithms are extensively studied to find correspondence between stereo images by calculating matching costs [14], [15], [16], [17]. With the development of learning-based methods, convolutional neural network (CNN)-based stereo depth estimation methods demonstrate superior performance on challenging real-world scenes [18], [19], [20], [21], [22]. Based on characteristics that both stereo matching and optical flow aim at finding pixel correspondence, some approaches tried incorporating optical flow information to estimate more robust stereo depth maps [23], [24], [25]. Due to the availability of binocular depth cue, stereo depth estimation is generally more robust and accurate than monocular depth estimation at the expense of increased computational costs and necessity of stereo images. Notably, for scenes of old age, stereo images are not available. Nevertheless, we generate stereo images of old scenes using a warping module to exploit them as source images for pseudoground-truth depth maps.

C. MONOCULAR DEPTH ESTIMATION
Traditional monocular depth estimation techniques rely on monocular depth cues such as texture, shade, focus, and occlusion for estimating a depth map from a single image. However, monocular depth cue-based methods suffer from inherent ambiguities when applied to unconstrained scenes. Due to the remarkable progress of deep learning, learningbased monocular depth estimation methods are currently dominating [26]. If the corresponding pairs of images and depth maps are available, a CNN can be trained to predict the depth map from the icuest image. For example, the pioneering work of Eigen et al. [27] presented a coarse-to-fine framework that obtains a coarse global prediction and a fine local prediction sequentially. Many of the follow-up studies attempted to improve the performance by changing network architectures and loss functions [28], [29], [30]. For the input old photo shown in Fig. 2 approach can be applied to monocular video sequences [32], [33]. However, these scenarios are not applicable to old photos available as single images.

D. KNOWLEDGE DISTILLATION
Knowledge distillation, which has advantages in increasing performance and reducing model complexity, has been actively studied recently. The student-teacher structure, proposed by Hinton et al. [34], is based on a strategy that a more accurate and deeper network can serve as a teacher model to guide the less complex student model to estimate better results. In the field of depth estimation, stereo networks that take left-view and right-view images as input usually produce more accurate depth maps than monocular networks. Accordingly, stereo and monocular networks have been employed as teacher and student networks, respectively, for the monocular depth estimation task. Some studies employed knowledge distillation by making monocular networks follow the predictions of stereo networks [35], [36], [37]. In addition to the depth maps, the intermediate feature maps extracted from the stereo networks have also been used as supervision signals for monocular networks [38], [39], [40]. Several other studies attempted to use only reliable predictions of the stereo network for knowledge distillation [41], [42]. In line with these recent works, we present the student-teacher strategy dedicated to the depth estimation of old photos.

III. PROPOSED METHOD
Due to the lack of ground-truth depth maps or multi-view views, depth maps for old photos cannot be accurately estimated by existing monocular depth estimation methods. Therefore, inspired by the knowledge distillation, we propose to train the monocular network with the pseudo-ground-truth depth maps generated by the stereo network. The generated pseudo-ground-truth depth maps are refined through the left-right consistency check and used to finetune the monocular network. Fig. 3 shows an overall framework of the proposed method, where MonoNet and StereoNet represent the monocular and the stereo depth estimation networks, respectively. Throughout the paper, 'depth' specifically implies 'disparity.' All the old photos are used after preprocessed by the old photo restoration technique [2].

A. PRETRAINING OF MONOCULAR AND STEREO NETWORKS
Our framework is not restricted to specific architectures of StereoNet and MonoNet, and we used DispNetC [43] and VGG-16 [44] for our experiment. The overall pretraining process of StereoNet and MonoNet is the same as [35], but the output layers are modified to obtain both left-view and rightview depth maps. Specifically, StereoNet is supervised by the modern photo dataset [43] where ground-truth depth maps are available. After the training of StereoNet, MonoNet is trained in an unsupervised manner by using the predictions of StereoNet as pseudo-ground-truth. The training datasets and  Fig. 2(a). Due to a large domain gap between the dataset used for training [47] and the input old photo, the depth maps are obtained with significant errors. Given I l and D r m , we can synthesize the right-view old photoĨ r as follows: where W represents the backward warping function, and we used the differentiable warping layer [45] in our implementation. Similarly, givenĨ r and D l m , we can synthesize the left-view old photoĨ l as follows: Figs. 4(c) and (d) show examples of the synthesized stereo images,Ĩ l andĨ r . Because of a large domain gap between modern and old photos, the depth maps are obtained with significant errors, and consequently, the stereo images contain artifacts. The parameter updates of MonoNet are essential for handling old photos, but the lack of ground-truth depth maps makes MonoNet training challenging.
Our solution is to exploit StereoNet, which is proven to be less sensitive to the domain gap between training and test datasets [35]. FromĨ l andĨ r , StereoNet can produce left-view and right-view depth maps, denoted as D l s and D r s , respectively. By checking the left-right consistency [45], we can easily obtain reliability maps of the left-view and right-view, denoted as R l and R r , respectively. The threshold for the consistency check is set to 0-pixel distance. Figs. 4(e) and (f) show the depth maps obtained by applyingĨ l andĨ r to the pretrained StereoNet. Because of low-quality synthesized stereo images, the depth maps are obtained with large errors. However, we can still obtain some reliable estimates that are valuable for finetuning MonoNet. Specifically, we use the pixels with value ones in the reliability maps, as shown in For each old photo I l , we finetune MonoNet for T iterations and obtain the depth map from the finetuned MonoNet.

C. LOSS FUNCTIONS
The training loss function L total is defined as which is a combination of four main terms: the distillation loss, the reconstruction loss, the depth smoothness loss, and the left-right consistency loss. The weighting factors were empirically chosen as µ 1 = 0.5, µ 2 = 0.1, and µ 3 = 0.01. The distillation loss from StereoNet to MonoNet for the left-view, denoted as L dl , is defined as follows: where ∥·∥ measures the L1 norm, M is the number of pixels and P is a set of the positions of reliable depth estimates, i.e., P = p|R l (p) = 1 for pixel coordinate p. Similarly, the distillation loss from StereoNet to MonoNet for the rightview, denoted as L dr , is defined as where Q = {q|R r (q) = 1} for pixel coordinate q. By using L dl and L dr as the training objective for the finetuning of MonoNet, we can make MonoNet produce more reliable depth estimates. Several loss terms widely used for depth estimation are also found to be helpful for our finetuning of MonoNet. First, the reconstruction loss L re [45] between I l andĨ l is defined as follows: where SSIM measures the structural similarity [48], α is a weighting factor. Second, the depth smoothness loss L sm [45] is given as where ∇ x and ∇ y measure the gradient along the x-axis and y-axis, respectively. Third, the left-right consistency loss L lrc [45] is given as Note that L re and L sm are measured only for the left-view, and L lrc is measured for both the left-view and the right-view. Because MonoNet and StereoNet output multi-scale predictions, each loss term is measured at multi-scales as L * = N −1 n=0 w n L n * , * ∈ {dl, dr, re, sm, lrc}. Here, L n * measures each loss for the n-th scale image, and we chose N = 4 with the weighting factors as w 1 = 1.0, w 2 = 0.5, w 3 = 0.1, and w 4 = 0.01.

A. DATASETS
Due to the lack of ground-truth depth maps of old photos, we used the SceneFlow and KITTI datasets for the pretraining of MonoNet and StereoNet. Then, for the finetuning of MonoNet, we constructed an old photo dataset by collecting photos from the repository [1].

1) SceneFlow
[43] is a synthetic dataset, containing more than 39,000 stereo pairs for training and 4,000 for testing. Following [35], we obtained the occlusion masks by applying the left-right consistency check and excluded the occluded pixels during the training.

2) KITTI
Reference [47] is a collection of images captured from vehicles in several outdoor scenes. We used the raw data, which include rectified stereo sequences, calibration information, and 3D LIDAR point clouds. The ground-truth depth maps VOLUME 11, 2023 were obtained by mapping the LIDAR points to the image coordinates. In particular, the Eigen split [27] of the KITTI dataset, which contains 22,600, 888, and 687 image pairs for training, validation, and testing, respectively, was used in our experiments.

3) OLD PHOTO DATASET
is a collection of old photos. We crawled old photos from the online library [1]. The five collections (i.e., Carpenter, Abdul Hamid II, Lawrence & Houseworth, Grabill, Travel view of Japan and Korea) were selected out of 70 collections in the Print & Photographs online catalog. From the selected collections, we extracted 100 photos in consideration of various types of scenes to define our old photo dataset. Fig. 5 shows old photo examples. Our subject quality evaluation was conducted on only 20 images to reduce the evaluation time, but the results for all images are available at the project page. 1 These old photos were taken around the 1900's with various cameras and resolutions. The old photos were boundary-cropped and pre-processed [2] to be used as input for depth estimation networks.

B. IMPLEMENTATION DETAILS 1) StereoNet
was pretrained using the SceneFlow and KITTI datasets. We adopted DispNetC [43] for the StereoNet architecture but modified the output layers to obtain depth maps for both left-view and right-view to enable the proposed finetuning of MonoNet for each old photo. StereoNet was first trained for 50 epochs on the SceneFlow dataset with a batch size of 4 and a patch size of 768 × 384. The initial learning rate was 10 −4 and scaled by half at 20, 35, and 45 epochs. StereoNet was then finetuned on the KITTI dataset with a patch size of 832 × 256. The learning rate was initialized as 2 × 10 −5 1 https://github.com/rmawngh/Old-Photo-3D and decayed until it reached 2.5 × 10 −6 . For the subsequent unsupervised finetuning, StereoNet was trained for 10 epochs on the Eigen split of the KITTI dataset.

2) MonoNet
was pretrained using the KITTI dataset. We adopted VGG-16 [44] for the MonoNet architecture but modified the output layers to obtain both left-view and right-view depth maps. MonoNet was pretrained for 50 epochs with a batch size of 4. The input images and depth maps from StereoNet were resized to 512 × 256 to fit the input size of MonoNet. The learning rate was initialized as 10 −4 and decayed by half at 20, 35, and 45 epochs. The encoder of MonoNet was initialized with the ImageNet-pretrained model.

3) OUR PROPOSED METHOD
adopted zero-shot learning, thus finetuned MonoNet for 10 epochs with the learning rate of 10 −6 for each given old photo. Note that the parameters of StereoNet were fixed during the finetuning of MonoNet. As for the rendering of stereo images, the warping module from Monodepth [45] was used. The input size for MonoNet and StereoNet was set as 512 × 256 for training.

C. PERFORMANCE EVALUATION
Because ground-truth depth maps do not exist for old photos, we conducted subjective quality evaluation. In addition, several objective quality metrics that do not require ground-truth were used to quantitatively compare the performance. Full 32-bit floating-point precision was used for all trained and tested models.

1) OBJECTIVE QUALITY EVALUATION
typically requires ground-truth data for comparison. However, conventional measures, such as RMSE and the percentage of bad pixels, are not applicable to old photos due to the lack of the ground-truth. Other no-reference quality measures developed for color images, such as NIQE [53], PI [54], and NIMA [55], are also not suitable for the quality evaluation of depth images. We thus first evaluated the performance TABLE 1. Performance comparisons with different quality metrics. ↓ and ↑ represent the lower the better and the higher the better, respectively. of depth estimation methods using RR-DQM [49] that is a depth quality assessment metric requiring only a pair of color and depth images. In particular, RR-DQM measures local image distortions caused by image rendering with the estimated depth map. We compared the proposed method with state-of-the-art monocular depth estimation methods, including CADepthNet [9], HRDepth [10] and MonoDepth2 [11]. Table 1 shows the RR-DQM results for 20 images that are used for the subjective evaluation. The smaller the depth distortion, the lower the RR-DQM score. The results demonstrate that the proposed method outperforms the state-of-theart methods.
We also evaluated the performance using depth imagebased rendering (DIBR) quality assessment metrics [50], [51], [52]. These metrics measure the distortions in DIBRsynthesized images. Among DIBR quality assessment met-rics, we used no-reference methods, including NIQSV [50], NIQSV+ [51], and MNSS [52], to compare the proposed method with other state-of-the-art methods. NIQSV [50] uses edges detected by morphological operations for quality measurement, and NIQSV+ [51] further applies stretching detection and black hole detection. MNSS [52] is a metric derived from multi-scale natural scene statistics. The results shown in Table 1 demonstrate that the proposed method outperforms the state-of-the-art methods.

2) SUBJECTIVE QUALITY EVALUATION
was performed in consideration of our objective of enabling a more realistic rendering of old photographs. Specifically, three-dimensional (3-D) renderings of old photos were obtained using the estimated depth maps, and the subjects were asked to assess the quality of the 3-D renderings. Since VOLUME 11, 2023  2. Performance comparisons on the Kitti eigen test split using different loss combinations, where L d is a distillation loss, L re is a reconstruction loss, L sm is a smoothness loss, and L lrc is a left-right consistency loss. there was no evaluation tool for 3-D rendering comparison, we built it using the Open3D API [56]. As shown in Fig. 6, our evaluation tool consists of two 3-D renderings, a single old photo, and a control panel. We conducted a pairwise comparison by asking the subjects to indicate their preference from two randomly shuffled 3-D renderings. The subjects were recommended to use the keyboard to navigate around the 3-D renderings, and one minute was given for the decision on one image pair.
We compared the proposed method with MonoNet [35], the baseline model that our method is applied. 20 subjects participated in the subjective quality evaluation. The result shows that MonoNet [35] and the proposed method were preferred over the other method by 6.35 and 13.65 times on average out of total 20 times, demonstrating the effectiveness of the proposed method. Fig. 7 shows several qualitative comparisons with other methods. The proposed method outperforms the other state-of-the-art monocular depth estimation methods [9], [10], [11]. Furthermore, Fig. 8 shows the estimated depth maps for the left and right viewpoints. Compared to MonoNet [35], our proposed method provides enhanced left and right viewpoint depth maps. More results can be found on our project website.

D. ABLATION STUDY
Since the total loss in (5) consists of different loss terms, we investigated the effectiveness of each loss term as ablation studies. For quantitative performance evaluation, we used the ground-truth depth maps from the KITTI dataset [47] for this experiment since old photos do not have ground-truth depth maps. Specifically, each image of the KITTI Eigen split [27] was tested using different models trained with different combinations of the loss terms.
The proposed method trains MonoNet using reliable estimates from StereoNet using the distillation loss (L d in Table 2), and the other three loss terms, i.e., the reconstruction loss L re , smoothness loss L sm , and left-right consistency loss L lrc , are auxiliary loss terms for regularization. We thus tested different models by including additional loss terms to the distillation loss. The average performance scores for the left-view depth maps of the KITTI Eigen split [27] shown in Table 2 demonstrate that the model trained with all loss terms produced the best performance in most quality metrics, and the other terms contributed to the performance improvements. The weighting factors for the loss terms were empirically chosen as µ 1 = 0.5, µ 2 = 0.1, and µ 3 = 0.01.
For the performance analysis on our target old photos, we provide multiple resultant images obtained using the models trained with different loss configurations on our project page.

V. CONCLUSION
Although old photos have archaeological and historical significance, depth estimation of old photos has attracted more attention. Because most old photos are available as single-view images and their ground-truth depth maps cannot be available, we developed a learning framework that is based on the collaboration of monocular and stereo depth estimation networks. Specifically, the monocular network was used to produce input for the stereo network, and the stereo network was used to yield reliable depth predictions to be used for supervision of the monocular network training. We could train the monocular network to produce a high-quality depth map of the given old photo by the proposed