Monocular Human Depth Estimation Via Pose Estimation

We propose a novel monocular depth estimator, which improves the prediction accuracy on human regions by utilizing pose information. The proposed algorithm consists of two networks — PoseNet and DepthNet — to estimate keypoint heatmaps and a depth map, respectively. We incorporate the pose information from PoseNet to improve the depth estimation performance of DepthNet. Specifically, we develop the feature blending block, which fuses the features from PoseNet and DepthNet and feeds them into the next layer of DepthNet, to make the networks learn to predict the depths of human regions more accurately. Furthermore, we develop a novel joint training scheme using partially labeled datasets, which balances multiple loss functions effectively by adjusting weights. Experimental results demonstrate that the proposed algorithm can improve depth estimation performance significantly, especially around human regions. For example, the proposed algorithm improves the depth estimation performance on the human regions of ResNet-50 by 2.8% and 7.0% in terms of $\delta _{1}$ and RMSE, respectively, on the proposed HD + P dataset.


I. INTRODUCTION
Depth estimation, which predicts the distance of a scene point corresponding to each pixel in a 2D image, is one of the fundamental tasks in computer vision for inferring the 3D geometry of the scene. The estimated depth map can be used to reconstruct 3D volumetric data, including point cloud or triangle mesh [1]. Traditionally, 3D information has been recovered by estimating depth maps from multiple images based on geometric constraints. In particular, stereo matching-based depth estimation, which computes the disparity between the images observed from two viewpoints by simulating human eyes, has been developed most actively [2]- [4]. Although the geometry-based algorithms can estimate an accurate depth map, they require calibrated image pairs or sequences. Furthermore, it is generally difficult to estimate accurate disparities in textureless regions. In contrast, monocular depth estimation uses only a single The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . image to estimate a depth map and does not require additional equipments for capturing and calibrating multiple images.
Because only a single camera is available in most scenarios, monocular depth estimation has been employed in many computer vision applications, such as 2D-to-3D image and video conversion [5], augmented reality [6], autonomous driving [7], surveillance [8], and 3D model generation [9]. Due to its practical importance, monocular depth estimation has recently been researched actively, and various techniques have been developed to improve its accuracy [10]- [19].
Since different 3D scenes can be projected onto the same 2D image, estimating the depth map from a single image is ill-posed. Early approaches inferred depths by modeling the relation between regions based on prior assumptions on depth maps [11], [12], [21]. However, as they are based on manually defined features, they may provide unreliable results, especially in regions with small objects or ambiguous colors. Recently, with the advances in deep learning, many monocular depth estimation algorithms using convolutional neural networks (CNNs) have been developed to yield more accurate results. For example, multi-scale architecture [13], robustness to global depth scales [13], ordinal regression of depths [15], Fourier domain analysis [16], whole strip masking [17], relative depth estimation [18], and reliable boundary processing [19] have been developed or exploited for accurate monocular depth estimation.
Although a lot of researches have been carried out to increase the accuracy of monocular depth estimation, little effort has been made to consider the characteristics of specific objects in a scene. In particular, humans are the main objects of interest in many practical applications, such as gaming and surveillance. However, the depth estimation of humans in a scene is more difficult than that of other objects due to flexible body parts, diverse clothing, and different poses. Figure 1 shows examples, in which a conventional algorithm yields less reliable depths for human regions. Thus, it is important to develop a monocular depth estimator that predicts the depths for human regions, as well as other regions, reliably and accurately by exploiting the characteristics of humans.
In this paper, we propose a novel monocular depth estimator, called HDEP (human depth estimation via pose), to improve the prediction accuracy on human regions by utilizing pose information. The proposed algorithm is composed of two networks: PoseNet, which predicts heatmaps of human skeletal keypoints, and DepthNet, which estimates a depth map for an entire image. To improve the depth estimation performance on human regions, DepthNet uses the pose information from PoseNet. To this end, we develop the feature blending block to fuse the features extracted from DepthNet with those from PoseNet. Finally, we develop a novel joint training scheme using partially labeled datasets, which balances multiple loss functions for depths and poses by adjusting the weights. In addition, we construct a publicly available dataset, called the HD + P (human depths with poses) dataset, composed of RGB images, depth maps, and human keypoint heatmaps. Experimental results show that the proposed HDEP algorithm improves depth estimation performance significantly, especially around human regions. This paper has the following contributions: • We propose the HDEP algorithm to predict depth maps for images containing humans, which is composed of PoseNet and DepthNet. We then develop an effective joint training scheme for the HDEP algorithm using partially labeled data.
• We construct the HD + P dataset, in which an image is annotated with a depth map and a set of human keypoint heatmaps. We will make it publicly available.
• We experimentally demonstrate that the proposed HDEP algorithm provides excellent monocular depth estimation performance qualitatively and quantitatively on multiple datasets, especially on human regions.
The rest of this paper is organized as follows. Section II reviews related work on monocular depth estimation and pose estimation. Section III describes the proposed HDEP algorithm. Section IV discusses experimental results and analyzes the impacts of the key components in HDEP on the results. Finally, Section V concludes the paper.
Depth information provides useful clues about closely related information in a scene, such as surface normals and semantic segmentation labels. Based on this observation, several algorithms [29], [34]- [36] have been developed to solve the depth estimation task jointly with other related tasks. However, if two tasks are closely related, the features extracted for each task are similar, so an incorrect prediction in one task may adversely affect the other task. In this work, we aim to improve the depth estimation performance by employing pose features additionally. Note that pose  Table 1 are computed and their weights are controlled by a loss rebalancing scheme. estimation is weakly related to depth estimation in that the former localizes only the keypoints of a human, whereas the latter predicts the depth value of every pixel in an image.

B. HUMAN POSE ESTIMATION
Human pose estimation is the task of locating human joints -also known as keypoints -in an image. It can be divided into 2D and 3D pose estimation according to the dimensions of estimated coordinates of joints. In 2D pose estimation, multiscale features have been adopted to deal with humans of various sizes in images [37]- [40]. The simplest approach to 3D pose estimation is to lift estimated 2D poses to 3D poses [41]- [43]. Alternatively, the likelihood of a joint in each voxel can be directly estimated using a 3D voxel model [44], or three marginal 2D heatmaps in the xy-, yz-, zx-plains can be predicted [45].
Datasets for 3D pose estimation have been constructed in indoor studios with markers on human joints to collect 3D ground-truth coordinates [46]- [48]. However, these datasets provide the depth information of only human regions or only human keypoints in simple and controlled environments. We cannot use these datasets, because we aim at estimating the depth of every pixel in an image, which contains humans with cluttered and real backgrounds.

C. HUMAN DEPTH ESTIMATION
Several algorithms have been developed for human depth estimation. Geometric details of human objects, such as cloth wrinkles and folds, have been considered in [22], [49]. Specifically, Tang et al. [22] estimated the shape of a human from an RGB image by combining segmentation results with 3D pose information. Tan et al. [49] proposed a self-supervised human depth estimation framework from monocular videos. In addition, Lin and Lee [23] developed a multi-person 3D pose estimation algorithm in the camera coordinate space, which predicts the overall distance of each person from the camera. Figure 2 shows how the proposed algorithm is different from these conventional algorithms. Specifically, our objective is to estimate the pixelwise depths of both human and other regions instead of extracting the 3D shape of humans only [22], [49], or the distance of each person from the camera [23]. Figure 3 shows an overview of the proposed HDEP algorithm, in which DepthNet estimates a depth map D and PoseNet outputs a set of heatmaps H for human skeletal keypoints. Features extracted by the PoseNet encoder are fed into the DepthNet encoder through the feature blending block. The pose information is estimated to improve the depth estimation performance; the pose estimation in itself is not an objective of this work. To train the networks, we define the overall loss all ( D, H, D, H) between the estimated D and H and their ground-truth D and H, which is a weighted sum of the eight loss functions in Table 1 for different aspects of depth and pose information.

A. NETWORK ARCHITECTURE
In Figure 3, both DepthNet and PoseNet have the same encoder-decoder architecture, where the encoders extract deep features for depths and human poses, respectively. For each encoder, we employ ResNet-50 [50], which has been commonly used in recent depth estimators [14], [19]. However, note that any conventional monocular depth estimator can be used as DepthNet, as will be discussed in Section IV-C. The decoders take the aforementioned features as input and yield a single depth map and multiple keypoint heatmaps. Since the PoseNet decoder estimates heatmaps for human skeletal keypoints, it has more output channels than Depth-Net, which yields a single-channel depth map. More specifically, PoseNet outputs 16 channels as specified in the MPII dataset [51], and each channel represents a heatmap for each keypoint. Note that, since both DepthNet and PoseNet have the same architecture and a pretrained PoseNet is unavailable in general, we train both networks as will be described in Section IV-B.
Recently, multi-task learning has been adopted to estimate monocular depths jointly with surface normals, edge contours, and semantic segmentation labels [29], [34], [52]- [54]. Such strategies enable a network to learn various cues to understand the geometry of a scene by exploiting commonalities and differences among tasks. In contrast, pose estimation and depth estimation are less related; pose estimation aims to locate skeletal keypoints only, while depth estimation aims to predict a dense depth map of an entire scene. Thus, if depth and pose are trained jointly in a fully-shared network, they may fail to share common features effectively and the training of one network may make the training of the other more difficult or even divergent.
We transfer PoseNet features unidirectionally to Depth-Net to exploit human skeletal information and improve the depth estimation performance on the corresponding human region. To this end, we develop the feature blending block, which enables DepthNet to exploit the pose information from PoseNet. As illustrated in Figure 3, both DepthNet and PoseNet reduce the spatial resolutions of feature maps in five stages. From the second to fourth decimation stage, the feature blending block adds the decimated features of PoseNet to those of DepthNet and then feeds the sum into the next layer of DepthNet. Specifically, let F out D and F out P denote the decimated output features of DepthNet and PoseNet, respectively. Then, the input feature F in D of the next layer of DepthNet is given by where κ P and κ D are the weights for F out P and F out D to control the relative contributions of the two features. Note that, instead of the addition in (1), various blending strategies can be adopted. We will discuss the effects of different blending strategies on the performance in Section IV-D3.

B. LOSS FUNCTIONS
To train DepthNet and PoseNet jointly, we employ the eight losses in Table 1, which affect the training of the networks in different aspects. Each loss is computed using a depth map, a set of heatmaps, or both.
The first five losses d , n , m , x , and y are computed using a depth map only. First, the depth loss d is the L 1 -norm between the ground-truth depth D and its estimate D. Second, the normal loss n is obtained from the cosine similarities between the normal vectors n ij and n ij of pixel (i, j), which are approximated from D and D using depth gradients as done in [19]. Third, the mean-removed loss m measures the L 1 -norm between mean-removed D and D to compare relative depths with respect to the average depth of a scene [18]. Fourth, the gradient losses x and y are the L 1 -norms between gradients of D and D, where ∇ x and ∇ y are the horizontal and vertical gradient operators, respectively. The gradient losses make the networks focus on high-frequency contents and thus reconstruct depth boundaries more sharply.
Next, the heatmap loss h is defined as the average L 1 -norm between the ground-truth heatmaps H j ∈ H and their estimates H j . J is a set of human skeletal keypoints. We detect 16 kinds of keypoints as specified in [51], and thus |J | = 16. The heatmap loss enables PoseNet to extract useful features for locating keypoints and identifying human poses.
Finally, we design two losses g and p to consider the accuracy of both depth and pose estimation jointly. They compute depth prediction errors around keypoints. More specifically, we define g as the average L 1 -norm between D and D near the ground-truth keypoints, given by where T τ (·) denotes the binarization with threshold τ and the operator • denotes the element-wise multiplication. p is defined similarly to g in (2). However, because g cannot be computed on a dataset without keypoint labels, we define p using the predicted keypoints. More specifically, predicted keypoints in H j are used instead of ground-truth heatmaps in H j . In this work, we fix τ to 0.01.

C. TRAINING BASED ON LOSS REBALANCING
The overall loss all is defined as a weighted sum of the eight loss functions in Table 1 where w k is the weight for k . As different losses are in different scales, each loss contributes to the overall loss differently. Also, the losses interact with one another and tend to fluctuate during training. Hence, it is necessary to VOLUME 9, 2021 balance their contributions to the overall loss. To address this balancing issue, we develop a loss rebalancing scheme, inspired by [33], which adjusts loss weights w k adaptively in a periodic manner based on the rate at which each loss decreases during training. Loss rebalancing consists of two steps: weight initialization and weight rebalancing. First, in the initialization, we equalize all weights so that each loss contributes equally to the overall loss. Then, in the rebalancing, the weight w t k for the kth loss at the tth period is updated from w t−1 k by monitoring how fast the kth loss is reduced. In this work, one fourth of an epoch is defined as a period. The ratio P t k of the kth loss in the overall loss is defined as where t k and t all are the averages of k and all , respectively, over the tth period. Then, the weight is updated as where P t k = P t k − P t−1 k . The hyper-parameter λ controls the rebalancing strategy. If λ = 0, loss weights are unchanged during training. At λ < 0, the weight for a slowly decreasing loss gets larger to make all losses balanced. This is because, if a certain loss is more difficult to reduce and thus decreases more slowly than the other losses, we have P t k = P t k − P t−1 k > 0 and w t k > w t−1 k . On the contrary, at λ > 0, the weight for a difficult loss gets smaller. To summarize, the training focuses on easy (or difficult) losses when λ is positive (or negative). Both depth and keypoint coordinates are needed to train the networks using the eight losses. However, most datasets have either depth or keypoint labels only. While the DIH dataset [55] contains both depth and keypoint labels, its images lack diversity in backgrounds and human poses. Thus, we use other datasets that contain either depth or keypoint labels for training. However, the overall loss all in (3) and the weight w t k in (5) for some k cannot be computed using such partially labeled datasets, for the corresponding losses are undefined. For example, d is undefined if depth labels are unavailable, while h is undefined if keypoint labels are unavailable. Moreover, the joint losses g and p demand both depth and keypoint labels.
To address the aforementioned challenge on the joint training of the networks using partially labeled datasets, we develop an effective training strategy. First, when depth or keypoints are unavailable, some losses cannot be computed. We set such undefined losses to 0 when computing the overall loss all in (3) during trainingi.e. in the backpropagation process. Second, we augment the loss rebalancing scheme to consider partially labeled data. As done for the overall loss, an undefined loss k is set to 0. Then, the weight in (5) cannot be computed. To address this problem, we replace the undefined loss k with the average value over valid samples. Specifically, let k,i denote the kth loss for the ith training sample and I(·) be the indicator function Then, we define the pseudo loss˜ k as Next, we replace the undefined loss k with the pseudo loss. Figure 4 compares weighted loss trends over epochs during training. In the equal weighting in Figure 4(a), the overall loss is dominated by large losses, such as d and m , whereas the contributions of small losses, such as g and n , are negligible. In Figure 4(b), the proposed rebalancing scheme effectively equalizes the contributions of all losses to the overall loss by adjusting the weights adaptively. This is possible even with partially labeled data.

A. DATASETS
We use four datasets for training: NYUv2 [24], DIH [55], MPII [51], and HD + P. Figure 5 shows examples of RGB images and their depth and/or keypoint labels in each dataset, and Figure 6 shows examples of heatmap representation of different joints. Table 2 summarizes available labels in each dataset.
NYUv2 [24]: It consists of RGB images and the corresponding depth maps. Most images in the NYUv2 dataset do not contain humans, as shown in Figure 5(a). We sample 17K images for training and 654 images for test.
MPII [51]: It contains 25K images with 40K human subjects and 16 keypoint coordinates for each human subject. We generate the keypoint heatmaps by applying a 2D Gaussian filter to the keypoint coordinates, as in Figure 5(b).
DIH [55]: It has real and synthetic splits. We use the real split, which provides RGB images, depth maps, and keypoint information, as in Figure 5(c). We use 1,734 training and 750 test images. Since DIH provides 17 kinds of keypoints, we preprocess some keypoints to make the number of keypoints identical to that of MPII, as will be detailed later.
HD + P: We construct the HD + P dataset, which is composed of 1,566 training and 318 test images. Each training data contains an RGB image and the corresponding depth map, while each test data contains an RGB image, depth, and keypoint coordinates. We collect RGB + D images for 145 scenes with 13 humans using an Intel RealSense D435 camera. Next, we manually mask missing depth regions and then fill in those regions using the colorization algorithm [56], as done in [24]. Also, 16 keypoints for each human are annotated as shown in Figure 5(d). Table 3 compares the numbers of scenes and humans in the HD + P dataset to those in the DIH dataset. In addition to diversity in scenes and humans, the HD + P dataset contains more challenging scenes than DIH, such as humans with complex poses, occluding people, similar colors with the background, and occlusions. Figure 7 shows some examples of such challenging images.  [55] and HD + P datasets.

FIGURE 7.
Examples of challenging images in the HD + P dataset.

1) DATA PREPROCESSING
As mentioned earlier, the datasets have different numbers and kinds of keypoints. Furthermore, depth maps contain invalid regions at occlusions or transparent or reflective objects. Therefore, we perform data preprocessing to match pose and depth labels from different datasets. For keypoints, MPII [51] and DIH [55] provide 16 and 17 labels, respectively, as listed in Figure 8(a). We match the labels in DIH to those in MPII. Specifically, we remove ''Head,'' ''Left eye,'' and ''Right eye'' from DIH and interpolate ''Pelvis'' and ''Thorax'' from the midpoints of both hips and both shoulders, respectively. Since the ''Head'' label in MPII corresponds to the upper region of a head, we use ''HeadUp'' in DIH instead of ''Head'' in MPII.
For depth maps, we first identify invalid regions for each dataset. Specifically, for the DIH dataset, we remove depth values outside the range in the hardware specification of Kinect 2, whereas we mask invalid regions manually for the HD + P dataset. Then, we fill in invalid regions using the colorization scheme [56]. Figure 9 shows examples of raw depth maps and the corresponding preprocessed depth maps.

B. IMPLEMENTATION DETAILS 1) NETWORK ARCHITECTURE
We employ ResNet [50] as the encoder backbones of both DepthNet and PoseNet. An input image is resized to 304 × 228, and the size of each encoder output is 10 × 8 × 2048. The encoder consists of four layers with three, four, six, and three bottleneck blocks, each of which has three convolutional layers with kernel sizes 1, 3, and 1 with VOLUME 9, 2021  batch normalization and ReLU activation, respectively. The output feature is then transformed into 10 × 8 × 320 and 10 × 8 × 640 features using 1 × 1 convolutions to be used as inputs for DepthNet and PoseNet, respectively. Each decoder includes four upsampling blocks, each of which consists of a bilinear interpolation layer and two 3 × 3 convolution layers with the ReLU activation. DepthNet outputs a depth map of resolution 152 × 114, while PoseNet generates 16 heatmaps -one for each type of keypoints -of the same resolution 304 × 228.

2) TRAINING
We train the proposed HDEP algorithm in three steps. First, we train PoseNet using the MPII and DIH datasets, which contain keypoint labels. We use the Adam optimizer [57] for 140 epochs with an initial learning rate of 10 −3 . We decrease the learning rate by a factor of 0.1 at the 70th and 110th epochs, respectively. In this step, only the heatmap loss h is used. Second, we train DepthNet only using the NYUv2, DIH, and HD + P datasets containing depth labels. We perform the same training as the first step, except that the depth loss d is used instead of h . Finally, we retrain (fine-tune) both DepthNet and PoseNet jointly for 60 epochs using all four datasets with the overall loss all in (3). For the loss rebalancing, we gradually reduce the hyper-parameter λ in (5) from 3 to −3 as done in [33]. Also, in the feature blending block, κ P and κ D in (1) are set to 0, 0, 0.1, and 0, 1, 1 at the first, second, and third steps, respectively.

1) EVALUATION METRIC
We assess the depth estimation performance using two protocols. The first protocol (Entire) evaluates the estimation accuracy on an entire depth map, which is commonly used for assessing depth estimation performance. However, the performance evaluated using this protocol is strongly affected by large background regions, therefore this protocol may fail to faithfully reflect the accuracy on human regions. Therefore, to assess the depth estimation performance on human regions, we develop another protocol (Human-oriented) that measures the accuracy only on the regions containing humans. Specifically, we compute the accuracy only within the bounding boxes that contain all 16 ground-truth keypoints. For both evaluation protocols, three evaluation metrics for depth estimation performance are employed, which are defined as where d i and d i denote the ground-truth and predicted depths of pixel i, respectively, and N is the number of pixels in a depth map.

2) BASELINES
As mentioned in Section III-A, any monocular depth estimator can be used with the proposed HDEP algorithm. We test four conventional depth estimators: Chen et al. [20], Hu et al. [19], ResNet-50 [50], and EfficientNet-B3 [58]. Note that ResNet and EfficientNet are commonly used as backbone networks for both depth estimation [14], [59], [60] TABLE 4. Comparison of the depth estimation results on the DIH and HD + P datasets using the two evaluation protocols. For δ scores, a higher score indicates a better result, while lower RMSE and ARD values are better results. The scores in red indicate that the proposed algorithm improves the performance of each conventional estimator, while those in blue indicate that it degrades the performance.  and pose estimation [61]. For Chen et al. [20] and Hu et al. [19], we use ResNet-34 [50] as the backbones and train them as specified in the respective papers. This inconsistency is partly due to the lack of data diversity in the DIH dataset. In contrast, on the HD + P dataset, the proposed algorithm outperforms the backbone networks in all cases. Moreover, in the Human-oriented protocol, the proposed algorithm outperforms the backbone networks in almost all cases -in many cases by large margins. For example, the proposed algorithm provides a 0.044 higher δ 1 score than Chen et al.'s estimator on the DIH dataset and yields a 0.041 lower RMSE value than ResNet-50 on the HD + P dataset. These results indicate that the proposed algorithm improves the depth estimation performance on human regions significantly, while providing comparable or even better performance on entire images. We also conducted a statistical test to determine the performance improvement using the Wilcoxon signed-rank test [62]. The null hypothesis is that the proposed algorithm degrades the estimation performance over the baseline for each evaluation metric. The signed-rank sum is normalized by the rank sum so that it lies in the range [−1, 1]. A positive number indicates the performance improvement over the baseline, while a negative number implies the performance degradation. Table 5 shows the normalized signed-rank sums for comparing the proposed algorithm against Chen et al.'s. The obtained p-values are p < 0.01 in all metrics except for δ 3 in the Human-oriented protocol, where p = 0.11. The results imply that all the null hypotheses are rejected for all metrics, so the proposed algorithm indeed improves the estimation performance of the baseline.        To demonstrate the generalization capability of the proposed algorithm, we compare the performance on the validation set in MPII [51] and the test set in the Human3.6M dataset [46], which contain more diverse scenes and humans than DIH and HD + P, in Figures 11 and 12, respectively. In this test, we use Chen et al.'s estimator [20]. It can be seen that the proposed algorithm improves the estimation performance of the baseline, especially in terms of shape and depth consistency on the human regions, thereby reconstructing human silhouettes faithfully. Figure 11 also shows that the proposed algorithm can improve the estimation performance consistently regardless of the sizes of humans.

3) QUANTITATIVE AND QUALITATIVE EVALUATION
Finally, we compare the depth estimation performance according to various factors affecting the difficulty of estimation. To this end, we split the test set of the HD + P dataset into three groups: images with complex poses, occluded humans, and normal scenes. Table 7 compares the estimation performance. These results indicate that the proposed algorithm improves the depth estimation performance on FIGURE 13. Visualization of error maps. The first two rows are from the DIH dataset [55] and the rest are from the HD + P dataset. The proposed rebalancing scheme focuses more on human regions during training. human regions consistently regardless of the difficulty of the scene.

D. ANALYSIS
We conduct ablation studies to analyze the contributions of the key components in the proposed algorithm: loss functions, weight rebalancing scheme, and feature blending strategy. All experiments are performed using ResNet-50 [50] as the baseline, and the Human-oriented protocol is used for comparison on the HD + P dataset, unless specified otherwise.

1) LOSS FUNCTIONS
To analyze the effectiveness of losses, we train the proposed networks to estimate depth maps using different combinations of losses. Table 8 compares the results quantitatively. First, (l n , l m , l x , l y ) improves the estimation performance. Second, l h slightly decreases the depth estimation performance, for it only estimates pose information without considering depth maps. Third, joint training of pose information using h and the joint losses l g and l p improves the depth estimation performance significantly.
In addition, we analyze how each loss helps to train the network. Figure 13 visualizes some error maps for the eight losses in Table 1 on both the DIH and HD + P datasets. Note that g cannot be computed because ground-truth keypoints are unavailable in the training set of HD + P. Each error map focuses on different aspects of a depth map. For example, d and m attend to overall depth maps while g and p focus on human regions. Also, x and y are activated on human silhouettes where depths vary abruptly.

2) LOSS REBALANCING STRATEGIES
To analyze the effectiveness of the proposed loss rebalancing algorithm, we train the proposed networks with three different weighting strategies: equal weighting without rebalancing, Lee and Kim's scheme [33], and the proposed rebalancing scheme. Table 9 compares the performances. The proposed algorithm provides significantly better depth estimation performance than the other two schemes by effectively equalizes the contributions of all losses to the overall loss.
In addition, we visualize the weighted sum of error maps for the equal weighting and the proposed rebalancing strategies in Figure 13. We see that the proposed rebalancing scheme focuses more on human regions than the baseline. This confirms the effectiveness of the proposed rebalancing strategy on the depth estimation performance on human regions. That is, as shown in Figure 4, if the weight for each loss is identical, all weighted losses are in different scales, thus the overall loss is dominated by large losses. On the other hand, with the proposed rebalancing scheme, weighted loss values contribute equally to the overall loss.

3) FEATURE BLENDING STRATEGIES
We analyze the effects of different feature blending strategies and the number of feature blending layers. Table 10 compares  the quantitative results on different configurations obtained  TABLE 11. Comparison of the depth estimation results of ResNet-based architectures on the DIH and HD + P datasets using the human-oriented evaluation protocol. by using Chen et al. [20] as the baseline. Using more layers in the feature blending block significantly increases the depth estimation performance, especially for concatenation and addition, by exploiting human pose information at various scales. Also, we see that the addition strategy provides better estimation performance than the other two blending strategies.
Finally, we compare an extreme scenario in Table 11, where a single encoder is used with two decoders (ResNet-50-2D). However, this setting degrades the performance, since sharing an encoder between depth and pose hinders the training. This result indicates that the performance improvement is brought by the effective use of pose information.

V. CONCLUSION
We proposed a monocular depth estimation algorithm that exploits pose information for images with humans. The proposed algorithm is composed of two networks, PoseNet and DepthNet, which estimate keypoint heatmaps and a depth map, respectively. We jointly trained the two networks by adopting the feature blending block and the loss weight rebalancing scheme. In addition, we constructed a new dataset for human depth estimation, called HD + P, in which an image is paired with a ground-truth depth map and a set of keypoint heatmaps. Experimental results demonstrated that the proposed algorithm improves depth estimation performance, especially on human regions. However, the proposed algorithm is limited to improve the depth estimation performance on only images containing humans. Thus, an important direction for future work is to develop adaptive schemes for feature transfer from PoseNet to DepthNet by adapting the network architecture according to the existence of humans in images.