ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion

https://doi.org/10.1016/j.patcog.2021.108516Get rights and content

Highlights

  • The adaptive depth reduction module learns different depth ranges for each pixel.

  • The occlusion map reflects the visibility of pixels in different views.

  • The outliers in depth map are filtered by multi-scale photometric consistency.

Abstract

3D point cloud reconstruction is an urgent task in computer vision for environment perception. Nevertheless, the reconstructed scene is inaccurate and incomplete, because the visibility of pixels is not taken into account by existing methods. In this paper, a cascaded network with a multiple cost volume aggregation module named ADR-MVSNet is proposed. Three improvements are presented in ADR-MVSNet. First, to improve the reconstruction accuracy and reduce the time complexity, an adaptive depth reduction module, which adaptively adjusts the depth range of the pixel through the confidence interval, is proposed. Second, to more accurately estimate the depth of occluded pixels in multiview images, a multiple cost volume aggregation module, in which Gini impurity is introduced to measure the confidence of pixel depth prediction, is proposed. Third, a multiscale photometric consistency filter module is proposed, which considers the information in multiple confidence maps at the same time and filters out outliers accurately to remove pixels with low confidence. Therefore, the accuracy of point cloud reconstruction is improved. The experimental results on the DTU and Tanks and Temple datasets demonstrate that ADR-MVSNet achieves highly accurate and highly complete reconstruction compared with state-of-the-art benchmarks.

Introduction

The goal of 3D point cloud reconstruction is to reconstruct the original 3D geometry from single-view images or multiview images. At present, 3D point cloud reconstruction is applied in fields such as human pose estimation [1], UAV detection [2], robot navigation [3] and point cloud semantic segmentation [4]. Environmental information is the premise for positioning, path planning, and motion control. Therefore, environmental information largely determines the future development of these fields. As one of the most urgent tasks in computer vision and graphics [5], 3D point cloud reconstruction generates the 3D geometry of the scene, which contains abundant environmental information; therefore, an increasing number of scientists are beginning to study 3D point cloud reconstruction.

According to the density of reconstruction scenes, 3D point cloud reconstruction is divided into two categories: structure from motion (SFM) [6] and multiview stereo (MVS) [7]. The SFM is the algorithm for 3D point cloud reconstruction based on a series of time-ordered images [8]. The SFM algorithm estimates the motion of the camera by extracting and matching the handcrafted features between the time-ordered images and finally restores the point cloud. MVS is another 3D reconstruction algorithm based on a collection of images and corresponding camera parameters [9]. Compared with SFM, MVS generates a dense reconstruction [10] by performing matching operations on every pixel of an image. Because dense reconstruction contains more abundant environmental information, which is the key to scene understanding and environmental perception, MVS for generating dense point clouds has been further studied in recent years. Specifically, MVS methods are mainly divided into traditional methods and deep learning methods.

Traditional methods rely on photometric consistency and geometric consistency to compute multiview similarity. Campbell et al. [11] proposed an algorithm for multiview stereo, which improved the performance in the case of sparse image sets. They used normalized cross correlation as a metric to extract a series of possible depth values and then generate depth hypotheses for each pixel through a Markov random field. Furukawa and Ponce [10] proposed a novel approach to enforce local photometric consistency and global visibility constraints by repeatedly expanding matched key points. Moreover, this approach detects and discards obstacles and outliers. Tola et al. [12] proposed an approach for large-scale 3D reconstruction by using a robust descriptor, which significantly reduces time consumption by checking match consistency and rejecting erroneous matches. Galliani et al. [13] presented Gipuma, a massively parallel method for multiview matching. Gipuma operates on half of all pixels in an image in parallel with a red–black checkerboard scheme. However, these methods are not robust to low-texture regions and exposure regions where photometric consistency is unreliable. Thus, Schonberger et al. [7] presented the multiview stereo system named COLMAP, which jointly estimates the depth map and normal information and pixel-wise view selection using photometric priors. Nevertheless, these traditional methods only achieve complete reconstruction under the assumption that the scene is Lambertian. Thus, the further development of traditional methods is limited by the complexity of the scene.

Compared with traditional methods, deep learning methods generate more complete and accurate point clouds because of the robustness of convolutional neural networks (CNNs). CNNs are applied to multiview stereo in diverse ways. For voxel representation, voxel-based methods discretize scenes into 3D grids and determine whether a voxel is on the surface. In this way, 3D CNNs are utilized to regularize voxels [10]. Nevertheless, large-scale reconstruction is difficult due to the high memory consumption of voxels. To avoid the inherent trade-off between memory consumption and reconstruction accuracy, patch-based methods, which convert the scene into multiple patches, have been proposed [14]. However, patch-based methods are not convenient for dealing with scenes with rich details. To preserve the details of the scene, the point cloud-based methods [15] first generate point clouds from different views and then rely on the distance metric function to obtain the point cloud of the scene. As the distance metric function of point clouds proceeds sequentially, point cloud-based methods usually take a long time in processing. To overcome the time-consuming characteristics of point clouds, MVSNet [16], which is a pioneering method that generates point clouds from depth maps, is proposed. MVSNet generates multiview depth maps through differentiable homomorphic warping and then reconstructs the point cloud through a variance-based fusion algorithm. However, it does not consider the memory consumption of 3D convolution and the visibility of multiview images.

In summary, three issues are still not resolved in MVSNet. First, the memory consumption of the network grows cubically with increasing scene resolution. Thus, although the reconstruction strategy based on MVSNet achieves excellent performance on low-resolution images, the high memory requirement limits its implementation on high-resolution images. Second, not all pixels in one view can be seen in other views. Due to the occlusion phenomenon, mismatched pixels, which are nonidentical pixels that are matched together in different views, result in incomplete reconstruction results. Third, for the depth map, outliers still exist in the low-texture regions and exposure regions. Since the previous networks are unable to eliminate outliers well, the dense 3D point cloud generated by them is not sufficiently accurate.

To address the above problems, a cascaded network with a multiple cost volume aggregation module named ADR-MVSNet is proposed. First, to reduce the memory consumption in high-resolution 3D point cloud reconstruction, an adaptive depth reduction (ADR) module is proposed. In ADR, the depth map is estimated in a coarse-to-fine manner by utilizing the confidence interval, where the optimized confidence factor adaptively reduces the depth range according to the probability distribution in the depth direction. Specifically, ADR estimates a depth map through three stages. In the first stage, the depth range of the pixel is obtained directly from the dataset. The second stage utilizes the estimated depth map from the first stage and calculates the confidence interval by kurtosis and skewness to adaptively adjust the depth range of pixels. The third stage utilizes the estimated depth map from the second stage and further accurately calculates the confidence interval by kurtosis and skewness to adaptively adjust the depth range of pixels. Second, to reduce the negative influence of mismatched pixels in the cost volume, a multiple cost volume aggregation (MCVA) module is proposed. In MCVA, first, the view-wise cost volume is constructed based on each reference-source image pair. Then, the matching degree of each pixel is reflected by Gini impurity. Next, MCVA obtains the occlusion map of each view-wise cost volume through a CNN. Finally, the aggregated cost volume is obtained through the weighted summation of the view-wise cost volumes and occlusion maps. Third, a multiscale photometric consistency filter (MPCF) module is proposed to filter the outliers of the depth map. First, all confidence maps are upsampled to the size of the original resolution. If the confidence values of the pixels on all confidence maps are all greater than the correlation threshold, then the depth value of the pixel is considered to be reliable; otherwise, it is filtered out as an outlier. In this way, the accuracy of the reconstructed scene is greatly improved compared with existing methods.

ADR-MVSNet is trained and tested on the DTU [17] and Tanks and Temples datasets [18]. The experimental results show that ADR-MVSNet achieves superior performance compared with state-of-the-art benchmarks.

In summary, the main contribution of this paper is the proposal of a cascaded network with a multiple cost volume aggregation module. It consists of three innovations for multiview stereo.

  • (1)

    We propose an adaptive depth reduction module. The module adaptively learns different depth ranges for each pixel by using the probability distribution of pixels in the depth direction. By combining the confidence interval, the module accurately estimates the pixel depth when constructing a small number of depth planes.

  • (2)

    We propose a measure to learn occlusion maps from reference-source image pairs. Occlusion maps reflect the visibility of pixels in different views. This measure allows our network to reconstruct a complete object surface.

  • (3)

    We propose a multiscale photometric consistency filtering module. The module considers the characteristics of small receptive field area of low-resolution confidence map and large receptive field area of high-resolution confidence map to filter out outliers inside and outside the object.

The rest of the paper is organized as follows. Related work is introduced in Section 2. Section 3 describes the cascaded network architecture named ADR-MVSNet in detail. Section 4 presents the experimental results for ADR-MVSNet and a comparison with other state-of-the-art methods. Finally, our conclusions are provided in Section 5.

Section snippets

Related work

In this section, 3D reconstruction is introduced first. Then some of the basic theories used in our research are briefly explained.

Multiview stereo network with adaptive depth reduction module (ADR-MVSNet)

Given a reference image I1, a set of source images is defined as Eq. (1):S={IiRH×W×3|i=2,...,N},where N represents the number of input images and H and W are the height and width of the image, respectively. ADR-MVSNet aims to infer the depth map for the reference image with the help of corresponding camera parameters. Fig. 1 presents the architecture diagram of ADR-MVSNet, which consists of an adaptive depth reduction module, a multiple cost volume aggregation module and multiscale photometric

Experiments and results

In this section, the DTU dataset [17] and Tanks and Temples dataset [18] are introduced. Then, the evaluation metrics and implementation details are explained. A comparison with the results of several state-of-the-art methods is presented in Section 4.4. To verify the effectiveness of the modules in ADR-MVSNet, ablation experiments are conducted in Section 4.5. In Section 4.6, to prove that ADR-MVSNet is a lightweight network, we compare the memory consumption and time complexity with other

Conclusion

3D point cloud reconstruction is a challenging task due to the existence of low-texture areas and occluded areas. To overcome this challenge, a network architecture called ADR-MVSNet is proposed for 3D point cloud reconstruction. Three contributions are proposed in this paper. First, an adaptive depth reduction module is proposed, ADR, to further effectively narrow the depth range. ADR narrows the depth range adaptively, which improves the accuracy of depth estimation and greatly reduces the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research was supported by the Department of Science and Technology of Jilin Province (No. 20190303135SF).

Ying Li received the B.S., M.S., and Ph.D. degrees from Jilin University. From 2000 to 2006, she was an Associate Professor with the Department of Space Information Processing, Jilin University. Since 2006, she has been a Professor in the computer application technology with the Jilin University. She is currently a fellow of the China Computer Federation. She has published over 60 papers in journals and international conference. Her research interests include big data, 3D visual modeling, 3D

References (40)

  • J.L. Schönberger et al.

    Pixelwise view selection for unstructured multi-view stereo

  • Y. Furukawa et al.

    Accurate, dense, and robust multiview stereopsis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • N.D.F. Campbell et al.

    Using multiple hypotheses to improve depth-maps for multi-view stereo

  • E. Tola et al.

    Efficient large-scale multi-view stereo for ultra high-resolution image sets

    Mach. Vis. Appl.

    (2012)
  • S. Galliani et al.

    Massively parallel multiview stereopsis by surface normal diffusion

  • M. Goesele et al.

    Multi-view stereo for community photo collections

  • Y. Wei et al.

    Conditional single-view shape generation for multi-view stereo reconstruction

  • Y. Yao et al.

    Mvsnet: depth inference for unstructured multi-view stereo

  • H. Aanæs et al.

    Large-scale data for multiple-view stereopsis

    Int. J. Comput. Vis.

    (2016)
  • A. Knapitsch et al.

    Tanks and temples: benchmarking large-scale scene reconstruction

    ACM Trans. Graph.

    (2017)
  • Cited by (0)

    Ying Li received the B.S., M.S., and Ph.D. degrees from Jilin University. From 2000 to 2006, she was an Associate Professor with the Department of Space Information Processing, Jilin University. Since 2006, she has been a Professor in the computer application technology with the Jilin University. She is currently a fellow of the China Computer Federation. She has published over 60 papers in journals and international conference. Her research interests include big data, 3D visual modeling, 3D image processing, machine vision and machine learning.

    Zhijie Zhao received the B.S. degree from the College of Software, Jilin University, Changchun, China, in 2019, where he is currently pursuing the master's degree. His research interests include 3D point cloud reconstruction and machine vision and machine learning.

    Jiahao Fan received the B.S. degree from the Computer Science and Technology College, Jilin University, Changchun, China, in 2015, where he is currently pursuing the Ph.D. degree. His research interests include metaheuristic algorithm, machine learning, image processing, data mining, and 3D data processing.

    Wenyue Li received the B.S. degree from the College of Software, Jilin University, Changchun, China, in 2019, where she is currently pursuing the master's degree. Her research interests include 3D point cloud reconstruction and image processing.

    View full text