Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Jiang, Chenfanfu; Qi, Siyuan; Zhu, Yixin; Huang, Siyuan; Lin, Jenny; Yu, Lap-Fai; Terzopoulos, Demetri; Zhu, Song-Chun

doi:10.1007/s11263-018-1103-5

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Published: 30 June 2018

Volume 126, pages 920–941, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chenfanfu Jiang¹^na1,
Siyuan Qi²^na1,
Yixin Zhu ORCID: orcid.org/0000-0001-7024-1545²^na1,
Siyuan Huang²^na1,
Jenny Lin²,
Lap-Fai Yu³,
Demetri Terzopoulos⁴ &
…
Song-Chun Zhu²

1279 Accesses
49 Citations
3 Altmetric
Explore all metrics

Abstract

We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illuminations and camera viewpoints). We demonstrate the value of our synthesized dataset, by improving performance in certain machine-learning-based scene understanding tasks—depth and surface normal prediction, semantic segmentation, reconstruction, etc.—and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Article 20 March 2020

References

Aldous, D. J. (1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII 1983 (pp. 1–198). Berlin: Springer.
Backhaus, W. G., Kliegl, R., & Werner, J. S. (1998). Color vision: Perspectives from different disciplines. Berlin: Walter de Gruyter.
Book Google Scholar
Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2D-3D alignment via surface normal prediction. In Conference on computer vision and pattern recognition (CVPR).
Bar-Aviv, E., & Rivlin, E. (2006). Functional 3D object classification using simulation of embodied agent. In British machine vision conference (BMVC).
Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(8), 1670–87.
Article Google Scholar
Bartell, F., Dereniak, E., & Wolfe, W. (1981). The theory and measurement of bidirectional reflectance distribution function (brdf) and bidirectional transmittance distribution function (btdf). In Radiation scattering in optical systems (Vol. 257, pp. 154–161). International Society for Optics and Photonics.
Bell, S., Bala, K., & Snavely, N. (2014). Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 33(4), 98.
Article MATH Google Scholar
Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). Opensurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG), 32(4), 111.
Article Google Scholar
Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Conference on computer vision and pattern recognition (CVPR).
Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems (NIPS).
Bickel, S., Brückner, M., & Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10, 2137–2155.
MathSciNet MATH Google Scholar
Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Empirical methods in natural language processing (EMNLP).
Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. AI Stats, 10, 33–40.
Google Scholar
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012.
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., & Su, H., et al. (2015). Shapenet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012.
Chapelle, O., & Harchaoui, Z. (2005). A machine learning approach to conjoint analysis. In Advances in neural information processing systems (NIPS).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915.
Chen, W., Wang, H., Li, Y., Su, H., Lischinsk, D., Cohen-Or, D., & Chen, B., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In International conference on 3D vision (3DV).
Choi, W., Chao, Y. W., Pantofaru, C., & Savarese, S. (2015). Indoor scene understanding with geometric and semantic contexts. International Journal of Computer Vision (IJCV), 112(2), 204–220.
Article MathSciNet Google Scholar
Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). Sample selection bias correction theory. In International conference on algorithmic learning theory.
Csurka, G. (2017). Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374.
Daumé III, H. (2007). Frustratingly easy domain adaptation. In Annual meeting of the association for computational linguistics (ACL).
Daumé III, H. (2009). Bayesian multitask learning with latent hierarchies. In Conference on uncertainty in artificial intelligence (UAI).
Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In Conference on computer vision and pattern recognition (CVPR).
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Conference on computer vision and pattern recognition (CVPR).
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Conference on computer vision and pattern recognition (CVPR).
Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3d human motion capture with monocular image sequence and height-maps. In European conference on computer vision (ECCV).
Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In International conference on computer vision (ICCV).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems (NIPS).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.
Article Google Scholar
Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. In International conference on knowledge discovery and data mining (SIGKDD).
Fanello, S. R., Keskin, C., Izadi, S., Kohli, P., Kim, D., Sweeney, D., et al. (2014). Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics (TOG), 33(4), 86.
Article MATH Google Scholar
Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., & Hanrahan, P. (2012). Example-based synthesis of 3D object arrangements. ACM Transactions on Graphics (TOG), 31(6), 208-1–208-12.
Article Google Scholar
Fisher, M., Savva, M., & Hanrahan, P. (2011). Characterizing structural relationships in scenes using graph kernels. ACM Transactions on Graphics (TOG), 30(4), 107-1–107-12.
Article Google Scholar
Fouhey, D. F., Gupta, A., & Hebert, M. (2013). Data-driven 3d primitives for single image understanding. In International conference on computer vision (ICCV).
Fridman, A. (2003). Mixed markov models. Proceedings of the National Academy of Sciences (PNAS), 100(14), 8093.
Article MathSciNet Google Scholar
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Conference on computer vision and pattern recognition (CVPR).
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (ICML).
Ghezelghieh, M. F., Kasturi, R., & Sarkar, S. (2016). Learning camera viewpoint using cnn to improve 3D body pose estimation. In International conference on 3D vision (3DV).
Grabner, H., Gall, J., & Van Gool, L. (2011). What makes a chair a chair? In Conference on computer vision and pattern recognition (CVPR).
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015) Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. M., & Schöllkopf, B. (2009). Covariate shift by kernel mean matching. In Dataset shift in machine learning (pp. 131–160). MIT Press.
Gupta, A., Hebert, M., Kanade, T., & Blei, D. M. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Advances in neural information processing systems (NIPS).
Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Conference on computer vision and pattern recognition (CVPR).
Handa, A., Pătrăucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In Conference on computer vision and pattern recognition (CVPR).
Handa, A., Patraucean, V., Stent, S., & Cipolla, R. (2016). Scenenet: an annotated model generator for indoor scene understanding. In International conference on robotics and automation (ICRA).
Handa, A., Whelan, T., McDonald, J., & Davison, A. J. (2014). A benchmark for rgb-d visual odometry, 3D reconstruction and slam. In International conference on robotics and automation (ICRA).
Hara, K., Nishino, K., et al. (2005). Light source position and reflectance estimation from a single view without the distant illumination assumption. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 27(4), 493–505.
Article Google Scholar
Hattori, H., Naresh Boddeti, V., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In Conference on computer vision and pattern recognition (CVPR).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International conference on computer vision (ICCV).
Heckman, J. J. (1977). Sample selection bias as a specification error (with an application to the estimation of labor supply functions). Massachusetts: National Bureau of Economic Research Cambridge
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In International conference on computer vision (ICCV).
Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., & Riedmiller, M., et al. (2017). Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286.
Hermans, T., Rehg, J. M., & Bobick, A. (2011). Affordance prediction via learned object attributes. In International conference on robotics and automation (ICRA).
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Article MATH Google Scholar
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
Hoiem, D., Efros, A. A., & Hebert, M. (2005). Automatic photo pop-up. ACM Transactions on Graphics (TOG), 24(3), 577–584.
Article Google Scholar
Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG). https://doi.org/10.1145/2766890.
Jiang, Y., Koppula, H., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In Conference on computer vision and pattern recognition (CVPR).
Kohli, Y. Z. M. B. P., Izadi, S., & Xiao, J. (2016). Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. arXiv preprint arXiv:1603.04922.
Koppula, H. S., & Saxena, A. (2014). Physically grounded spatio-temporal object affordances. In European conference on computer vision (ECCV).
Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 14–29.
Article Google Scholar
Kratz, L., & Nishino, K. (2009). Factorizing scene albedo and depth from a single foggy image. In International conference on computer vision (ICCV).
Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015). Picture: A probabilistic programming language for scene perception. In Conference on computer vision and pattern recognition (CVPR).
Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. arXiv preprint arXiv:1606.00373.
Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In Conference on computer vision and pattern recognition (CVPR).
Liang, W., Zhao, Y., Zhu, Y., & Zhu, S.C. (2016). What is where: Inferring containment relations from videos. In International joint conference on artificial intelligence (IJCAI).
Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S. C. (2016). A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH ASIA 2016 virtual reality meets physical reality: Modelling and simulating virtual humans and environments (pp. 11). ACM.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV).
Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Conference on computer vision and pattern recognition (CVPR).
Liu, X., Zhao, Y., & Zhu, S. C. (2014). Single-view 3d scene parsing by attributed grammar. In Conference on computer vision and pattern recognition (CVPR).
Lombardi, S., & Nishino, K. (2016). Reflectance and illumination recovery in the wild. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2321–2334.
Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Conference on computer vision and pattern recognition (CVPR).
Loper, M. M., & Black, M. J. (2014). Opendr: An approximate differentiable renderer. In European conference on computer vision (ECCV).
López, A. M., Xu, J., Gómez, J. L., Vázquez, D., & Ros, G. (2017). From virtual to real world visual perception using domain adaptation the dpm as example. In Domain adaptation in computer vision applications (pp. 243–258). Springer.
Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). Learning frame models using cnn filters. In AAAI Conference on artificial intelligence (AAAI).
Mallya, A., & Lazebnik, S. (2015). Learning informative edge maps for indoor scene layout prediction. In International conference on computer vision (ICCV).
Mansinghka, V., Kulkarni, T. D., Perov, Y. N., & Tenenbaum, J. (2013). Approximate bayesian image interpretation using generative probabilistic graphics programs. In Advances in neural information processing systems (NIPS).
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Annual conference on learning theory (COLT).
Marin, J., Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In Conference on computer vision and pattern recognition (CVPR).
Merrell, P., Schkufza, E., Li, Z., Agrawala, M., & Koltun, V. (2011). Interactive furniture layout using interior design guidelines. ACM Transactions on Graphics (TOG). https://doi.org/10.1145/2010324.1964982.
Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In European conference on computer vision (ECCV).
Movshovitz-Attias, Y., Sheikh, Y., Boddeti, V. N., & Wei, Z. (2014). 3D pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In British machine vision conference (BMVC).
Myers, A., Kanazawa, A., Fermuller, C., & Aloimonos, Y. (2014). Affordance of object parts from geometric features. In Workshop on Vision meets Cognition, CVPR.
Nishino, K., Zhang, Z., Ikeuchi, K. (2001). Determining reflectance parameters and illumination distribution from a sparse set of images for view-dependent image synthesis. In International conference on computer vision (ICCV).
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In International conference on computer vision (ICCV).
Oxholm, G., & Nishino, K. (2014). Multiview shape and reflectance from natural illumination. In Conference on computer vision and pattern recognition (CVPR).
Oxholm, G., & Nishino, K. (2016). Shape and reflectance estimation in the wild. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(2), 2321–2334.
Google Scholar
Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Conference on computer vision and pattern recognition (CVPR).
Pharr, M., & Humphreys, G. (2004). Physically based rendering: From theory to implementation. San Francisco: Morgan Kaufmann.
Google Scholar
Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In Conference on computer vision and pattern recognition (CVPR).
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In Conference on computer vision and pattern recognition (CVPR).
Qi, C. R., Su, H., Niessner, M., Dai, A., Yan, M., & Guibas, L. J. (2016). Volumetric and multi-view cnns for object classification on 3D data. In Conference on computer vision and pattern recognition (CVPR).
Qiu, W. (2016). Generating human images and ground truth using computer graphics. Ph.D. thesis, University of California, Los Angeles.
Qiu, W., & Yuille, A. (2016). Unrealcv: Connecting computer vision to unreal engine. arXiv preprint arXiv:1609.01326.
Qureshi, F., & Terzopoulos, D. (2008). Smart camera networks in virtual reality. Proceedings of the IEEE, 96(10), 1640–1656.
Article Google Scholar
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Rahmani, H., & Mian, A. (2015). Learning a non-linear knowledge transfer model for cross-view action recognition. In Conference on computer vision and pattern recognition (CVPR).
Rahmani, H., & Mian, A. (2016). 3D action recognition from novel viewpoints. In Conference on computer vision and pattern recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS).
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In European conference on computer vision (ECCV).
Roberto de Souza, C., Gaidon, A., Cabon, Y., & Manuel Lopez, A. (2017). Procedural generation of videos to train deep action recognition networks. In Conference on computer vision and pattern recognition (CVPR).
Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3D pose estimation in the wild. In Advances in neural information processing systems (NIPS).
Romero, J., Loper, M., & Black, M. J. (2015). Flowcap: 2D human pose from optical flow. In German conference on pattern recognition.
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A.M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Conference on computer vision and pattern recognition (CVPR).
Roy, A., & Todorovic, S. (2016). A multi-scale cnn for affordance segmentation in rgb images. In European conference on computer vision (ECCV).
Sato, I., Sato, Y., & Ikeuchi, K. (2003). Illumination from shadows. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 25(3), 1218–1227.
Google Scholar
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV).
Sharma, G., & Bala, R. (2002). Digital color imaging handbook. Boca Raton: CRC Press.
Book Google Scholar
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1), 116–124.
Article Google Scholar
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European conference on computer vision (ECCV).
Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).
Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2014). Semantic scene completion from a single depth image. In Conference on computer vision and pattern recognition (CVPR).
Stark, L., & Bowyer, K. (1991). Achieving generalized object recognition through reasoning about association of function to structure. Transactions on Pattern Analysis and Machine Intelligence (TPAMI),13(10), 1097–1104.
Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D cad data. In British machine vision conference (BMVC).
Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics (TOG), 33(4), 37.
MATH Google Scholar
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In International conference on computer vision (ICCV).
Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference (BMVC).
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In International conference on computer vision (ICCV).
Terzopoulos, D., & Rabie, T. F. (1995). Animat vision: Active vision in artificial animals. In International conference on computer vision (ICCV).
Torralba, A., & Efros, A.A. (2011). Unbiased look at dataset bias. In Conference on computer vision and pattern recognition (CVPR).
Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultaneous deep transfer across domains and tasks. In International conference on computer vision (ICCV).
Valberg, A. (2007). Light vision color. New York: Wiley.
Google Scholar
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In Conference on computer vision and pattern recognition (CVPR).
Vázquez, D., Lopez, A. M., Marin, J., Ponsa, D., & Geronimo, D. (2014). Virtual and real world adaptation for pedestrian detection. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(4), 797–809.
Article Google Scholar
Wang, X., Fouhey, D., & Gupta, A. (2015). Designing deep networks for surface normal estimation. In Conference on computer vision and pattern recognition (CVPR).
Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. arXiv preprint arXiv:1603.05631.
Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in neural information processing systems (NIPS).
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In International conference on machine learning (ICML).
Whelan, T., Leutenegger, S., Salas-Moreno, R. F., Glocker, B., & Davison, A. J. (2015). Elasticfusion: Dense slam without a pose graph. In Robotics: Science and systems (RSS).
Wu, J. (2016). Computational perception of physical object properties. Ph.D. thesis, Massachusetts Institute of Technology.
Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems (NIPS).
Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In Advances in neural information processing systems (NIPS).
Xie, J., Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408.
Xie, J., Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). A theory of generative convnet. In International conference on machine learning (ICML).
Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8, 35–63.
MathSciNet MATH Google Scholar
Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In Conference on computer vision and pattern recognition (CVPR).
Yeh, Y. T., Yang, L., Watson, M., Goodman, N. D.,&Hanrahan, P. (2012). Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. ACM Transactions on Graphics (TOG), https://doi.org/.10.1145/2185520.2185552.
Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In International conference on machine learning (ICML).
Yu, L. F., Duncan, N., & Yeung, S. K. (2015). Fill and transfer: A simple physics-based approach for containability reasoning. In International conference on computer vision (ICCV).
Yu, L. F., Yeung, S. K., Tang, C. K., Terzopoulos, D., Chan, T. F., & Osher, S. J. (2011). Make it home: Automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG), 30(4), 786–797.
Article Google Scholar
Yu, L. F., Yeung, S. K., & Terzopoulos, D. (2016). The clutterpalette: An interactive tool for detailing indoor scenes. IEEE Transactions on Visualization & Computer Graph (TVCG), 22(2), 1138–1148.
Article Google Scholar
Zhang, H., Dana, K., & Nishino, K. (2015). Reflectance hashing for material recognition. In Conference on computer vision and pattern recognition (CVPR).
Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J. Y., Jin, H., & Funkhouser, T. (2017). Physically-based rendering for indoor scene understanding using convolutional neural networks. In Conference on computer vision and pattern recognition (CVPR).
Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In Conference on computer vision and pattern recognition (CVPR).
Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. International Journal of Computer Vision (IJCV), 112(2), 221–238.
Article MathSciNet Google Scholar
Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2013). Beyond point clouds: Scene understanding by reasoning geometry and physics. In Conference on computer vision and pattern recognition (CVPR).
Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3D-guided cycle consistency. In Conference on computer vision and pattern recognition (CVPR).
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness meets deepness: 3D human pose estimation from monocular video. In Conference on computer vision and pattern recognition (CVPR).
Zhu, S. C., & Mumford, D. (2007). A stochastic grammar of images. Breda: Now Publishers Inc.
MATH Google Scholar
Zhu, Y., Fathi, A., & Fei-Fei, L. (2014). Reasoning about object affordances in a knowledge base representation. In European conference on computer vision (ECCV).
Zhu, Y., Jiang, C., Zhao, Y., Terzopoulos, D., & Zhu, S. C. (2016). Inferring forces and learning human utilities from videos. In Conference on computer vision and pattern recognition (CVPR).
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In International conference on robotics and automation (ICRA).
Zhu, Y., Zhao, Y., & Zhu, S. C. (2015). Understanding tools: Task-oriented object modeling, learning and recognition. In Conference on computer vision and pattern recognition (CVPR).

Download references

Author information

C. Jiang, Y. Zhu, S. Qi, and S. Huang contributed equally to this work.

Authors and Affiliations

SIG Center for Computer Graphics, University of Pennsylvania, Philadelphia, USA
Chenfanfu Jiang
UCLA Center for Vision, Cognition, Learning and Autonomy, University of California, Los Angeles, Los Angeles, USA
Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin & Song-Chun Zhu
Graphics and Virtual Environments Laboratory, University of Massachusetts Boston, Boston, USA
Lap-Fai Yu
UCLA Computer Graphics and Vision Laboratory, University of California, Los Angeles, Los Angeles, USA
Demetri Terzopoulos

Authors

Chenfanfu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Qi
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Lin
View author publications
You can also search for this author in PubMed Google Scholar
Lap-Fai Yu
View author publications
You can also search for this author in PubMed Google Scholar
Demetri Terzopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Song-Chun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yixin Zhu.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Support for the research reported herein was provided by DARPA XAI Grant N66001-17-2-4029, ONR MURI Grant N00014-16-1-2007, and DoD CDMRP AMRAA Grant W81XWH-15-1-0147.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, C., Qi, S., Zhu, Y. et al. Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars. Int J Comput Vis 126, 920–941 (2018). https://doi.org/10.1007/s11263-018-1103-5

Download citation

Received: 30 July 2017
Accepted: 20 June 2018
Published: 30 June 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11263-018-1103-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Abstract

Access this article

Similar content being viewed by others

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Abstract

Access this article

Similar content being viewed by others

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation