Skip to main content
Log in

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illuminations and camera viewpoints). We demonstrate the value of our synthesized dataset, by improving performance in certain machine-learning-based scene understanding tasks—depth and surface normal prediction, semantic segmentation, reconstruction, etc.—and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Aldous, D. J. (1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII 1983 (pp. 1–198). Berlin: Springer.

  • Backhaus, W. G., Kliegl, R., & Werner, J. S. (1998). Color vision: Perspectives from different disciplines. Berlin: Walter de Gruyter.

    Book  Google Scholar 

  • Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2D-3D alignment via surface normal prediction. In Conference on computer vision and pattern recognition (CVPR).

  • Bar-Aviv, E., & Rivlin, E. (2006). Functional 3D object classification using simulation of embodied agent. In British machine vision conference (BMVC).

  • Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(8), 1670–87.

    Article  Google Scholar 

  • Bartell, F., Dereniak, E., & Wolfe, W. (1981). The theory and measurement of bidirectional reflectance distribution function (brdf) and bidirectional transmittance distribution function (btdf). In Radiation scattering in optical systems (Vol. 257, pp. 154–161). International Society for Optics and Photonics.

  • Bell, S., Bala, K., & Snavely, N. (2014). Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 33(4), 98.

    Article  MATH  Google Scholar 

  • Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). Opensurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG), 32(4), 111.

    Article  Google Scholar 

  • Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Conference on computer vision and pattern recognition (CVPR).

  • Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems (NIPS).

  • Bickel, S., Brückner, M., & Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10, 2137–2155.

    MathSciNet  MATH  Google Scholar 

  • Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Empirical methods in natural language processing (EMNLP).

  • Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. AI Stats, 10, 33–40.

    Google Scholar 

  • Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012.

  • Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., & Su, H., et al. (2015). Shapenet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012.

  • Chapelle, O., & Harchaoui, Z. (2005). A machine learning approach to conjoint analysis. In Advances in neural information processing systems (NIPS).

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915.

  • Chen, W., Wang, H., Li, Y., Su, H., Lischinsk, D., Cohen-Or, D., & Chen, B., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In International conference on 3D vision (3DV).

  • Choi, W., Chao, Y. W., Pantofaru, C., & Savarese, S. (2015). Indoor scene understanding with geometric and semantic contexts. International Journal of Computer Vision (IJCV), 112(2), 204–220.

    Article  MathSciNet  Google Scholar 

  • Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). Sample selection bias correction theory. In International conference on algorithmic learning theory.

  • Csurka, G. (2017). Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374.

  • Daumé III, H. (2007). Frustratingly easy domain adaptation. In Annual meeting of the association for computational linguistics (ACL).

  • Daumé III, H. (2009). Bayesian multitask learning with latent hierarchies. In Conference on uncertainty in artificial intelligence (UAI).

  • Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In Conference on computer vision and pattern recognition (CVPR).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Conference on computer vision and pattern recognition (CVPR).

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Conference on computer vision and pattern recognition (CVPR).

  • Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3d human motion capture with monocular image sequence and height-maps. In European conference on computer vision (ECCV).

  • Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In International conference on computer vision (ICCV).

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems (NIPS).

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.

    Article  Google Scholar 

  • Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. In International conference on knowledge discovery and data mining (SIGKDD).

  • Fanello, S. R., Keskin, C., Izadi, S., Kohli, P., Kim, D., Sweeney, D., et al. (2014). Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics (TOG), 33(4), 86.

    Article  MATH  Google Scholar 

  • Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., & Hanrahan, P. (2012). Example-based synthesis of 3D object arrangements. ACM Transactions on Graphics (TOG), 31(6), 208-1–208-12.

    Article  Google Scholar 

  • Fisher, M., Savva, M., & Hanrahan, P. (2011). Characterizing structural relationships in scenes using graph kernels. ACM Transactions on Graphics (TOG), 30(4), 107-1–107-12.

    Article  Google Scholar 

  • Fouhey, D. F., Gupta, A., & Hebert, M. (2013). Data-driven 3d primitives for single image understanding. In International conference on computer vision (ICCV).

  • Fridman, A. (2003). Mixed markov models. Proceedings of the National Academy of Sciences (PNAS), 100(14), 8093.

    Article  MathSciNet  Google Scholar 

  • Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Conference on computer vision and pattern recognition (CVPR).

  • Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (ICML).

  • Ghezelghieh, M. F., Kasturi, R., & Sarkar, S. (2016). Learning camera viewpoint using cnn to improve 3D body pose estimation. In International conference on 3D vision (3DV).

  • Grabner, H., Gall, J., & Van Gool, L. (2011). What makes a chair a chair? In Conference on computer vision and pattern recognition (CVPR).

  • Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015) Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

  • Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. M., & Schöllkopf, B. (2009). Covariate shift by kernel mean matching. In Dataset shift in machine learning (pp. 131–160). MIT Press.

  • Gupta, A., Hebert, M., Kanade, T., & Blei, D. M. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Advances in neural information processing systems (NIPS).

  • Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Conference on computer vision and pattern recognition (CVPR).

  • Handa, A., Pătrăucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In Conference on computer vision and pattern recognition (CVPR).

  • Handa, A., Patraucean, V., Stent, S., & Cipolla, R. (2016). Scenenet: an annotated model generator for indoor scene understanding. In International conference on robotics and automation (ICRA).

  • Handa, A., Whelan, T., McDonald, J., & Davison, A. J. (2014). A benchmark for rgb-d visual odometry, 3D reconstruction and slam. In International conference on robotics and automation (ICRA).

  • Hara, K., Nishino, K., et al. (2005). Light source position and reflectance estimation from a single view without the distant illumination assumption. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 27(4), 493–505.

    Article  Google Scholar 

  • Hattori, H., Naresh Boddeti, V., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In Conference on computer vision and pattern recognition (CVPR).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International conference on computer vision (ICCV).

  • Heckman, J. J. (1977). Sample selection bias as a specification error (with an application to the estimation of labor supply functions). Massachusetts: National Bureau of Economic Research Cambridge

  • Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In International conference on computer vision (ICCV).

  • Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., & Riedmiller, M., et al. (2017). Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286.

  • Hermans, T., Rehg, J. M., & Bobick, A. (2011). Affordance prediction via learned object attributes. In International conference on robotics and automation (ICRA).

  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  MATH  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  MATH  Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2005). Automatic photo pop-up. ACM Transactions on Graphics (TOG), 24(3), 577–584.

    Article  Google Scholar 

  • Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG). https://doi.org/10.1145/2766890.

  • Jiang, Y., Koppula, H., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In Conference on computer vision and pattern recognition (CVPR).

  • Kohli, Y. Z. M. B. P., Izadi, S., & Xiao, J. (2016). Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. arXiv preprint arXiv:1603.04922.

  • Koppula, H. S., & Saxena, A. (2014). Physically grounded spatio-temporal object affordances. In European conference on computer vision (ECCV).

  • Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 14–29.

    Article  Google Scholar 

  • Kratz, L., & Nishino, K. (2009). Factorizing scene albedo and depth from a single foggy image. In International conference on computer vision (ICCV).

  • Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015). Picture: A probabilistic programming language for scene perception. In Conference on computer vision and pattern recognition (CVPR).

  • Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. arXiv preprint arXiv:1606.00373.

  • Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In Conference on computer vision and pattern recognition (CVPR).

  • Liang, W., Zhao, Y., Zhu, Y., & Zhu, S.C. (2016). What is where: Inferring containment relations from videos. In International joint conference on artificial intelligence (IJCAI).

  • Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S. C. (2016). A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH ASIA 2016 virtual reality meets physical reality: Modelling and simulating virtual humans and environments (pp. 11). ACM.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV).

  • Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Conference on computer vision and pattern recognition (CVPR).

  • Liu, X., Zhao, Y., & Zhu, S. C. (2014). Single-view 3d scene parsing by attributed grammar. In Conference on computer vision and pattern recognition (CVPR).

  • Lombardi, S., & Nishino, K. (2016). Reflectance and illumination recovery in the wild. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2321–2334.

    Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Conference on computer vision and pattern recognition (CVPR).

  • Loper, M. M., & Black, M. J. (2014). Opendr: An approximate differentiable renderer. In European conference on computer vision (ECCV).

  • López, A. M., Xu, J., Gómez, J. L., Vázquez, D., & Ros, G. (2017). From virtual to real world visual perception using domain adaptation the dpm as example. In Domain adaptation in computer vision applications (pp. 243–258). Springer.

  • Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). Learning frame models using cnn filters. In AAAI Conference on artificial intelligence (AAAI).

  • Mallya, A., & Lazebnik, S. (2015). Learning informative edge maps for indoor scene layout prediction. In International conference on computer vision (ICCV).

  • Mansinghka, V., Kulkarni, T. D., Perov, Y. N., & Tenenbaum, J. (2013). Approximate bayesian image interpretation using generative probabilistic graphics programs. In Advances in neural information processing systems (NIPS).

  • Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Annual conference on learning theory (COLT).

  • Marin, J., Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In Conference on computer vision and pattern recognition (CVPR).

  • Merrell, P., Schkufza, E., Li, Z., Agrawala, M., & Koltun, V. (2011). Interactive furniture layout using interior design guidelines. ACM Transactions on Graphics (TOG). https://doi.org/10.1145/2010324.1964982.

  • Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In European conference on computer vision (ECCV).

  • Movshovitz-Attias, Y., Sheikh, Y., Boddeti, V. N., & Wei, Z. (2014). 3D pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In British machine vision conference (BMVC).

  • Myers, A., Kanazawa, A., Fermuller, C., & Aloimonos, Y. (2014). Affordance of object parts from geometric features. In Workshop on Vision meets Cognition, CVPR.

  • Nishino, K., Zhang, Z., Ikeuchi, K. (2001). Determining reflectance parameters and illumination distribution from a sparse set of images for view-dependent image synthesis. In International conference on computer vision (ICCV).

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In International conference on computer vision (ICCV).

  • Oxholm, G., & Nishino, K. (2014). Multiview shape and reflectance from natural illumination. In Conference on computer vision and pattern recognition (CVPR).

  • Oxholm, G., & Nishino, K. (2016). Shape and reflectance estimation in the wild. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(2), 2321–2334.

    Google Scholar 

  • Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Conference on computer vision and pattern recognition (CVPR).

  • Pharr, M., & Humphreys, G. (2004). Physically based rendering: From theory to implementation. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In Conference on computer vision and pattern recognition (CVPR).

  • Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In Conference on computer vision and pattern recognition (CVPR).

  • Qi, C. R., Su, H., Niessner, M., Dai, A., Yan, M., & Guibas, L. J. (2016). Volumetric and multi-view cnns for object classification on 3D data. In Conference on computer vision and pattern recognition (CVPR).

  • Qiu, W. (2016). Generating human images and ground truth using computer graphics. Ph.D. thesis, University of California, Los Angeles.

  • Qiu, W., & Yuille, A. (2016). Unrealcv: Connecting computer vision to unreal engine. arXiv preprint arXiv:1609.01326.

  • Qureshi, F., & Terzopoulos, D. (2008). Smart camera networks in virtual reality. Proceedings of the IEEE, 96(10), 1640–1656.

    Article  Google Scholar 

  • Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

  • Rahmani, H., & Mian, A. (2015). Learning a non-linear knowledge transfer model for cross-view action recognition. In Conference on computer vision and pattern recognition (CVPR).

  • Rahmani, H., & Mian, A. (2016). 3D action recognition from novel viewpoints. In Conference on computer vision and pattern recognition (CVPR).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS).

  • Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In European conference on computer vision (ECCV).

  • Roberto de Souza, C., Gaidon, A., Cabon, Y., & Manuel Lopez, A. (2017). Procedural generation of videos to train deep action recognition networks. In Conference on computer vision and pattern recognition (CVPR).

  • Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3D pose estimation in the wild. In Advances in neural information processing systems (NIPS).

  • Romero, J., Loper, M., & Black, M. J. (2015). Flowcap: 2D human pose from optical flow. In German conference on pattern recognition.

  • Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A.M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Conference on computer vision and pattern recognition (CVPR).

  • Roy, A., & Todorovic, S. (2016). A multi-scale cnn for affordance segmentation in rgb images. In European conference on computer vision (ECCV).

  • Sato, I., Sato, Y., & Ikeuchi, K. (2003). Illumination from shadows. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 25(3), 1218–1227.

    Google Scholar 

  • Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV).

  • Sharma, G., & Bala, R. (2002). Digital color imaging handbook. Boca Raton: CRC Press.

    Book  Google Scholar 

  • Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1), 116–124.

    Article  Google Scholar 

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European conference on computer vision (ECCV).

  • Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).

  • Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2014). Semantic scene completion from a single depth image. In Conference on computer vision and pattern recognition (CVPR).

  • Stark, L., & Bowyer, K. (1991). Achieving generalized object recognition through reasoning about association of function to structure. Transactions on Pattern Analysis and Machine Intelligence (TPAMI),13(10), 1097–1104.

  • Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D cad data. In British machine vision conference (BMVC).

  • Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics (TOG), 33(4), 37.

    MATH  Google Scholar 

  • Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In International conference on computer vision (ICCV).

  • Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference (BMVC).

  • Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In International conference on computer vision (ICCV).

  • Terzopoulos, D., & Rabie, T. F. (1995). Animat vision: Active vision in artificial animals. In International conference on computer vision (ICCV).

  • Torralba, A., & Efros, A.A. (2011). Unbiased look at dataset bias. In Conference on computer vision and pattern recognition (CVPR).

  • Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultaneous deep transfer across domains and tasks. In International conference on computer vision (ICCV).

  • Valberg, A. (2007). Light vision color. New York: Wiley.

    Google Scholar 

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In Conference on computer vision and pattern recognition (CVPR).

  • Vázquez, D., Lopez, A. M., Marin, J., Ponsa, D., & Geronimo, D. (2014). Virtual and real world adaptation for pedestrian detection. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(4), 797–809.

    Article  Google Scholar 

  • Wang, X., Fouhey, D., & Gupta, A. (2015). Designing deep networks for surface normal estimation. In Conference on computer vision and pattern recognition (CVPR).

  • Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. arXiv preprint arXiv:1603.05631.

  • Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in neural information processing systems (NIPS).

  • Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In International conference on machine learning (ICML).

  • Whelan, T., Leutenegger, S., Salas-Moreno, R. F., Glocker, B., & Davison, A. J. (2015). Elasticfusion: Dense slam without a pose graph. In Robotics: Science and systems (RSS).

  • Wu, J. (2016). Computational perception of physical object properties. Ph.D. thesis, Massachusetts Institute of Technology.

  • Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems (NIPS).

  • Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In Advances in neural information processing systems (NIPS).

  • Xie, J., Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408.

  • Xie, J., Lu, Y., Zhu, S. C., & Wu, Y. N. (2016). A theory of generative convnet. In International conference on machine learning (ICML).

  • Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8, 35–63.

    MathSciNet  MATH  Google Scholar 

  • Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In Conference on computer vision and pattern recognition (CVPR).

  • Yeh, Y. T., Yang, L., Watson, M., Goodman, N. D.,&Hanrahan, P. (2012). Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. ACM Transactions on Graphics (TOG), https://doi.org/.10.1145/2185520.2185552.

  • Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In International conference on machine learning (ICML).

  • Yu, L. F., Duncan, N., & Yeung, S. K. (2015). Fill and transfer: A simple physics-based approach for containability reasoning. In International conference on computer vision (ICCV).

  • Yu, L. F., Yeung, S. K., Tang, C. K., Terzopoulos, D., Chan, T. F., & Osher, S. J. (2011). Make it home: Automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG), 30(4), 786–797.

    Article  Google Scholar 

  • Yu, L. F., Yeung, S. K., & Terzopoulos, D. (2016). The clutterpalette: An interactive tool for detailing indoor scenes. IEEE Transactions on Visualization & Computer Graph (TVCG), 22(2), 1138–1148.

    Article  Google Scholar 

  • Zhang, H., Dana, K., & Nishino, K. (2015). Reflectance hashing for material recognition. In Conference on computer vision and pattern recognition (CVPR).

  • Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J. Y., Jin, H., & Funkhouser, T. (2017). Physically-based rendering for indoor scene understanding using convolutional neural networks. In Conference on computer vision and pattern recognition (CVPR).

  • Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In Conference on computer vision and pattern recognition (CVPR).

  • Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. International Journal of Computer Vision (IJCV), 112(2), 221–238.

    Article  MathSciNet  Google Scholar 

  • Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2013). Beyond point clouds: Scene understanding by reasoning geometry and physics. In Conference on computer vision and pattern recognition (CVPR).

  • Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3D-guided cycle consistency. In Conference on computer vision and pattern recognition (CVPR).

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness meets deepness: 3D human pose estimation from monocular video. In Conference on computer vision and pattern recognition (CVPR).

  • Zhu, S. C., & Mumford, D. (2007). A stochastic grammar of images. Breda: Now Publishers Inc.

    MATH  Google Scholar 

  • Zhu, Y., Fathi, A., & Fei-Fei, L. (2014). Reasoning about object affordances in a knowledge base representation. In European conference on computer vision (ECCV).

  • Zhu, Y., Jiang, C., Zhao, Y., Terzopoulos, D., & Zhu, S. C. (2016). Inferring forces and learning human utilities from videos. In Conference on computer vision and pattern recognition (CVPR).

  • Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In International conference on robotics and automation (ICRA).

  • Zhu, Y., Zhao, Y., & Zhu, S. C. (2015). Understanding tools: Task-oriented object modeling, learning and recognition. In Conference on computer vision and pattern recognition (CVPR).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yixin Zhu.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Support for the research reported herein was provided by DARPA XAI Grant N66001-17-2-4029, ONR MURI Grant N00014-16-1-2007, and DoD CDMRP AMRAA Grant W81XWH-15-1-0147.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, C., Qi, S., Zhu, Y. et al. Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars. Int J Comput Vis 126, 920–941 (2018). https://doi.org/10.1007/s11263-018-1103-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1103-5

Keywords

Navigation