A hybrid image dataset toward bridging the gap between real and simulation environments for robotics

Bayraktar, Ertugrul; Yigit, Cihat Bora; Boyraz, Pinar

doi:10.1007/s00138-018-0966-3

A hybrid image dataset toward bridging the gap between real and simulation environments for robotics

Annotated desktop objects real and synthetic images dataset: ADORESet

Original Paper
Published: 01 August 2018

Volume 30, pages 23–40, (2019)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

1112 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

The primary motivation of computer vision in the robotics field is to obtain a perception level that is as close as possible to human visual system. To achieve this, the inclusion of large datasets is necessary, sometimes involving less-frequent and seemingly irrelevant data to increase the system robustness. To minimize the effort and time in forming such extensive datasets from real world, the preferred method is to utilize simulation environments, replicating real-world conditions as much as possible. Following this solution path, the machine vision problems in robotics (i.e., object detection, recognition, and manipulation) often employ synthetic images in datasets and, however, do not mix them with real-world images. When the systems are trained only using the synthetic images and tested within the simulated world, the tasks requiring object recognition in robotics can be accomplished. However, the systems trained using this procedure cannot be directly used in the real-world experiments or end-user products due to the inconsistencies between real and simulation environments. Therefore, we propose a hybrid image dataset including annotated desktop objects from real and synthetic worlds (ADORESet). This hybrid dataset provides purposeful object categories with a sufficient number of real and synthetic images. ADORESet is composed of colored images with the dimension of \(300\times 300\) pixels within 30 categories. Each class has 2500 real-world images acquired from the wild web and 750 synthetic images that are generated within Gazebo simulation environment. This hybrid dataset enables researchers to implement their own algorithms for both real-world and simulation environment conditions. ADORESet is composed of fully annotated object images. The limits of objects are manually specified, and the bounding box coordinates are provided. The successor objects are also labeled to give statistical information and the likelihood about the relations of the objects within the dataset. To further demonstrate the benefits of this dataset, it is tested in object recognition tasks by fine-tuning the state-of-the-art deep convolutional neural networks such as VGGNet, InceptionV3, ResNet, and Xception. The possible combinations regarding the data types for these models are compared in terms of time, accuracy, and loss values. As a result of the conducted object recognition experiments, training with all-real images yields approximately \(49\%\) validation accuracy for simulation images. When the training is performed with all-synthetic images and validated using all-real images, the accuracy becomes lower than \(10\%\). If the complete ADORESet is employed for training and validation, the hybrid dataset validation accuracy reaches approximately to \(95\%\). This result proves further that including the real and synthetic images together in the training and validation sessions increases the overall system accuracy and reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating 2.5D Photorealistic Synthetic Datasets for Training Machine Vision Algorithms

Visual Data Simulation for Deep Learning in Robot Manipulation Tasks

Mixing Domains for Smartly Picking and Using Limited Datasets in Industrial Object Detection

References

Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011)
Article Google Scholar
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: towards common benchmarks for manipulation research. In: IEEE International Conference on Advanced Robotics, pp. 510–517. IEEE (2015)
Carlucci, F.M., Russo, P., Caputo, B.: A deep representation for depth images from synthetic data. In: IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 1362–1369. IEEE (2017)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357 (2016)
Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)
Article MathSciNet MATH Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)
Article Google Scholar
Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769 (2014)
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349 (2016)
Georgakis, G., Mousavian, A., Berg, A.C., Kosecka, J.: Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836 (2017)
Giusti, A., Guzzi, J., Cireşan, D.C., He, F.L., Rodríguez, J.P., Fontana, F., Faessler, M., Forster, C., Schmidhuber, J., Di Caro, G., et al.: A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robot. Autom. Lett. 1(2), 661–667 (2016)
Article Google Scholar
Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset. Technical Report 7694, California Institute of Technology, Pasadena (2007)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2315–2324 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasudevan, R.: Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks? In: IEEE International Conference on Robotics and Automation, pp. 1–8. IEEE (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images, vol. 1, No. 4. Technical report, University of Toronto, p. 7 (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016)
MathSciNet MATH Google Scholar
Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with large-scale data collection. In: International Symposium on Experimental Robotics, pp. 173–184. Springer (2016)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer (2014)
Milford, M., Shen, C., Lowry, S., Suenderhauf, N., Shirazi, S., Lin, G., Liu, F., Pepperell, E., Lerma, C., Upcroft, B., et al.: Sequence searching with deep-learnt depth for condition-and viewpoint-invariant route-based place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 18–25 (2015)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Ødegaard, N., Knapskog, A.O., Cochin, C., Louvigne, J.C.: Classification of ships using real and simulated data in a convolutional neural network. In: Radar Conference (RadarConf), 2016 IEEE, pp. 1–6. IEEE (2016)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1717–1724. IEEE (2014)
Peng, X., Sun, B., Ali, K., Saenko, K.: Learning deep object detectors from 3D models. In: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1278–1286. IEEE (2015)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1), 157–173 (2008)
Article Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, p. 6 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014). (abs/1409.1556)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
Article Google Scholar
Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. arXiv preprint arXiv:1804.06516 (2018)
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Yan, K., Wang, Y., Liang, D., Huang, T., Tian, Y.: Cnn vs. sift for image retrieval: alternative or complementary? In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 407–411. ACM (2016)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems (NIPS), pp. 3320–3328 (2014)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Cham (2014)
Zheng, L., Yang, Y., Tian, Q.: SIFT meets CNN: A decade survey of instance retrieval. IEEE. Trans. Pattern. Anal. Mach. Intell. 40(5), 1224–1244 (2018)
Article Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1252–1264 (2017)
Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Zhuo, L., Jiang, L., Zhu, Z., Li, J., Zhang, J., Long, H.: Vehicle classification for large-scale traffic surveillance videos using convolutional neural networks. Mach. Vis. Appl. 28(7), 793–802 (2017). https://doi.org/10.1007/s00138-017-0846-2
Article Google Scholar
Zuo, H., Lang, H., Blasch, E., Ling, H.: Covert photo classification by deep convolutional neural networks. Mach. Vis. Appl. 28(5), 623–634 (2017). https://doi.org/10.1007/s00138-017-0859-x
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechatronics Engineering, Graduate School of Science Engineering and Technology, Istanbul Technical University, 34496, Maslak, Istanbul, Turkey
Ertugrul Bayraktar
Department of Mechanical Engineering, Istanbul Technical University, Inonu Cd. No:65, 34437, Beyoglu, Istanbul, Turkey
Cihat Bora Yigit & Pinar Boyraz

Authors

Ertugrul Bayraktar
View author publications
You can also search for this author in PubMed Google Scholar
Cihat Bora Yigit
View author publications
You can also search for this author in PubMed Google Scholar
Pinar Boyraz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ertugrul Bayraktar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bayraktar, E., Yigit, C.B. & Boyraz, P. A hybrid image dataset toward bridging the gap between real and simulation environments for robotics. Machine Vision and Applications 30, 23–40 (2019). https://doi.org/10.1007/s00138-018-0966-3

Download citation

Received: 04 December 2017
Revised: 04 June 2018
Accepted: 14 July 2018
Published: 01 August 2018
Issue Date: 13 February 2019
DOI: https://doi.org/10.1007/s00138-018-0966-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid image dataset toward bridging the gap between real and simulation environments for robotics

Abstract

Access this article

Similar content being viewed by others

Generating 2.5D Photorealistic Synthetic Datasets for Training Machine Vision Algorithms

Visual Data Simulation for Deep Learning in Robot Manipulation Tasks

Mixing Domains for Smartly Picking and Using Limited Datasets in Industrial Object Detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Generating 2.5D Photorealistic Synthetic Datasets for Training Machine Vision Algorithms

Visual Data Simulation for Deep Learning in Robot Manipulation Tasks

Mixing Domains for Smartly Picking and Using Limited Datasets in Industrial Object Detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation