Skip to main content
Log in

VisuoSpatial Foresight for physical sequential fabric manipulation

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Robotic fabric manipulation has applications in home robotics, textiles, senior care and surgery. Existing fabric manipulation techniques, however, are designed for specific tasks, making it difficult to generalize across different but related tasks. We build upon the Visual Foresight framework to learn fabric dynamics that can be efficiently reused to accomplish different sequential fabric manipulation tasks with a single goal-conditioned policy. We extend our earlier work on VisuoSpatial Foresight (VSF), which learns visual dynamics on domain randomized RGB images and depth maps simultaneously and completely in simulation. In this earlier work, we evaluated VSF on multi-step fabric smoothing and folding tasks against 5 baseline methods in simulation and on the da Vinci Research Kit surgical robot without any demonstrations at train or test time. A key finding was that depth sensing significantly improves performance: RGBD data yields an \(\mathbf{80 \%}\) improvement in fabric folding success rate in simulation over pure RGB data. In this work, we vary 4 components of VSF, including data generation, visual dynamics model, cost function, and optimization procedure. Results suggest that training visual dynamics models using longer, corner-based actions can improve the efficiency of fabric folding by 76% and enable a physical sequential fabric folding task that VSF could not previously perform with 90% reliability. Code, data, videos, and supplementary material are available at https://sites.google.com/view/fabric-vsf/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Andrychowicz, O. M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J. W., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., & Zaremba, W. (2020). Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39, 20–23.

  • Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2018). Stochastic variational video prediction. In International conference on learning representations (ICLR).

  • Balaguer, B., & Carpin, S. (2011). Combining imitation and reinforcement learning to fold deformable planar objects. In: IEEE/RSJ international conference on intelligent robots and systems (IROS)

  • Balakrishna, A., Thananjeyan, B., Lee, J., Zahed, A., Li, F., Gonzalez, J. E., & Goldberg, K. (2019). On-policy robot imitation learning from a converging supervisor. In Conference on robot learning (CoRL).

  • Baraff, D., & Witkin, A. (1998). Large steps in cloth simulation. In ACM SIGGRAPH.

  • Berkenkamp, F., Schoellig, A. P., & Krause, A. (2016). Safe controller optimization for quadrotors with Gaussian processes. In IEEE international conference on robotics and automation (ICRA).

  • Borras, J., Alenya, G., & Torras, C. (2019). A Grasping-centered Analysis for Cloth Manipulation. arXiv preprint arXiv:1906.08202.

  • Chen, D., Zhou, B., Koltun, V., & Krahenbuhl, P. (2019). Learning by cheating. In Conference on robot learning (CoRL).

  • Chiuso, A., & Pillonetto, G. (2019). System identification: A machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems, 2, 281–304.

    Article  Google Scholar 

  • Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neural information processing systems (NeurIPS)

  • Community, B. O. (2018). Blender—a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation. Retrieved Jan, 2020, from http://www.blender.org

  • Coumans, E., & Bai, Y. (2016–2019). Pybullet, a python module for physics simulation for games, robotics and machine learning. Retrieved Jan, 2020, from http://pybullet.org.

  • Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., & Finn, C. (2019). RoboNet: Large-scale multi-robot learning. In Conference on robot learning (CoRL).

  • Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In International conference on machine learning (ICML).

  • Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). OpenAI Baselines. https://github.com/openai/baselines.

  • Doumanoglou, A., Kargakos, A., Kim, T. K., & Malassiotis, S. (2014). Autonomous active recognition and unfolding of clothes using random decision forests and probabilistic planning. In IEEE international conference on robotics and automation (ICRA).

  • Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., & Levine, S. (2018). Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568.

  • Ebert, F., Finn, C., Lee, A. X., & Levine, S. (2017). Self-supervised visual planning with temporal skip connections. In Conference on robot learning (CoRL).

  • Erickson, Z., Clever, H. M., Turk, G., Liu, C. K., & Kemp, C. C. (2018). Deep haptic model predictive control for robot-assisted dressing. In IEEE international conference on robotics and automation (ICRA).

  • Erickson, Z., Gangaram, V., Kapusta, A., Liu, C. K., & Kemp, C. C. (2020). Assistive gym: A physics simulation framework for assistive robotics. In IEEE international conference on robotics and automation (ICRA).

  • Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In Neural information processing systems (NeurIPS).

  • Finn, C., & Levine, S. (2017). Deep visual foresight for planning robot motion. In IEEE international conference on robotics and automation (ICRA).

  • Ganapathi, A., Sundaresan, P., Thananjeyan, B., Balakrishna, A., Seita, D., Grannen, J., Hwang, M., Hoque, R., Gonzalez, J. E., Jamali, N., Yamane, K., Iba, S., & Goldberg, K. (2021). Learning dense visual correspondences in simulation to smooth and fold real fabrics. In IEEE international conference on robotics and automation (ICRA).

  • Ganapathi, A., Sundaresan, P., Thananjeyan, B., Balakrishna, A., Seita, D., Hoque, R., Gonzalez, J. E., & Goldberg, K. (2020). MMGSD: Multi-modal Gaussian shape descriptors for correspondence matching in 1d and 2d deformable objects. In International conference on intelligent robots and systems (IROS) workshop on managing deformation. IEEE.

  • Gao, Y., Chang, H. J., & Demiris, Y. (2016). Iterative path optimisation for personalised dressing assistance using vision and force information. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

  • Hansen, N., & Auger, A. (2011). CMA-ES: Evolution strategies and covariance matrix adaptation. Association for Computing Machinery.

  • Hewing, L., Liniger, A., & Zeilinger. M. (2018). Cautious NMPC with Gaussian process dynamics for autonomous miniature race cars. In European controls conference (ECC).

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780.

    Article  Google Scholar 

  • Hoque, R., Seita, D., Balakrishna, A., Ganapathi, A., Tanwani, A. K., Jamali, N., Yamane, K., Iba, S., & Goldberg, K. (2020). VisuoSpatial Foresight for multi-step, multi-task fabric manipulation. In Robotics: Science and systems (RSS).

  • Jangir, R., Alenya, G., & Torras, C. (2020). Dynamic cloth manipulation with deep reinforcement learning. In IEEE international conference on robotics and automation (ICRA).

  • Jia, B., Hu, Z., Pan, J., & Manocha, D. (2018). Manipulating highly deformable materials using a visual feedback dictionary. In IEEE international conference on robotics and automation (ICRA).

  • Jia, B., Pan, Z., Hu, Z., Pan, J., & Manocha, D. (2019). Cloth manipulation using random-forest-based imitation learning. In IEEE international conference on robotics and automation (ICRA).

  • Kazanzides, P., Chen, Z., Deguet, A., Fischer, G., Taylor, R., & DiMaio, S. (2014). An Open-Source Research Kit for the da Vinci surgical system. In IEEE international conference on robotics and automation (ICRA).

  • Kingma, D.P., & Ba, J. (2015). ADAM: A method for stochastic optimization. In International conference on learning representations (ICLR).

  • Kita, Y., Ueshiba, T., Neo, E. S., & Kita, N. (2009a). A method for handling a specific part of clothing by dual arms. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Kita, Y., Ueshiba, T., Neo, E. S., & Kita, N. (2009b). Clothes state recognition using 3D observed data. In IEEE international conference on robotics and automation (ICRA).

  • Kocijan, J., Murray-Smith, R., Rasmussen, C., & Girard, A. (2004). Gaussian process model based predictive control. In American control conference (ACC).

  • Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., & Kingma, D. (2020). VideoFlow: A conditional flow-based model for stochastic video generation. In International conference on learning representations (ICLR).

  • Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523

  • Lee, R., Ward, D., Cosgun, A., Dasagi, V., Corke, P., & Leitner, J. (2020). Learning arbitrary-goal fabric folding with one hour of real robot experience. In Conference on robot learning (CoRL).

  • Li, Y., Hu, X., Xu, D., Yue, Y., Grinspun, E., & Allen, P. K. (2016). Multi-sensor surface analysis for robotic ironing. In IEEE international conference on robotics and automation (ICRA).

  • Li, Y., Yue, Y., Grinspun, D. X. E., & Allen, P. K. (2015). Folding deformable objects using predictive simulation and trajectory optimization. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International conference on learning representations (ICLR).

  • Lin, X., Wang, Y., Olkin, J., & Held, D. (2020). SoftGym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on robot learning (CoRL).

  • Lippi, M., Poklukar, P., Welle, M. C., Varava, A., Yin, H., Marino, A., & Kragic, D. (2020). Latent space roadmap for visual action planning of deformable and rigid object manipulation. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., & Abbeel, P. (2010). Cloth Grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In IEEE international conference on robotics and automation (ICRA).

  • Mann, H. B., & Whitney, D. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18, 50–60.

  • Matas, J., James, S., & Davison, A. J. (2018). Sim-to-real reinforcement learning for deformable object manipulation. In Conference on robot learning (CoRL).

  • Miller, S., Berg, J. V., Fritz, M., Darrell, T., Goldberg, K., & Abbeel, P. (2012). A geometric approach to robotic laundry folding. The International Journal of Robotics Research, 31, 249–267.

  • Nagabandi, A., Kahn, G., Fearing, R., & Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In IEEE international conference on robotics and automation (ICRA).

  • Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Overcoming exploration in reinforcement learning with demonstrations. In IEEE international conference on robotics and automation (ICRA).

  • Nair, S., Babaeizadeh, M., Finn, C., Levine, S., & Kumar, V. (2020). Time reversal as self-supervision. In IEEE international conference on robotics and automation (ICRA).

  • Nair, S., & Finn, C. (2020). Goal-aware prediction: Learning to model what matters. In International conference on machine learning (ICML).

  • Narain, R., Samii, A., & O’Brien, J. F. (2012). Adaptive anisotropic remeshing for cloth simulation. In ACM SIGGRAPH Asia.

  • Osawa, F., Seki, H., & Kamiya, Y. (2007). Unfolding of Massive Laundry and Classification Types by Dual Manipulator. Journal of Advanced Computational Intelligence and Intelligent Informatics, 11(5), 457–463.

    Article  Google Scholar 

  • Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97.

    Article  Google Scholar 

  • Provot, X. (1995). Deformation constraints in a mass-spring model to describe rigid cloth behavior. In Graphics interface.

  • Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In International conference on learning representations (ICLR).

  • Rosolia, U., & Borrelli, F. (2020). Learning how to autonomously race a car: A predictive control approach. IEEE Transactions on Control Systems Technology, 28, 2713–2719.

  • Ross, S., Gordon, G. J., & Bagnell, J. A. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics (AISTATS).

  • Rubinstein, R. (1999). The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1, 127–190.

    Article  MathSciNet  Google Scholar 

  • Sanchez, J., Corrales, J., Bouzgarrou, B., & Mezouar, Y. (2018). Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey. The International Journal of Robotics Research, 37, 688–716.

  • Schrimpf, J., & Wetterwald, L. E. (2012). Experiments towards automated sewing with a multi-robot system. In IEEE international conference on robotics and automation (ICRA).

  • Seita, D., Florence, P., Tompson, J., Coumans, E., Sindhwani, V., Goldberg, K., & Zeng, A. (2021). Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks. In IEEE international conference on robotics and automation (ICRA).

  • Seita, D., Ganapathi, A., Hoque, R., Hwang, M., Cen, E., Tanwani, A.K., Balakrishna, A., Thananjeyan, B., Ichnowski, J., Jamali, N., Yamane, K., Iba, S., Canny, J., & Goldberg, K. (2020). Deep imitation learning of sequential fabric smoothing from an algorithmic supervisor. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Seita, D., Jamali, N., Laskey, M., Berenstein, R., Tanwani, A.K., Baskaran, P., Iba, S., Canny, J., & Goldberg, K. (2019). Deep transfer learning of pick points on fabric for robot bed-making. In International symposium on robotics research (ISRR).

  • Seita, D., Krishnan, S., Fox, R., McKinley, S., Canny, J., & Goldberg, K. (2018). Fast and reliable autonomous surgical debridement with cable-driven robots using a two-phase calibration procedure. In IEEE international conference on robotics and automation (ICRA).

  • Shibata, S., Yoshimi, T., Mizukawa, M., & Ando, Y. (2012). A trajectory generation of cloth object folding motion toward realization of housekeeping robot. In International conference on ubiquitous robots and ambient intelligence (URAI).

  • Shin, C., Ferguson, P. W., Pedram, S. A., Ma, J., Dutson, E. P., & Rosen, J. (2019). Autonomous tissue manipulation via surgical robot using learning based model predictive control. In IEEE international conference on robotics and automation (ICRA).

  • Sun, L., Aragon-Camarasa, G., Cockshott, P., Rogers, S., & Siebert, J. P. (2014). A heuristic-based approach for flattening wrinkled clothes. In Towards autonomous robotic systems TAROS 2013 lecture notes in computer science (Vol. 8069).

  • Sun, L., Aragon-Camarasa, G., Rogers, S., & Siebert, J. P. (2015). Accurate garment surface analysis using an active stereo robot head with application to dual-arm flattening. In IEEE international conference on robotics and automation (ICRA).

  • Thananjeyan*, B, Balakrishna*, A., Nair, S., Luo, M., Srinivasan, K., Hwang, M., Gonzalez, J. E., Ibarz, J., Finn, C., & Goldberg, K. (2020). Recovery RL: Safe reinforcement learning with learned recovery zones. In NeurIPS robot learning workshop, NeurIPS.

  • Thananjeyan, B., Balakrishna, A., Rosolia, U., Li, F., McAllister, R., Gonzalez, J. E., Levine, S., Borrelli, F., & Goldberg, K. (2020). Safety augmented value estimation from demonstrations (SAVED): Safe deep model-based RL for sparse cost robotic tasks. In IEEE robotics and automation letters (RA-L).

  • Thananjeyan, B., Garg, A., Krishnan, S., Chen, C., Miller, L., & Goldberg, K. (2017). Multilateral surgical pattern cutting in 2D orthotropic gauze with deep reinforcement learning policies for tensioning. In IEEE international conference on robotics and automation (ICRA).

  • Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Torgerson, E., & Paul, F. (1987). Vision guided robotic fabric manipulation for apparel manufacturing. In IEEE international conference on robotics and automation (ICRA).

  • Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., & Uszkoreit, J. (2018). Tensor2tensor for neural machine translation. CoRR arXiv:1803.07416.

  • Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., & Riedmiller, M. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817.

  • Verlet, L. (1967). Computer experiments on classical fluids: I. Theormodynamical properties of Lennard–Jones molecules. Physics Review, 159, 98.

    Article  Google Scholar 

  • Wang, Z., Bovik, A., Sheikh, H., & Simoncelli, E.P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612.

  • Willimon, B., Birchfield, S., & Walker, I. (2011). Model for unfolding laundry using interactive perception. In IEEE/RSJ international conference on intelligent robots and systems (IROS).

  • Wu, Y., Yan, W., Kurutach, T., Pinto, L., & Abbeel, P. (2020). Learning to manipulate deformable objects without demonstrations. In Robotics: Science and systems (RSS).

  • Xie, A., Singh, A., Levine, S., & Finn, C. (2018). Few-shot goal inference for visuomotor learning and planning. In Conference on robot learning (CoRL).

  • Yan, W., Vangipuram, A., Abbeel, P., & Pinto, L. (2020). Learning predictive representations for deformable objects using contrastive estimation. In Conference on robot learning (CoRL).

  • Yang, P. C., Sasaki, K., Suzuki, K., Kase, K., Sugano, S., & Ogata, T. (2017). Repeatable folding task by humanoid robot worker using deep learning. In IEEE robotics and automation letters (RA-L).

Download references

Acknowledgements

This research was performed at the AUTOLAB at UC Berkeley in affiliation with Honda Research Institute USA, the Berkeley AI Research (BAIR) Lab, Berkeley Deep Drive (BDD), the Real-Time Intelligent Secure Execution (RISE) Lab, and the CITRIS “People and Robots” (CPAR) Initiative, and by the Scalable Collaborative Human-Robot Learning (SCHooL) Project, NSF National Robotics Initiative Award 1734633. The authors were supported in part by Siemens, Google, Amazon Robotics, Toyota Research Institute, Autodesk, ABB, Samsung, Knapp, Loccioni, Intel, Comcast, Cisco, Hewlett-Packard, PhotoNeo, NVidia, and Intuitive Surgical. Daniel Seita is supported by a Graduate Fellowships for STEM Diversity and Ashwin Balakrishna is supported by an NSF GRFP. We thank Mohammad Babaeizadeh for advice on extending the SVG model to be action-conditioned, and we thank Ellen Novoseller and Lawrence Chen for extensive writing advice.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan Hoque.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.

Appendices

Appendix

We structure this “Appendix” as follows:

  • “Appendix 9” compares and contrasts various fabric simulators.

  • “Appendix 10” lists hyperparameters and provides details for training policies.

  • “Appendix 11” provides more details on the smoothing experiments.

Fabric simulators

As in the prior paper (Hoque et al. 2020), we use the fabric simulator originally developed in Seita et al. (2020). This simulator possesses an ideal balance between ease of code implementation, speed, and accuracy, and was able to lead to reasonable smoothing policies in prior work. We considered using simulators from ARCSim (Narain et al. 2012), MuJoCo (Todorov et al. 2012), PyBullet (Coumans and Bai 2016–2019), Blender (Community 2018), or NVIDIA FLeX (Lin et al. 2020), but did not use them for several reasons outlined below.

High-fidelity simulators, such as ARCSim, take too long to simulate to get sufficient data for training visual dynamics models. Furthermore, it is difficult to simulate rudimentary grasping behavior in ARCSim because it does not represent fabric as a fixed grid of vertices, which means grasping cannot be simulated by pinning vertices.

Blender includes a new fabric simulator, with substantial improvements after 2017 for more realistic shearing and tensioning. These changes, however, are only supported in Blender 2.8, not Blender 2.79, and we used 2.79 because Blender 2.8 does not allow background processes to run on headless servers, which prevented us from running mass data collection. Additionally, Blender does not allow the dynamic re-grasping of mesh vertices during simulation which makes long horizon cloth manipulation and data collection difficult.

MuJoCo is a widely utilized physics simulator for deep reinforcement learning benchmarks (Todorov et al. 2012). The first MuJoCo version providing full support for fabric manipulation was released in October 2018. Currently, the only work that integrates the fabric simulator with simulated robot grasps is from Wu et al. (2020), which was developed concurrently with the prior work (Hoque et al. 2020). Upon investigating the open-source code, we found that MuJoCo’s fabric simulator did not handle fabric self-collisions better than the simulator from Seita et al. (2020), and hence did not pursue it further.

The PyBullet simulator code from Matas et al. (2018) showed relatively successful fabric simulation, but it was difficult for us to adapt the author’s code to the proposed work, which made significant changes to the off-the-shelf PyBullet code. PyBullet’s fabric simulator was upgraded and tested for more fabric-related tasks in Seita et al. (2021), but still suffers from self-collisions and fabric which tends to get crumpled.

In concurrent work, SoftGym (Lin et al. 2020) benchmarks deep reinforcement learning algorithms on deformable object manipulation tasks, including those with fabrics. SoftGym provides fabric simulation environments utilizing NVIDIA FLeX, which models deformable objects in a particle and position based dynamical system similar to the mass-spring system used in the fabric simulator from Hoque et al. (2020), Seita et al. (2020) and also incorporates self-collision handling. SoftGym is concurrent work, and future work will investigate the feasibility of utilizing Flex. Additionally, we will compare the performance of the model-based policies presented in this work to the model-free policies evaluated in Lin et al. (2020) on similar smoothing and folding tasks.

Details of learning-based methods

We describe implementation and training details of the three learning-based methods tested: imitation learning, model-free reinforcement learning, and model-based VisuoSpatial Foresight. The other baselines tested—random, highest point, and wrinkles—are borrowed unmodified from prior open-source code (Seita et al. 2020).

1.1 Imitation learning baseline: DAgger

This section contains details and results from our prior work (Hoque et al. 2020). DAgger (Ross et al. 2011) is implemented directly from the open source DAgger code in Seita et al. (2020). This was originally based on the open-source OpenAI baselines (Dhariwal et al. 2017) library for parallel environment support to overcome the time bottleneck of fabric simulation.

We ran the corner pulling demonstrator for 2000 trajectories, resulting in 6697 image-action pairs \(({\mathbf {o}}_t, {\mathbf {a}}_t')\), where the notation \({\mathbf {a}}_t'\) indicates the action is labeled and comes from the demonstrator. Each trajectory was randomly drawn from one of the three tiers in the simulator with equal probability. We then perform a behavior cloning (Pomerleau 1991) “pre-training” period for 200 epochs over this offline data, which does not require environment interaction.

After behavior cloning, each DAgger iteration rolls out 20 parallel environments for 10 steps each (hence, 200 total new samples) which are labeled by the corner pulling policy, the same policy that created the offline data and uses underlying state information. These are inserted into a replay buffer of image-action samples, where all samples have actions labeled by the demonstrator. The replay buffer size is 50,000, but the original demonstrator data of size 6697 is never removed from it. After environment stepping, we draw 240 minibatches of size 128 each for training and use Kingma and Ba (2015) for optimization. The process repeats with the agent rolling out its updated policy. We run DAgger for 110,000 steps across all environments (hence, 5500 steps per parallel environment) to make the number of actions consumed to be roughly the same as the number of actions used to train the video prediction model. This is significantly more than the 50,000 DAgger training steps in prior work (Seita et al. 2020). Table 7 contains additional hyperparameters.

The actor (i.e., policy) neural network for DAgger uses a design based on Seita et al. (2020) and Matas et al. (2018). The input to the policy are RGBD images of size \((56 \times 56 \times 4)\), where the four channels are formed from stacking an RGB and a single-channel depth image. The policy processes the input through four convolutional layers that have 32 filters with size \(3\times 3\), and then uses four fully connected layers with 256 nodes each. The actor network has 0.8 million parameters.

Table 7 DAgger hyperparameters
Fig. 13
figure 13

Average coverage over 50 simulated test-time episodes at checkpoints (marked “X”) during the behavior cloning and DAgger phases. For each setting of no action truncation and action truncation, we deploy a single DAgger policy deployed on all tiers. Using dashed lines, we annotate the average starting coverage and the corner pulling demonstrator’s average final coverage

The result from the actor policy is a 4D vector representing the action choice \({\mathbf {a}}_t \in {\mathbb {R}}^4\) at each time step t. The last layer is a hyperbolic tangent which makes each of the four components of \({\mathbf {a}}_t\) within \([-1,1]\). During action truncation, we further limit the two components of \({\mathbf {a}}_t\) corresponding to the deltas into \([-0.4, 0.4]\).

A set of graphs representing learning progress for DAgger is shown in Fig. 13, where for each marked snapshot, we roll it out in the environment for 50 episodes and measure final coverage. Results suggest the single DAgger policy, when trained with 110,000 total steps on RGBD images, performs well on all three tiers with performance nearly matching the 95–96% coverage of the demonstrator.

We trained two variants of DAgger, one with and one without the action truncation to \([-0.4, 0.4]\) for the two deltas \(\varDelta x\) and \(\varDelta y\). The model trained on truncated actions outperforms the alternative setting, and it is also the setting used in VSF-1.0, hence we use it for physical robot experiments. We choose the final snapshot as it has the highest test-time performance, and we use it as the policy for simulated and real benchmarks in the main part of the paper.

1.2 Model-free reinforcement learning baseline: DDPG

This section contains details and results from our prior work (Hoque et al. 2020). To provide a second competitive baseline, we apply model-free reinforcement learning. Specifically, we use a variant of Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al. 2016) with several improvements as proposed in the research literature. Briefly, DDPG is a deep reinforcement learning algorithm which trains parameterized actor and critic models, each of which are normally neural networks. The actor is the policy, and the critic is a value function (Table 8).

Table 8 DDPG hyperparameters

First, as with DAgger, we use demonstrations (Vecerik et al. 2017) to improve the performance of the learned policy. We use the same demonstrator data of 6697 samples from DAgger, except this time each sample is a tuple of \(({\mathbf {o}}_t, {\mathbf {a}}_t', r_t, {\mathbf {o}}_{t+1})\), including a scalar reward \(r_t\) (to be described) and a successor state \({\mathbf {o}}_{t+1}\). This data is added to the replay buffer and never removed. We use a pre-training phase (of 200 epochs) to initialize the actor and critic. We also apply \(L_2\) regularization for both the actor and critic networks. In addition, we use the Q-filter from Nair et al. (2018) which may help the actor learn better actions than the demonstrator provides, perhaps for cases when naive corner pulling might not be ideal. For a fairer comparison, the actor network for DDPG uses the same architecture as the actor for DAgger. The critic has a similar architecture as the actor, with the only change that the action input \({\mathbf {a}}_t\) is inserted and concatenated with the features of the image \({\mathbf {o}}_t\) after the four convolutional layers, and before the fully connected portion. As with the imitation learning baseline, the inputs are RGBD images of size \((56\times 56 \times 4)\). Further hyperparameters are provided in Table 8.

Fig. 14
figure 14

Average coverage over 50 simulated test-time episodes at checkpoints (marked “X”) during the pre-training DDPG phase, and the DDPG phase with agent exploration. This is presented in a similar manner as in Fig. 13 for DAgger. Results suggest that DDPG has difficulty in training a policy that can achieve high coverage

We design a dense reward to encourage the agent to achieve high coverage. At each time, the agent gets reward based on:

  • A small negative living reward of − 0.05

  • A small negative reward of − 0.05 for failing to grasp any point on the fabric (i.e., a wasted grasp attempt).

  • A delta in coverage based on the change in coverage from the current state and the prior state.

  • A \(+\) 5 bonus for triggering 92% coverage.

  • A − 5 penalty for triggering an out-of-bounds condition, where the fabric significantly exceeds the boundaries of the underlying fabric plane.

We designed the reward function by informal tuning and borrowing ideas from the reward in Andrychowicz (2020), which used a delta in joint angles and a similar bonus for moving a block towards a target, or a penalty for dropping it. Intuitively, an agent may learn to take a slightly counter-productive action which would decrease coverage (and thus the delta reward component is negative), but which may enable an easier subsequent action to trigger a high bonus. This reward design is only suited for smoothing. As with the imitation learning baseline, the model-free DDPG baseline is not designed for non-smoothing tasks.

Figure 14 suggests that the pre-training phase, where the actor and critic are trained on the demonstrator data, helps increase coverage. The DDPG portion of training, however, results in performance collapse to achieving no net coverage. Upon further inspection, this is because the actions collapsed to having no “deltas,” so the robot reduces to picking up but then immediately releasing the fabric. Due to the weak performance of DDPG, we do not benchmark the policy on the physical robot.

1.3 VisuoSpatial Foresight

Table 9 Visual MPC hyperparameters for CEM and CMA-ES

The main technique considered in this paper and our prior work (Hoque et al. 2020) is VisuoSpatial Foresight (VSF), an extension of Visual Foresight (Ebert et al. 2018). It consists of a training phase followed by a planning phase. An overview of VisuoSpatial Foresight is provided in Sect. 4, and practical implementation details are in Sect. 5. For the planning phase described in Sect. 4.2, we tuned the hyperparameters in Table 9. The CEM variance reported is the diagonal covariance used for folding and double folding. We found that for smoothing, a lower CEM variance (0.25, 0.25, 0.04, 0.04) results in better performance, though it may encourage the policy towards taking shorter actions. For CMA-ES, we use the open source Python implementation PyCMA (https://pypi.org/project/cma/), changing only the number of iterations, initial mean, and initial variance from default parameters. CMA-ES and CEM take a similar amount of computation time.

As described in Sect. 4.3, we evaluate with a Pixel L2 and learned Vertex L2 cost function. For the Pixel L2 cost function (Eq. 3), we remove the 7 pixels on each side of the image to get rid of the impact of the dark border, using only the inner \(42\times 42\) region of the \(56\times 56\) image. For the Vertex L2 cost function, as described in Sect. 4.3.2, we generate a second dataset from the primary dataset collected. Each of the 9,932 episodes in Fabric-CornerBias can contribute up to \(10 \atopwithdelims ()2\) image pairs to use in the second dataset, but we sample only 10 of these possible pairs from each episode to keep the dataset size modest (and the same size as the primary dataset). Specifically, we use the following 10 pairs, chosen for their variable gaps in temporal distance: {\(({\mathbf {o}}_1, {\mathbf {o}}_2)\), \(({\mathbf {o}}_1, {\mathbf {o}}_3)\), \(({\mathbf {o}}_1, {\mathbf {o}}_5)\), \(({\mathbf {o}}_1, {\mathbf {o}}_9)\), \(({\mathbf {o}}_6, {\mathbf {o}}_8)\), \(({\mathbf {o}}_6, {\mathbf {o}}_{10})\), \(({\mathbf {o}}_6, {\mathbf {o}}_7)\), \(({\mathbf {o}}_3, {\mathbf {o}}_4)\), \(({\mathbf {o}}_3, {\mathbf {o}}_7)\), \(({\mathbf {o}}_3, {\mathbf {o}}_9) \)}. During training, we flip the order of half of the data points to encourage the network to ignore the direction of time in its estimation of mesh distance. As mentioned in Sect. 4.3.2, we annotate all data points with the sum of the distances between corresponding points in the ground truth mesh states, i.e.

$$\begin{aligned} \sum _{i=0}^{625} ||p_1^{(i)} - p_2^{(i)} ||_2^2 \end{aligned}$$

where \(p_1^{(i)}\) is the (xyz) coordinates of the ith point of the mesh shown in the first image and \(p_2^{(i)}\) is the (xyz) coordinates of the ith point of the mesh in the second image. We divide all labels by the maximum value for more stable training. Finally, for the network architecture, we use the same CNN as the actor in the DAgger baseline as described in Sect. 10.1. However, to accommodate the second image input, we pass both images through the same convolutional layers and concatenate the outputs to a 5184-dimensional vector before applying the fully connected layers. The resulting network has about 1.5 million parameters.

Supplementary smoothing results

1.1 Statistical significance tests

We run the Mann–Whitney U test (Mann and Whitney 1947) on the coverage and number of action results reported in Table 3 for VSF-1.0 against all baselines other than Imitation Learning, to which we wish to perform similarly. See Table 10 for computed p values. We conclude that we can confidently reject the null hypothesis that the values are drawn from the same distribution for all metrics except Tier 2 coverage for Wrinkle and the Tier 1 and Tier 3 number of actions for DDPG (\(p < 0.02\)). Note that Tier 3 results are most informative, as it is the most difficult tier.

Table 10 Mann–Whitney test p values for coverage and number of actions of VSF-1.0 compared with Random, Highest, Wrinkle and DDPG baselines across all tiers of difficulty for smoothing

1.2 Domain randomization ablation

For Fabric-Random, we run 50 simulated smoothing episodes per tier with a policy trained without domain randomization and compare with the 200 episodes from Table 3. In the episodes without domain randomization, we keep fabric color, camera angle, background plane shading, and brightness constant at training and testing time. In the episodes with domain randomization, we randomize these parameters in the training data and test in the same setting as the experiments without domain randomization, which can be interpreted as a random initialization of the domain randomized parameters. In particular, we vary the following:

  • Fabric color RGB values uniformly between (0, 0, 128) and (115, 179, 255), centered around blue.

  • Background plane color RGB values uniformly between (102, 102, 102) and (153, 153, 153).

  • RGB gamma correction with gamma uniformly between 0.7 and 1.3.

  • A fixed amount to subtract from the depth image between 40 and 50 to simulate changing the height of the depth camera.

  • Camera position (xyz) as \((0.5+\delta _1, 0.5+\delta _2, 1.45+\delta _3)\) meters, where each \(\delta _i\) is sampled from \({\mathcal {N}}(0, 0.04)\).

  • Camera rotation with Euler angles sampled from \({\mathcal {N}}(0, 90^{\circ })\).

  • Random noise at each pixel uniformly between − 15 and 15.

From the results in Table 11, we find that final coverage values are similar whether or not we use domain randomization on training data, suggesting our domain randomization techniques do not have an adverse effect on performance in simulation.

To analyze robustness of the policy to variation in the randomized parameters, we also evaluate the former two policies (trained with and without domain randomization) with randomization in the test environment on Tier 3 starting states. Specifically, we change the color of the fabric in fixed increments from its non-randomized setting (RGB (25, 89, 217)) until performance starts to deteriorate. In Table 12, we observe that the domain randomized policy maintains high coverage within the training range (RGB (0, 0, 128) to (115, 179, 255)) while the policy without domain randomization suffers as soon as the fabric color is slightly altered.

Table 11 Coverage and number of actions for simulated smoothing episodes from Fabric-Random, with and without domain randomization on training data, where the domain randomized results are from Table 3
Table 12 Coverage and number of actions for Tier 3 simulated smoothing episodes with and without domain randomization on Fabric-Random training data, where we vary fabric color in fixed increments. (26, 89, 217) is the default blue color and (128, 191, 115) is slightly outside the domain randomization range

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hoque, R., Seita, D., Balakrishna, A. et al. VisuoSpatial Foresight for physical sequential fabric manipulation. Auton Robot 46, 175–199 (2022). https://doi.org/10.1007/s10514-021-10001-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-021-10001-0

Keywords

Navigation