TidyBot: personalized robot assistance with large language models

Wu, Jimmy; Antonova, Rika; Kan, Adam; Lepert, Marion; Zeng, Andy; Song, Shuran; Bohg, Jeannette; Rusinkiewicz, Szymon; Funkhouser, Thomas

doi:10.1007/s10514-023-10139-z

TidyBot: personalized robot assistance with large language models

Published: 16 November 2023

Volume 47, pages 1087–1102, (2023)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Jimmy Wu¹,
Rika Antonova²,
Adam Kan³,
Marion Lepert²,
Andy Zeng⁴,
Shuran Song⁵,
Jeannette Bohg²,
Szymon Rusinkiewicz¹ &
…
Thomas Funkhouser^1,4

1128 Accesses
9 Citations
137 Altmetric
18 Mentions
Explore all metrics

Abstract

For a robot to personalize physical assistance effectively, it must learn user preferences that can be generally reapplied to future scenarios. In this work, we investigate personalization of household cleanup with robots that can tidy up rooms by picking up objects and putting them away. A key challenge is determining the proper place to put each object, as people’s preferences can vary greatly depending on personal taste or cultural background. For instance, one person may prefer storing shirts in the drawer, while another may prefer them on the shelf. We aim to build systems that can learn such preferences from just a handful of examples via prior interactions with a particular person. We show that robots can combine language-based planning and perception with the few-shot summarization capabilities of large language models to infer generalized user preferences that are broadly applicable to future interactions. This approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Brain Intelligence: Go beyond Artificial Intelligence

Article 21 September 2017

Impacts of service robots on service quality

Article 07 August 2020

References

Abdo, N., Stachniss, C., Spinello, L., & Burgard, W. (2015). Robot, organize my shelves! tidying up objects by predicting user preferences. In 2015 IEEE international conference on robotics and automation (ICRA).
Batra, D., Chang, A. X., Chernova, S., Davison, A. J., Deng, J., Koltun, V., Levine, S., Malik, J., Mordatch, I., & Mottaghi, R., et al. (2020). Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., & Julian, R. (2022). Do as i can, not as i say: Grounding language in robotic affordances. In 6th annual conference on robot learning.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Chen, W., Hu, S., Talak, R., & Carlone, L. (2022). Leveraging large language models for robot 3d scene understanding. arXiv preprint arXiv:2209.05629
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., & Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
Chen, B., Xia, F., Ichter, B., Rao, K., Gopalakrishnan, K., Ryoo, M.S., Stone, A., & Kappler, D. (2022). Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., & Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
Coulter, R. C. (1992). Implementation of the pure pursuit path tracking algorithm. Technical report, Carnegie-Mellon UNIV Pittsburgh PA Robotics INST.
Dewi, T., Risma, P., & Oktarina, Y. (2020). Fruit sorting robot based on color and size for an agricultural product packaging system. Bulletin of Electrical Engineering and Informatics, 9(4), 1438–1445.
Article Google Scholar
Ehsani, K., Han, W., Herrasti, A., VanderBilt, E., Weihs, L., Kolve, E., Kembhavi, A., & Mottaghi, R. (2021). Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Gan, C., Zhou, S., Schwartz, J., Alter, S., Bhandwaldar, A., Gutfreund, D., Yamins, D. L., DiCarlo, J. J., McDermott, J., & Torralba, A. (2022). The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai. In 2022 International conference on robotics and automation (ICRA).
Garrido-Jurado, S., Muñoz-Salinas, R., Madrid-Cuevas, F. J., & Marín-Jiménez, M. J. (2014). Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6), 2280–2292.
Article Google Scholar
Gu, X., Lin, T.-Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. In International conference on learning representations.
Gupta, M., & Sukhatme, G. S. (2012). Using manipulation primitives for brick sorting in clutter. In 2012 IEEE international conference on robotics and automation.
Herde, M., Kottke, D., Calma, A., Bieshaar, M., Deist, S., & Sick, B. (2018). Active sorting: An efficient training of a sorting robot with active learning techniques. In 2018 international joint conference on neural networks (IJCNN).
Høeg, S. H., & Tingelstad, L. (2022). More than eleven thousand words: Towards using language models for robotic sorting of unseen objects into arbitrary categories. In Workshop on language and robotics at CoRL 2022.
Holmberg, R., & Khatib, O. (2000). Development and control of a holonomic mobile robot for mobile manipulation tasks. The International Journal of Robotics Research, 19(11), 1066–1074.
Article MATH Google Scholar
Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207
Huang, E., Jia, Z., & Mason, M. T. (2019). Large-scale multi-object rearrangement. In 2019 international conference on robotics and automation (ICRA).
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., & Chebotar, Y., et al. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608
Kang, M., Kwon, Y., & Yoon, S.-E. (2018). Automated task planning using object arrangement optimization. In 2018 15th international conference on ubiquitous robots (UR), IEEE.
Kant, Y., Ramachandran, A., Yenamandra, S., Gilitschenski, I., Batra, D., Szot, A., & Agrawal, H. (2022). Housekeep: Tidying virtual households using commonsense reasoning. arXiv preprint arXiv:2205.10712
Kapelyukh, I., & Johns, E. (2022). My house, my rules: Learning tidying preferences with graph neural networks. In Conference on robot learning.
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916
Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., & Farhadi, A. (2017). Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474
Kujala, J. V., Lukka, T. J., & Holopainen, H. (2016). Classifying and sorting cluttered piles of unknown objects with robots: A learning approach. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS).
Li, C., Xia, F., Martín-Martín, R., Lingelbach, M., Srivastava, S., Shen, B., Vainio, K.E., Gokmen, C., Dharan, G., & Jain, T. (2022). igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In Conference on robot learning.
Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., Wang, C., Levine, G., Lingelbach, M., & Sun, J. (2022). Behavior-1k: A benchmark for embodied ai with 1000 everyday activities and realistic simulation. In 6th annual conference on robot learning.
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., & Zeng, A. (2022). Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753
Lin, K., Agia, C., Migimatsu, T., Pavone, M., Bohg, J. (2023). Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Lukka, T. J., Tossavainen, T., Kujala, J. V., & Raiko, T. (2014). Zenrobotics recycler–robotic sorting using machine learning. In Proceedings of the international conference on sensor-based sorting (SBS).
Madaan, A., Zhou, S., Alon, U., Yang, Y., & Neubig, G. (2022). Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128
Mees, O., Borja-Diaz, J., & Burgard, W. (2022). Grounding language with visual affordances over unstructured data. arXiv preprint arXiv:2210.01911
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., & Shen, Z., et al. (2022). Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230
Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., & Luan, D., et al. (2021). Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114
Pan, Z., Hauser, K. (2021). Decision making in joint push-grasp action space for large-scale object sorting. In 2021 IEEE international conference on robotics and automation (ICRA).
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., & Torralba, A. (2018). Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning.
Raman, S. S., Cohen, V., Rosen, E., Idrees, I., Paulius, D., & Tellex, S. (2022). Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935
Rasch, R., Sprute, D., Pörtner, A., Battermann, S., & König, M. (2019). Tidy up my room: Multi-agent cooperation for service tasks in smart environments. Journal of Ambient Intelligence and Smart Environments, 11(3), 261–275.
Article Google Scholar
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP).
Ren, A. Z., Govil, B., Yang, T.-Y., Narasimhan, K., & Majumdar, A. (2022). Leveraging language for accelerated learning of tool manipulation. arXiv preprint arXiv:2206.13074
Rytting, C., & Wingate, D. (2021). Leveraging the inductive bias of large language models for abstract textual reasoning. Advances in Neural Information Processing Systems, 34, 17111–17122.
Google Scholar
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108
Sarch, G., Fang, Z., Harley, A.W., Schydlo, P., Tarr, M.J., Gupta, S., & Fragkiadaki, K. (2022). Tidee: Tidying up novel rooms using visuo-semantic commonsense priors. In European conference on computer vision.
Shah, D., Osinski, B., Ichter, B., & Levine, S. (2022). LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action. arXiv preprint arXiv:2207.04429
Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., & Fox, D. (2020). Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y., Trischler, A., & Hausknecht, M. J. (2021). Alfworld: Aligning text and embodied environments for interactive learning. In ICLR.
Silver, T., Hariprasad, V., Shuttleworth, R. S., Kumar, N., Lozano-Pérez, T., & Kaelbling, L. P. (2022). Pddl planning with pretrained large language models. In NeurIPS 2022 foundation models for decision making workshop.
Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., & Garg, A. (2022). Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302
Song, H., Haustein, J. A., Yuan, W., Hang, K., Wang, M.Y., Kragic, D., Stork, J. A. (2020). Multi-object rearrangement with monte Carlo tree search: A case study on planar nonprehensile sorting. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS).
Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K. E., Lian, Z., Gokmen, C., Buch, S., & Liu, K. (2022). Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on robot learning.
Szabo, R., Lie, I. (2012). Automated colored object sorting application for robotic arms. In 2012 10th international symposium on electronics and telecommunications.
Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D. S., Maksymets, O., et al. (2021). Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34, 251–266.
Google Scholar
Taniguchi, A., Isobe, S., El Hafi, L., Hagiwara, Y., & Taniguchi, T. (2021). Autonomous planning based on spatial concepts to tidy up home environments with service robots. Advanced Robotics, 35(8), 471–489.
Article Google Scholar
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., & Metzler, D., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903
Weihs, L., Deitke, M., Kembhavi, A., & Mottaghi, R. (2021). Visual room rearrangement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Wu, J., Antonova, R., Kan, A., Lepert, M., Zeng, A., Song, S., Bohg, J., Rusinkiewicz, S., & Funkhouser, T. (2023). Tidybot: Personalized robot assistance with large language models. In IEEE/rsj international conference on intelligent robots and systems (IROS).
Yan, Z., Crombez, N., Buisson, J., Ruichck, Y., Krajnik, T., & Sun, L. (2021). A quantifiable stratification strategy for tidy-up in service robotics. In 2021 IEEE international conference on advanced robotics and its social impacts (ARSO).
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629
Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., Lee, J., & Vanhoucke, V., et al. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598
Zeng, A., Song, S., Lee, J., Rodriguez, A., & Funkhouser, T. (2020). Tossingbot: Learning to throw arbitrary objects with residual physics. IEEE Transactions on Robotics, 36(4), 1307–1319.
Article Google Scholar
Zeng, A., Song, S., Yu, K.-T., Donlon, E., Hogan, F. R., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., et al. (2022). Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. The International Journal of Robotics Research, 41(7), 690–705.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank William Chong, Kevin Lin, and Jingyun Yang for fruitful technical discussions, and Bob Holmberg for mentorship and support in building up the mobile platforms.

Funding

This work was supported in part by the Princeton School of Engineering, Toyota Research Institute, and the National Science Foundation under CCF-2030859, DGE-1656466, and IIS-2132519.

Author information

Authors and Affiliations

Princeton University, Princeton, NJ, USA
Jimmy Wu, Szymon Rusinkiewicz & Thomas Funkhouser
Stanford University, Stanford, CA, USA
Rika Antonova, Marion Lepert & Jeannette Bohg
The Nueva School, San Mateo, CA, USA
Adam Kan
Google, Mountain View, CA, USA
Andy Zeng & Thomas Funkhouser
Columbia University, New York, NY, USA
Shuran Song

Authors

Jimmy Wu
View author publications
You can also search for this author in PubMed Google Scholar
Rika Antonova
View author publications
You can also search for this author in PubMed Google Scholar
Adam Kan
View author publications
You can also search for this author in PubMed Google Scholar
Marion Lepert
View author publications
You can also search for this author in PubMed Google Scholar
Andy Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Shuran Song
View author publications
You can also search for this author in PubMed Google Scholar
Jeannette Bohg
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Rusinkiewicz
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Funkhouser
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JW, RA, AK, ML, and AZ contributed to system implementation, experiments, or analysis. RA, AZ, SS, JB, SR, and TF supervised the project. All authors contributed to the manuscript.

Corresponding author

Correspondence to Jimmy Wu.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: LLM prompts

This section contains the full prompts used for all LLM text completion tasks. Each prompt consists of 1–3 in-context examples in gray followed by a test example that we ask the LLM to complete. The portion of the test example that is generated by the LLM is . We use the same in-context examples across all scenarios in both the benchmark and the real-world system. For each scenario, only the final test example is modified.

1.1 A.1 Summarization for receptacle selection

1.2 A.2 Receptacle selection

1.3 A.3 Summarization for primitive selection

1.4 A.4 Primitive selection

1.5 A.5 Category extraction for real-world system

1.6 A.6 Receptacle selection for real-world system

1.7 A.7 Primitive selection for real-world system

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Antonova, R., Kan, A. et al. TidyBot: personalized robot assistance with large language models. Auton Robot 47, 1087–1102 (2023). https://doi.org/10.1007/s10514-023-10139-z

Download citation

Received: 02 May 2023
Accepted: 25 August 2023
Published: 16 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10514-023-10139-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TidyBot: personalized robot assistance with large language models

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Brain Intelligence: Go beyond Artificial Intelligence

Impacts of service robots on service quality

References

Acknowledgements

Funding