skip to main content
10.1145/3641399.3641419acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
short-paper

How much SPACE do metrics have in GenAI assisted software development?

Published:22 February 2024Publication History

ABSTRACT

Large Language Models (LLMs) are revolutionizing the way a developer creates software by replacing code with natural language prompts as primary drivers. While many initial assessments of such LLMs suggest that it helps with developer productivity, other research studies have also pointed out areas in the Software Development Life Cycle(SDLC) and developer experience where such tools fail miserably. Currently, there exist many studies dedicated to evaluation of LLM-based AI-assisted software tools but there lacks a standardization of studies and metrics which may prove to be a hindrance to adoption of metrics and reproducible studies. The primary objective of this survey is to assess the recent user studies and surveys, aimed at evaluating different aspects of developer’s experience of using code-based LLMs, and highlight any existing gaps among them. We have leveraged the SPACE framework to enumerate and categorise metrics from studies conducting some form of controlled user experiments. In Generative AI assisted SDLC, the developer’s experience should encompass the ability to perform the in-hand task efficiently and effectively, with minimal friction using these LLM tools. Our exploration has led to some critical insights regarding complete absence of user studies in Collaborative aspects of teams, bias towards certain LLM models & metrics and lack of diversity in metrics within productivity dimensions. We also propose some recommendations to the research community which will help bring some conformity in the evaluation of such LLMs.

References

  1. Naser Al Madi. 2022. How readable is model-generated code? examining readability and visual inspection of github copilot. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).Google ScholarGoogle Scholar
  3. Shraddha Barke, Michael B James, and Nadia Polikarpova. 2022. Grounded copilot: How programmers interact with code-generating models.(2022). CoRR arXiv 2206 (2022).Google ScholarGoogle Scholar
  4. Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking Flight with Copilot. Commun. ACM 66, 6 (2023), 56–62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Victor Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh, Han Liu, and Saleema Amershi. 2022. Aligning Offline Metrics and Human Judgments of Value of AI-Pair Programmers. arXiv preprint arXiv:2210.16494 (2022).Google ScholarGoogle Scholar
  6. Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023. Out of the bleu: how should we assess quality of the code generation models?Journal of Systems and Software 203 (2023), 111741.Google ScholarGoogle Scholar
  7. Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: There’s more to it than you think.Queue 19, 1 (2021), 20–48.Google ScholarGoogle Scholar
  8. Saki Imai. 2022. Is github copilot a substitute for human pair-programming? an empirical study. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 319–321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the syntax and strategies of natural language programming with generative language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, 2023. StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023).Google ScholarGoogle Scholar
  11. OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]Google ScholarGoogle Scholar
  12. Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).Google ScholarGoogle Scholar
  13. Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2022. Security implications of large language model code assistants: A user study. arXiv preprint arXiv:2208.09727 (2022).Google ScholarGoogle Scholar
  14. Jiao Sun, Q Vera Liao, Michael Muller, Mayank Agarwal, Stephanie Houde, Kartik Talamadupula, and Justin D Weisz. 2022. Investigating explainability of generative AI for code through scenario-based design. In 27th International Conference on Intelligent User Interfaces. 212–228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts. 1–7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q Vera Liao, and Jennifer Wortman Vaughan. 2022. Generation probabilities are not enough: improving error highlighting for AI code suggestions. In Virtual Workshop on Human-Centered AI Workshop at NeurIPS (HCAI@ NeurIPS’22). Virtual Event, USA. 1–4.Google ScholarGoogle Scholar
  17. Frank F Xu, Bogdan Vasilescu, and Graham Neubig. 2022. In-ide code generation from natural language: Promise and challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–47.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 7443–7464. https://doi.org/10.18653/v1/2023.acl-long.411Google ScholarGoogle ScholarCross RefCross Ref
  19. Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How much SPACE do metrics have in GenAI assisted software development?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ISEC '24: Proceedings of the 17th Innovations in Software Engineering Conference
      February 2024
      144 pages
      ISBN:9798400717673
      DOI:10.1145/3641399

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 February 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate76of315submissions,24%
    • Article Metrics

      • Downloads (Last 12 months)66
      • Downloads (Last 6 weeks)52

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format