Published March 7, 2023 | Version 0.2
Dataset Open

Lost at C: Data from the Security-focused User Study

Description

2022 Study on the security implications of Large Language Model Code Assistants

This repository contains the results of the 2022 study described in the Paper: `Lost at C: A User Study on the security implications of Large Language Model Code Assistants` Link: https://arxiv.org/pdf/2208.09727.pdf

Here, the overall goal is to determine if users with access to code suggestions via a Large Language Model (OpenAI code-cushman-001) in a GitHub Copilot-like arrangement produce code with a higher incidence rate of security-related bugs than those without any such access. In particular we concern ourselves with low-level memory-related bugs such as those often present in buggy C code.

To answer this question, a User Study was conducted (N=58) which had users implement a shopping list in C as a singly-linked list. Half the users had access to a custom Copilot-like extension which generated suggestions according to code-cushman-001, and half had no access or coding hints other than provided by Visual Studio Code's default Intellisense.
The study was performed in a controlled environment (a virtualized cloud-based desktop).

This task was made deliberately difficult than usual via the specifications: participants had to implement the shopping list according to an unusual API containing a number of pitfalls. They had to implement only the implementation of the specification (i.e. the `list.c` file). Users were provided `list.h` as well as a suite of automated (if basic) tests.

For more details, you can see the associated paper.

The repository contains the user study data as well as the scripts used for analysis and results reproduction.

Files

llm-user-study-for-security-data-full.zip

Files (4.6 MB)

Name Size Download all
md5:b7d13622ca1c83ec4401c68e4cb7b7a9
4.6 MB Preview Download