Authors:
Ole Werger
;
Stefan Hanenberg
;
Ole Meyer
;
Nils Schwenzfeier
and
Volker Gruhn
Affiliation:
University of Duisburg–Essen, Essen, Germany
Keyword(s):
Large Language Model, ChatGPT, Empirical Study, User Study.
Abstract:
It is now widely accepted that ML models can solve tasks that deal with the generation of source code. Now
it is interesting to know whether the related tasks can be generated as well. In this paper, we evaluate how
well ChatGPT can generate tasks that deal with generating simple SQL statements. To do this, ChatGPT
generated for 10 different database schemas tasks with three different difficulty levels (easy, medium, hard).
The generated tasks are then evaluated for suitability and difficulty by exam-correction-experienced raters.
With a substantial raters agreement (α=.731), 90.67% of the tasks were considered appropriate (p<.001).
However, while raters agreed that tasks, that ChatGPT considers as more difficult, are actually more difficult
(p<.001), there is in general no agreement between ChatGPT’s task difficulty and rated difficulty (α=.310).
Additionally, we checked in an N-of-1 experiment, whether the use of ChatGPT helped in the design of exams.
It turned out that ChatGPT inc
reased the time required to design an experiment by 40% (p=.036; d=-1.014).
Altogether the present study rather raises doubts whether ChatGPT is in its current version a practical tool for
the design of source code tasks.
(More)