ChatGPT Code Detection: Techniques for Uncovering the Source of Code

In recent times, large language models (LLMs) have made significant strides in generating computer code, blurring the lines between code created by humans and code produced by artificial intelligence (AI). As these technologies evolve rapidly, it is crucial to explore how they influence code generation, especially given the risk of misuse in areas like higher education. This paper explores this issue by using advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT, a type of LLM. We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms - including Deep Neural Networks, Random Forests, and Extreme Gradient Boosting - to achieve this differentiation with an impressive accuracy of 98%. For the successful combinations, we also examine their model calibration, showing that some of the models are extremely well calibrated. Additionally, we present white-box features and an interpretable Bayes classifier to elucidate critical differences between the code sources, enhancing the explainability and transparency of our approach. Both approaches work well but provide at most 85-88% accuracy. We also show that untrained humans solve the same task not better than random guessing. This study is crucial in understanding and mitigating the potential risks associated with using AI in code generation, particularly in the context of higher education, software development, and competitive programming.


Introduction
The recent performance of ChatGPT has astonished scholars and the general public, not only regarding the seemingly human way of using natural language but also its proficiency in programming languages.While the training data (e.g., for the Python programming language) ultimately stems from human programmers, ChatGPT has likely developed its own coding style and idiosyncrasies, as human programmers do.In this paper, we train and evaluate various machine learning (ML) models on distinguishing human-written from ChatGPT-written Python code.These models achieve very high performance even for code samples with few lines, a seemingly impossible task for humans.The following subsections present our motivation, scientific description of the task, and the elaboration of research questions.

Motivation
Presently, the world is looking ambivalently at the development and opportunities of powerful large language models (LLMs).On the one hand, such models can execute complex tasks and augment human productivity due to their enhanced performance in various areas [1,2].On the other hand, these models can be misused for malicious purposes, such as generating deceptive articles or cheating in educational institutions and other competitive environments [3][4][5][6][7].
The inherently opaque nature of these black-box LLM models, combined with the difficulty of distinguishing between human-and AI-generated content, poses a problem that can make it challenging to trust these models [8].Earlier research has extensively focused on detecting natural language (NL) text content generated by LLMs [9][10][11].A recent survey [12] discusses the strengths and weaknesses of those approaches.However, detecting AI-generated code is an equally important and relatively unexplored area of research.As LLMs are utilized more often in the field of software development, the ability to distinguish between human-and AI-generated code becomes increasingly important.Thereby, it is not exclusively about the distinction but also the implications, such as the code's trustworthiness, efficiency, and security.Moreover, the rapid development and improvement of AI models may lead to an arms race where detection techniques must continuously evolve to stay ahead of the curve.
Earlier research showed that using LLMs to generate code can lead to security vulnerabilities, and 40% of the code fails to solve the given task [13].Contrastingly, using LLMs can also lead to a significant increase in productivity and efficiency [2].This dual-edged nature of LLMs necessitates a balanced approach.Harnessing the potential benefits of such models while mitigating risks is the key.Ensuring the authenticity of code is especially crucial in academic environments, where the integrity of research and educational outcomes is paramount.Fraudulent or AI-generated submissions can undermine the foundation of academic pursuits, leading to a loss of trust in research findings and educational qualifications.Moreover, in the context of examinations, robust fraud detection is essential to prevent cheating, ensuring the assessments accurately reflect the student's capabilities and do not check the non-deterministic output of a prompt due to the stochastic decision-making of LLMs based on transformer models (TM) during inference.Under the assumption that AI-generated code has a higher chance of security vulnerabilities and beyond the educational context, it can also be critical in software industries to have the ability to distinguish between human-and AI-generated code when testing an unknown piece of software.As the boundaries of what AI can achieve expand, our approach to understanding, managing, and integrating these capabilities into our societal fabric, including academic settings, will determine our success in the AI-augmented era.

Problem Introduction
This paper delves deep into the challenge of distinguishing between human-generated and AI-generated code, offering a comprehensive overview of state-of-the-art methods and proposing novel strategies to tackle this problem.
Central to our methodology is a reduction of complexity: the intricate task of differentiating between human-and AI-generated code is represented as a fundamental binary classification problem.Specifically, given a code snippet x ∈ C as input, we aim for a function f : C → {0, 1} = Y, which indicates whether the code's origin is human {0} or GPT {1}.This allows us to use well-known and established ML models.We represent the code snippets as human-designed (white-box) and embeddings (black-box) features in order to apply ML models.The usage of embeddings requires prior tokenization, which is carried out either implicitly by the model or explicitly by us.Hence, we use embeddings to obtain constant dimensionality across all code snippets or single tokens.Figure 1 gives an overview of our approach in the form of a flowchart, where the details will be explained in the following sections.
For a model to truly generalize in the huge field of software development across many tasks, it requires training on vast amounts of data.However, volume alone is insufficient for a model.The data's quality is paramount, ensuring meaningful and discriminating features are present within the code snippets.An ideal dataset would consist of code snippets that satisfy a variety of test cases to guarantee their syntactic and semantic correctness for a given task.Moreover, to overcome the pitfalls of biased data, we emphasize snippets prior to the proliferation of GPT in code generation.However, a significant blocker emerges from a stark paucity of publicly available GPT-generated solutions that match our criteria mentioned before.This lack underscores the necessity to find and generate solutions that can serve as fitting data sources for our classifiers.

Research Questions and Contributions
We formulate the following research questions based on the problem introduction and the need to obtain detection techniques for AI-generated code.In addition, we provide our hypotheses regarding the questions we want to answer with this work.

RQ2:
To what extent can we explain the difference between human-and AI-generated code?
• H 1 : There are detectable differences in style between AI-generated and humangenerated code.• H 2 : The differences are only to a minor extent attributable to code formatting: If both code snippets are formatted in the same way, there are still many detectable differences.
Upon rigorous scrutiny of the posed research questions, an apparent paradox emerges.LLMs have been trained using human-generated code.Consequently, the question arises: Does AI-generated code diverge from its human counterpart?We postulate that LLMs follow learning trajectories similar to individual humans.Throughout the learning process, humans and machines are exposed to many code snippets, each encompassing distinct stylistic elements.They subsequently develop and refine their unique coding style, i.e., by using consistent variable naming conventions, commenting patterns, code formatting, or selecting specific algorithms for given scenarios [14].Given the vast amount of code the machine has seen during training, it is anticipated to adopt a more generalized coding style.Thus, identifiable discrepancies between machine-generated code snippets and individual human-authored code are to be expected.
The main contributions of this paper are: 1.
Several classification models are evaluated on a large corpus of code data.While the human-generated code comes from many different subjects, the AI-generated code is (currently) only produced by ChatGPT-3.5.

2.
The best model-feature combinations are models operating on high dimensional vector embeddings (black-box) of the code data.

3.
Formatting all snippets with the same code formatter decreases the accuracy only slightly.Thus, the format of the code is not the key feature of distinction.

4.
The best models achieve classification accuracies of 98%.An explainable classifier with almost 90% accuracy is obtained with the help of Bayes classification.

Structure
The remainder of this article is structured as follows: Section 2 provides a comprehensive review of the existing literature, evaluating and discussing its current state.Section 3 offers an overview of the techniques and frameworks employed throughout this study, laying the foundation for understanding the subsequent sections.Section 4 details our experimental setup, outlining procedures from data collection and preprocessing to the training of models and their parameters.Section 5 presents the empirical findings of our investigations, followed by a careful analysis of the results.In Section 6, we provide a critical discussion and contextualization within the scope of alternative approaches.Finally, Section 7 synthesizes our findings and outlines potential directions for further research.

Related Work
Fraud detection is a well-established area of research in the domain of AI.However, most methodologies focus on AI-generated NL content [9,10,15,16] rather than on code [17,18].Nevertheless, findings from text-based studies remain relevant, given the potential for cross-application and transferability of techniques.

Zero-shot detection
One of the most successful models for differentiating between human-and AI-generated text is DetectGPT by Mitchell et al. [9], which employs zero-shot detection, i.e., it requires neither labeled training data nor specific model training.Instead, DetectGPT follows a simple hypothesis: Minor rewrites of model-generated text tend to have lower log probability under the model than the original sample, while minor rewrites of humanwritten text may have higher or lower log probability than the original sample.DetectGPT only requires log probabilities computed by the model of interest and random perturbations of the passage from another generic pre-trained LLM (e.g., T5 [19]).Applying this method to textual data of different origins, Mitchell et al. [9] report very good classification results, namely AUC = 0.90 − 0.99, which is considerably better than other zero-shot detection methods on the same data.
Yang et al. [18] developed DetectGPT4Code, an adaptation of DetectGPT for code detection that also operates as a zero-shot detector, by introducing three modifications: (1) Replacing T5 with the code-specific Incoder-6B [20] for code perturbations, addressing the need for maintaining code's syntactic and semantic integrity.(2) Employing smaller surrogate LLMs to approximate the probability distributions of closed, black-box models, like GPT-3.5 [21] or GPT-4 [22].(3) Using fewer tokens as anchors turned out to be better than the full-length code.Preliminary experiments found that the ending tokens are more deterministic given enough preceding text and thus better indicators.Yang et al. [18] tested DetectGPT4Code on a relatively small set of 102 Python and 165 Java code snippets.Their results with AUC = 0.70 − 0.80 were clearly better than using plain DetectGPT (AUC = 0.50 − 0.60) but still not reliable enough for practical use.

Text Detectors Applied to Code
Recently, several detectors like DetectGPT [9], GPTZero [23] and others [10,15,16,24] were developed that are good at distinguishing AI-generated from human-generated NL texts, often with an accuracy better than 90%.This also makes it tempting to apply those NL text detectors to code snippets.As written above, Yang et al. [18] used DetectGPT as a baseline for their code detection method.
There are two recent works [25,26] that compare a larger variety of NL text detectors on code detection tasks: Wang et al. [25] collect a large code-related content dataset, among them 226k code snippets, and apply 6 different text detectors to it.When using the text detectors as-is, they only reach a low AUC = 0.40 − 0.50 on code snippets, which they consider unsuitable for reliable classification.In a second experiment, they fine-tune one of the open-source detectors (RoBERTa-QA [24]) by training it with a portion of their code data.It is unclear whether this fine-tuning and its evaluation used training and test samples originating from the same coding problem.Interestingly, after fine-tuning, they report a considerably higher AUC = 0.77 − 0.98.The authors conclude that "While fine-tuning can improve performance, the generalization of the model still remains a challenge".
Pan et al. [26] provide a similar study on a medium-size database with 5k code snippets, testing 5 different text detectors on their ability to recognize the origin.As a special feature, they consider 13 variant prompts.They report an accuracy of 50 − 60% for the tested detectors, only slightly better than random choice.
In general, text detectors work well on NL detection tasks but are not reliable enough on code detection tasks.

Embedding-and Feature-Based Methods
In modern LLMs, embeddings constitute an essential component, transforming text or code into continuous representations within a dense vector space of constant dimension, where proximity indicates similarity among elements.Hoq et al. [17] use a prior term frequency-inverse document frequency (TF-IDF) [27] embedding for classic ML algorithms, code2Vec [28] and abstract syntax tree-based neural networks (ASTNN) [29] for predicting the code's origin.While TF-IDF reflects the frequency of a token in a code snippet over a collection of code snippets, code2Vec converts a code snippet, represented as an abstract syntax tree (AST), into a set of path-contexts, linking pairs of terminal nodes.Subsequently, it computes the attention weights of the paths and uses them to compute the single aggregated weighted code vector.Similarly, ASTNN parses code snippets as an AST and uses preorder traversal to segment the AST into a sequence of statement trees, which are further encoded into vectors with pre-trained embedding parameters of Word2Vec [30].These vectors are processed through a Bidirectional Gated Recurrent Unit (Bi-GRU) [31] to model statement naturalness, with pooling of Bi-GRU hidden states to represent the code fragment.Hoq et al. [17] used 3.162 × 10 3 human-and 3 × 10 3 ChatGPT-generated code snippets in Java from a CS1 course with a total of 10 distinct problems, yielding 300 solutions per problem.Further, they select 4 × 10 3 random code snippets for training and distribute the remaining samples equally on the test-and validation set.All models achieve similar accuracies, ranging from 0.90 − 0.95.However, the small number of unique problems, the large number of similar solutions, and their splitting procedure render the results challenging to generalize beyond the study's specific context, potentially limiting the applicability of the finding to broader scenarios.
Li et al. [32] present an interesting work where they generate features in three groups (lexical, structural layout, and semantic) for discrimination of code generated by Chat-GPT from human-generated code.Based on these rich feature sets, they reach detection accuracies between 0.93 − 0.97 with traditional ML classifiers like random forests (RF) or sequential minimal optimization (SMO).Limitations of the method, as mentioned by the authors, are the relatively small ChatGPT code dataset (1206 code snippets) and the lack of prompt engineering (specific prompt instructions may lead to different results).

Algorithms and Methodology
In this section, we present fundamental algorithms and describe the approaches mandatory for our methodology.We start by detailing the general prerequisites and preprocessing needed for algorithm application.Subsequently, we explicate our strategies for code detection and briefly delineate the models used for code sample classification.

General Prerequisites
Detecting fraudulent use of ChatGPT in software development or coding assessment scenarios requires an appropriate dataset.Coding tasks consisting of a requirement text, human solutions, and test cases are of interest.To the best of our knowledge, one of the most used methods for fraudulent usage is representing the requirement text as the prompt for ChatGPT's input and using the output's extracted code for submission.Therefore, tasks that fulfill the above criteria are sampled from several coding websites, and human solutions are used as a baseline to compare to the code from ChatGPT.The attached test cases were applied to both the human and AI solutions before comparison to guarantee that the code is not arbitrary but functional and correct.For generating code gpt-3.5-turbo[33] was used, the most commonly used AI tool for fraudulent content, which also powers the application ChatGPT.

ChatGPT
Fundamentally, ChatGPT is a fine-tuned sequence-to-sequence learning [34] model with an encoder-decoder structure based on a pre-trained transformer [21,35].Due to its positional encoding and self-attention mechanism, it can process data in parallel rather than sequentially, unlike previously used models such as recurrent neural networks [36] or longshort-term memory (LSTM) models [37].Just limited in the maximum capacity of input tokens, it is capable of capturing long-term dependencies.During inference, the decoder is detached from the encoder and is used solely to output further tokens.Interacting with the model requires the user to provide input that is then processed and passed into the decoder, which generates the output sequence token-by-token.Once a token is generated, the model incorporates this new token into the input from the preceding forward pass, continuously generating subsequent tokens until a termination criterion is met.Upon completion, the model stands by for the next user input, seamlessly integrating it with the ongoing conversation.This process effectively simulates an interactive chat with a GPT model, maintaining the flow of the conversation.

Embeddings
Leveraging the contextual representation of embeddings in a continuous and constant space allows ML models to perform mathematical operations and understand patterns or similarities in the data.In our context, we use the following three models to embed all code snippets: TF-IDF [38] incorporates an initial step of prior tokenization of code snippets, setting the foundation to capture two primary components: (1) term frequency (TF), which is the number of times a token appears in a code snippet, and (2) inverse document frequency (IDF), which reduces the weight of tokens that are common across multiple code snippets.Formally, TF-IDF is defined as: where N is the number of code snippets, N t,d the number of times token t appears in code snippet d, N d the number of tokens in code snippet d and N t the number of code snippets that include token t.The score emphasizes tokens that occur frequently in a particular code snippet but are less frequent in the entire collection of code snippets, thereby underlining the unique relevance of those tokens to that particular code snippet.
Word2Vec [39] is a neural network-based technique used to generate dense vector representations of words in a continuous vector space.It fundamentally operates on one of two architectures: (1) Skip-gram (SG), where the model predicts the surrounding context given a word, or (2) continuous bag of words (CBOW), where the model aims to predict a target word from its surrounding context.Given a sequence of words w 1 , . . ., w T , their objective is to maximize the average log probability: , where v w and v ′ w denote the input and output vector representation of word w i ∈ V in the sequence of all words in the vocabulary, W ∈ N the number of words in that vocabulary V, and c ∈ N the size of the training context.The probability of a word given its context is calculated by the softmax function with p(w O |w I ).Training the model efficiently involves the use of hierarchical softmax and negative sampling to avoid the computational challenges of the softmax over large vocabularies [30].
OpenAI ADA [33] does not have an official paper, but we strongly suspect that a methodology related to OpenAI's paper [40] was used to train the model.In their approach, Neelakantan et al. [40] use a contrastive objective on semantically similar paired samples {(x i , y i )} N i=1 and in-batch negative in training.Therefore, a transformers pre-trained encoder E [35], initialized with GPT models [21,41], was used to map each pairs elements to their embeddings, and calculate the cosine similarity: where ⊕ denotes the operation of string concatenation and EOS, SOS special tokens, delimiting the sequences.Fine-tuning the model includes contrasting the paired samples against in-batch negatives, given by supervised training data like natural language inference (NLI) [42].Mini-batches of M samples are considered for training, which consist of M − 1 negative samples from NLI, and one positive example (x i , y i ).Thus, the logits for one batch is a M × M matrix, where each logit is defined as ŷ = sim(x i , y i ) • exp(τ), where τ is a trainable temperature parameter.The loss is calculated as the cross entropy losses across each row and column direction, where positives examples lie on the diagonal of the matrix.Currently, embeddings from ADA can be obtained by using OpenAIs API, namely text-embedding-ada-002 [33], which returns a non-variable dimension x ∈ R 1536 .

Supervised Learning Methods
Feature extraction and embedding derivation constitute integral components in distinguishing between AI-generated and human-generated code, serving as inputs for classification models.Subsequently, we list the supervised learning (SL) models employed in our analysis: Logistic Regression (LR) [43] which makes a linear regression model usable for classification tasks.[44] is a well-known form of decision trees (DTs) that offers transparent decision-making.Its simplicity, consisting of simple rules, makes it easy to use and understand.Oblique Predictive Clustering Tree (OPCT) [45]: In contrast to regular DTs like CART, an OPCT split at a decision node is not restricted to a single feature, but rather a linear combination of features, cutting the feature space along arbitrary slanted (oblique) hyperplanes.Random Forest (RF) [46]: A random forest is an ensemble method, i.e., the application of several DTs, and is subject to the idea of bagging.RF tend to be much more accurate than individual DTs due to their ensemble nature, usually at the price of reduced interpretability.eXtreme Gradient Boosting (XGB) [47]: Boosting is an ensemble technique that aims to create a strong classifier from several weak classifiers.In contrast to RF with its independent trees, in boosting the weak learners are trained sequentially, with each new learner attempting to correct the errors of their predecessors.In addition to gradient boosting [48], XGB employs a more sophisticated objective function with regularization to prevent overfitting and improve computational efficiency.Deep Neural Network (DNN) [49,50]: A feedforward neural network with multiple layers.DNNs can learn highly complex patterns and hierarchical representations, making them extremely powerful for various tasks.However, they require large amounts of data and computational resources for training and their highly non-linear nature makes them, in contrast to other methods, somewhat of a "black-box", making it difficult to interpret their predictions.

Gaussian Mixture Models
Beyond SL methods, we also incorporate Gaussian mixture models (GMMs).Generally, a GMM is characterized by a set of K Gaussian distributions N (x|µ, σ).Each distribution k = 1, . . ., K has a mean vector ⃗ µ k and a covariance matrix Σ k .Additionally, there are mixing coefficients ψ k associated with each Gaussian component k , satisfying the condition ∑ K k=1 ψ k = 1 to ensure the probability is normalized to 1. Further, all components k are initialized with k-means, modeling each cluster with the corresponding Gaussian distribution.The probability density function of a GMM is defined as: The pre-defined clusters serve as starting point for optimizing the GMM with the expectationmaximization (EM) algorithm, which refines the model through iterative expectation and maximization steps.In the expectation step, it calculates the posterior probabilities γik of data points belonging to each Gaussian component k, using the current parameter estimates according to Eq. ( 2).Subsequently, the maximization step in Eq. ( 3) updates the model parameters ( ψk , ⃗ µ k , Σk ) to maximize the data likelihood: The iterative repetition of this process guarantees that at least one local optimum and possibly the global optimum is always achieved.

Experimental Setup
In this section, we outline the requirements to carry out our experiments.We cover basic hard-and software components, as well as the collection and preprocessing of data to apply the methodology and models presented in the previous section.
For data preparation and all our experiments, we used Python version 3.10 with different packages, as delineated in our repository https://github.com/MarcOedingen/ChatGPT-Code-Detection (accessed on July 4, 2024).Due to large amounts of code snippets, we recommend a minimum of 32 GB of RAM, especially when experimenting with Word2Vec.

Data Collection
As previously delineated in the general prerequisites, an ideal dataset for our intended purposes is characterized by the inclusion of three fundamental elements: (1) Problem description, (2) one or more human solutions, and (3) various test cases.This is exemplified in Figure 2. The problem description (1) should clearly contain the minimum information required to solve a programming task or to generate a solution through ChatGPT.In contrast, unclear problem descriptions may lead to solutions that overlook the main problem, thereby lowering the solution quality and potentially omitting useful solutions from the limited available samples.The attached human solutions (2) for a coding problem play an important role in the subsequent analysis and serve as referential benchmark for the output of ChatGPT.Furthermore, a set of test cases (3) facilitates the elimination of syntactically correct solutions that do not fulfill the functional requirements specified in the problem description.To this end, a controlled environment is created in which the code's functionality is rigorously tested, preventing the inclusion of snippets of code that are based on incorrect logic or could potentially produce erroneous output.Hence, we strongly focus on syntactic and executable code but ignore a possibly typical behavior of ChatGPT in case of uncertainties or wrong answers.For some problems, a function is expected to solve them, while others expect a console output.We have considered both by using either the function name or the entire script, referred to as the entry point, for the enclosed test cases.

Problem description
Write a python function to find the minimum element in a sorted and rotated array.Programming tasks from programming competitions are particularly suitable for the above criteria; see Table 1 for the sources of the coding problems.Due to the variety and high availability of such tasks in Python, we decided to use this programming language.Thereby, we exclusively include human solutions from a period preceding the launch of ChatGPT.We used OpenAIs gpt3.5-turboAPI with the default parameters to generate code.
The OpenAI report [22] claims that GPT-3.5 has an accuracy of 48.1% in a zero-shot evaluation for generating a correct solution on the HumanEval dataset [51].After a single generation, our experimental verification yielded a notably lower average probability of 21.3%.Due to the low success rate, we conducted five distinct API calls for each collected problem.This strategy improved the accuracy rate considerably to 45.6%, converging towards the accuracy reported in [22] and substantiating the theory that an increment in generation attempts correlates positively with heightened accuracy levels [51].Furthermore, it is noteworthy that the existence of multiple AI-generated solutions for a single problem does not pose an issue, given that the majority of problems possess various human solutions; see Table 1 in column 'before pre-processing' for the number of problems n PROBLEMS , average human solutions per problem nPROBLEMS and the total human samples n SAMPLES .
When generating code with gpt3.5-turbo, the prompt strongly influences success rate.Prompt engineering is a separate area of research that aims to utilize the intrinsic capabilities of an LLM while mitigating potential pitfalls related to unclear problem descriptions or inherent biases.During the project, we tried different prompts to increase the yield of successful solutions.Our most successful prompt, which we subsequently used, is the following: 'Question: <Coding_Task_Description> Please provide your answer as Python code.Answer:".Other prompts, i.e., "Question: <Coding_Task_Description> You are a developer writing Python code.Put all python code $PYTHON in between [[[$PYTHON]]].Answer:", led to a detailed explanation of the problem and an associated solution strategy of the model, but without the solution in code.

Data Preprocessing
Based on impurities in both the GPT-generated and human solutions, the data must be preprocessed before it can be used as input for ML models.Hence, we first extracted the code from the GPT-generated responses and checked whether it and the human solutions can be executed, reducing the whole dataset to 3.68 × 10 5 samples.This also eliminated missing values due to miscommunication with the API, server overloads, or the absence of Python code in the answer.Further, to prevent the overpopulation of particular code snippet subsets, we removed duplicates in both classes.A duplicate is a code snippet for problem P that is identical to another code snippet for the same problem P.This first preprocessing step leaves us with 3.14 × 10 5 samples.
Numerically, the largest collapse for the remaining samples, and especially for the GPT-generated code snippets, is given by the application of the test cases.This reduces the number of remaining GPT samples by 72.39% and the number of human samples by 28.59%, leaving a total of 1.71 × 10 5 samples.Furthermore, we consider a balanced dataset so that our models are less likely to develop biases or favor a particular class, reducing the risk of overfitting and making the evaluation of the model's performance more straightforward.Given n individual coding problems P i , i = 1, . . ., n, with h i human solutions and g i GPT solutions, we take the minimum k i = min(h i , g i ) and choose k i random and distinct solutions from each of the two classes for P i .The figures for human samples after the pre-processing procedure are listed in Table 1 in column 'after pre-processing'.Based on a balanced dataset, there are as many average GPT solutions nSOLUTIONS and total GPT samples n SAMPLES as human solutions and samples for each source in the final processed dataset.Thus, the pre-processed, balanced and cleaned dataset contains 3.14 × 10 4 samples in total.

Optional Code Formatting
A discernible method for distinguishing between the code sources lies in the analysis of code formatting patterns.Variations in these patterns may manifest through the presence of spaces over tabs for indention purposes or the uniform application of extended line lengths.Thus, we use the Black code formatter [58], a Python code formatting tool, for both human-and GPT-generated code, standardizing all samples into a uniform formatting style in an automated way.This methodology effectively mitigates the model's tendency to focus on stylistic properties of the code.Consequently, it allows the models to emphasize more significant features beyond mere formatting.A comparison of the number of tokens for all code snippets of the unformatted and formatted datasets is shown in Figure 3.

Training / Test Set Separation
In dividing our dataset into training and test sets, we employed a problem-wise division, allocating 80% of the problems to the training set and 20% to the test set instead of a sample-wise approach.This decision stems from our dataset's structure, which includes multiple solutions per problem.The sample-wise approach could include similar solutions for the same problem within the training and test sets.We opted for a problem-wise split to avoid this issue and enhance the model's generalization, ensuring the model is tested on unseen problem instances.Additionally, we repeat each experiment ten times for statistical reliability, each time using another seed for a different distribution of problems into training and test sets.

Modeling Parameters and Tokenization
For all SL methods, we use the default parameters proposed by scikit-learn [59] for RF, GB, LR, and DT, those of xgboost [47] for XGB, those of spyct [45] for OPCT, and those of TensorFlow [60] for DNN with two notable exceptions for DNNs (1) and OPCTs (2).For DNNs (1), we configure the network architecture to [1536, 768, 512, 128, 32, 8, 1], employing the relu activation function across all layers except for the output layer, where sigmoid was used alongside binary cross-entropy as the loss function.In response to the strong fluctuation of the OPCTs (2), we create 10 individual trees and select the best of them.Additionally, we standardized the number of Gaussian components K = 2 for all experiments with GMMs.
While embedding ADA uses internal tokenization which is the cl100k_base encoding, we must explicitly tokenize the prepared formatted or unformatted code for TF-IDF and Word2Vec.We decide to use the same tokenization cl100k_base encoding implemented in the tiktoken library [33] which was uniformly applied across all code snippets.Given the fixed size of the embedding of text-embedding-ada-002 with x ∈ R 1536 , we adopted this dimensionality for the other embeddings.For TF-IDF, we retain sklearn's default parameters, while for gensim's [61] Word2Vec, we adjusted the threshold for a word's occurrence in the vocabulary to min_count = 1 and use CBOW as the training algorithm.

Results
In this section, we present the primary outcomes from deploying the models introduced in Section 3, operating on the pre-processed dataset as shown in Section 4. We discuss the impact of different kinds of features (human-designed vs. embeddings) and the calibration of ML models.We then put the results into perspective by comparing them to the performance of untrained humans and a Bayes classifier.

Similarities between Code Snippets
The representation of the code snippets as embeddings describes a context-rich and high-dimensional vector space.However, the degree of similarity among code snippets within this space remains to be determined.Based on our balanced dataset and the code's functionality, we assume that the code samples are very similar.They are potentially even more similar when a code formatter, e.g., the Black code formatter [58], is used, which presents the models with considerable challenges in distinguishing subtle differences.Mathematically, similarities in high-dimensional spaces can be particularly well calculated using cosine similarity.Let H P , G P ∈ C be code snippets for problem P originating from humans and GPT, respectively.We compute the cosine similarity equivalent to Eq. (1) as sim(H P , G P ) = H P •G P ∥H P ∥∥G P ∥ .Concerning the embeddings of all formatted and unformatted code snippets generated by ADA, the resulting distributions of cosine similarities are presented in Figure 4.This allows us to mathematically confirm our assumption that the embeddings of the codes are very similar.The cosine similarities for both the embeddings of the formatted and unformatted code samples in the ADA-case are approximately normally distributed, resulting in very similar mean and standard deviation: xFORM = 0.859 ± 0.065 and xUNFORM = 0.863 ± 0.067.Figure 4 also shows the TF-IDF embeddings for both datasets.In contrast to the ADA embeddings, significantly lower cosine similarities can be identified.We suspect that this discrepancy arises because TF-IDF embeddings are sparse and based on exact word matches.In contrast, ADA embeddings are dense and capture semantic relationships and context.Finally, the average cosine similarities in the TF-IDF-case are ȳFORM = 0.316 ± 0.189 for the formatted dataset and ȳUNFORM = 0.252 ± 0.166 for the unformatted dataset.
Figure 4 demonstrates that the cosine similarities of embeddings vary largely, depending on the kind of embedding.But, as the results in Section 5.3 will show, ML models can effectively detect the code's origin from those embeddings.However, embeddings are black-box in the sense that the meaning of certain embedding dimensions is not explainable to humans.Consequently, we also investigated features that can be interpreted by humans to avoid the black-box setting with embeddings.

Human-designed Features (white-box)
A possible and comprehensible differentiation of the code samples can be attributed to their formatting.Even if these differences are not immediately visible to the human eye, they can be determined with the help of calculations.To illustrate this, we have defined the features in Table 2.We assess their applicability using the presented SL models from Section 3.4.The results for the unformatted samples are displayed in Table 3, and for the formatted samples in Table 4.We find that the selected features capture a large proportion of the differences on the unformatted dataset.The XGB model stands out as the most effective one, achieving an average accuracy of 88.48% across various problem splits.However, when assessing the formatted dataset, there is a noticeable performance drop, with the XGB model's effectiveness decreasing by approximately 8 percentage points across all metrics.While outperforming all other models on the unformatted dataset, XGB slightly trails behind the RF model in the formatted dataset, where the RF model leads with an accuracy of 80.50%.This aligns with our expectations that the differences are minimized when employing formatting.While human-designed features are valuable for distinguishing unformatted code, their effectiveness diminishes significantly when formatting variations are reduced.This fact is further supported by Figure 5, which reflects the normalized values of the individual features from Table 2   Despite their astonishing performance at first glance, human-designed features do not lead to near-perfect classification results.This is likely due to the small number of features and the fact that they do not capture all available information.ML models that use the much higher-dimensional (richer) embedding space address these limitations by capturing implicit patterns not readily recognizable by human-designed features.

Embedding Features (black-box)
ML models operating on embeddings achieve superior performance compared to the models using the human-designed features; see Table 3 and Table 4 for results on the unformatted and formatted dataset, respectively.The results on the unformatted dataset show that the highest values are achieved by XGB + TF-IDF with an accuracy greater than 98% and an astonishing AUC value of 99.84%.With a small gap to XBG + TF-IDF across all metrics, RF + TF-IDF is in second place.With less than one percentage point difference to XGB + TF-IDF, DNN + ADA is in third place.An identical ranking of the top 3 models can be found on the formatted dataset.As with the human-designed features, 'formatted' shows weaker performance than 'unformatted', with a deterioration of about 4 percentage points on all metrics except AUC, which only decreased by about 1 percentage point.In a direct comparison of models based on either human-designed features or embeddings, the best- The disparities in performance between the white-and black-box approaches, despite employing identical models, highlight the significance of embedding techniques.While traditional feature engineering is based on domain-specific expertise or interpretability, it often cannot capture the complex details of code snippets as effectively as embedding features.

Gaussian Mixture Models
Although GMMs are, in principle, capable of unsupervised learning, they reach better classification accuracies in a supervised or semi-supervised setting: the class labels are provided during training, but the assignment of data points to one of the K GMM components has to be found by the EM method described in Section 3.5.Our method proceeds as follows: First, we train two independent GMMs, one exclusively on human  samples and the other on GPT samples.For the prediction of an embedded sample ⃗ x, we then calculate the likelihood: where p(⃗ x; G AI ) and p(⃗ x; G HU ) denote the probability density functions of ⃗ x under the GMMs, respectively, and assign ⃗ x to the class with the higher probability.When using ADA, the embedding can be performed on individual code snippets x directly before the GMM finds clusters in the embeddings E (x).When applying TF-IDF embedding, instead, tokenization T is required (to determine the number of tokens in each snippet) before the embeddings are computed on the tokenized code snippets E (T (x)).With both embeddings, GMMs achieve accuracies of over 90%, outperforming any of the models based on human-designed features (see Table 3).
The underlying concept of GMMs is the approximation of the probability distributions of ChatGPT, which does not extend over entire snippets but rather on individual tokens used for the prediction of the following token.It appears, therefore, naturally, to apply Word2Vec for the embedding of single tokens instead of the snippet-level embedding used in ADA and TF-IDF. 1 With the single-token Word2Vec embedding, GMMs reach an accuracy of 93.57%.
To summarize, we show in Figure 6 the box plots for all our main results.Each box group contains all models trained on a particular (feature set, format)-combination.Within a certain format choice, the box plots for the embedding feature sets do not overlap with those for the human-designed feature set.This shows that the choice 'embedding vs. human-designed' is more important than the specific ML model.

Model Calibration
Previously, we focused on how well the considered algorithms can discriminate between code snippets generated by humans and GPT-generated code snippets.This is an important aspect when judging the performance of these algorithms.However, it also reduces the problem to one of classification, e.g., the algorithm only tells us whether a code snippet is generated by humans or by GPT.In many cases, we may also be interested in more nuanced judgments, e.g., in how likely it is that GPT or humans generated a code snippet.All the considered algorithms are theoretically able to output such predicted class probabilities.In this section, we evaluate how well these predicted probabilities correspond to the proportion of actually observed cases.
We mainly use calibration plots to do this for a subset of the considered algorithms.In such plots, the predicted probability of a code snippet being generated by GPT is shown on the x-axis, while the actually observed value (0 if human-generated, 1 if GPT generated) is shown on the y-axis.A simple version of this plot would divide the predicted probabilities into categories, for example, ten equally wide ones [62].The proportion of code snippets labeled as GPT inside those categories should ideally be equal to the mean predicted probability inside this category.For example, in the category 10 − 20%, the proportion of actual GPT samples should roughly equal 15%.We use a smooth variant of this plot by calculating and plotting a non-parametric locally weighted regression (LOESS) instead, which does not require the use of arbitrary categories [63].
Figures 7 and 9 show these calibration plots for each algorithm, separately for formatted and unformatted data.Most of the considered algorithms show adequate calibration,  with the estimated LOESS regression being close to the line that goes through the origin.
There are only minor differences between algorithms fitted on formatted vs unformatted data.For some algorithms, however (RF + ADA, RF + TF-IDF, GMM + WORD2VEC, GMM + TF-IDF), there seem to be some issues with the calibration in predicted probabilities between 0.15 and 0.85.However, one possible reason for this is not a lack of calibration but a lack of suitable data points to correctly fit the LOESS regression line.For example, the GMM-based models almost always predict only probabilities very close to 0 or 1.This is  not necessarily a problem of calibration (if the algorithm is correct most of the time), but it may lead to unstable LOESS results [63].We, therefore, additionally plotted kernel density estimates of the predicted probabilities for each algorithm to show the range of predictions made by each one (Figures 8  and 10).As can be seen quite clearly, most of the DNN and GMM-based models relying on embeddings generated only very few predicted probabilities between 0.1 and 0.9, making the validity of the calibration curves in these ranges questionable for those algorithms.Since these models do not discriminate perfectly between the two classes, these models may not be the best choice when the main interest lies in predicting the probability of a code snippet being generated by GPT because their output always suggests certainty, even when it is wrong.On the other hand, algorithms such as XGB + TF-IDF or XGB + ADA show a nearly perfect calibration and a very high accuracy.

Human Agents and Bayes Classifiers
To gain a better understanding about the "baseline" performance, to which we compare the more sophisticated models, we briefly present the results of untrained humans and a Bayes classifier.

Untrained Human Agents
To assess how difficult it is for humans to classify Python code snippets according to their origin, we conducted a small study with 20 participants.Using Google Forms, the individuals were asked to indicate their educational background, their experience in Python, and to self-asses their programming proficiency on a 10-point Likert scale.They were afterward asked to judge whether 20 randomly selected and ordered code snippets were written by humans or by ChatGPT.The dataset was balanced, but the participants were not told.The participants, of which 50% had a Master's degree and 40% held a PhD degree, self-rated their programming skills at an average of 6.85 ± 1.69 with a median of 7 and indicated an average of 5.05 ± 4.08 (median 5) years of Python programming experience.The result in Table 5 shows a performance slightly below random guessing, confirming this task's difficulty for untrained humans.Figure 11 shows the distribution of the participant's performances.When comparing these results to the ones obtained by ML models, it should be taken into account that the participants were not trained on the task and are, therefore, considered zero-shot learners.How well humans can be trained in the task using labeled data remains to be investigated.

Trained Bayes Classifier
Viewed from a broader perspective, the question raised by the results from the preceding subsection is why ML models with an F1 score between 83% and 98% excel so much over (untrained) humans who show a performance close to random guessing.In the following, we investigate the possibility (which also appeared briefly in Li et al. [32]) that ML models might use subtle differences in conditional probabilities for the appearance of tokens for decision-making.Such probabilities might be complex for humans to calculate, memorize, and combine.
To demonstrate this, we build a simple Bayes classifier: In the following, we abbreviate H = human origin, G = GPT origin, and X = either origin.For each token t k in our training dataset, we calculate the probabilities P(t k |X).We keep for reliable estimates only those tokens t k where the absolute frequencies n(t k |H) and n(t k |G) are both greater-equal some predefined threshold τ.The set of those tokens above the threshold shall be T.For illustration, Figure 12 shows the tokens with the largest probability ratio.
Given a new code document D from the test dataset, we determine the intersection D ∩ T of tokens and enumerate the elements in this intersection as {T 1 , . . ., T K }, where T k means the event "Document D contains token t k ".We assume statistical independence: . Now we can calculate with Bayes's law the probabilities of origin: and, similarly, P(G| ∩ K k=1 T k ).By definition, we have P(H| ∩ K k=1 T k ) + P(G| ∩ K k=1 T k ) = 1.The Bayes classifier classifies document D as being of GPT-origin, if P(G| ∩ K k=1 T k ) > P(H| ∩ K k=1 T k ).Eq. ( 4) might look complex, but it is just a simple multiply-add of probabilities, straightforward to calculate from the training dataset.It is a simple calculation for a machine but difficult for a human just looking at a code snippet.We conducted the Bayes classifier experiment with τ = 32 (best out of a number of tested τ values), and obtained the results shown in Table 6.
It is astounding that such a simple ML model can achieve results comparable to the best human-designed-features (white-box) models (unformatted, Table 3) or the average of the GMM models (formatted, Table 4).On the other hand, the Bayes calculation is built upon a large number of features (about 1366 tokens in the training token set T and on average 144 tokens per tested document), which makes it clear that a (untrained) human cannot perform such a calculation by just looking at the code snippets.This is notwithstanding the possibility that a trained human could somehow learn an equivalent complex pattern matching that associates the presence or absence of specific tokens with the probability of origin.

Discussion
In this section, we will explore the strengths and weaknesses of the methodologies employed in this paper, assess the practical implications of our findings, and consider how emerging technologies and approaches could further advance this field of study.Additionally, we compare our results with those of other researchers attempting to detect the origin of code.

Strengths and Weaknesses
Large code dataset: To the best of our knowledge, this study currently is the largest in terms of coding problems collected and solved by both humans and AI (see Table 1, 3.14 × 10 4 samples in total).The richness of the training data is probably responsible for the high accuracy and recall around 98% we achieve with some of the SL models.

Train-test-split along problem instances:
A probably important finding from our research is that we initially reached a somewhat higher accuracy (+4 percentage points for most models, not shown in the tables above) with random-sample split.But this split method is flawed, if problems have many human solutions or many GPT solutions, because the test set may contain problems already seen in the training set.A split along problem instances, as we do in our final experiments, ensures that each test case comes from a so-far unseen problem and is the choice recommended to other researchers as well.It has a lower accuracy, but it is the more realistic accuracy we expect to see on genuinely new problem instances.
Formatting: Somewhat surprisingly, after subjecting all samples to the Black code formatter tool [58], our classification models still exhibited the capacity to yield satisfactory outcomes, only slightly degrading in performance.This shows a remarkable robustness.However, it is essential to note that the Black formatter represents just one among several code formatting tools available, each with its distinct style and rules.Utilizing a different formatter such as AutoPEP8 [64] could potentially introduce variability in the formatting of code snippets, impacting the differentiation capability of the models.Examining the robustness of our classification models against diverse formatting styles remains an area warranting further exploration.Additionally, the adaptation of models to various formatters can lead to the enhancement of their generalization ability, ensuring consistent performance across different coding styles.

White-box vs black-box features:
Our experiments on unformatted code have shown that the human-designed white-box features of Table 2 can achieve a good accuracy level above 80% with most ML models.However, our exceptional good results of 92 − 98% are only achieved with black-box embedding features.This is relatively independent of the ML model selected; what is more important is the type of input features.
Feature selection: Although the human-designed features (white-box) were quite successful for both the original and formatted code snippets, there is still a need for features that lead to higher accuracy.The exemplary performance of the embedding features (black-box) indicates the existence of features with higher discriminatory power, thereby necessitating a more profound analysis of the embedding space.
Bayes Classifier: The Bayes classifier introduced in Section 5.6.2shows another possibility to generate a rich and interpretable feature set.The statistical properties that can be derived from the training set enable an explainable classifier that achieves an accuracy of almost 90%.
Test cases: Tested code, having undergone rigorous validation, offers a reliable and stable dataset, enhancing the model's accuracy and generalization by mitigating the risk of incorporating errors or anomalies.This reliability fosters a robust training environment, enabling the model to learn discernible patterns and characteristics intrinsic to human-and AI-generated code.However, focusing solely on tested code may limit the model's exposure to diverse and unconventional coding styles or structures, potentially narrowing its capability to distinguish untested, novel, or outlier instances.Integrating untested code could enrich the diversity and comprehensiveness of the training dataset, accommodating a broader spectrum of coding styles, nuances, and potential errors, thereby enhancing the model's versatility and resilience in varied scenarios.In future work, exploring the trade-off between the reliability of tested code and the diversity of untested code might be beneficial to optimize the balance between model accuracy and adaptability across various coding scenarios.

Code generators:
It is important to note that only AI examples generated with Open-AI's gpt-3.5-turboAPI were used in this experiment.This model is probably the one most frequently used by students, rendering it highly suitable for the objectives of this research paper.However, it is not the most capable of the GPT series.Consequently, the integration of additional models such as gpt-4 [22] or T5 + [65], which have demonstrated superior performance in coding-related tasks, can serve to not only enhance the utilization of the available data but also introduce increased variability.It is an open research question whether one classification model can disentangle several code generators and human code or whether separate models for each code generator are needed.Investigating whether distinct models formulate their unique distributions or align with a generalized AI distribution would be intriguing.This exploration could provide pivotal insights into the heterogeneity or homogeneity of AI-generated code, thereby contributing to the refinement of methodologies employed in differentiating between human-and AI-generated instances.
Programming languages: Moreover, it is essential to underscore that this experiment exclusively encompassed Python code.However, our approach remains programminglanguage-agnostic: Given a code dataset in another programming language, the same methods shown here for Python could be used to extract features or to embed the code snippets (token-wise or as a whole) in an embedding space.We suspect that, given a similar dataset, the performance of classifiers built in such way would be comparable to the ones presented in this article.

Publicly available dataset and trained models:
To support further research on more powerful models or explainability in detecting AI-generated code, we made the pre-processed dataset publicly available in our repository https://github.com/MarcOedingen/ChatGPT-Code-Detection.The dataset can be downloaded via a link from there.Moreover, we offer several trained models and a demo version of the XGB model using the TF-IDF embedding.This demonstration serves as a counterpart to public AI-text detectors, allowing for the rapid online classification of code snippets.

Comparison with Other Approaches to Detect the Source of Code
Two of the works mentioned in Section 2, namely Hoq et al. [17] and Yang et al. [18], are striving for the same goal as our paper: the detection of the source of code.Here, we compare their results with ours.
Hoq et al. [17] approach the source-of-code detection with two ML models (SVM, XGB) and with two DL models (code2vec, ASTNN).They find that all models deliver quite similar accuracies in the range of 90 − 95%.The drawback is, however, that they have a dataset with only 10 coding problems, for each of which they generate 300 solutions.They describe it as "limiting the variety of code structures and syntax that ChatGPT would produce".Moreover, given this small number of problems, a potential flaw is that a purely random train-test-split will have a high probability that each test problem is also represented in the training set (overfitting).In our approach with the larger code dataset, we perform the train-test-split in such a way that all test problems do not occur in the training set.This somewhat tougher task may be the reason for seeing a larger gap between simple ML models and more complex DL-embedding models than is reported in [17].
Yang et al. [18] pursue the ambitious task of zero-shot classification, i.e., predicting the source of code without training.Even more ambitious, they use three different advanced LLMs as code generators (and not only ChatGPT3.5 as we do).Their number of code samples for testing is 267, which is pretty low.Nothing is said about the number of AIgenerated samples.Their specific method, sketched in Section 2, is shown in [18] to be much better than text detectors applied to the code detection task.However, without specific training, their TPR (= recall) of 20 − 60% is much lower than the recall of 98% we achieve with the best of our trained models.

Conclusion
This research aimed to find a classification model capable of differentiating AI-and human-written code samples.In order to enable the feasibility of such a model in an application, emphasis was also put on explainability.Upon thoroughly examining existing AI-based text sample detection research, we strategically transposed the acquired knowledge to address the novel challenge of identifying AI-generated code samples.
We experimented with a variety of feature sets and a large number of ML models, including DNNs and GMMs.It turned out that the choice of the input feature set was more important than the model used.The best combinations are DNN + ADA with 97.8% accuracy and XGB + TF-IDF with 98.3% accuracy.The accuracies with human-designed, low-dimensional feature sets are 10 − 15 percentage points lower.
Within the structured context of our experimental framework, and through the application of the methodologies and evaluative techniques delineated in this manuscript, we have successfully demonstrated the validity of our posited hypotheses H 1 and H 2 which postulate the distinctness between AI-generated and human coding styles, regardless of formatting.The interpretability of our approach was improved by a Bayes classifier, which made it possible to highlight individual tokens and provide a more differentiated understanding of the decision-making process.
One notable outcome of this experiment is the acquisition of a substantial dataset of Python code examples created by humans and AI, originating from various online coding task sources.This dataset serves as a valuable resource for conducting in-depth investigations into the fundamental structural characteristics of AI-generated code, and it facilitates a comparative analysis between AI-generated and human-generated code, highlighting distinctions and similarities.
Having only experimented with Python code, further research could be focused on investigating the coding style of AI with other programming languages.Moreover, AI code samples were only created with OpenAI's gpt-3.5-turboAPI.To make a more general statement about the coding style of language models, further research on other models should be conducted.This study, one of the first of its kind, can be used as a foundation for further research to understand how language models write code and how it differs from human-written code.
In light of the rapid evolution and remarkable capabilities of recent language models, which, while designed to benefit society, also harbor the potential for malicious use, developers and regulators must implement stringent guidelines and monitoring mechanisms to mitigate risks and ensure ethical usage.To effectively mitigate the potential misuse of these advances, the continued development of detection applications, informed by research such as that presented in this paper, remains indispensable.

1 Figure 1 .
Figure 1.Flowchart of our Code Detection Methodology

Figure 2 .
Figure 2. Example of row in dataset

Figure 3 .
Figure 3. Distribution of the code length (number of tokens according to cl100k_base encoding) across the unformatted and formatted dataset.Values larger than the 99% quantile were removed to avoid a distorted picture.

Figure 4 .
Figure 4. Cosine similarity between all human and GPT code samples embedded using ADA and TFIDF, both formatted and unformatted.

Figure 5 .
Figure 5. Box plot of human-designed features for formatted and unformatted dataset.

Figure 6 .
Figure 6.Distribution of the F1 Score of all considered algorithms in different conditions.In both ADA cases, the single outlier is CART.

Figure 8 .
Figure 8. Kernel density estimates of the probabilities predicted by each considered classifier in the unformatted test set.The plot contains one line per run.

Figure 10 .
Figure 10.Kernel density estimates of the probabilities predicted by each considered classifier in the formatted test set.The plot contains one line per run.

Figure 11 .
Figure 11.Performance distribution of study participants

Table 1 .
Code datasets overview: n PROBLEMS -number of distinct problems, nSOLUTIONS -average number of human solutions per problem, and n SAMPLES -total human samples per data source.

Table 2 .
Detailed description of all human-designed features.
over the two data sets.The figure demonstrates an alignment of the feature value distributions for humans and ChatGPT after formatting, especially evident in features such as n LWL and n TW , which show nearly identical values post-formatting.

Table 3 .
Final results -Shown are the mean µ ± standard deviation σ from 10 independent runs on unformatted data; each run corresponds to a different seed, i.e., different distribution of training and test samples.Boldface: column maximum of µ in the 'Human-designed (white-box)' box and for each individual embedding model in the 'Embedding (black-box)' box.

Table 4 .
Same as Table3but for formatted data.

Table 5 .
Results of human performance on the classification task Top 40tokens with the largest absolute discrepancies in their log probabilities (which corresponds to the largest ratio of probabilities).

Table 6 .
Results of Bayes classifier.Mean and standard deviation from 10 runs with 10 different training-test-set separations (problem-wise).