Syntax-Aware On-the-Fly Code Completion

Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace and GitHub.


INTRODUCTION
C ODE completion, or AutoCompletion, is one of the most essential features in modern Integrated Development Environments (IDEs) (e.g., GitHub's Copilot, Intellisense in Visual Studio Code [1]). The goal of code completion is to automatically recommend source code based on a given context, which could help developers reduce the amount of typing and coding iteration time and eliminate the number of typo errors. A recent study conducted by Google found that the current code completion feature could reduce developers' effort by 6% and context switching by 7% [2].
Recent code completion approaches often leverage modern deep learning architectures (e.g., Recurrent Neural Network, Transformer architecture) to exploit their strong representation power. More specifically, state-of-the-art code completion models (e.g., CodeGPT [3], GPT-2 [4], GPT-C [5], TavTrans [6], CodeFill [7]) are based on code-focused large language models (LLMs) that are trained from large codebase and natural language corpora (e.g., the CodeSearchNet corpus with 2 million GitHub repositories). These LLMs are fine-tuned on a specific dataset to perform specific tasks (e.g., code completion). However, existing code completion approaches have the following limitations.
Limitation 1: On-the-fly code completions approaches do not consider syntactic information. On-the-fly code completion approaches are designed to generate code tokens based on a given context without requiring the completeness of prior context. Represent techniques include GPT-2 [4], a Transformer-based decoder model for gen- erative tasks pre-trained on English webpage datasets; CodeGPT [3], a GPT-2 model architecture pre-trained on source code datasets; and GPT-C [5], a GPT-2 model architecture pre-trained on multi-language source code. In their pre-training, these models learn to complete the next code tokens. In doing so, the performance of these on-thefly code completion approaches is limited by their lack of consideration of syntactic information. Limitation 2: Existing syntax-aware code completion approaches are not on-the-fly. To ensure that the generated source code is syntactically correct [8], researchers proposed to leverage the Abstract Syntax Tree (AST) information [9], [10], [6], [11], [7], [12]. For example, Kim et al. [6] proposed TravTrans, a Transformer-based architecture consuming syntactic information from a variety representations of ASTs traversal; Izadi et al. [7] proposed Code-Fill, a multi-task, Transformer-based architecture consuming source code and AST types. While existing AST-based code completion approaches may generate code that is more syntactically correct, the application scenario remains limited. In particular, the existing AST-based code completion approaches [9], [10], [6], [11], [7], [12] require source code to be completed (i.e., all the previous tokens are valid and parsable) at the inference time so the AST information can be obtained from the source code. However, our motivating analysis found that in practice, two thirds of the source code characters is incomplete and not parsable (e.g., containing syntax errors), making the existing AST-based code completion approaches inapplicable in real-world scenarios.
In this paper, we propose PyCoder, an automated code completion approach that can generate source code at any time regardless of the completeness of the source code, i.e., syntax-aware on-the-fly code completion. Our approach is arXiv:2211.04673v1 [cs.SE] 9 Nov 2022 designed to consider the syntactic information of the source code during the learning phase, but does not require syntactic information during the inference phase. Instead of using the AST information like in previous works [6], [7], [9], [10], [11], [12], we propose to leverage the token type information (e.g., String, Number, Name, Keyword), which is a readilyavailable and light-weight syntactic information without requiring the completeness of the source code. During the learning process, we design our approach to carry out two prediction tasks, i.e., the token prediction task and the type prediction task. To ensure that our model captures both syntactic and semantic information during the training process, we leverage Multi-Task Training (MTT) techniques to learn both the token prediction task and the type prediction task. Given a sequence of code tokens, our approach performs the following steps: (1) extract the token type information of each token, (2) perform the sub-word tokenization on each token, (3) align token type data with sub-word source code data, and (4) build a code completion model using a GPT-2 architecture based on a pre-trained CodeGPT language model with a multi-task training technique.
In our experiment, we compare our PyCoder with four existing state-of-the-art models (i.e., Pointer Mixture Network [9], TravTrans [6], GPT-2 [4], and CodeGPT [3]). During the inference phase, we evaluate our approach based on the token-level and line-level prediction tasks. Through an extensive evaluation on the PY150 [13] standard benchmark Python dataset for the code completion task that is used in Microsoft's CodeXGLUE benchmark [3], we answer the following research questions: RQ1) What is the performance of our PyCoder for the token-level and line-level code completion tasks when compared to state-of-the-art models? Results. PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines.

RQ2)
What is the impact of the training strategies on the performance of our PyCoder? Results. Multi-task training strategies have an impact on PyCoder for both token-level and line-level predictions. We find that PyCoder-Hard performs best; followed by PyCoder-IFN and PyCoder-Soft.

RQ3)
What is the impact of the task weighting parameters in multi-task learning on the performance of our PyCoder? Results. PyCoder is generally robust to the task weighting parameters, achieving comparative (without task weighting) or better (with task weighting) performance when compared to the baselines.

RQ4)
What is the impact of the decoding methods on the performance of our PyCoder? Results. Decoding methods have an impact on the performance of PyCoder with an exact match varying from 33.80% to 41.52% for line-level predictions. Beam Search performs best, while Sampling performs worst.

Novelty.
The key novelty of our work is as follows: • PyCoder is the first to leverage the token type information for code completion with a variety of multi-task training techniques. • PyCoder is the first to extensively explore the sensitivity of the task weighting parameter and decoding methods in code completion. • PyCoder surpasses four state-of-the-art code completion techniques and the CodeXGLUE Benchmark, achieving the highest performance in both token-level and line-level predictions. • PyCoder is publicly available on HuggingFace (https: //huggingface.co/Wannita/PyCoder) together with the token type dataset (https://huggingface.co/ datasets/Wannita/PyCoder). Paper Organization. The paper is organized as follows. Section 2 describes the background, motivation, and limitations of the state-of-the-art approaches. Section 3 presents our PyCoder approach. Section 4 describes the experimental setup and state-of-the-art baselines. Section 5 presents the experimental results and discussions. Section 6 discusses the results of our PyCoder. Section 7 discloses the threats to validity. Section 8 draws the conclusion.

RELATED WORK AND MOTIVATION
In this section, we discuss related work about automated code completion to situate the problems and present a motivating analysis.

Code Completion
Code completion is a task to suggest the next code token from a given context. More formally, given a sequence of m tokens x 1 ...x m as a context, code completion aims to predict the next n tokens to complete a sentence x 1 ...x m+n . The learning objective of a language model for code completion is to minimize a conditional probability distribution of the following function: Statistical language models. Previously, several studies proposed code completion approaches using various types of techniques (e.g., heuristic, statistical, and deep learning). Heuristic-based approaches aim to recommend source code based on rules [14], program history [15], and code examples [16]. However, heuristic-based approaches are heavily based on rules and patterns that researchers need to develop, which is time-consuming and expensive. Therefore, statistical language models have been proposed to automatically learn the naturalness of source code based on a probabilistic of the occurrence of source code. For example, Hindle et al. [17], [18] argued that source code is natural and repetitive (similar to natural language) and found that an n-gram approach can accurately predict the next code token based on a given context. Raychev et al. [13] proposed TGEN, a probabilistic-based learning approach with decision tree structures. However, the statistical language models are able to learn only the limited number of n consecutive tokens (according to the n-gram algorithm), which does not reflect the nature of the source code that is usually long (i.e., long-term dependencies).

LSTM-based language models.
To address the limitation of the statistical language models, Long Short-Term Memory (LSTM)-based deep learning approaches are applied to the code completion task. However, existing LSTMbased language models can only learn the semantic information of the source code, without considering its syntactic structure. Thus, to ensure that the LSTM-based code completion models recognize the syntactic information, Abstract Syntax Tree (AST) is widely used by the previous work. For example, Li et al. [9] proposed Pointer Mixture Networks, which is an LSTM-based architecture for predicting the AST node. Similarly, Svyatkovskiy et al. [10] proposed Pythia, which is an LSTM-based approach that incorporates ASTs information through the Word2Vec embedding approach. While such RNN-based and LSTM-based are able to handle longer sequences of source code than statistical language models, the approach remain inaccurate due to the sequential nature of source code processing, the limited ability to capture long-term dependencies, and the limited ability to recognize the importance of different code tokens.
Transformer-based language models. To address the limitations of LSTM-based language models, the Transformer architecture is introduced for the code completion task. Generally, the development of Transformer-based language models consists of two steps: pre-training and finetuning. Pre-training is a process to train a Transformerbased language model in a self-supervised manner (i.e., without labels), allowing the language models to selfunderstand given data by itself (i.e., natural language or programming languages). Normally, the language models for code completion are trained using a Causal Language Model (CLM) (i.e., predicting the unknown token after a sequence of known tokens). Once a language model is pretrained, the model is then fine-tuned on a specific dataset (e.g., PY150 [13]) with the same learning objective as the pre-training process (i.e., CLM). For example, Lu et al. [3] proposed CodeGPT-based models, which is based on a GPT-2 architecture [4] that is pre-trained on both Natural Language (NL) corpus (i.e., WebText) and/or Programming Language (PL) corpus (i.e., CodeSearchNet)-i.e., PL only for CodeGPT, and NL+PL for CodeGPT-adapt.
To ensure that the Transformer-based language models recognize the syntactic structure of source code, Kim et al. proposed TravTrans [6], a vanilla Transformer-based language model that incorporates ASTs information through different encoding styles. Similarly, Wang et al. [19] leverages AST information with a vanilla Transformer-based language model, but using a different AST encoding technique (i.e., by flattening the ASTs nodes). However, these AST-based code completion approaches also leverage AST information at the inference phase, which requires source code to be completed at the inference time so the AST information can be parsed and obtained from the source code. Therefore, in practice, source code is often incomplete and not compilable (e.g., syntax errors), making the existing AST-based code completion approaches not applicable in real-world scenarios.  Fig. 1: The comparison between AST and Token Type representations and the ideal deployment scenarios.

A Motivating Example
Let's consider a code snippet logging.getLogger() as an example (see Figure 1). logging. is the input code token, while getLogger() is the code token to be predicted. Below, we illustrate two key limitations of the ASTbased code completion approach by using TravTrans [6] as an example, which makes the existing AST-based code completion not able to predict next code tokens on-the-fly. First, the learning objective of TravTrans does not reflect the natural order of typing source code sequences. Since representing the source code as AST node sequence by traversing the AST, the order of the node sequence are inconsistent with the token sequence [12]. For example, at the learning phase, TravTrans [6] represents the input code tokens as a sequence of an AST node structurekl (i.e., [AttributeLoad, NameLoad, logging, Attr]) in order to predict the next AST node (i.e., [getLogger]). However, this learning objective does not mimic the natural sequence of code tokens (i.e., [logging, ., getLogger, (, )]), meaning that the programming language-specific characters (e.g., dot [.] and parenthesis [(, )]) are currently ignored. Therefore, in many cases at the deployment scenarios, such AST node information needs to be post-processed in order to successfully perform code completion in practice (e.g., add missing tokens [(, )], convert [Attr] to [.]).
Second, in order to use AST information as an input, TravTrans [6] requires source code to be completed at the inference time so the AST information can be parsed and obtained from the source code. For example, in Figure 1, if developers type logging., TravTrans can successfully recommend the next token (e.g., getLogger)). However, source code is often incomplete and not compilable. For example, in Figure 1, if developers type logging.get, TravTrans cannot correctly recommend the next token, due to the syntax errors during the AST parsing step.

A Motivating Analysis
To demonstrate the significance of the problem of the ASTbased code completion approaches, we perform a motivating analysis to investigate how often AST information could be provided at the inference phase, making AST-based code completion can be executed at the inference phase.
Let's assume that a developer is typing a Python program character-by-character, we aim to analyze how often an AST parser can/cannot successfully parse a Python program at each character. To do so, we select a statistical representative sample of 383 syntactically correct Python files from the PY150 dataset (with a confidence level of 95% and a confidence interval of 5%). 1 Since we simulate the application of AST-based code completion at the character level, we execute a Python AST parser 2 at each character incrementally. In total, we execute a Python AST parser for 1,263,296 times according to the total of 1,263,296 characters. We find that 33.96% of the executions can be successfully parsed, while 66.04% of the executions fail to parse due to syntax errors.  Finding: For every two out of three characters that developers type, AST-based code completion cannot be performed at all due to the failed execution of the Python AST parser, limiting its ability to perform code completion onthe-fly at the inference time. Since existing syntax-aware code completion is not on-the-fly and existing on-the-fly code completion is not syntax-aware, this paper aims to address these significant gaps by proposing a syntacticaware on-the-fly Python code completion approach.

SYNTAX-AWARE ON-THE-FLY CODE COMPLE-TION
In this section, we present an overview of our syntax-aware on-the-fly Python code completion approach (PyCoder). Conceptually, PyCoder aims to generate source code at any time regardless of the completeness of the source code, while considering the syntactic and semantic information of the source code during the learning phase, but do not require syntactic information during the inference phase. To ensure that the learning process considers both semantic and syntactic information, we design our approach to focus on two prediction tasks, i.e., the code token prediction task and the token type prediction task. In particular, we leverage a Multi-Task Training technique (MTT) to cooperatively learn both the code token prediction task (Task 1: Predict the next code token, considered as a Target Task) and the token type prediction task (Task 2: Predict its token type, considered as a Supporting Task). For the type prediction task, we propose to leverage the standard Python token type information (e.g., String, Number, Name, Keyword), which is readily available and lightweight, instead of using the AST information [6], [7], [9], [10], [11], [12] where we found not available for the two-third of the executions (see our finding in Section 2.3), limiting its ability to perform onthe-fly code completion. In contrast, our PyCoder does not require syntactic information at the inference phase. Thus, 1. https://www.surveysystem.com/sscalc.htm 2. https://docs.python.org/3/library/ast.html the completeness of the source code at the inference time is not required.
Overview. Figure 2 presents the overview of our Py-Coder, which consists of two phases: training and inference. During the training phase, PyCoder performs 6 main steps: Step 1 Type Extraction, to extract the token type information from source code; Step 2 Tokenization, to perform subword tokenization on the source code; Step 3 Data Alignment, to align the type information which is word level to the code information which is currently subword level; Step 4 Multi-task Training Architecture with 3 training techniques: hard parameters sharing (MTL), soft parameters sharing (MTL), and intermediate fine-tuning (IFN); then in Step 5 Hyperparameter Task Weighing and Step 6 Decoding Methods are the exploration steps to maximize the performance. For the inference phase, we describe in Step 7 Code Generation step in the details of token-level prediction and line-level prediction.

(Step 1) Type Extraction
Syntactic information can be represented in many forms, e.g., Abstract Syntax Tree (AST) which is widely used in the previous work, and Token Type information which remains largely unexplored. In fact, both AST and token type information have their own advantages and disadvantages. While AST provides a formal representation of syntactic information of source code, it requires syntactically correct source code in order to be successfully parsed by a Python AST parser. Since our finding in Section 2.3 shows that the Python AST parser failed to execute for every two out of three characters that developers type, the usage scenarios of the existing AST-based code completion approach are still limited in practice.
To address this challenge, we leverage a standard Python token type information, offering a more abstract representation of the syntactic structure of source code (e.g., Name, String, Number), which (1) is more lightweight, (2) follows the natural order of code sequences; and (3) can be successfully parsed at any times without requiring the complete and syntactically correct source code. Generally, the standard Python token consists of two pieces of information i.e., (1) the token type, which provides syntactic meaning, and (2) the token value, which provides semantic meaning. For example, given a logging token, the token type is NAME and its value is logging. Since the token type information is not available in the existing code completion benchmark, we describe the steps to extract the type information below.

Code Completion
Data Alignment

Inference Phase
Source code data  <GREATER>, <EQUAL>, <DOT>. Then, we perform the following pre-processing steps.
• First, we discard the following three token types that will not be executed, i.e., <ENCODING> which describes the encoding of the Python file, <ENDMARKER> which describes the end position of the Python file, and <COMMENT> which describes the code comment of the Python file. • Second, <NAME> provided by the Python tokenizer could be either identifier names (e.g., logging) or Python reserved names (e.g., True). Thus, a code completion approach may not be able to recognize the difference between the identifier names and the Python reserved names-which does not reflect the reality. To ensure that our code completion approach can recognize the difference between different types of names, we use the keyword.iskeyword() function 4 in order to check and rename all of the Python reserved words which is originally extracted as <NAME> to <KEYWORDS>. With this approach, the representation of the token types (i.e., each token has its own type) follows the natural order of source code, not the AST structure which addresses the limitations of the AST-based code completion approaches. As shown in Figure 2, logging.getLogger() will be tokenized as [logging, ., getLogger, (, )] with the following token types [NAME, DOT, NAME, LPAR, RPAR]. 4. https://docs.python.org/3/library/keyword.html

(Step 2) Tokenization
Tokenization is an important step in automated code completion, aiming to split the source code into meaningful units. There are three general levels of granularity, i.e., a word level, a subword level, and a character level. While the word-level representation is the simplest tokenization approach, it may produce a massive vocabulary size. However, limiting the vocabulary size based on its frequency may cause an Out-of-Vocabulary words (OOV) problem. While the character-level representation can diminish the OOV problem with the limited vocabulary size (e.g., English characters), models may not be able to handle an excessively long sequence of source code (i.e., each character has its own vector). Instead, we use sub-word tokenization with the Byte-Pair Encoding (BPE) algorithm [20], as prior studies found that BPE can substantially reduce the vocabulary size [21], [22], while being able to generate new identifiers that never appear in the dataset [23]. First, BPE splits source code into characters. Then, BPE iteratively merges the characters into subwords based on the frequency of the occurrences to create the vocabulary until the desired size. In this paper, we use the CodeGPT tokenizer, which has a vocabulary size of 50,000 subwords. To ensure that the CodeGPT tokenizer can recognize the token types, we represent the token types in the bracket parenthesis form ... , which are included in the special token vocabulary for the BPE tokenizer to avoid any subword tokenization on these token types.

(Step 3) Data Alignment
Data alignment is an important step to ensure that the sequence of code tokens and their corresponding token types are correctly matched and aligned. With the use of BPE, some words may be tokenized as subwords, while their type is not tokenized into the subword level, making the sequence of code tokens and the corresponding token types not correctly matched. For example, as shown in Figure 2, BPE splits logging into [logg, ing] with a single corresponding <NAME> token type. To address this problem, we repeat the token type for any word that is split by BPE. Therefore, in Figure 2, the token type <NAME> is repeated twice in order to match the subword-level code sequence of [logg, ing]. This data alignment step will produce a sequence of code tokens and their corresponding token types with the same length, which is ready to be fed into our code completion approach to learn both syntactic and semantic meanings of source code.

(Step 4) Multi-Task Training Architectures
Our PyCoder leverages a Multi-Task Training (MTT) paradigm, which is a set of techniques designed to learn multiple tasks, allowing the model to capture multiple sources of information. Traditionally, deep learning is designed for one single learning objective (e.g., only predicting the next code token), limiting its ability to capture other important and useful sources of information (e.g., syntactic information of source code). Instead of training a model with one single learning objective, the MTT paradigm aims to provide a generalist model with multiple learning objectives, providing a more robust vector representation. For our PyCoder approach, we design the target task to predict the next token, while the supporting task (aka. an auxiliary task or additional related non-target task) is to predict the token type. In addition, we build three variants of PyCoder, with three different MTT techniques, according to two learning styles [24] as follows.

Multi-Task Learning (MTL)
Multi-Task Learning (MTL) is an MTT technique to learn multiple tasks simultaneously instead of learning them separately. Normally, during the learning process, the model aims to optimize a loss function for one single learning objective. With the MTL approaches, multiple loss functions are optimized together during the learning process, allowing the MTL-based model to simultaneously learn against multiple objectives and share the knowledge understanding from multiple related sources. In this paper, we consider two main MTL approaches for Multi-Task Learning (MTL) [25], i.e., Hard Parameter Sharing (PyCoder-Hard) and Soft Parameter Sharing (PyCoder-Soft). For Hard Parameter Sharing, the key principle is to train a code completion model against two learning objectives, where the loss functions of the two learning objectives (L code and L type ) are optimized together within the same model. Formally, the PyCoder-Hard model aims to minimize the following loss function: , where d code , d type denotes the code token dataset and the token type dataset, respectively, and ω denotes a model's parameters. With Hard Parameter Sharing, the weights and model parameters are shared between tasks, allowing the model to explicitly learn the input representations between tasks (i.e., code and type vectors) that are closely related.
For Soft Parameter Sharing, the key principle is similar to Hard Parameter Sharing where the goal is to train a code completion model with two learning objectives. However, instead of training a model against two tasks like the Hard Parameter Sharing model, the Soft Parameter Sharing is designed to train two individual models for each task (L code and L type ), allowing each model to learn separately for each task. Therefore, each learning objective has an individual model (i.e. separated weights and parameters between the learning objectives). To allow the model to share the knowledge between tasks (i.e., to learn the similarities between the related parameters), a shared loss function is also used, which is computed as follows: , where ω n denotes the model parameters of the learning objective n. Finally, the PyCoder-Soft model aims to minimize the following loss function: With Soft Parameter Sharing, each learning objective has its own model parameters and weights, allowing the models to implicitly learn the input representations that might have more connection to a specific task.

Intermediate Fine-Tuning (IFT)
Intermediate Fine-Tuning (IFT) [24] adapts a transfer learning concept (i.e., pre-training then fine-tuning) where the goal is to learn multiple tasks sequentially. First, the model is fine-tuned on the supporting task (token type prediction) followed by the target task (code token prediction), respectively. Thus, the fine-tuned step on the supporting task can be considered the second stage of the model pre-training. Therefore, the Intermediate Fine-Tuning (IFT) model (PyCoder-IFT) is first trained based on an intermediate self-supervised task (token type prediction), then trained on the target task (code token prediction), allowing the model to gain knowledge on the token type prior to predicting the next code tokens.

GPT-2 Model Architecture
Among the three variants of the MTT techniques (i.e., PyCoder-Hard, PyCoder-Soft, and PyCoder-IFT), we use the GPT-2 architecture as a base model. GPT-2 [4] is a decoderonly Transformer model. The GPT-2 architecture for code completion consists of three main components: the embedding layer, the decoder block, and the language model head. First, the embedding layer embeds the input tokens into vectors with positional encoding, allowing the model to learn the semantic meaning and the position of each code token. Then, the embedding vectors are fed into the decoder block which contains decoder layers. Each decoder layer includes masked self-attention layers, feed-forward neural network layers, and normalization layers. The masked selfattention layer indicates which tokens to focus on, while the masking approach prevents the attention mechanism [26] to see the unseen tokens in the future. The feed-forward neural network layer is a sophisticated network with hidden nodes to capture the related information between each data point. The normalization layer makes the learning process more effective by enabling smoother gradients and generalized accuracy. After L layers of decoder, an output of the last layer is fed to the language model head, i.e. a linear layer, which converts the output to a vector whose dimensions are the same as the vocabulary size. Lastly, the vector is converted to a probability distribution by the softmax activation function. Formally, to predict the next token x t based on a given input sequence, GPT-2 can be represented as follows: , where W e is the tokens embedding matrix, C denotes the context vector of tokens, W p is the position embedding matrix, L is a number of decoder layers, and N is the length of the sequence. We follow the traditional language models by maximizing the log-likelihood of: , where ω is the model parameters that are learned during the optimization process. Particularly, PyCoder uses the pretrain CodeGPT [3] that is pre-trained on the CodeSearchNet dataset [27] as a starting checkpoint.

(Step 5) Hyperparameter Task Weighting
Since our PyCoder leverages MTL training techniques to learn multiple different tasks simultaneously, some tasks may have a higher influence than others, which later may produce an unsatisfactory accuracy for the other tasks (called a conflicting gradient problem). To prevent such conflicting gradients between tasks, it is important to find the most optimal task weights by minimizing the loss. Therefore, we optimize the hyperparameters (α i ) to adjust the task weights to find optimal task weights for our architecture. Specifically, we aim to minimize the loss of the code prediction task along with the type prediction task using the following loss function.

(Step 6) Decoding Methods
Decoding is a method to select the next token from the potential vocabulary when generating a sequence. Although selecting only the highest probable token is suitable for a single step, it might be a sub-optimal for the sequence.
Since the search space of the next tokens is large, different decoding methods will have different mechanisms, providing different predictions of the next tokens. Thus, the selection of the decoding methods may have an impact on the overall performance of our PyCoder. In the code completion literature, we found that Beam Search is one of the most commonly used decoding methods. However, Holtzman et al. [28] found that there exist other decoding methods that are widely used in the NLP area, yet remain largely explored in the code completion literature. Thus, we aim to experiment with the six following decoding methods.
• Greedy is a method to select the maximum probable vocabulary to be the next tokens. This method assumes that the model already outputs the best probability in every timestep. • Beam Search applies a search algorithm to generate all possible tokens in the vocabulary; then, it selects the top b (i.e., beam size) probable tokens to continue. The Beam Search method is one of the most commonly used decoding methods in text generation tasks [29], [30]. • Sampling is a method to randomly select the next token from the actual probability distribution assigned by the model. Different from Greedy and Beam search methods which in some cases may recommend only the same probable next tokens at different timesteps, the sampling method may recommend different next tokens at different timesteps (i.e., non-deterministic). • Sampling with Temperature applies a temperature parameter to shape the probability distribution [31], which is different from the original sampling method where the randomness is arbitrary. The temperature is used to increase the probability of the most probable next tokens, while decreasing the probability of the others. We note that the probability of the least probable next tokens is only decreased, but they are not removed from the recommendation. The range of the temperature value is usually at 0 < temp ≤ 1, where temp = 1 is a normal sampling. • Top-K Sampling aims to truncate the probability distribution by choosing the top-k probable next tokens from the vocabulary, then, re-scale the distribution and perform sampling based on the new distribution. This method ensures that the less probable next tokens will not be generated, while only the top-k probable next tokens are only considered during the sampling process. • Top-P Sampling (Nucleus Sampling) is similar to the Top-k sampling method where the Top-P sampling method also truncates the probability distribution, but with different criteria. Top-P sampling prunes the distribution by the cumulative probability of the current step ≥ p [28]; then, re-scale and perform sampling. Formally, given the probability P, we can define the smallest summation of the probability as V p in The benefit of this method is that it can dynamically adjust the number of k depending on the certainty of the model. If the model is very certain on some tokens, the search space is small, and vice versa.

(Step 7) Code Completion
PyCoder performs predictions at two granularity levels, i.e., at the token level and at the line level. Token-level code completion is a process to predict the next token (the right side), given the prior code tokens as a context (the left side).
Line-level code completion is similar to the token-level prediction, but the model aims to predict the next tokens until completing the whole line of code (i.e., not just only one single next token). For the line-level prediction, we leverage the same model used for the token-level code completion task to iteratively generate the next token, where the newly generated token is used as a context for the next step of the prediction. This process is repeated iteratively until the model generates a EOL token, or until it reaches a certain n threshold (n = 100, following the CodeXGlue [3]).

EXPERIMENTAL SETUP
In this section, we present the goal of our experiment, along with the research questions, followed by the experimental setup in detail.

Goal and Research Questions
The goal of this paper is to empirically evaluate our Py-Coder and compare with the state-of-the-art approaches according to the token-level and line-level code completion tasks and to provide a better understanding of the impact of the components of our PyCoder. To achieve this goal, we present the motivation and the research questions below.

RQ1) What is the performance of our PyCoder for the token-level and line-level code completion tasks when compared to state-of-the-art models?
Motivation. As motivated earlier, existing syntaxaware code completions are not on-the-fly, while existing on-the-fly code completions are not syntax-aware.
To address this important gap, we introduce PyCoder (a syntax-aware on-the-fly code completion). Thus, we formulate this RQ to investigate how well our Py-Coder perform when compared to the state-of-the-art approaches for both token-level and line-level code completion tasks based on the CodeXGlue Benchmark. RQ2) What is the impact of the training strategies on the performance of our PyCoder? Motivation. There exist various training strategies for multi-task learning used in code completion. For example, Liu et al. [12] found that hard parameter sharing performs best, while Izadi et al. [7] found that soft parameter sharing performs better than hard parameter sharing for code completion. This contradictory finding motivates us to investigate the impact of training strategies on the performance of PyCoder. RQ3) What is the impact of the task weighting parameters in multi-task learning on the performance of our PyCoder? Motivation. Our PyCoder relies on two prediction tasks, i.e., code prediction and token type prediction tasks. It could be possible that these two tasks may be conflicting with each other or one task has a higher influence than the other task during the learning process. Thus, prior studies [32], [33] raised concerns that the conflicting issue (aka. conflicting gradient) may degrade the performance of multi-task learning. Therefore, task weighting parameters are used to weigh the importance of each task to achieve optimal accuracy. However, PyCoder may be sensitive to the task weighting parameters. Thus, we set out this RQ to investigate the impact of the task weighting parameters on the performance of PyCoder. RQ4) What is the impact of the decoding methods on the performance of our PyCoder? Motivation. Decoding methods are an important component of code completion used to generate the next probably code tokens. Recently, only a few methods are used for code completion (e.g., Beam Search, Greedy) [5], [6], [7], [9]. However, there are other decoding methods that have been used for text generation in the natural language processing field, yet have never been explored in software engineering. Thus, there is a lack of understanding of whether decoding methods widely used in code completion are the best.

Dataset
We use the ETH PY150 python dataset (the standard code completion benchmark) provided by Raychev et al. [13] to ensure a fair comparison with prior studies [6], [7], [9], [34]. The dataset is collected from open-source software projects in GitHub repositories with non-viral licenses (e.g. MIT, Apache, and BSD)-a license that an owner gives permission for freely use under specific terms; thus mitigating potential licensing issues. Note that this dataset is also used in Microsoft's CodeXGLUE benchmark [3]-a worldwide competition for the AI4Code area. As any duplicated codes have been removed by Raychev et al. [13], arriving at a total of 150,000 Python files, we confirm that there is no code duplication between the training set and the testing set, thus mitigating several potential biases like code duplication in our experiment. Following CodeXGLUE, for tokenlevel predictions, the dataset is split into 95,000 files for the training set, 5,000 files for the validation set, and 50,000 files for the testing set, with the number of tokens of 72.1M, 4.4M, and 37.3M, respectively. For line-level predictions, it's a common practice to reuse the same model trained for tokenlevel predictions. Thus, only a testing set is required, but a training set and a validation set are not required. Therefore, we use the 10,000 Python files provided by CodeXGLUE [3] as a testing set for line-level predictions.

Pre-processing Methods
Sensitive data information (e.g. name, number, credential, IP address) could appear in the source code. To avoid the models unnecessarily paying attention to this information, we mask these sensitive data by creating a placeholder for any string and numeric literals in the source code. Particularly, following CodeXGLUE [3], we first identify tokens based on their STRING and NUMBER types. Then, in the top-200 most frequent strings and the top-30 most frequent numeric literals, we replace the string with STR LIT:value and replace the number with NUM LIT:value . The rest of the uncommon literals are masked by STR LIT or NUM LIT . Finally, these placeholders are also added to the special tokens of the tokenizer, avoiding any subword tokenization for these special tokens.
In addition, we preserve the original indentation of the source code that is ignored by CodeXGLUE's pre-processing step. Indentation plays an important role as part of the Python syntax grammars, as it is used to indicate a group of  statements that belongs to a particular code block, assisting a Python interpreter to decide the execution of the next statement. To do so, for any positions of the indentation, we use INDENT and DEDENT special tokens. INDENT denotes the indentation, which appears once at the beginning of a code block, not once per line, while DEDENT denotes the dedentation at the end of the code block.

Model Training
We use PyTorch 5 [35] and HuggingFace 6 [36] libraries for the implementation of our GPT-2 based model with the pre-trained checkpoint of CodeGPT. The base model is the default GPT-2 small configuration [4], consisting of 12 layers of Transformer decoders, 12 attention heads, n position = 1024, n ctx = 1024, and n embd = 768. We train our models for 200,000 steps with an Adam optimizer [37]. The hyperparmeters setting is shown in Table 2.
We do not fine-tune the hyperparameters due to limited resources. Therefore, our results could serve as a lower bound, but the optimization may improve the accuracy of our model. Overall we train 12 variants of PyCoder (3 multitask training techniques + 9 task weighing parameters) for a total of more than 850 training hours. For the baseline, we use all the best hyperparameters described in their papers. Our experiments is run on one NVIDIA GeForce RTX 3090 GPU with 24 GB memory, an Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz with 36 core processors, and 64G RAM.

Evaluation Measurement
We evaluate our models based on the following evaluation measures: Accuracy (Acc) for token-level predictions; Exact Match (EM), Edit Similarity (ES), and Mean Reciprocal Rank (MRR) for line-level predictions. Accuracy (Acc) is the proportion of correctness between predicted code tokens to the ground-truth tokens.
Exact Match (EM) is similar to Accuracy, but is evaluated at the line level, meaning that the whole predicted lines must be exactly matched with the ground-truth lines.
Edit Similarity (ES) uses a Levenshtein distance [38] to measure the edit distance between the predicted lines and ground-truth lines. The Levenshtein distance is the minimum number of edits in characters (either an insertion, a deletion, or a replacement of a character) between the predicted line and the ground-truth line.
Mean Reciprocal Rank (MRR) evaluates the top-R possible results using the multiplicative inverse of the rank of the first correct prediction. Formally, MRR is defined as: 5. https://pytorch.org 6. https://huggingface.co , where Q is the number of samples, and rank i is the rank of the correct prediction given by the model. If the correct prediction exceeds rank R, then the reciprocal rank is 0. In this paper, we use R = 5.

Baselines
There exist various non-AST-based code completion approaches in CodeXGLUE [3], [4] and AST-based code completion approaches [6], [7], [9] in the literature. To ensure that our evaluation is reasonably comprehensive, we consider a total of seven (7) baselines with respect to two evaluation settings: (1) externally evaluate the prediction results through the CodeXGLUE leaderboard, 7 and (2) internally evaluate the prediction results within our own setting. For the CodeXGLUE evaluation setting, we compare our approach with CodeGPT-adapt, CodeGPT, GPT-2, Transformer (12L), and LSTM+BPE. To do so, we apply our PyCoder to the testing set provided by CodeXGLUE for both token-level and line-level predictions. Then, the prediction results are submitted to the CodeXGLUE team to obtain the results based on their evaluation setting.
For our own evaluation setting, we consider two ASTbased approaches (i.e., Pointer Mixture Network [9] and TravTrans [6]); and two non AST-based approaches (i.e., GPT-2 and CodeGPT). We do not consider CodeFill [7], since the available replication package is not executable. We also do not consider Codex (i.e., a descendant of GPT-3 for source code) in our experiment due to the different levels of model parameter size. GPT-3, a base model of Codex, has 175B model parameters, which is 100x larger than the size of our GPT-2 based model which has only 117M model parameters. Below, we describe the details of each approach.
• Pointer Mixture Network (PMN), proposed by Li et al. [9], is an LSTM-based code completion leveraging AST information for syntactic structures. The model is designed with pointer networks to mitigate the OOV problems in code completion. Their replication package is available on Github 8 and also in Pytorch version. 9 • TravTrans, proposed by Kim et al. [6], is a transformerbased model that considers the syntactical structure of source code via AST information. Their replication package is available on GitHub 10 . • GPT-2, proposed by Radford et al. [4], is a GPT-2-based model for text generation tasks. The GPT-2 model is first pre-trained on millions of English web pages (the WebText corpus) to build a language model through self-supervision learning without any explicit labels. The model is available on HuggingFace. 11 • CodeGPT, proposed by Lu et al. [3], is a GPT-2-based model for source code generation. The CodeGPT model is a GPT-2 model that is pre-trained on a monolingual python source code from CodeSearchNet [27]

EXPERIMENTAL RESULTS
In this section, we present the experimental results according to our four research questions (RQs).

(RQ1) What is the performance of our PyCoder for the token-level and line-level code completion tasks when compared to state-of-the-art models?
PyCoder. Among our comprehensive investigation, the best setting for PyCoder is to train with the hard parameter sharing strategy (PyCoder-Hard), a task weight of 9:1 (code:type) using a Beam Search as a decoding method. We use this setting as a reference for comparison with other approaches throughout the paper. PyCoder achieves the first rank on the CodeXGLUE leaderboard for the code completion task (as of 13 October 2022, see Table 3). We find that PyCoder achieves an accuracy of 76.93% for the token-level predictions, while achieving an exact match of 43.91% for the line-level predictions. The evaluation results confirm that PyCoder is more accurate than other baselines by 0.43%-24.25% for token-level predictions and 3.63%-84.73% for line-level predictions.
Similarly, PyCoder outperforms existing AST-based and non-AST-based code completion approaches, according to our own setting. Table 4 shows that PyCoder achieves an accuracy of 77.12% for the token-level predictions, while achieving an exact match of 43.37% for the line-level predictions. For the token-level predictions (Acc), we find that PyCoder is more accurate than Pointer Mixture Network by 11.74%, GPT-2 by 4.37%, TravTrans by 2.15%, and CodeGPT by 1.89%. This finding indicates that PyCoder that is syntaxaware and on-the-fly performs better than a code completion approach that is either syntax-aware alone or on-the-fly alone. It is worth noting that the accuracy of PyCoder-Hard, CodeGPT, and GPT-2 achieved for the CodeXGLUE leaderboard is slightly different from the accuracy of those that are run in our experimental setup. The difference that we observed has to do with the dataset used in CodeXGLUE and our experiment. In CodeXGLUE, they removed the indentation, while the dataset used in our experiment preserved the original indentation. To mimic the practical deployment scenario, we opt to preserve the original indentation.  In addition, the token-type information can improve the line-level code completion task by 8.34%-15.22%. For the line-level predictions (EM), we find that PyCoder is more accurate than GPT-2 by 15.22% and CodeGPT by 8.34%. This finding indicates that the use of token-type information that is largely ignored by the literature can also improve the line-level predictions by 8.34% to 15.22%, confirming that the token-type information is useful to improve the performance of line-level code completions.
Finally, when comparing PyCoder with the existing ASTbased code completions (i.e., TravTrans and Pointer Mixture Network), we find that the existing AST-based code completions are designed for the token-level predictions only. Thus, the line-level predictions cannot be performed, highlighting the limitations of the AST-based approaches that require AST information at the inference time, while demonstrating the benefits of our approach that consider the token-type information (i.e., syntax-aware), while can still predict code at any points of time (i.e., on-the-fly).

RQ1 Summary.
PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines.
(RQ2) What is the impact of the training strategies on the performance of our PyCoder? Table 5 presents the results of PyCoder when using various multi-task training strategies.
Hard parameter sharing (PyCoder-Hard) as a multitask training strategy performs the best. Table 5 shows that different multi-task training strategies have an impact on the performance of PyCoder for both token-level and line-level predictions. Particularly, we observe that PyCoder with hard parameter sharing achieves an exact match of 42.83%, while PyCoder with software parameter sharing achieves an exact match of 38.29%. The 4.54% difference (i.e., maxmin) confirms the impact that the training strategies have  on the performance of PyCoder. In addition, our results are contradictory to Izadi et al. [7] who found that soft parameter sharing performs best for code completion. This finding highlights the importance of investigating various choices of multi-task training strategies for code completions, instead of following prior suggestions or practices. Different from Izadi et al. [7], our PyCoder-Hard is designed to take both sequences of code tokens and their types as inputs one-by-one at a time and simultaneously learn with the same loss functions that are optimized together within the same model. With this method, the inputs can be detached from each other at the inference phase, resulting in better performance confirmed by our results. Nonetheless, the high-performing hard parametersharing training strategy (PyCoder-Hard) has to do with the benefits of the tight relationship between the learning tasks (i.e., code predictions and type predictions). Since token types are directly aligned with the same sequence of code tokens, these two pieces of information have a tight relationship. Therefore, PyCoder-Hard, which completely shares the model's weights and parameters between tasks, gains the most benefit from the shared relationship between the code and type information. However, the soft parameter sharing model (PyCoder-Soft) learns each task separately, making the learning process between two related tasks harder, resulting in sub-optimal performance. RQ2 Summary. Multi-task training strategies have an impact on PyCoder for both token-level and linelevel predictions. We find that PyCoder-Hard performs best; followed by PyCoder-IFN and PyCoder-Soft.
(RQ3) What is the impact of the task weighting parameters in multi-task learning on the performance of our PyCoder? Table 6 presents the results of different task weighing parameters for PyCoder-Hard.
PyCoder is generally robust to the task weighting parameters, achieving comparative (without task weighting) or better (with task weighting) performance when compared to the baselines. Table 6 shows that when varying the task weighting parameters (Type:Code) from 1:9 to 9:1, our PyCoder achieves an exact match between 41.19% to   9). For any sampling methods, we report both the Mean and its standard deviation (SD).
43.37%, which is still greater than the existing approaches (i.e., 40.03% for CodeGPT and 37.64% for GPT-2) with an exception for the weighting of 9:1. Although the task parameters are not weighted (cf. No Weight), our PyCoder still achieves an exact match of 42.83%, which also outperforms the existing approaches. In line with the other measures for both line-level and token-level predictions, this finding confirms that by adding token-type information by at least a small weighting of 10%, our PyCoder often performs better than the existing approaches. This means that the task objectives of PyCoder rarely suffered from conflicting gradients (i.e., the gradients of different task objectives are not aligned leading to the sub-optimal performance in the average gradient) showing that type prediction and code prediction are correspondent and beneficial to each other. In our setting, the best task's weight is 1:9 for the type prediction task to the code prediction task.

RQ3 Summary.
PyCoder is generally robust to the task weighting parameters, achieving comparative (without task weighting) or better (with task weighting) performance when compared to the baselines.
(RQ4) What is the impact of the decoding methods on the performance of our PyCoder?
Since decoding methods are specially designed for generating code predictions as a sequence (i.e., not an individual code token), the rest of this RQ will focus on the line-level predictions only, not the token-level predictions. We note that some decoding methods (i.e., Beam Search and Sampling with a probability shaping function) require parameter settings to be specified.  (1*Greedy + 5*BeamSearch) + 5 repeats × (1*Sampling, 6*Temp, 5*k, 6*p) .
Beam Search performs the best, while Sampling performs the worst. Table 7 shows that there is a great performance difference of PyCoder when different decoding methods are used. For example, Beam Search(CodeXGLUE) generally achieves an exact match of 43.37%, while Sampling achieves an exact match of 33.80%, confirming that the decoding methods have a substantial impact on the performance of PyCoderfor line-level code completion. In addition, we find that not only the methods but different libraries with different implementations also produce different results. In particular, when comparing Beam Search between CodeXGLUE and HuggingFace libraries (see Table 7), we find that Beam Search from the CodeXGLUE library achieves an exact match of 43.37% (used by Py-Coder), which is greater than that from the HuggingFace library. This finding suggests that future studies should use Beam Search(CodeXGLUE) for code completion and should report the library used for decoding methods for better reproducibility and replicability details.
We find that Sampling is the lowest-performing decoding method, while advanced Sampling (i.e., Sampling with Probability Shaping) tends to perform better, depending on the specified parameter settings. Through the comprehensive investigation, Top-P sampling performs best when p=0.1, and Sampling with Temp performs best when temp=0.1. These optimal parameter settings are domain and context-specific to code completion, which are different from Holtzman et al. [28] who recommend temp ∈ [0.5, 1], k ∈ [1, 100], p ∈ [0.9, 1) for the text generation tasks. The optimal setting that we achieved for code completion that is different from the recommendations in the NLP text generation field suggests that researchers should experiment with various parameter settings for the problem that tackle, instead of solely relying on suggestions or recommendations from prior work.

RQ4 Summary.
Decoding methods have an impact on the performance of PyCoder with an exact match varying from 33.80% to 41.52% for line-level predictions. Beam Search performs best, while Sampling performs worst.

DISCUSSION
Intuitively, the performance of PyCoder may be dependent on the amount of dataset (could be either training or testing). Since PyCoder is specifically designed to incorporate token type information, we perform additional analysis to investigate the relationship between the accuracy of code token predictions for each token type and the frequency of each token type that appears in the training and testing dataset (see Figure 3).
In general, syntax-related types of tokens tend to be more accurate than other types of tokens (e.g., operational tokens, boolean and logical expressions, strings, and numbers). The difference in accuracy could be due to the amount of data in the training/testing dataset. Figure 3 shows that tokens related to syntax types (i.e., LPAR, RPAR, COLON, KEYWORD, INDENT, DEDENT, EOL) generally achieve an accuracy of 68.35%-100.00%, where these types account for 58.50% and 58.33% of the training and testing datasets, respectively. On the other hand, operation-related tokens (e.g., PLUS, STAR, GREATER, NOTEQUAL) tend to be less accurate than syntax-related tokens, since these operationrelated tokens tend to have less amount of tokens in the dataset. The relationship between the code token accuracy and its frequency is also confirmed by Spearman's Rank Correlation of 0.85 (high, p-value = 1.59×10 −15 ), suggesting that more data in the training dataset may improve the code token predictions that are less frequent in the dataset.

THREATS TO VALIDITY
Threats to construct validity relate to the selection of baseline approaches. In this paper, we select the publicly accessible approach, which could reduce biases and increase the transparency of the comparison of the experimental results. Therefore, we select the competitive state-of-theart approaches which are publicly available by the authors as the baselines. We run all the experiments using the replication package and the best hyperparameter settings in their papers.
Threats to internal validity relate to the impact of the hyperparameters on the performance of PyCoder. To mitigate this threat, we conduct experiments with various hyperparameter settings (see RQ3 and RQ4). However, we find that PyCoder is generally robust to the model task weights. Thus, we suspect that hyperparameters will have a minimal impact on the performance of PyCoder. Nevertheless, optimizing the hyperparameters of the Transformer model could be expensive and is not the main goal of this paper. Due to the limited access to premium GPU computing resources, our results serve as a minimum bound, which could be further improved after optimization and with premium GPU access. Nevertheless, to mitigate this threat, we report the hyperparameter settings in our replication package.
Threats to external validity relate to the degree to which our approach can be generalized across other context. We evaluate our PyCoder with 50,000 python files from PY150 dataset which is the dataset used in many literature [3], [6], [7], [9], [11], [12], [34]. We also evaluate the model with the code completion benchmark in CodeXGLUE [3]. However, we limit the scope of this paper to python and have not demonstrated the results to other languages. Thus, other datasets can be explored in the future work.

CONCLUSION
In this work, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, with a multitask training strategy that learning on the supporting task of predicting token types during the training phase. We intensively train and test our PyCoder on different multitask training techniques, task weighing parameters, and decoding methods to find the best suitable architecture. Our study underline the following conclusion: • PyCoder surpasses all the state-of-the-art models in our setting and also receives the first place in CodeXGLUE's python code completion benchmark. The results indicate that the token type syntactic information can be beneficial in code completion. • In our setting, MTL: Hard Parameter Sharing -PyCoder-Hard with task's weight (Type:Code) 1:9 and Beam Search performs the best. • Our study highlights the importance of investigating various choices of setting (e.g., multi-task training strategies, parameter setting) instead of solely relying on suggestions from prior work.
Our PyCoder has extended the feature of on-the-fly code completion with lightweight syntactic-aware information. However, we acknowledge that there is still a space to develop the fully syntactically correct code completion model with on-the-fly feature. We leave this exploration for the future research study.