Log Statements Generation via Deep Learning: Widening the Support Provided to Developers

Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. Nevertheless, LANCE grapples with two primary constraints: (i) it presumes that a method necessitates the inclusion of logging statements and; (ii) it allows the injection of only a single (new) log statement, even in situations where the injection of multiple log statements might be essential. To address these limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR).


Introduction
The practice of injecting log statements in applications' code is widely adopted both in industry and open source projects (Oliner et al., 2012).Indeed, log statements are instrumental to support several software-related activities, including program comprehension and debugging (Lu et al., 2017;Gurumdimma et al., 2016).Given its popularity, it comes without surprise the proliferation of libraries to support logging activities: just for Java some possible options are Log4j (Log4J, 2022), JCL (Java, 2022), slf4j (QOS.ch, 2022b), and logback (QOS.ch, 2022a).
While logging is usually perceived as a good practice, it comes with its own drawbacks: Excessive logging could negatively impact performance and, if not carefully conceived, log statements can result in security issues such as providing access to user credentials or sensitive information.Also, researchers documented several bad practices that should be avoided while logging code (Chen and Jiang, 2017a;Li et al., 2019).
In general, logging poses several challenges to software developers.First, they need to decide what to log, by finding the right amount of log statements needed in the application without, however, flood it with Once pre-trained, the model has been fine-tuned for the specific task of interest.In this case, we selected ∼62k Java methods and removed from them exactly one log statement asking the model to generate and inject it, thus deciding where to log (i.e., in which part of the method), which log level to use, and what to log (i.e., generate a meaningful log message in natural language).LANCE is the first approach supporting developers in all these activities.The empirical evaluation we run (Mastropaolo et al., 2022) showed that LANCE was able to correctly predict the appropriate location of a log statement and its level in ∼66% of cases, while the approach struggling in predicting a meaningful log message, being successful in 15.2% of test instances.
While LANCE represents a step ahead in logging automation, it comes with some limitations.First, it assumes that only one log statement is needed in a Java method provided as input.This is due to the training procedure we employed that asks the model to always generate a single log statement.Second, given a Java method, LANCE cannot assess whether log statements are needed at all.Indeed, in some cases, enough log statements may be already present in the method or, maybe, the method does not feature statements that would benefit from logging.Finally, LANCE showed substantial limitations in synthesizing meaningful natural language log messages.In this work we study how to partially address these limitations.
We start replicating LANCE by training and testing it on a dataset 3.6 times larger than the one we used originally (Mastropaolo et al., 2022) (230k training instances vs. 63k).Besides being larger, the new dataset features a more variegate set of log statements.Then, we present LEONID as an extension of LANCE able to (i) discriminate between methods needing and not needing the injection of new log statements; and (ii) in case a need for log statements is identified, LEONID, differently from LANCE, can decide the proper number of log statements to inject (which can be higher than one) and properly place them in the correct position.We found that LEONID can correctly predict the need for log statements with an accuracy higher than 90%.Also, when log statements are needed, LEONID can generate and inject in the right position multiple complete log statements in ∼17% of cases.
Finally, in LEONID we attempted to improve the performance achieved in the generation of meaningful log messages by exploiting a combination of DL and Information Retrieval (IR).Indeed, based on the results we achieved with LANCE, the generation of log messages really looked like the Achilles' heel of DL-based log generation.Results show that by increasing the size of the training dataset, the ability of LANCE in predicting meaningful substantially improves (+100% as compared to what we reported in Mastropaolo et al. (2022)).Instead, the combination of DL and IR we propose in LEONID only marginally improves the results for this specific task (+5% relative improvement).
Table 1 shows how LEONID widens the support provided to developers in the automation of logging activities.Indeed, it is the only one deciding whether log statements are needed in a method and, in case of positive answer, synthesizing multiple and complete log statements, and inject them in the correct position.

LEONID
We start by providing an introduction to the T5 model we use (Section 2.1), the same we also exploited in LANCE (Mastropaolo et al., 2022).Then, we describe how we built the datasets used for the different training phases we deal with (Section 2.2).Section 3 will then explain how we used these datasets to run the actual training process.

Text-to-Text-Transfer-Transformer (T5)
T5 has been introduced by Raffel et al. (2020) as a Transformer (Vaswani et al., 2017) model to support multitask learning.The idea behind T5 is to reframe NLP tasks in a unified text-to-text format in which the input and output of the model are text strings.The training of T5 includes two phases.The first is the pre-training, in which the model is trained with a self-supervised objective to acquire general knowledge about the language(s) of interest.For example, this may mean providing as input to the model English sentences having a subset of their words masked and asking the model to generate as output the masked words.Being self-supervised (i.e., the training instances can be automatically generated by masking random words) the pre-training can usually be performed on large-scale datasets.Once pre-trained, T5 can be fine-tuned to support specific tasks with supervised training objectives.This means, for example, providing it with pairs of sentences <english, spanish> to train a translator.
In our work, we rely on the same T5 architecture (i.e., T5  ) we exploited in LANCE (Mastropaolo et al., 2022).T5 small is characterized by six blocks for encoders and decoders.The feed-forward networks in each block consist of a dense layer with an output dimensionality (   ) of 2048.The key and value matrices of all attention mechanisms have an inner dimensionality (  ) of 64, and all attention mechanisms have eight heads.All the other sub-layers and embeddings have a dimensionality (  ) of 512.We acknowledge that employing larger models such as T5  or T5  can influence the performance of LEONID when automating logging activities, but this comes at the expense of increased time and computational power requirements during the training process.The code implementing T5 is available in our replication package (Mastropaolo, 2023).

Datasets needed for training, validation, and testing
We start by describing the dataset used for pre-training T5 (Section 2.2.1).Then, we detail the several fine-tuning datasets we built (featuring training, validation, and test set).The first, aimed at replicating LANCE (Mastropaolo et al., 2022), teaches T5 how to inject a single log statement in a Java method (Section 2.2.2).The second fine-tuning dataset also focuses on the problem of injecting a single log statement, but this time exploits IR to provide T5 with concrete examples of log messages that might be relevant for the prediction at hand (Section 2.2.3).This allows to compare LANCE with LEONID in the task of single log statement injection.The third fine-tuning dataset trains LEONID for the task of multi-log statements prediction, i.e., injecting from 1 to  log statements in a given method (Section 2.2.4).Finally, we describe the fine-tuning dataset to train a T5 able to discriminate between methods needing and not needing log statements (Section 2.2.5).The datasets are summarized in Tables 2 and 3 and available in Mastropaolo (2023).
All datasets have been built starting from the same set of GitHub repositories that we selected using the GHS (GitHub Search) tool by Dabic et al. (2021).GHS allows to query GitHub for projects meeting specific criteria.We used the same selection criteria exploited in our former work on LANCE (Mastropaolo et al., 2022), selecting all public non-forked Java projects having at least 500 commits, 10 contributors, and 10 stars.These selection criteria aim at excluding personal/toy projects and reduce the chance of collecting duplicated code (nonforked repositories).We cloned the latest snapshot of the 6352 projects returned by GHS.We scanned all cloned repositories to assess whether they featured a POM (Project Object Model) or a build.gradlefile.Both these files allow to declare external dependencies towards libraries, the former using Maven, the latter Gradle.Such a check was performed since, as a subsequent step, we verify whether projects had a dependency towards Apache Log4j (Log4J, 2022) (i.e., a wellknown Java logging library) or SLF4J (Simple Logging Facade for Java) (QOS.ch,2022b) (i.e., an abstraction for Java logging frameworks similar to Log4j).Indeed, to train a T5 for the task of injecting complete log statement(s) in Java methods, we need examples of methods featuring log statements.The usage of popular logging Java libraries was thus a prerequisite for the project's selection.
We found 3865 projects having either a POM or a build.gradlefile and 2978 of them featured a dependency towards at least one logging library.The overall projects' selection is very similar to the one we performed in Mastropaolo et al. (2022), with the main differences being the additional mining of projects: (i) using Gradle as build system (in Mastropaolo et al. (2022) only Maven was considered); and (ii) having a dependency towards SLF4J (in Mastropaolo et al. (2022) only Log4j was considered).These choices help in increasing the size and variety of both the training and the testing datasets, making the prediction more challenging.
We used srcML (SrcML, 2022) to extracted all Java methods in the selected projects.Then, we identified the log statements within each method (if any) and removed all methods featuring log statements exploiting custom log levels (i.e., log levels that do not belong to any of the two libraries we consider, but that have been defined within a specific project).The valid log levels we considered are: FATAL, ERROR, WARN, DEBUG, INFO, and TRACE.At this point we were left with two sets of methods: those not having any log statement and those having at least one log statement using one of the ''valid'' log levels.
We run javalang (Thunes, Thunes) on these methods to tokenize them and excluded all those having # < 10 or # ≥ 512.The upper-bound filtering has been done in previous works (Mastropaolo et al., 2021;Tufano et al., 2021;Ciniselli et al., 2021;Tufano et al., 2019a,b) to limit the computational expenses of training DL-based models.The lower-bound of 10 tokens aims at removing empty methods.We also removed all methods containing non-ASCII characters in an attempt to exclude at least some of the methods featuring log messages not written in English.Finally, to avoid any possible overlap between the training, evaluation, and test datasets we are going to create from the collected set of methods, we removed all exact duplicates, obtaining the final set of 12,916,063 Java methods, of which 244,588 contain at least one log statement.

Pre-training dataset
Since the goal of pre-training is to provide T5 with general knowledge about the language of interest (i.e., Java), we used for pre-training all methods not featuring a log statement (the latter will be used for the fine-tuning datasets).We adopted a classic masked language model task, which consists in randomly masking 15% of the tokens composing a training instance (i.e., a Java method) asking the model to predict them.
Fig. 1 depicts the masking procedure of instances used to pre-train the model.

Fine-tuning dataset: Single Log Generation
We build a fine-tuning dataset aimed at replicating what we did in the training of LANCE (Mastropaolo et al., 2022).We process each method  having  ≥ 1 log statements by removing from it one log statement (i.e., leaving it with −1 log statements).This allows to create a training pair ⟨  ,   ⟩ with   representing the input provided to the model (i.e.,  with one removed log statement) and   being the expected output (i.e.,  in its original form, with all its log statements).This is the dataset used to train LANCE (Mastropaolo et al., 2022) and it allows to train a model able, given a Java method as input, to inject in it one new log statement.For methods having  > 1 (i.e., more than one log statement), we created  pairs ⟨  ,   ⟩, each of them having one of the  log statements removed (i.e., different   ).To ensure that after the log statement removal our instances still featured valid Java methods, we parsed each   using JavaParser (JavaParser, 2022) and removed all pairs including an invalid   .
We split the remaining pairs into training (80%), validation (10%) and test (10%) set as reported in Table 2. Training and testing a T5 model on this dataset basically means performing a differentiated replication of LANCE on a 3.6× larger and more variegate (multiple logging libraries) dataset.

Fine-tuning dataset: Single Log Generation with IR
In LEONID, we combine DL and IR with the goal of boosting performance especially in the generation of meaningful log messages.The main idea is to augment the input provided to the model (i.e.,   ) with log messages belonging to methods similar to   which are featured in the training set.For each of the 244,588 ⟨  ,   ⟩ pairs in the fine-tuning dataset described in Section 2.2.2 (this includes training, validation, and test), we identify the  most similar pairs in the training set.The similarity between two pairs is based on the similarity of their   (i.e., the method in which the log statement must be created) and it is computed using the Jaccard similarity (Hancock, 2004) index, based on the percentage of code tokens shared across the two methods.We then use these  similar methods to extract from them examples of log messages used in coding contexts which are similar to the   at hand.
Two clarifications are needed.First, independently if a given pair is in the training, validation, or test set, we extract its  most similar pairs only from the training set.This is needed since, while predicting the log statement to inject, the training set must be the only knowledge available to the model (i.e., the test set must be composed of previously unseen instances).Second, when computing the Jaccard similarity, we remove from the compared methods all log statements, since we want to identify similar ''coding contexts'' that may require similar log statements.We created three different fine-tuning datasets using different values of  = {1, 3, 5} (thus, a lower/higher number of exemplar log messages provided to the model).Fig. 2 shows an example of training instance for this fine-tuning dataset.The method on top represents the   Java method in which a log statement must be injected (i.e., the one highlighted in red).The method is enriched with the exemplar log messages that have been found in the  = 1 most similar method shown in the bottom.Besides the log messages, we also provide T5 with the Jaccard similarity between the   at hand (top of the figure in this case) and the method of the training set from which the exemplar log message(s) has been extracted.This is meant to provide T5 with an additional hint in terms of which exemplar message comes from the most similar coding context (when more messages are retrieved).Note that the instances in this dataset are exactly the same of the one previously described to replicate LANCE (see Table 2).This allows a direct comparison in terms of performance which will provide information about the gain, if any, provided by the IR integration.

Fine-tuning dataset: Multi-log Injection with IR
One limitation of LANCE (Mastropaolo et al., 2022) we aim at addressing in this extension is the assumption that a Java method provided as input always requires one new log statement to be injected.
Also for this dataset, LEONID exploits a combination of DL and IR, thus we follow a process similar to the one described in Section 2.2.3, with the main difference being the number of log statements we ask the model to generate.Given a method  featuring  log statements, we randomly select  log statements to remove from it, with 1 ≤  ≤ .This means that we create pairs ⟨  ,   ⟩ in which   lacks a ''random'' number of log statements that must be generated by the model to obtain the target method   .This makes the prediction task substantially more challenging as compared to the single-log injection scenario experimented in LANCE.Also in this case we parsed each   using JavaParser (JavaParser, 2022) and removed all pairs including an invalid   .The remaining part of the process (i.e., identifying the  most similar pairs to inject examples of log messages) is the same described in Section 2.2.3.Table 2 shows the distribution of instances among the training, evaluation, and test set for this dataset as well.

Fine-tuning dataset: Deciding whether log statements are needed
While the dataset described in Section 2.2.4 allows to build a model able to inject multiple log statements in a given Java method, such a model still assumes that at least one log statement must be injected in the input method.Thus, LEONID also includes a T5 model trained as a binary classifier in charge of deciding whether a method provided as input requires the addition of log statements or not.In case of affirmative answer, the method can then be passed to the previously trained model which will decide how many and which log statements to inject.To train such a classifier we again start from the original set of 244,588 Java methods having at least one log statement.Then, similarly to what done in Section 2.2.4,given a method  featuring  log statements, we randomly select  log statements to remove from it with, however, 0 ≤  ≤ .Thus, differently from the training dataset used for multi-log injection, we have instances from which we did not remove any log statement ( = 0).Then, we create a pair ⟨  , ⟩ in which   is the original method  possibly lacking a random number of log statements, while  is a boolean variable that could be equal true (i.e.,   needs the addition of log statements, since  ≥ 1) or false (i.e., no log statements are needed in   , since  = 0).Non-parsable methods resulting after the removal of the log statements have then been removed, as well as duplicates resulting from different methods that, after the removal of log statements, become equal (i.e., their only differences were the removed log statements).This process resulted in a dataset featuring 190,974 training instances (98,848 needing at least a log statement and 92,126 not needing it), accompanied by the evaluation and test sets summarized in Table 3.
As it can be seen, four different versions of the test set have been created, to experiment LEONID in different scenarios.Let us explain such a choice.The test set should be representative of the real distribution of methods needing and not needing log statements.However, such a distribution cannot be computed in a reliable way.Indeed, one possibility we considered to build our dataset was to just consider all methods with and without log statements as training instances (as opposed to work only with methods having at least a log statement as we do).In a nutshell, the process would have been: (i) remove a random number of log statements from the methods with at least one log statement to create instances needing logs; and (ii) assume that all methods without log statements do not require logging.However, assuming that all methods in a project not having log statements do not require logging is a very strong assumption.It is indeed possible that the project's developers just did not consider yet the usage of logs in a specific method or that, in a given project, logging is not yet a practice at all (thus all methods do not use log statements).This makes difficult a reliable computation of the number of methods needing and not needing logging.Also, such a problem justifies our decision to create instances of methods needing /not needing a log statement starting from all methods having at least one log statement and using the process described above (i.e., removing a random number of statements to create instances in need of logging, and not removing any log statement to create instances not needing logging).At least, we are sure that these are methods for which developers considered logging (since they have at least one log statement) and, thus, can be seen as a sort of ''oracle''.
The four test sets in Table 3 simulate four different distributions of methods needing /not needing log statements: balanced (50% per category), unbalanced towards needing (75%-25%), unbalanced towards not needing (25%-75%), and strongly unbalanced towards not needing (2%-98%).The latter is a distribution we computed based on all 12M+ methods we mined, in which 98% of methods do not have log statements, while 2% have it.As said, this distribution is not completely reliable but, at least, gives an idea of what we found in the mined projects.

Training and hyperparameter tuning
All training we performed have been run using a Google Colab's 2 × 2, 8 cores TPU topology with a batch size of 128.

Tokenizer training
Since we use software-specific corpora for pre-training and finetuning, we trained a tokenizer (i.e., a SentencePiece model Kudo and  (Raffel et al., 2020).We included English sentences since, once fine-tuned, the models may be required to synthesize complex (natural language) log messages.We set the size of the vocabulary to 32k word-pieces.

Pre-training
We pre-trained T5 for 500k steps on the pre-training dataset composed by 12,671,475 Java methods (Table 2).Given the size of our dataset and the batch size, 500k steps correspond to ∼5 epochs.The maximum size of the input/output was set to 512 tokens.

Hyperparameter tuning
Once pre-trained the model, we finetune the hyperparameters of the model following the same procedure we employed when developing LANCE.Such a procedure has been executed for each of the finetuning datasets previously described.In particular, we assessed the performance of T5 when using four different learning rate scheduler: (i) Constant Learning Rate (C-LR): the learning rate is fixed during the whole training; (ii) Inverse Square Root Learning Rate (ISR-LR): the learning rate decays as the inverse square root of the training step; (iii) Slanted Triangular Learning Rate (Howard and Ruder, 2018) (ST-LR): the learning rate first linearly increases and then linearly decays to the starting learning rate; and (iv) Polynomial Decay Learning Rate (PD-LR): the learning rate decays polynomially from an initial value to an ending value in the given decay steps.The exact configuration of all the parameters used for each scheduling strategy is reported in Table 5.
Each model has been run for 100k training steps on the fine-tuning dataset.Then, its performance has been assessed on the evaluation set in terms of correct predictions (i.e., cases in which the generated output is equal to the target one).
For the generative models injecting log statements this means that they outputted the Java method featuring all correct log statements in the expected positions.For the classifier, it means that it correctly predicted the need for log statements in a given method.The results achieved with each learning rate are reported in Table 4.Our hyperparameter tuning required training and evaluating 28 models: For each of the 7 fine-tuning datasets in Table 4 we experimented 4 different learning rates.Given the achieved results, we will use the ISQ-LR for the generative models, and the PD-LR for the classifier when fine-tuning the models.Concerning the ''replication of LANCE'' (i.e., fine-tuning T5 on the dataset Fine-tuning: Single Log Generation in Table 2), we did not perform any hyperparameter tuning, but relied on the best configuration reported in the original paper (Mastropaolo et al., 2022), thus using the PD-LR.

Fine-tuning
Once identified the best learning rates to use, we fine-tuned the final models using early stopping, with checkpoints saved every 10k steps, a delta of 0.01, and a patience of 5.This means training the model on the fine-tuning dataset and evaluating its performance (again in terms of correct predictions) on the evaluation set every 10k.The training process stops if a gain lower than delta (0.01) is observed at each 50k steps interval.This means that after 60k steps, the performance of the model is compared against that of the 10k checkpoint and, if the gain in performance is lower than 0.01, the training stops and the bestperforming checkpoint up to that training step is selected.This process has been used for all models, including the one replicating LANCE.Our replication package (Mastropaolo, 2023) reports the convergence of all models (i.e., the steps after which the early stopping criterion was met).

Generating predictions
Once the T5 models have been pre-trained and fine-tuned, they can be used to generate predictions for the targeted tasks.We generate predictions using a greedy decoding strategy, meaning that the generated prediction is the result of selecting at each decoding step the token with the highest probability of appearing in a specific position.Thus, a single prediction (i.e., the one maximizing the likelihood of among all the produced tokens) is generated for an input sequence, as compared to strategies such as beam-search (Freitag and Al-Onaizan, 2017) that generate multiple predictions.

Study design
The goal of our study is to evaluate the performance of LEONID in supporting logging activities in Java methods.We focus on three scenarios: single log injection, in which we compare with our previous approach LANCE (Mastropaolo et al., 2022); multi-log injection; and deciding weather log statements are needed or not in a given Java method.The context is represented by the test datasets reported in Additionally, we assess LEONID as a whole using it to both predict the need for log statements and, subsequently, generate and inject them (if needed).

Data collection and analysis
To answer RQ 1 we run both LEONID and LANCE against the test set described in Table 2 for the single log generation task.The only difference is that LANCE has been trained on the dataset not featuring the exemplar log messages added through IR (row Fine-tuning: Single Log Generation in Table 2), while LEONID exploits this information (row Fine-tuning: Single Log Generation with IR in Table 2).However, the training and test instances are exactly the same, allowing for a direct comparison.We assess the performance of the two techniques using the same evaluation schema employed in Mastropaolo et al. (2022).In particular, we contrast the predictions generated by the two models against the expected output (i.e., the Java method provided as input with the addition of the correct log statement).Note that generating and injecting a log statement (e.g., LoggerUtil.debug("execution ok")) involves correctly predicting several information: (i) the name of the variable used for the logging (i.e., LoggerUtil); (ii) the log level (i.e., debug); (iii) the log message (i.e., "execution ok"); and (iv) the position in the method in which the log statement must be injected.Thus, when a prediction is generated, three scenarios are possible: Correct prediction: A prediction that correctly captures all abovedescribed information, i.e., it matches the name used for the variable, the log level, message, and position as written by the original developers.
Partially correct prediction: A prediction that correctly captures a subset of the needed information (e.g., it correctly generates the log statement but injects it in the wrong position).
Wrong prediction: None of the above-described information is correctly predicted.
We answer RQ 1 through the following combination of quantitative and qualitative analysis.On the quantitative side, we report for both LEONID and LANCE the percentage of correct, partially correct, and wrong predictions.For the partially correct, we report the percentage of cases in which each of the ''log statement components'' (i.e., variable name, log level, log message, and log position) has been correctly predicted.As for the percentage of correct and partially correct predictions, we pairwise compare them among the experimented techniques, using the McNemar's test (McNemar, 1947), which is a proportion test suitable to pairwise compare dichotomous results of two different treatments.We complement the McNemar's test with the Odds Ratio (OR) effect size.We use the Holm's correction procedure (Holm, 1979) to account for multiple comparisons.
Concerning the quality of the log messages generated by the two techniques, looking for exact matches (i.e., cases in which the generated log message is identical to the one written by developers) is quite limitative considering that a prediction including a message different but semantically equivalent to the target one could still be valuable.For this reason, we also compute the following four metrics used in Natural Language Processing (NLP) for the assessment of automatically generated text: BLEU (Papineni et al., 2002) assesses the quality of the automatically generated text in terms of -grams overlap with respect to the target text.The BLEU score ranges between 0 (the sequences are completely different) and 1 (the sequences are identical) and can be computed considering four different values of  (i.e., BLEU-{1, 2, 3, 4}).Besides these four variants, we also compute their geometric mean (i.e., BLEU-A).
METEOR (Banerjee and Lavie, 2005) is a metric based on the harmonic mean of unigram precision and recall.Compared to BLEU, METEOR uses stemming and synonyms matching to better reflect the human perception of sentences with similar meanings.Values range from 0 to 1, with 1 being a perfect match.
ROUGE (Lin, 2004) is a set of metrics focusing on automatic summarization tasks.We use the ROUGE-LCS (Longest Common Subsequence) variant which returns three values: the recall computed as LCS(X,Y)/length(X), the precision computed as LCS(X,Y)/length(Y), and the F-measure computed as the harmonic mean of recall and precision, where X and Y represent two sequences of tokens.
LEVENSHTEIN Distance (Levenshtein, 1966) provides an indication of the percentage of words that must be changed in the synthesized log message to match the target log message.This is accomplished by computing the normalized token-level Levenshtein distance (Levenshtein, 1966) (NTLev) between the predicted log message and the target one.Such a metric can act as a proxy to estimate the effort required to a developer in fixing a non-perfect log message suggested by the model.
We also statistically compare the distribution of the BLEU-4 (computed at sentence level), METEOR, ROUGE, and LEVENSHTEIN distance related to the predictions generated by LEONID and LANCE.We assume a significance level of 95% and use the Wilcoxon signed-rank test (Wilcoxon, 1945), adjusting -values using the Holm's correction (Holm, 1979).The Cliff's Delta () is used as effect size (Grissom and Kim, 2005) and it is considered: negligible for || < 0.10, small for 0.10 ≤ || < 0.33, medium for 0.33 ≤ || < 0.474, and large for || ≥ 0.474 (Grissom and Kim, 2005).
On the qualitative side, we manually inspected 300 of the partially correct predictions generated by both techniques and having all information but the log message correctly predicted.The goal of the inspection is to verify whether the generated log message, while different from the target one, is semantically equivalent to it.To this aim, two of the authors independently inspected all 600 log messages (300 for each approach), with ∼11% (70) arisen conflicts being solved by a third author.We report the percentage of ''wrong'' log messages generated by both techniques classified as semantically equivalent to the target one.
To answer RQ 2 and evaluate the extent to which LEONID is able to correctly inject multiple log statements, we run LEONID against the test set reported in Table 2 (see row Fine-tuning: Multi-log Injection with IR).We then report the percentage of correct predictions generated by the approach (i.e., methods for which all  log statements that LEONID was supposed to generate and inject have been correctly predicted).In this case we do not compute the partially correct predictions since, if a prediction is not completely correct, it is not possible to match the generated log statements with the target ones to compare them.To make this concept more clear, consider the case in which LEONID was asked to generate two log statements  1 and  2 but it only injects one statement   , being different from both  1 and  2 .We cannot know whether   should be compared with  1 or with  2 to assess the percentage of partially correct predictions in terms of e.g., log level.For this reason, we only focus on the predictions being 100% correct (i.e., the output method is identical to the target one).
To answer RQ 3 , we run LEONID against the test sets presented in Table 3, reporting the confusion matrix of the generated predictions and the corresponding accuracy, recall, and precision.We compare these results with those of: (i) an optimistic classifier always predicting true (i.e., the method is in need for log statements); (ii) a pessimistic classifier always predicting false (i.e., no need for log statements); and (iii) a random classifier, randomly predicting true or false for each input instance.We use the same statistical analysis described for RQ 1 to compare LEONID with the baselines.

Results discussion
We discuss the achieved results by research question.

RQ 1 : Injecting a single log statement
Table 6 reports the results achieved by LEONID and LANCE, in terms of correct and partially correct predictions for the task of single-log injection.For LEONID we only report the results when  = 5, since this is the variant that achieved the best performance (results with  = 1 and  = 3 are available in Mastropaolo ( 2023)).The first row of Table 6 shows the percentage of correct predictions by both approaches, which is slightly higher for LEONID (+1.8% of relative improvement, from 26.78% to 27.26%).This difference is statistically significant (adj.value < 0.01) with 1.12 higher odds of obtaining a correct prediction from LEONID as compared to LANCE.
The four subsequent rows report the cases in which one of the four log-statement components (variable, level, message, and position) was correctly predicted (✓), independently from whether the other three components were correct or not (−).As it can be seen, there is no significant difference in the prediction of the log position, with both techniques correctly predicting it in ∼82.3% of cases.Differences are observed for the log variable and level in favor of LANCE (+1.0% and +0.9% relative improvement), and for the log message in favor of LEONID (+4.6% relative improvement).The log message is the part for which we observed the highest OR among all comparisons.Considering that the only difference between LEONID and LANCE is the usage of IR, the improvement in the generation of meaningful log messages we targeted has been at least partially achieved.The latter has, however, a small price to pay in the correct prediction of the log variable and level.Still, for these elements LEONID is able to generate a correct prediction in over 73.5% of cases, while the correct generation of the log message still represents the Achilles' heel of these techniques, with 31.55%correct predictions achieved by LEONID.Thus, we believe that improvements on the log message predictions should be favored even at the expense of losing a bit of prediction capabilities on other elements.
Digging further into the quality of the generated log messages, Table 7 reports the results computed using the four NLP metrics presented in Section 4 for both models (in bold the best results).All metrics suggest that the log messages generated by LEONID are closer to those written by humans.According to our statistical analysis (results in Table 8), all these differences are statistically significant (adj.-value <0.001) with, however, a negligible effect size.
Also the result of our manual inspection of 300 partially correct predictions by LEONID and by LANCE point to a similar story: We found 198 of those generated by LEONID (66%) to report the same information of the target log message, despite being semantically different.The remaining 102 (34%) predictions, instead, reported a log message completely different from the target one or not meaningful at all.For LANCE, the number of semantically equivalent log messages is slightly lower -192 (64%) -but inline with that observed for LEONID.Examples of different but semantically equivalent log messages generated by LEONID are reported in Fig. 3.The methods labeled with ''Target Java Method'' represent the ''oracle'', namely the log statement  (Levenshtein, 1966) 44.02 41.85 that LEONID was supposed to generate.Those instead labeled with ''Predicted Method'' represents the generated prediction being different from the expected target but, accordingly to our manual analysis, still valid.
Answer to RQ 1 .The 3.6 larger training dataset (as compared to the original one we used in Mastropaolo et al. (2022)), resulted in a boost of performance when predicting the log message (15.20% in Mastropaolo et al. (2022) vs. 30.16%).Such a result has been further improved by LEONID, which achieves a +4.6% relative improvement (i.e., 31.55% of correctly generated log messages).
All metrics used to assess the quality of the log messages generated by LEONID indicate improvements over LANCE.However, these improvements are marginal, showing that more research is needed to further improve the automated generation of log messages.

RQ 2 : Injecting multiple log statements
As explained in Section 4, it is not possible to compute the partially correct predictions in the scenario of multiple log injection.Thus, we limit our discussion to the correct predictions generated by LEONID.Independently from the value of  (i.e., the number of similar coding contexts from which exemplar log messages are extracted), LEONID can correctly predict all log statements to inject in a given method in >23% of cases.Also in this scenario,  = 5 is confirmed as the best configuration, with 23.51% of correct predictions.Fig. 4 depicts two cases for which LEONID correctly recommended more than one log statement: four in 1 and three in 2 .
Interestingly, the drop in performance as compared to the simpler scenario of single log injection is there but is not substantial (27.26% vs. 23.51%).Remember that in this experiment we removed from a given Java method  a random number  of log statements, with 1 ≤  ≤  and  being the number of log statements in .Thus, it is possible that most of the methods in our dataset had  = 1 and, as a consequence,  = 1 (i.e., LEONID must generate one log statement), thus making the task similar to the single-log injection.For this reason, we inspected our test set and found indeed that 85% of methods in it featured, in their original form, a single log statement.On top of this, there is another 6.7% of methods which originally had more than one log statement and from which we randomly removed  = 1 statement, thus again resulting in instances requiring the addition of a single log statement.We clustered the instances in the test set based on the number of log statements that LEONID was required to generate.We created two subsets: (i) one-log, having  = 1; and (ii) at-least-two-log,  ≥ 2. The onelog subset features 91.7% of the instances in the test set (22,104 out of 24,088) and, on those, LEONID achieves 24.1% correct predictions; the two-log subset features 1984 instances (8.3%), on which LEONID has a 17.0% success rate.Thus, there is an actual performance drop when LEONID needs to predict multiple log statements in a given method.Still, in 17% of cases, LEONID is able to inject the same log statements manually written by developers.To give a term of comparison, in our original paper presenting LANCE (Mastropaolo et al., 2022), we reported a 15.2% success rate for the task of single-log injection.
Answer to RQ 2 .LEONID can support the task of multiple log injection, achieving 17.0% of correct predictions when more than one log statement must be injected.It is important to highlight that in this task it is up to the model to infer how many log statements are actually needed in the method given as input, making it more complex than the single-log injection experiment even when only a single log statement must be injected.

RQ 3 : Deciding whether log statements are needed
Fig. 5 reports the confusion matrices for the test sets in Table 3, differing for the proportion of need/no need instances they feature.The rows in the matrices represent the oracle and columns the predictions.For example, the first matrix to the left indicates that out of the 11,627 (11,013+614) methods in need for log statements, LEONID correctly identified 11,013 of them, wrongly reporting the remaining 614 as no need.
The overall accuracy of the classifier is always very high (≥0.95),indicating that most of instances are correctly classified.Similarly, the recall for the ''need'' class is always ≥0.94 (see Fig. 5), suggesting that most of the methods in need of log statements are identified.
Instead, the precision drops to 0.51 when the test set is very unbalanced towards the ''no need'' class, with only 238 need instances.Indeed, every classification error weights a lot more on the precision when the number of need instances is so low: The 219 misclassifications represent 49% -219/(230+219) -of the instances that LEONID classifies as in need of log statements.Given the overall very good performance achieved by LEONID, we decided to inspect these 219 instances to understand the rationale behind the recommendation by LEONID (i.e., add log statements).What we found is that, indeed, these are cases which are worth the attention of the developers since they may benefit from additional logging.
Fig. 6 shows two examples of ''no need methods'' classified by LEONID as in need for additional log statements.We added the LOG_STMT text bordered in red to indicate positions which may benefit of logging, especially considering the other log statements present in the method.For example, in method run 2 the developers used a log statement to document the reason for the InterruptedException in the second try/catch, while a similar scenario in the first try/catch is not logged.Overall, based on our manual inspection of the ''false positives'', we are confident that these could still represent valuable recommendations for developers.
When comparing the correct predictions achieved by LEONID with those of the optimistic, pessimistic, and random classifier, we always found a statistically significant difference in favor of LEONID (adj.-value < 0.001) accompanied by an OR going from a minimum of 6.17 to a maximum of 1426.The only exception is, as expected, the comparison with the pessimistic classifier on the 2-98 test set, on which the pessimistic classifier achieves 98% of correct predictions.In this case, we found no statistically significant difference (adj.-value = 0.63) with LEONID (detailed results in Mastropaolo (2023)).
Finally, we conducted a full-system assessment in which we integrated the classifier and generator into a pipeline that first determines  whether log statements are necessary, and if so, the module responsible for injecting the logs is activated.The achieved results showed that our end-to-end logging system can correctly inject ∼23% (5538/24,088) log statements when needed.This must be compared with the 27.26% achieved in RQ 1 when we only assessed the generation of log statements, ''providing'' LEONID only with instances that needed a log statement.Thus, while there is a slight loss in performance, the achieved results confirm the ability of LEONID in automatically assessing the need for log statements.
Answer to RQ 3 .LEONID can discriminate between methods needing and not needing additional log statements, with an accuracy higher than 0.95.This allows LEONID to both predict the need for log statements and generating them.

Threats to validity
Construct validity.The building of our fine-tuning datasets rely on the assumption that the exploited code instances, as written by developers, represent the ''correct'' predictions that the models should generate.This is especially true for the classifier aimed at predicting whether log statements are needed.For example, the instances that we labeled as ''not needing log statements'' are methods featuring  ≥ 1 log statements from which we did not remove any log statement.Thus, we assume that these methods need exactly  log statements (i.e., the ones injected by the developers), not one more.This is a strong assumption, as confirmed by the examples in Fig. 6.
In addition, there is evidence in the literature showing that some projects may adopt suboptimal logging practices (Patel et al., 2022), thus again posing question on the quality of the adopted ground truth.Future work should involve developers in the assessment of the recommendations generated by LEONID or similar techniques.Still, using the code written by developers as oracle is a popular practice in DL for SE (Tufano et al., 2022b(Tufano et al., , 2019b,a;,a;Watson et al., 2020b;Tufano et al., 2022a).
It is important to notice that, when preparing the fine-tuning datase we removed log statements from any location within a Java method.As a consequence, certain methods may contain empty blocks (e.g., an empty if block that only contained the log statemet), thus hinting the model to the right location in which the log statement should be injected (since there is likely something missing in that unusual empty block).To address this problem, we assessed the model's performance on a subset of our initial test set featuring 17,455 instances (∼73% of the original test set) in which there were no empty blocks left within the test method after removing the log statements.The results indicate that LEONID remains competitive even in this more challenging scenario, correctly generating and injecting log statements in 25.30% (4416/17,455) of the test instances (as compared to the 27.26% obtained on the full test set).
Internal validity.We performed a limited hyperparameters tuning only focused on identifying the best learning rate, while we relied on the best architecture identified by Raffel et al. (2020) for the other parameters.We acknowledge that additional tuning can result in improved performance.Also, different similarity measures used to retrieve similar   from the training set may lead to different results.Our choice of the Jaccard similarity was due to practical reasons: Since a given input method to LEONID must be compared with all entries in the training set, we needed a very efficient similarity measure in terms of required computational time.For example, we also implemented a variant of LEONID exploiting CodeBLEU (Ren et al., 2020) as a similarity measure.Considering that larger and larger training sets will be likely used in future, a scalable solution is a must also to make LEONID usable in practice.
External validity.Our research questions have been answered using a dataset being 3.6 times larger as compared to the dataset we originally used when proposing LANCE (Mastropaolo et al., 2022).Also, the new dataset is more variegated, featuring projects using different build systems (as compared to the Maven-only policy we relied in Mastropaolo et al. (2022)) and having dependencies towards different logging libraries (differently from the original Log4j-only policy we end up using in Mastropaolo et al. (2022)).Still, we do not claim generalizability of our findings for different populations of projects, especially those written in other programming languages.This holds not only when looking at the performance achieved on our test set (i.e., different test sets can yield to different results), but also when considering the usage in LEONID of information collected via IR from the training set (i.e., the performance observed for LEONID are bounded to the variety of data present in our training set).Additional experiments are needed to corroborate/contradict our findings.Yuan et al. (2012a) conducted one of the first empirical study on logging practices in open-source systems, analyzing C and C++ projects.They show that developers make massive usage of log statements and continuously evolve them with the goal of improving debugging and maintenance activities.Fu et al. (2014) studied the logging practices in two industrial projects at Microsoft, investigating in particular which code blocks are typically logged.They also propose a tool to predict the need for a new log statement, reporting a 90% F-Score.

Empirical studies on logging practices
Chen and Jiang (2017b) and Zeng et al. (2019) extended the study of Yuan et al. (2012a) 2019) investigated how logging configurations are used and evolve, distilling 10 findings about practices adopted in logging management, storage, formatting, and configuration quality.Other researchers studied the evolution and stability of log statements.For example, Kabinna et al. (2018) examined how developers of four open source applications evolve log statements.They found that nearly 20%-45% of log statements change throughout the software lifetime.Zhou et al. (2020) explored the impact of logging practices on data leakage in mobile apps.In addition, they propose MobiLogLeak to automatically identify log statements in deployed apps that leak sensitive data.Their study show that 4% of the analyzed apps leak sensitive data.
Recently, Li et al. (2020a) conducted an extensive investigation on logging practice from a developer's perspective.The goal of this research is to push the design of automated tools based on actual developers' needs (rather than on researchers' intuition).The authors surveyed 66 developers and analyzed 223 logging-related issue reports shedding light on the trade-off between costs and benefits of logging practices in open source.The results show that developers adopt an ad hoc strategy to compensate costs and benefits while inserting logging statements for various activities (e.g., debugging).
The above-described papers lay the empirical foundations for techniques supporting developers in logging activities (including our work).Approaches such as LEONID can help in reducing the cost of logging while supporting developers in taking proper decisions when they wish to add log statements.

Automating logging activities
Researchers proposed techniques and tools to support developers in logging activities.
Log message enhancement.Yuan et al. (2012b) proposed LogEnhancer as a prototype to automatically recommend relevant variable values for each log statement, refactoring its message to include such values.Their evaluation on eight systems demonstrates that LogEnhancer can dramatically reduce the set of potential root failure causes when inspecting log messages.Liu et al. (2019) tackled the same problem using, however, a customized deep learning network.Their evaluation showed that the mean average precision of their approach is over 84%.Ding et al. proposed LoGenText (Ding et al., 2022), a NMT (Neural Machine Translation) approach for improving the quality of log messages: By taking the code preceding a given log statement, LoGenText can translate it into a short textual description that can be used for logging.Such an approach can be considered complementary to the one presented in our paper.
Log placement.Other researchers targeted the suggestion of the best code location for log statements (Jia et al., 2018;Li et al., 2018;Li, 2020).For example, Zhu et al. (2015) presented LogAdvisor, an approach to recommend where to add log statements.The evaluation of LogAdvisor on two Microsoft systems and two open-source projects reported an accuracy of 60% when applied on pieces of code without log statements.Yao et al. (2018) tackled the same problem in the specific context of monitoring the CPU usage of web-based systems, showing that their approach helps developers when logging.Li et al. (2020b) proposed a deep learning framework to recommend logging locations at the code block level.They report a 80% accuracy in suggesting logging locations using within-project training, with slightly worse results (67%) in a cross-project setting.Cândido et al. (2021) investigated the effectiveness of log placement techniques in an industrial context.Their findings (e.g., 79% of accuracy) show that models trained on open source code can be effectively used in industry.
Log level recommendation.A third family of techniques focus on recommending the proper log level (e.g., error, warning, info) for a given log statement (Yuan et al., 2012a;Oliner et al., 2012).Mizouchi et al. (2019) proposed PADLA as an extension for Apache Log4j framework to automatically change the log level for better record of runtime information in case of anomalies.The DeepLV approach proposed by Li et al. (2021) uses instead a deep learning model to recommend the level of existing log statements in methods.DeepLV aggregates syntactic and semantic information of the source code and showed its superiority with respect to the state-of-the-art.
Lastly, in our previous work (Mastropaolo et al., 2022) we introduced LANCE, a tool to inject complete log statements by automatically selecting a proper log level, log message and log location.

Combining DL and IR to automate code related tasks
Although DL showed great potential in supporting various software engineering tasks (Watson et al., 2022), recent work showed how its performance can be further boosted by combining it with IR-based techniques.Lam et al. (2017) proposed to use IR alongside DL for bug localization.The IR technique assesses the textual similarity between bug reports and code files.The DL model is then used to learn relationships between terms in the two different vocabularies (i.e., bug reports vs. source code) and compute the final similarity score.The reported results show that DL and IR well-complement each other, with their combination outperforming the individual techniques used in isolation.Similarly, Choetkiertikul et al. (2018) proposed to combine IR and DL for identifying software components relevant for a given open issue.Yu et al. (2022) combined DL with IR for the task of automated assertion generation.The idea is to use IR to retrieve the most similar test method to the target one for which an assert statement must be generated.If the similarity between the retrieved method and the target one is higher than a threshold, the assert of the retrieved method is reused.Otherwise, a DL-based approach is used to generate the assert.
In this work, we combine IR and DL to improve the performance of log statement generation, especially for what concerns the definition of a meaningful log message.

Conclusions and future work
We started by discussing the limitations of LANCE (Mastropaolo et al., 2022), the approach we presented at ICSE'22 for the generation of complete log statements.LANCE always assumes that a single log statement must be injected in a method provided as input.This is a strong assumption considering that a method may not need logging or may need more than one log statement.Thus, we presented LEONID, an extension of LANCE able to partially address these two limitations, making a further step ahead in the automation of logging activities.Also, we experimented in LEONID a combination of DL and IR with the goal of improving the generation of meaningful log messages achieving, however, only limited improvements over LANCE.In light of the results we have obtained, LEONID can ensure up to 27.27% correct predictions, when asked to inject single log statement in Java methods.On the other hand, when the model is requested to inject multiple logging statements, we observed that they were correctly added in 17% of the methods.In addition, LEONID is capable of differentiating between methods that necessitate additional log statements and those that do not, achieving an accuracy surpassing 0.95.
We are working on the implementation of LEONID as a tool to be deployed to developers.This is the next step needed to perform in vivo studies, thus better understanding the main weaknesses of current DL-based log generation.

Fig. 2 .
Fig. 2. Example of instance in the ''Single Log Generation with IR'' dataset.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

A
.Mastropaolo et al.

Fig. 4 .
Fig. 4. Correct predictions made by LEONID when injecting more than one log statement.

Fig. 5 .
Fig. 5. RQ 3 : Results achieved by LEONID when deciding whether log statements are needed or not in Java methods.

A
.Mastropaolo et al.

Fig. 6 .Fig. 7 .
Fig. 6.  3 : Examples of methods that may benefit from further logging.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 7 provides an overview of how LEONID operates in an end-to-end logging scenario.In this context, the CLASSIFIER module first determines whether log statements are required for the target method.If log statements are necessary, the INJECTOR component inserts one or more log statements into the provided Java method.
to Java and Android systems, respectively.In particular, Chen analyzed 21 Java-based open-source projects while Zeng et al. considered 1444 open-source Android apps mined from F-Droid.Both studies confirmed the results of Yuan et al. (2012a), finding a massive presence of log statements in the analyzed systems.Zhi et al. (

Table 1
State-of-the-art approaches supporting developers in logging activities.

Table 2
Number of methods in the datasets used in our study.

Table 3
Number of methods in the datasets used to predict the need for log statements.

Table 4
T5 hyperparameter tuning results (in bold the best learning rate).

Table 2 (
single and multi-log injection) and Table 3 (deciding whether logging is needed).To what extent is LEONID able to correctly inject multiple log statements when needed?RQ 2 tests LEONID in the more challenging scenario of injecting from 1 to  log statements in a Java method, as needed.RQ 3 : To what extent is LEONID able to properly decide when to inject log statements?RQ 3 analyzes the accuracy of LEONID in predicting whether or not log statements are needed in a given Java method.
Mastropaolo et al. (2022)NID able to correctly inject a single complete logging statement in Java methods?RQ 1 mirrors the study we performed when presenting LANCE.We experiment LEONID in the same scenario presented inMastropaolo et al. (2022): The injection of a single log statement in a given Java method.We compare the performance of LEONID with that of LANCE when training and testing them on the same dataset.A. Mastropaolo et al.RQ 2 :

Table 6
RQ 1 : Correct and partially correct predictions by LEONID and LANCE on the single-log injection task.