Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration

Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model reasoning process limits the way in which current neural models are leveraged today, so far mostly for their raw prediction. To fill this gap, this work studies how the processed attention signal of three open large language models - CodeGen, InCoder and GPT-J - agrees with how developers look at and explore code when each answers the same sensemaking questions about code. Furthermore, we contribute an open-source eye-tracking dataset comprising 92 manually-labeled sessions from 25 developers engaged in sensemaking tasks. We empirically evaluate five heuristics that do not use the attention and ten attention-based post-processing approaches of the attention signal of CodeGen against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement between model and human attention. Our follow-up attention method can predict the next line a developer will look at with 47% accuracy. This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of other developers to recommend the next line. These results demonstrate the potential of leveraging the attention signal of pre-trained models for effective code exploration.


I. INTRODUCTION
Large language models (LLMs) pre-trained on code such as Codex [1], CodeGen [2], and AlphaCode [3] have demonstrated remarkable proficiency at program synthesis and competitive programming tasks.Yet our understanding of why they produce a particular solution is limited.In large-scale practical applications, the models are often used for their prediction alone, i.e., as generative models, and the way they reason about code internally largely remains untapped.
These models are often based on the attention mechanism [4], a key component of the transformer architecture [5].Besides providing substantial performance benefits, attention weights have been used to provide interpretability of neural models [6,7,8].Additionally, existing work [9,10,11,12] also suggests that the attention mechanism reflects or encodes objective properties of the source code processed by the model.We argue that just as software developers consider different locations in the code individually and follow meaningful Matteo Paltenghi is with the University of Stuttgart, Stuttgart, Germany.Email: mattepalte@live.it.Work done while at GitHub Next for a research internship.Rahul Pandita and Albert Ziegler are with GitHub Inc, San Francisco, CA, USA.E-mail: {rahulpandita, wunderalbert}@github.com.Austin Z. Henley is with Microsoft Research, Redmond, WA, USA.E-mail: azh321@gmail.com.connections between them, the self-attention of transformers connects and creates information flow between similar and linked code locations.This raises a question: Are human attention and model attention comparable?And if so, can the knowledge about source code conveyed by the attention weights of neural models be leveraged to support code exploration?
Although there are other observable signals that might capture the concept of relevance, such as gradients-based [13,14] or layer-wise relevance propagation [15], this work focuses on approaches using only the attention signal.The reasons for this choice are two: (1) almost all state-of-the-art models of code are based on the transformer block [5], and the attention mechanism is ultimately its fundamental component, so we expect the corresponding attention weights to carry directly meaningful information about the models' decision process; (2) attention weights can be extracted almost for free during the generation with little runtime overhead since the attention is computed automatically during a single forward pass.
Answering the main question of this study requires a dataset tracking developers' attention.In this work, we use visual attention as a proxy for the elements to which developers are paying mental attention while looking at code.However, the existing datasets of visual attention are not suitable for our purposes.Indeed, they either put the developers in an unnatural, and thus possibly biasing, environment where most of the vision is blurred [8], requiring participants to move the mouse over tokens to reveal them, or they contain few and very specific code comprehension tasks [16] on code snippets too short to exhibit any interesting code navigation pattern.This blurring method can introduce bias by forcing unnatural interactions, potentially affecting how developers naturally explore and understand code.To address these limitations and stimulate developers to not only glance at code, but also to deeply reason about it, we prepare an ad-hoc code understanding assignment called the sensemaking task.This involves questions on code, including mental code execution, side-effects detection, algorithmic complexity, and deadlock detection.Moreover, using eye-tracking, we collect and share a dataset of 92 valid sessions with developers.
On the neural model side, motivated by some recent successful applications of few-shot learning in code generation and code summarization [17,18] and even zero-shot in program repair [19], the sensemaking task is designed to be a zero-shot task for the model with a specific prompt that triggers it to reason about the question at hand.Then we query three LLMs of code, namely CodeGen [2], InCoder [20] and GPT-J [21] on the same sensemaking task and compare their attention signal1 to the attention of developers.The correlation with CodeGen, the largest model, is the highest among the LLMs studied (r=+0.23), motivating the use of raw and processed versions of CodeGen's attention signal for code exploration.To that end, we experimentally evaluate how well existing and novel attention post-processing methods align with the code exploration patterns derived from our dataset's chronological sequence of eye-fixation events.To the best of our knowledge, this work is the first to investigate the attention signal of these pre-trained models to support code exploration, a specific code-related task, directly related to code reading work [22,23].
We empirically demonstrate that post-processing methods based on the attention signal can be well aligned with the way developers explore code.In particular, using the novel concept of follow-up attention, we achieve the highest overlap with the developers' ground truth on which line to explore next.Contributions: This paper makes the following contributions: ⋆ Sensemaking Task A novel task and setup to deepen our understanding of how the LLM attention connects to the temporal sequence of location shifts regarding developer focus.⋆ Eye-Tracking Dataset A novel dataset of 92 eye tracking sessions of 25 developers engaged in sensemaking tasks while using a common code editor with code written in three popular programming languages (Python, C++, and C#).⋆ Follow-up Attention The analytical formula for follow-up attention, a novel post-processing approach derived solely from the attention signal, which aligns well with the developer interaction of which line to look at next when exploring code.⋆ Empirical Study The first comparison of both effectiveness and visual attention of LLMs and developers when reasoning on sensemaking questions.An empirical evaluation comprising ten post-processing approaches of the attention signal, five heuristics, and an ablation study of the followup attention against the collected ground truth of developers exploring code.

II. RELATED WORK
This section provides an overview of related work around the explanatory role of attention and previous studies of the attention of neural models and developers when reasoning on code.
Attention as explanation.Initially, preliminary work [24] studying attention weights of recurrent neural models has found that the attention weights do not always agree with other explanation methods and that alternative weights can be adversarially constructed while still preserving the same model prediction.However, in response, Wiegreffe and Pinter [25] have shown how the alternative attention weights can be constructed only per a single instance prediction, whereas obtaining a model which is consistently wrong in its explanations is very unlikely to happen.On the same line, Tutek and Šnajder [26] have proposed four regularization methods to mitigate the adversarial exploitation of attention weights for recurrent models, including the use of residual connections which are natively embedded into transformers [5], the building blocks of the LLMs studied in this work.To further corroborate this connection between attention and explanation, Rabin et al. [27] have shown how even Sivand, an explainability technique based on program simplification, pinpoint important tokens which largely overlap with those reported by the attention mechanism.
Attention studies of neural models of code.Paltenghi and Pradel [8] have compared the attention weights of neural models of code and developers' visual attention when performing a code summarization task, and found a strong positive correlation on the copy attention mechanism for an instance of a pointer network [28].Further works [9,11] have then shown how the attention weights of pre-trained models on source code capture important properties of the abstract syntax tree of the program.However, none of them considered the use of the attention signal for a code-related task, such as code exploration.Moreover, they are limited to relatively small selfattention transformer models, whereas we study the attention of CodeGen [2], InCoder [20] and GPT-J [21], large generative models with masked self-attention.
Eye-Tracking Studies Turner et al. [29] conducted an eyetracking study involving 38 students fixing or describing five simple Python and C++ programs (5)(6)(7)(8)(9)(10)(11)(12)(13) showing that the fixation duration is comparable between the two languages.Beelders [30] has qualitatively observed the eye movement of 36 students and four lecturers when reading and mentally executing a short C# program (12 LoC).An eye-tracking dataset with 216 participants has been collected by [16], however, they only consider two short snippets (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22) of code, since they do not support scrolling.Similarly, Blascheck and Sharif [22] and Busjahn et al. [23] have studied the reading order in C++ and Java code comprehension task focusing on six small programs that could fit into a single screen, whereas we consider longer snippets and a much larger dataset of 45 unique tasks.Sharifi et al. [31] have recently studied code navigation strategies on Java code with eye tracking involving 36 participants focusing on the bug fixing process, however, we study the sensemaking task which might elicit a different kind of reasoning compared to bug-fixing.To more closely mimic real-world setups in integrated development environments (IDEs), Guarnera et al. [32] propose iTrace, an eye-tracking plugin for IDEs that can track developers' eye movements in more realistic and dynamic coding environments beyond a single screen of code.Further studies, including Fakhoury et al. [33], have proposed Gazel, an IDE plugin that supports eye tracking in the context of source code editing.Following this latest trend, we also use an IDE plugin to collect the eye-tracking data, allowing for a more realistic coding environment.

III. SENSEMAKING TASK
To study developers' and models' attention, we prepare a code understanding task called sensemaking task because the developer has to "make sense" of code to answer the question correctly.One sensemaking task is contained in a single source code file p composed of four sections: (1) a brief description of the context of the main code snippet (e.g., The following code reasons about triangles in the geometrical sense.),(2) the main code snippet, either sourced on the internet or written from scratch by the authors, (3) a sensemaking question to stimulate the reasoning (i.e., Question:), and (4) a final prompt to trigger the model's answer (i.e., Answer:).Note that all the sections except the main snippet are in the form of code comments.Figure 1 shows an example task, whereas the full list of questions can be seen in the Table I.
To source the tasks for our study, we rely on Geeks-forGeeks2 , a well-known website for programming education and practice.This website offers a variety of problem statements that are commonly used in typical technical interviews by modern software companies, as shown by previous research [34].Therefore, we expect that the software developers would have some familiarity with the type of these programs.We then create specific sense-making questions about these programs, inspired by the kind of questions that an interviewer might pose, such as asking about the output, complexity, correctness, or code modification.Indeed, many of our questions are concrete instances of question templates such as "What is the purpose of the code?" (nqueens_Q1), "What is the program supposed to do?" (tree_Q3) or "What code could have caused this behavior?"(triangle_Q1), which also have been identified as questions that software engineers often ask themselves in a real working setting [35].To stimulate code exploration, many of them are also instances of reachability questions [36]; namely, they involve the search over all feasible paths of a program to locate target statements matching search criteria.Some examples of these are "What are the implications of this change?"(triangle_Q3) or "How does application behavior vary in these different situations that might occur?" (triangle_Q2, tree_Q1, multithread_Q3).We prepare five main snippets and create three unique questions for each of them.Then we translate the same task into three programming languages: Python, C++, and C#.In total, we have 45 unique tasks.Although the sensemaking task includes questions that might have also been asked in studies focused on code comprehension [37], the main difference is those studies typically restrict the scope of their questions to either bottom-up [37] or top-down comprehension [38] tasks.Whereas, in our sensemaking task, beside code snippet and question, participants receive also the header of the file with some contextual information, which creates an unusual blend of bottom-up and top-down comprehension tasks which is typically not seen in code comprehension studies which focus on either one or the other.This decision is motivated by our goal of stimulating code exploration, where the participants have to integrate different pieces of information at different locations and create an integrated mental model.
Neural Model's Task.We feed the entire source file of a single task as input, also referred to as prompt, to the generative model and query it for three different answers in the form of text completion.A model processes the input file p by splitting it in tokens via a deterministic tokenizer (p = t 1 , ..., t n ) and then generates a sequence of tokens as output, as shown on the left of Figure 2. We allow the models to generate an answer of length 100 tokens at maximum, which is more than enough to respond to all the questions.We use three widely used open source pre-trained models namely: CodeGen [2] in its language-agnostic variant 3 , InCoder [20] and GPT-J [21], all in their largest variants of 16B, 6B and 6B of parameters respectively.To query the model multiple times we use the temperature sampling strategy with a temperature of 0.2.
Developers' Task.We recruit 25 software developers via direct contacts at a large software company, ranging from interns to more senior software engineers, thus having diverse degrees of familiarity with software development and programming.We track the eye gaze of each participant during a 19 minutes session (on average) while they answer as many questions as possible, typically three or four.We ensure they see each main code snippet only once to avoid bias in answering a question on a snippet they have already explored in a previous task.The eye-tracking setup is calibrated at the beginning of each task to ensure consistent data collection.

IV. PROBLEM FORMULATION
The majority of modern large language models (LLMs) are based on the architecture of generative pre-trained transformers (GPT) [39], such as Codex [1], CodeGen [2], and AlphaCode [3].Self-attention is a mechanism used in these models that allows each processed token to weigh its own importance with respect to other tokens in the same sequence, enabling the model to capture relationships and dependencies within the sequence.In particular, the representation of each  token can incorporate information from tokens that come earlier in the sequence, and on the contrary cannot incorporate information from tokens that come later in the sequence.In this work, when a token A incorporates information from another token B, we say that A attends to B, or equally that token A pays attention to token B. This attention is usually quantified by a scalar value, called attention weight, which is computed by the model in its attention mechanism.
When the model takes as input a sequence of x tokens, the attention mechanism is applied to each token in the sequence.Figure 2 on the left shows a toy example with a model of three layers and two attention heads, together with the attention generated by the model.For each token, the attention is computed sequentially through the L layers of the neural model and, at each layer, the attention is computed in parallel H times, once for each sub-network called attention head.Fixing a combination of layer and head, the attention given by i-th token to the other tokens of the sequence can be represented by a vector of weights: where a i,j is the weight given by token at position i to token at position j.Note that the token cannot attend any token that come later in the sequence, thus the weights a i,j are zero for j > i. Stacking the attention vectors one after the other as row, we obtain an attention matrix A = (a 1 , a 2 , ..., a x ) for the specific combination of layer and attention head, note that it is a lower triangular matrix.
Thus, when the input file comprising n tokens (t 1 , ..., t n ) is fed to the model f , beside a predicted answer of m newly generated tokens (t n+1 , ..., t n+m ), the model also computes an attention tensor A of shape (L, H, n + m, n + m), where L is the number of layers and H is the number of attention heads.In particular, when comparing developers' and the model's attention, we focus on studying the attention weights referring to the prompt tokens only, even if some post-processing approach may use the entire tensor A.
Note that by construction, not all tokens can attend to all other tokens, thus we define the notions of followers of a token t i as the set of tokens that can pay attention to t i .This set is defined as F(t i ) = {t j | j > i}, where the subscript represents the position of the token in the sequence.

A. Views of Attention
In our problem formulation, we model an extraction function g that takes as input the attention tensor A and returns either a measure of how much attention the model pays to each part of the prompt or a measure of how much each part is linked to other parts of the prompt.Depending on the case, we refer to the outputs as visual attention vector or interaction matrix respectively.
Visual Attention Vector.It is a static view telling us which part of the input is important for the model when solving the sensemaking task.We define a visual attention of a model as a vector a = (a 1 , ..., a c ) over the c characters of the prompt, where each a i intuitively tells us how much attention was given to that the i-th character when solving the task.We use g viz (A) to model a function that takes as input the attention tensor A and returns a visual attention vector a.
Interaction Matrix.It is a dynamic view that tells us, given a position in the prompt, which other position of the prompt is more deeply connected to it.We define an interaction matrix S as a right stochastic matrix with size n × p where n is the number of tokens in the prompt and p is the number of admissible target positions in the prompt.We distinguish two kinds of interaction matrices depending on the granularity of the target position p, either pointing to another token or line in the source code (the latter being of interest primarily Fig. 2: Overview of the three extraction functions for the visual attention vector and the interaction matrix, both follow-up and mean.Note that a and b represent specific aggregation functions as explained in the text (e.g., mean, max or sum).The darker the red color, the more attention is paid to by token on the row i to the token on the column j.
with developer tooling in mind, which is often line based).
Respectively, we call them: (1) token-level, where S has size n × n where n is the number of tokens in the prompt; (2) line-level, where S has size n × n l where n l is the number of lines in the prompt.We use g token (A) and g line (A) to model two functions that take as input the tensor A and output an interaction matrix, either S token or S line respectively.

V. EXTRACTION FUNCTIONS
We investigate two algorithms for extracting the visual attention vector and four for the interaction matrix.

A. Attention Extraction Overview
Figure 2 illustrates the process of querying the model and extracting its attention signals, leading to the comparison with human developers.In particular, it shows how to the attention tensor A is derived by querying the model, and how to post process it to extract both the interaction matrix S token and the visual attention vector a.It is split in three phases: generation, post-processing, and human comparison.In the generation phase, a neural model with L layers and H attention heads processes n prompt tokens to generate both m tokens and an attention tensor A. Here, the model has three layers and two attention heads, handling five prompt tokens and generating four new tokens.During the post-processing phase, the attention tensors are aggregated to form interaction matrices S token using two techniques: mean aggregation and follow-up attention, as explained in Section V-C.Note that the matrices are cut to consider only the attention to the n tokens of the prompt.Then, the interaction matrix from token-to-token is converted to the line-level interaction matrix by aggregating the attention weights of the tokens belonging to the same line, obtaining S line .At this point, the visual attention vector a is extracted from the S token matrix (see Section V-B) and converted to the character level.In the last phase, the interaction matrices S line and the visual attention vectors a are compared with human data collected via eyetracking.

B. Visual Attention Vector Extraction
We introduce two alternative approaches called attention mean and attention max to condense the attention tensor A to the visual attention vector a, namely to implement g viz (A) : A → a.The first approach is visualized in the bottom part of Figure 2.
Attention mean.It aggregates over all the layers L and attention heads H by taking the average attention weight for each token position.After keeping only the prompt tokens, this step outputs a matrix A with shape (n, n) where each element A i,j is the average attention paid by the i-th token to the j-th token in across all layers and heads.Note that it is a lower triangular matrix because a token cannot attend tokens coming after it by construction.Then, we compute the mean of each column excluding the zeros to avoid penalizing more recent tokens with fewer followers.This step corresponds to represent each token t i with the average attention given to it by its followers F(t i ), thus we call the step mean of followers.It outputs a token-level visual attention a vector that is converted to character-level vector, by dividing the attention weight on a single token in equal shares among all its characters.
Attention max.This approach differs from the previous one in how it condenses layers and heads in the first step, replacing the mean with the max function to favor the extremely positive signals appearing only in one or few layers and heads; the rest is unchanged.

C. Interaction Matrix Extraction
We study four approaches: mean, max, rollout and followup attention.Apart from the rollout attention, which has been introduced by [40], the other three are either inspired by the work of [8] or a novel contribution of this work, such as the follow-up attention.
Attention mean.It computes the mean among all the L layers and H attention heads: 14 return S by multiplying the attention weights along multiple paths starting and ending in the same input-output pair.Since it does not model the attention head dimension, we condense that dimension via a simple sum.There is no a priori criterion for which attention layer should be used in the end.Thus, we sum the rollout values of all the layers.[40] for more details.Follow-up attention.It is our novel approach that centers on modeling the flow of information between subsequent layers.The intuition behind it is that follow-up attention tracks whether a token being attended to in one layer will cause a different token being attended to in the next layer.This is the model analogue of tracking humans jumping from attending one token to attending a different token next.
Algorithm 1 summarizes the entire procedure 4 , which is also represented in Figure 2. Similarly to the rollout attention, we aggregate the attention weights over the H attention heads by summing the weights of A along the attention head dimension and obtaining a layer-wise attention L, a 3dimensional tensor (Line 3).The follow-up attention explicitly models the temporal relationship between the attention weights computed at different layers since the attention weights in the later layers depend on the earlier ones.The intuition is that the layer-after-layer transformation reflects how the models explore code through time, similar to multiple successive fixations of a developer when navigating and exploring source code.Instead of looking at how the token gives attention to other tokens in the same layer, the follow-up attention adopts a differential approach which compares the attention received by token i at layer z with the attention received by token j at layer z − 1.To represent this received attention, we define the follower score f (z) i of token i at layer z, as the vector of the attention quota that each other token (which we call observers) gives to token i at the same layer (Line 9).Note that, similarly to the attention vector, the follower score is also a vector of real numbers and it has the same length corresponding to the input sequence length, thus representing a complementary viewpoint.To realize the agreement between follower scores at two consecutive layers, we use the cosine similarity as a soft version of the intersection between the set of followers of the two tokens (Line 11).Then we compute the follow-up attention for each ordered pair of tokens i → j (Lines 6-7) and for each pair of consecutive layers (Line 4) and condense all layer pairs into a single matrix via sum (Line 13).We aggregate attention over multiple layers since [41] have empirically shown how token identifiability is retained over layers, thus a generic embedding at position e i in any layer l is traceable to the input embedding x i in the input sequence.(36).

VI. CODE EXPLORATION DATASET
The sessions are single-purpose and live-monitored by an experimenter to ensure correct setup and focus on the code exploration task.While each participant has 45 minutes to solve as many tasks as possible, due to calibration and transition times, on average, they spend an average of 18.54 minutes exploring code using the IDE, with an average of 4.92 minutes per single question.A pair of code snippet and question is looked at by a median number of 3 different participants, and each code snippet is looked at by a median number of 7 different participants.No participant is presented with the same code snippet more than once.
Each session consists of a sequence of eye fixation events evt eye , each represented as a tuple (t, x, y, d) where t is the timestamp in milliseconds, x and y are the coordinates of the fixation point in pixels and d is the duration of the fixation in milliseconds.The average of fixations per session is 603.66.Each session is recorded in Visual Studio Code 5 to have a natural coding environment.Based on the size of the parafoveal region [42], each eye fixation event is converted to column and line coordinates: evt (char) eye = (t, c, l, d) where c is the column, and l is the line of the original source file.

A. Eye Tracking Setup
To collect the eye tracking data, we use an eye tracker from GazePoint (model GP3, with 0.5 -1 degree of visual angle accuracy), which is placed below the monitor thus not requiring the user to wear any additional device.Note that our setup is as close as possible to a normal coding session without any invasive or unnatural methods.The participants can see between 21 and 26 lines of code.The screen size is 52.7 mm x 29.6 mm with a resolution of 1920x1080 pixels.The participant seats at a fixed distance of approximately 30 cm from the screen.The fixation are computed by the internal fixation filter of the eye tracker, which uses a custom algorithm based on displacement [43], using the FPOG (Fixation Point of Gaze) data stream 6 Eye-tracking data are pre-processed using custom code in Python (version: 3.8) described next below and openly shared (see Data Availability).
Besides collecting the eye tracking data evt eye , our setup also collects evt txt coming from a custom VSCode plugin that logs the visible text on the screen.A visible text event evt txt corresponds to a tuple (t, txt, f, l) where t is the timestamp in milliseconds, txt is the visible text, f is the file name shown in the code area, l is the line number of the first visible line with respect to the given file.Note that this event is crucial since we study long code snippets and allow also screen scrolling.To ensure that we have a consistent grid mapping between pixel positions and char positions in the text, we use a monospace font, prevent partial scrolling and prevent any resizing of the code area during the experiment.Then, for each timeframe, we map the pixel of each eye fixation event to a specific character position in the code area by using a grid over the character positions identified by a line and column.
To derive the developer attention maps from the eye tracking data, we first synchronize data from the VSCode plugin and the eye tracker, via their timestamps.Then, we convert the fixation point of gaze x and y coordinates of each evt eye to the corresponding character line and column coordinates in the relative coordinate system of the code area.And knowing the line number l, we can attribute the fixation to a specific character position in the original file.In this way, we convert each evt eye to its equivalent event in character coordinates evt (char ) eye = (t, c, l, d) where t is the timestamp, c and l are the column and line coordinates with respect to the original file of the fixation point and d is the duration of the fixation.
Since it is hard to tell whether, during a single fixation event, a participant is looking at a specific character or a group of neighboring characters, we attribute the developer's attention to neighboring characters.In particular, if the developer looks at position (c, l) in the original file, we augment our data by introducing new events which point to all the neighboring characters within a vertical offset v off and a horizontal offset h off from our coordinate (c, l).As a result, we replace each basic evt (char) eye = (t, c, l, d) with the set of derived events (t, c new , l new , d) where c − h off ≤ c new ≤ c + h off and l − v off ≤ l new ≤ l + v off .This is strictly connected to the concept of fovea region.Indeed, as reported by [42], our fovea region, which is responsible for a sharp central vision, accounts for 2°of the visual field, whereas the parafoveal region, which is used for visual search and scene perception, accounts for 5°of the visual field.Thus considering 5°visual region and our screen size (527mm x 296mm), the developer can see 7.16 characters horizontally and 2.92 characters vertically.Rounding those quantities we set v off = 1 and h off = 4. Fig. 3: Example of two events where the yellow area corresponds to their contribution to the connection strength between from token i to token j.
This approach also contributes to mitigating any small x and y errors in the eye-tracking data collection.

B. Ground Truth Visual Attention
Here we borrow from the concept of human attention proposed by [8] and define the analogous developer attention as the total time that a specific char was visible to the participant (i.e., in their field of vision): d = (d 1 , ..., d c ) where c is the number of characters in the prompt and d i is the total time that the i-th character was visible to the participant according to the eye tracking data.In contrast to [8], we consider the char-level instead of the token level because it is more natural for our eye-tracking data.

C. Ground Truth Interaction Matrix
From each developer session, we derive a ground truth interaction matrix S. For a fair comparison of neural models with developers, we take into account the tokenization used by the neural model, namely we use the CodeGen tokenizer 7which is based on byte-level byte-pair-encoding [44].
To convert char-level events into token-level ones, for each timestamp, if at least one character of a given token is visible, then the token is considered visible as well and we count the corresponding event evt (token) eye = (t, i, d) where t is the timestamp, i is the token index and d is the event duration.Based on the pairs of events involving token i and token j, we quantify how likely it is that the developer looks at token j after having looked at token i.
Intuitively, we want to have stronger connection when a fixation on token i is shortly followed by a fixation on token j, and if this second fixation has a significant duration.Thus, we define the strength of the temporal connection between token i and token j as: where P i→j is the set with all the pairs of events where token i is seen before token j and the discounting factor α controls the decay of the connection the more the two events are far apart in time.For our experiments we empirically set α = 0.1, accounting for observed behavior where developers often Fig. 4: The strength of the connection S i,j depends significantly on the difference i − j.Both cases i > j and j > i can be well modelled using a Weibull distribution.
spend several seconds scrolling and presumably shallowly searching the code.In Figure 3 we show an example of the integral connecting two consecutive events.We noticed a strong neighboring effect across the whole dataset, where the connection between closer tokens tends to be relatively stronger irrespective of context and content.Indeed, developers do not jump randomly between likely locations within the code: they have a significant bias for staying close to their current position.We, therefore, begin by attempting to predict the observed strength of the temporal connection between tokens solely on the basis of their relative position.We suggest a two-tiered approach: first, consider whether the developer is traversing forwards or backward, then use a relative model for how far they will move in that direction.We expect the ratio of forwards or backward traversal to be dependent on the exact task, and in fact, in our dataset the proportion of forward traversal ranged from 45.8% to 78.2%.For each task individually, as well as to some extent in general, the best simple predicting feature for traversal direction appears to be the current token position divided by the total number of tokens, i.e. the ratio of the document still in front of the developer.Fitting individual linear regressions for going forward (Eq.2) and backward (Eq. 3) (which do not sum up to 1 because of the chance of returning to the token itself) gives the predictions of where the R 2 values indicate the goodness of fit, i is the token index, max(i) is the total number of tokens in the document, and the numeric values are the coefficients of the linear regression.We expect, and find, the distribution for the distance between consecutive gaze points to be less dependent on the task.Of a number of standard distributions we tested against (normal, poisson, lognormal, exponential, Pareto, Weibull), it is best modeled using a fitted Weibull distribution, with the best fit of shape = 0.89, scale = 98.14 tokens going forward and shape = 0.88, scale = 105.61tokens going backward (see Figure 4).
Thus, for the code exploration task, to extract relevance be-yond mere closeness, we normalize each row of the interaction matrix S by dividing by the average empirical ground truth distribution where the probability to go to a token constantly decreases the further away the target token is.

VII. RESULTS
In this section, we compare the visual attention and interaction matrix extracted from the attention tensor of neural models against the ground truth computed from the developers.We organize our empirical investigation in the following research questions: • RQ1: How effective are developers and neural models in solving sensemaking tasks?• RQ2: How does the visual attention of developers and neural models compare?• RQ3: How is the agreement between developers and neural models influenced by the programming language?• RQ4: How do the interaction matrices of developers and neural models compare?• RQ5: How is the effectiveness of follow-up attention influenced by layer choice and number of newly generated tokens?RQ1 and RQ2 considers three neural models: CodeGen8 , GPT-J 9 and InCoder [20].Whereas, for the remaining questions we focus on the larger and more effective CodeGen model.

A. RQ1: Answer Correctness
To evaluate the effectiveness of the developers and models in solving the sensemaking task, we annotate each generated answer by both groups involving four annotators in the process.We use a scale of three values of correctness: (1) correct, when the answer touches all the expected correct points, (2) partial, if at least part of the correct answer is present or if the answer is wrong but in the same style of the correct solution (e.g. the Big-O notation), (3) wrong, when the answer does not contain any correct part.Note that, especially for the model, if the model generates extra text beyond the correct or partial answer we ignore the rest if it is incorrect.Moreover, whenever the question is under-specified we accept multiple correct answers as long as they are compatible with the question.To ensure a reproducible annotation process, all the authors collectively come up with a shared set of gold-standard answers for each question.Then, two of the authors independently annotate more than 20% of the answers generated by the model and the developer, and within two rounds of annotation followed by discussion, the final set of gold standard answers is agreed upon.The final agreement on the 20% of data led to a Cohen's Kappa of 0.711, 0.898, 0.833 and 0.783 for developers, CodeGen, Gpt-J, and InCoder respectively, which is considered a very high agreement [45].Finally, the remaining 80% of the data is split in half and annotated only by one of the two authors individually.The upper part of Figure 5

B. RQ2: Agreement on Visual Attention
To measure the agreement between the visual attention of developers and the neural model, we regard them as vectors with meaningful ordinal content and compute their Spearman rank correlation coefficient [46], aligned with related work [8].In Figure 6, we report the Spearman rank correlation between the developer attention vector and the model attention vector.For completeness, we also report the comparisons among developers.Note that we only compare the attention maps of two different subjects from the different groups (e.g.developer vs CodeGen, developer vs GPT-J, etc.) when looking at the same code snippet and question.Moreover, neither the data from the participants, nor those extracted by multiple model predictions are aggregates among subjects of the same group.Instead, we consider the comparisons of all the possible combinations of subjects from the two groups.We avoid aggregation because it may be sensitive to largely deviating data of single participants and the identification of a suitable aggregation function is a non-trivial task, as reported by [47].Related work [8] avoids aggregation for similar reasons.This approach is adopted in all the comparisons among subjects in the paper.The observed agreement exceeds that observed in previous work [8], which we hypothesize to be due to a combination of us using a more advanced model [48] and a more natural data collection setup (eye tracking vs a deblurring interface).To investigate whether higher model-human agreement is connected to higher effectiveness on the sensemaking task we compare agreement of the cases where both human and model are correct and where they are both wrong.We run a statistical t-test and   similarly to [8] we find that for InCoder the comparisons where both human and model are correct have a higher agreement than those where they both are incorrect (pval=2.23e-11).We use the t-test under the assumption that the agreement values are normally distributed, which we confirmed by visual inspection.For the other two models, there is no statistical significance.
Answer to RQ2: The attention of neural models trained on code like CodeGen and InCoder exhibit a significantly higher agreement (+0.22,+0.20) with the developers when answering sensemaking questions as compared to GPT-j (+0.04), which was mainly trained on natural languages text.

C. RQ3: Programming Language Analysis
We investigate the differences in the agreement between developers and models across programming languages.In the lower part of Figure 5 we report the answer correctness (RQ1) divided into groups across the three programming languages under study: C#, C++, and Python.Since each developer has participated in the study only using a single programming language, the difference in answer correctness of developers might be a result of both the programming language or the skill and background of the specific developer.On the other side, the neural models are equally applied to the different languages, and thus we expect that the difference in answer correctness is due to the specific programming language.We compare the agreement between developers and models across programming languages with the Mann-Whitney U test [49] to compare two distributions.Table III illustrates that the agreement between developers and models is significantly higher on C# than on Python (p-value < 0.05), and marginally significantly higher on C# than on C++ (p-value < 0.1), making C# the language with the highest agreement between developers and models.
Answer to RQ3: The effectiveness in answering sensemaking questions is influenced by the specific programming language in which the task is formulated to the neural model, with a gap of up to 13.3 absolute points.Whereas, the agreement between developers' attention and neural models is higher for C# than for Python, and marginally higher for C# than for C++.

D. RQ4: Agreement on Interaction Matrix
Considering CodeGen's better answer effectiveness and higher developer agreement, we restrict our subsequent investigation to this model.
To compute the agreement between the interaction matrices we use a row-by-row approach, thus considering the task from the perspective of the starting location and asking ourselves: to which code location should I look next, given that now I am looking at this token?As a possible target code location, we consider the other lines in the code, thus we compare line-level interaction matrices S line .We focus on lines for two reasons: a) because it is well-known that LLMs attend to different kinds of tokens within a line than humans, such as punctuation or newlines [8], and b) because we argue that the line-level is perhaps the most useful granularity from a hypothetical user perspective (see Section X).To obtain S line , we take S and sum the probabilities referring to tokens on the same line, thus the probability to go to a line x is the probability to go to any token of that line.To compare corresponding rows of the ground truth interaction matrix and that derived by our attention signal, we use both the Spearman rank correlation coefficient or the top-3 overlap, defined as the number of top-3 target positions shared between the ground truth row and the model-derived row.Moreover, to balance the fact that some potential starting tokens might be rarely (or only very transiently) looked at by the developers, we weight each comparison based on the total number of seconds spent by the developer on the corresponding starting token.We also fix a maximum for this weight to 10 sec to prevent long-observed tokens from dominating the comparison.
We run the different extraction functions on the model attention signal to obtain interaction matrices S line , which we compare to developer-derived ground truth.We distinguish between: (1) attention-based code traversal predictions, which are those introduced in Section V-C, and (2) attention-agnostic code traversal predictions.The attention-based methods comprise raw attention in the first and last layer, max, and mean, with their respective symmetric versions where the triangular matrix is mirrored and added to replace the zero values, the rollout, and follow-up attention.The attention-agnostic methods comprise: copycat recommending all the positions containing tokens identical to the starting token (e.g., starting from token print it recommends all other lines containing print with equal weight), uniform recommending all the positions preceding the current token, and position recommending the neighboring positions of the current token with a Gaussian distribution centered on the current token.
We find that attention-based methods do carry predictive power, and in particular that follow-up attention performs best among all methods for both Spearman rank and top-3 overlap (Figures 7 and 8).We note that a purely positionbased approach performs better than the copycat method despite being completely content-agnostic.We attribute this to developers' tendency to often read source code in (piecewise) linear order as described by [22].Regarding raw attention, [11] demonstrated deeper semantic information being concentrated in later layers.Yet for both the triangular and symmetric versions respectively, higher levels appear inferior at predicting eye movement to earlier levels, possibly because such deeper semantic information may not be apparent to developers.
Answer to RQ4: The follow-up attention function performs best in predicting the next code location to look at, with a Spearman rank correlation of +0.49 and a top-3 overlap of 47%.This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of another developer to recommend the next line.

E. RQ5: Ablation Study
We investigate two key design choices for follow-up attention: (1) the selection of layers to use, and (2) the number of generated tokens, i.e., observers.In the top part of Figure 9  where all layers are considered on top.In the same lower part of Figure 9, we show the top-3 overlap when restricting the usage of the next 10, 50, or 100 generated tokens or "followers".There is significant agreement with the ground truth even for smaller numbers of layers, particularly the very first one.This suggests little processing, and maybe even only the token embedding, might be needed to extract valuable information.Additionally, a higher number of observers of the follow-up has a positive impact on the agreement of the follow-up attention with the ground truth.
Answer to RQ5: The follow-up attention benefits more from using the attention signal produced by early layers and performance are robust to the number of generated tokens.

VIII. THREATS TO VALIDITY
There are potential threats to validity that may limit the generalizability of our findings.First, our sample of developers may not represent all such developers, especially since we recruited from one large technology company.However, we did screen participants and require that they have professional programming experience.Second, the tasks are sensemaking tasks, as opposed to ecologically valid debugging or feature enhancement tasks, involving code that participants are not familiar with, and the participants were restricted from running the code, using a debugger, and performing web searches.Such tasks are commonly used in technical interviews and the participants did not indicate the tasks were atypical.Third, reactivity effects may occur since participants knew they were being observed and may believe their technical ability is being assessed.To minimize this threat, we advised participants that their individual performance was not being reported or analyzed and the observations were performed remotely with the researcher not being present in the same physical room as the participant.Fourth, it is possible that a developer's gaze does not always represent their attention, though such eyetracking data has been well-studied for decades in numerous domains.Fifth, regarding possible model's "cheating" due to memorization [50], we acknowledge the impossibility to exclude that these programs have not been seen by the models during training.However, we note that we evaluate them on the performance on the sensemaking task, for which the combination snippet and question is novel to the models since it was created for this study.Regarding the attention distribution, whether models exhibit different attention patterns on code snippet that have been during training, as opposed to those that are novel to them, still remains an open question for future work.Finally, the answers generated by the neural models are dependent on the prompts we provide, and thus results may vary with more elaborate prompt design, which is an active research area in prompt engineering [51,52,53,54,55,17].
Generalizability.In the design of the study, we aimed to make the evaluation as generalizable as reasonably possible and in line with the current state of the art in the field.On the human side, comparing our sample size to what is found in other eye-tracking studies [56], we note that both number of participants (25 participants vs µ = 19.6,σ = 13.5) and number of programs (15 programs vs µ = 7.6 programs, σ = 17.2) are in line with other studies.On the contrary, most of the current literature focuses on Java (38%) [56], while we have a more diverse set of programming languages (C#, C++, and Python), possibly making our results more generalizable.On the model side, we picked a diverse set of widely popular models in terms of downloads on HuggingFace: CodeGen (81K+), GPT-J (2.5M+), and InCoder (58K+).Regarding the applicability of the follow-up attention to other models, especially closed-source ones, although current model inference APIs do not expose attention information yet, we note that the vast majority of closed source are transformer-based making the extraction of attention possible in principle.

IX. DISCUSSION AND IMPLICATIONS
Generally, each part of a codebase holds myriad disparate connections to other parts of the codebase in such forms as documentation, calls and tests, pattern and format parallelism, examples, data, and control flow.However, the positive moderate Spearman Rank correlation among developers (+0.56) shows that the developers tend to navigate a single file along similar paths when trying to make sense of it with the same goal, i.e. answering to the same question.To some extent, this points to a common notion representing a relationship of general "relevance" of one location to the other, at least as far as we consider a single file as done in the current study; more work is needed to generalize to larger codebases and across files.At the same time, we also find this general relevance relation in human understanding is reflected in the neural processing of large transformer models of code, which also show a remarkably promising correlation with the human exploration paths (+0.49).Surprisingly, the follow-up attention agrees with the ground truth on what line to look at next even more than the developers agree with each other: 47% vs 42.3% (Figure 8).Note that the two runner-up are still attention based and they also outperform or match developers' agreement: 44.6% for Max att.(Sym) and 42.3% for Raw att.(1st) (Sym).This shows that attention-based approaches, and follow-up attention above all, are promising for recommending the next code location to look at.Thus, this motivates further work on the analysis and use of the neural attention layers as promising way to support developers in their code exploration tasks.
IDEs with 360 • Vision.Developers spend a large part of their time understanding existing code [57,58], and a central role of advanced code editing environments is to facilitate navigation to relevant places, whether the developers' involvement is active (e.g., search), passive (e.g., highlighting of identical tokens) or semi-active (e.g., jump-to-definition).Such tools typically rely on a proxy for current developer focus, such as mouse pointer position or the user's cursor in a code editor.The challenge is to find the locations relevant to that focus location since the possible reasons for relevance are heterogeneous and syntactical methods can only surface a limited number of them.Nevertheless, such tools have been an active research area [59], with results such as Strathcona [60], Suade [61], Team Tracks [62], Navtracks [63], Mylar [64], PFIS [65], Prodet [66], and Hipikat [67].In fact, Singh et al. evaluated various operationalizations of human attention (i.e., cursor location, which code is visible on screen, and a qualitative human judgment) and its impact on predictive accuracy [68], though they did not include eye tracking data.The high rate of success of follow-up attention of recommending at least one relevant line among the top-3 (47%) shows the effectiveness of the attention of neural in providing one such possible proxy for relevant code locations connected to the current statement.Further research is needed to explore how to best incorporate such a proxy into existing tools, and how to best use it to support developers in their code exploration tasks, e.g.either highlight neural attended lines, offer to link to them, or list them in a side panel.
LLMs and Human Collaboration.Although the sensemaking task spans over diverse set of topics and has mostly open-ended questions, CodeGen, the largest LLM studied achieves already non-trivial performance with 39.3% of correct or partially correct answers.This is a promising result for the use of LLMs in supporting developers when reasoning on code, and motivates further research on perhaps more specialized sensemaking questions directly liked to specific traditional software engineering task, such as "Is there a bug in this code?" for bug detection [69,70] or "Is this code vulnerable to SQL injection?"for vulnerability detection [71].
Context Prioritization for LLMs.In existing tooling employing LLMs, such as GitHub Copilot [72], the model can process only limited part of the code at the same time, given by the maximum size of the prompt it can take, also called context window [73,54].In practice, heuristics are needed to explore which parts should be included in the prompt.Our findings show how attention-based methods exhibit a moderate positive agreeement with human experts, especially on the top-3 next lines (47%), thus could be used to prioritize the context to be included in the prompt.From our ablation study, it is encouraging to see that even with only 10 newly generated tokens, the agreement is still higher than the agreement between developers: 45.2% vs 42.3%.These results suggest that real-life deployment of the follow-up attention as a relevance provider could benefit from two important optimizations to reduce the computational cost: (1) restricting the number of layers to consider to the first two, and (2) restricting the number of generated tokens to consider.

X. CONCLUSION
We presented and shared a novel dataset of eye-tracking data, comprising 92 visual attention sessions of 25 developers when answering sensemaking questions in three popular programming languages (Python, C++, and C#).We confirmed that neural models provide promising but less accurate answers than developers to these questions while paying attention to similar parts of the code.We formalized a new code exploration task of predicting developer code traversal and confirmed the attention signal's relevance for this task by evaluating multiple processing approaches.Besides evaluating existing approaches on the sensemaking task, we contributed the concept of follow-up attention, which shows the best agreement with the developer attention data.

XI. DATA AVAILABILITY
All our code is publicly available at https://github.com/githubnext/followup-attention and the dataset is available here 10  Austin Z. Henley is a Senior Researcher at Microsoft where he works on the human factors of AI-powered developer tools.Previously, he was a tenure-track professor at the University of Tennessee where he led an NSF-funded lab researching developer productivity and taught software engineering courses.He received his Ph.D. in Computer Science from the University of Memphis in 2018.For more information, visit http://austinhenley.com/.
Albert Ziegler is a Principal Researcher at GitHub Next where he works on Artificial Intelligence for the Software Development Lifecycle.He was one of the three original inventors of GitHub Copilot and has since turned to LLM guided tooling in both in the IDE (Copilot NES, Copilot Radar) and the pull request workflow (AI for Pull Requests, Gentest).
He holds a PhD in Mathematics from Leeds University and has previously worked on developer productivity and diverse industry ML projects.

#Fig. 1 :
Fig. 1: Example of sensemaking task with code and question to be answered in the bottom comment.Completely empty lines have been removed for space reasons.

Fig. 9 :
Fig. 9: Effect of layer pair and number of generated tokens on top-3 overlap.

TABLE I :
Code snippets and related questions for each sensemaking task.

TABLE II :
Participants' Professional Software Development Experience We conduct an eye-tracking study and collect a novel dataset comprising 25 participants across 92 valid sessions.Participants' professional software development experience is summarized in Table II.Additionally, 24% of participants have more than 4 years of experience.The dataset contains 17.4% (16) of sessions on C++ code, 43.5% (40) on C# code, and 39.1% on Python code shows the percentage of correct,

TABLE III :
Results of the Mann-Whitney U statistical tests when comparing the distributions of developer-model agreement (Spearman rank coefficients) across models.
, we report the top-3 overlap for different pairs of layers among the 34 available in CodeGen, together with the configuration