Brain and autonomic nervous system activity measurement in software engineering: A systematic literature review

In the past decade, brain and autonomic nervous system activity measurement received increasing attention in the study of software engineering (SE). This paper presents a systematic literature review (SLR) to survey the existing NeuroSE literature. Based on a rigorous search protocol, we identified 89 papers (hereafter denoted as NeuroSE papers). We analyzed these papers to develop a comprehensive understanding of who had published NeuroSE research and classified the contributions according to their type. The 47 articles presenting completed empirical research were analyzed in detail. The SLR revealed that the number of authors publishing NeuroSE research is still relatively small. The thematic focus so far has been on code comprehension, while code inspection, programming, and bug fixing have been less frequently studied. NeuroSE publications primarily used methods related to brain activity measurement (particularly fMRI and EEG), while methods related to the measurement of autonomic nervous system activity (e.g., pupil dilation, heart rate, skin conductance) received less attention. We also present details of how the empirical research was conducted, including stimuli and independent and dependent variables, and discuss implications for future research. The body of NeuroSE literature is still small. Yet, high quality contributions exist constituting a valuable basis for future studies. © 2021 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
We use the term NeuroSE to describe a research field in software engineering (SE) that makes use of neurophysiological methods and knowledge to better understand the software development process, as well as its outcome, the software system. Because humans both develop and use software systems, it is clear that a better understanding of the human nervous system -which constitutes the basis of any human perception, thought, emotion, and behavior -is likely to contribute to a better understanding of the SE process and, as a consequence, should also positively affect the software system itself. Neurophysiology is a scientific field that is concerned with the investigation of the functioning of the nervous system, which consists of the brain and other neural tissue in the body . The NeuroSE field is relatively young and characterized by collaboration of researchers from various disciplines (e.g., computer ✩ Editor: Alexander Serebrenik. science, cognitive neuroscience, and psychology). To the best of our knowledge, the earliest study using brain or autonomic nervous system activity measurements (also referred to as neurophysiological measurements) in an SE context is from 2006 (Aschwanden and Crosby, 2006).
After that inaugural publication, further studies followed, and today a body of literature exists which we characterize as scattered. It follows that a cumulative research tradition does not exist. As we outline in more detail below, for NeuroSE research to progress, a more cumulative tradition is beneficial. A comprehensive review, along with a critical evaluation of the field, constitutes a valuable foundation for the future development of a viable research field. In this article, we present such a review. To the best of our knowledge, such a comprehensive NeuroSE review paper does not exist. The related work that we identified consisted of five related reviews (Goncales et al., 2019;Menzen et al., 2020;Obaidellah et al., 2018;Riedl et al., 2017bRiedl et al., , 2020aSharafi et al., 2015b), yet none of them had the goal of covering the entire body of NeuroSE literature comprehensively, both from a thematic and from a neuroscience methodology perspective. Specifically, eye tracking research in an SE context is analyzed in the reviews by Sharafi et al. (2015b) and Obaidellah et al. (2018). However, most papers covered by these reviews did not collect neurophysiological data (i.e., eye activity that is largely controlled by the autonomic nervous system (ANS)), but rather focused on eye movements which are not directly controlled by the ANS, namely fixations and saccades. In another review, Goncales et al. (2019) put an emphasis on one specific construct only, namely cognitive load. Moreover, this review neither analyzed the use of neurophysiological methods comprehensively, nor did it focus on the SE context (rather, this review included several papers that investigated cognitive load in a broader context). Review work recently published by Riedl et al. (2017bRiedl et al. ( , 2020a investigated studies using neurophysiological methods in the field of Information Systems (referred to as NeuroIS). However, our SLR focuses on software development and not on the use and impact of software systems, the focus of IS research (Sidorova et al., 2006). Most closely related to our work is the review by Menzen et al. (2020), which investigates the usage of biometrics in an SE context (covering both neurophysiological and behavioral measures). However, in their review they did not make a distinction between studies that collected and analyzed biometrics and studies that only describe their potential use. In addition, our SLR covers 89 papers, a much larger database compared to the 40 studies in the paper by Menzen et al. (2020). Despite the fact that our review offers a unique contribution, we stress that it is intended as a complement to the presented related work, and not a substitute for it.
The goal of this SLR is to provide a comprehensive overview of existing research using brain and/or autonomic nervous system activity measurements to investigate SE phenomena. In this literature review we answer the following research questions: • RQ1: Who published NeuroSE research and where? An answer to this question shows who the most productive Neu-roSE authors are and in which outlets they have published. In particular, in an area that is relatively nascent and highly interdisciplinary, such an analysis is valuable since it can help to identify reviewers with NeuroSE experience and editors that could handle NeuroSE submissions. Moreover, it can help researchers who are new to NeuroSE to identify potential collaboration partners, as well as possible publication outlets.
• RQ2: What kind of NeuroSE research has been published? -RQ2.1: What is the type of contribution? An answer to this question classifies the studies as ''empirical'', ''research in progress'', ''methodological'', ''review'', and ''conceptual''. We argue that for NeuroSE research to progress, a study of contribution type is essential. This may both provide valuable insights into the maturity level of the field and outline critical avenues for the future development of NeuroSE. Specifically, a high rate of completed empirical studies, along with the availability of a relatively large number of review papers and methodological contributions, indicates a relatively high maturity level (Vessey et al., 2002). -RQ2.2: Which major thematic orientation did NeuroSE researchers choose? An answer to this question reveals which software development activity (e.g., code comprehension, code inspection, programming) the existing body of literature focuses on.
• RQ3: Which neurophysiological methods and measures were applied in NeuroSE publications? An answer to this question reveals the neurophysiological methods used (e.g., fMRI, EEG, eye tracking), describes the measures that were applied, and explains their usage in answering different types of research questions. Such an analysis can therefore guide future research.
• RQ4: How were the empirical NeuroSE studies designed?
An answer to this question reveals various methodological aspects including study population, stimuli, experimental design, dependent and independent constructs as well as data analysis.
• RQ5: What are the main findings of existing NeuroSE re- search? An answer to this question provides a condensed overview of the main insights that have been obtained thus far based on the application of neurophysiological methods.
The major contributions of this paper are (1) a systematic mapping of the existing NeuroSE literature and (2) a discussion of the implications of the results and an outline of directions for future research. It is hoped that this review instigates a more cumulative research tradition in the future. The remainder of this paper is organized as follows: Section 2 outlines fundamentals of human neurophysiology. The knowledge of human physiology presented in this section is abstracted, as it is intended to serve as a brief introduction to the field for interested software practitioners and mainstream computer scientists and SE researchers. Section 3 describes the research methodology of this review. Results are presented in Section 4. We discuss implications and outline limitations in Section 5. Finally, we present concluding remarks in Section 6.

Background
The nervous system constitutes the basis for human perceptions, thoughts, feelings, and behavior; it consists of different parts. At a high level of abstraction, we can distinguish the central nervous system (CNS) and the peripheral nervous system (PNS). The CNS is subdivided into the brain and spinal cord's neural tissue, the PNS comprises all neural tissue except for the CNS. The PNS can be further sub-divided into the somatic nervous system (SNS) and the autonomic nervous system (ANS). The SNS consists of cranial and spinal nerves to and from the sensory organs, muscles, joints, and skin. The main functions of the SNS are the production of movements and the transmission of sensory information (e.g., temperature, touch). The ANS, by contrast, consists of the sympathetic division (which activates the body), the parasympathetic division (which relaxes the body) and the enteric nervous system (which governs the function of the gastrointestinal tract). Based on this overview of the human nervous system, it becomes clear that the brain (i.e., the information processing unit), as well as the sympathetic and parasympathetic divisions of the ANS (which keep the body in balance, referred to as homeostasis), are the major units of analysis in NeuroSE research (Mack et al., 2013;Riedl and Léger, 2016).
Different neurophysiological methods exist to capture neural activity. Functional magnetic resonance imaging (fMRI), electroencephalography (EEG), and functional near-infrared spectroscopy (fNIRS) are important methods to study brain activity. Measurement of heart rate and heart rate variability, electrodermal activity, as well as eye-related measures such as pupil dilation and eye blinks are major methods to study ANS activity. We briefly summarize the main characteristics of these methods in the remainder of this section.
fMRI: A magnetic resonance imaging (MRI) scanner measures blood oxygenation in the brain and exploits the different magnetic properties of oxygenated and deoxygenated blood. Details of this mechanism can be found in the literature on blood-oxygenlevel dependent (BOLD) contrast (Kwong et al., 1992). Evidence indicates that the BOLD signal is a good proxy for neuronal activity (Logothetis, 2008;Logothetis et al., 2001), and hence fMRI can be used to investigate the neural correlates of cognitive processes. Note that neurons do not have internal reserves of energy in the form of oxygen and sugar. It follows that their firing causes a need for more energy to be provided quickly. Based on a process called the hemodynamic response, blood releases oxygen to active neurons at a greater rate than to inactive neurons. The consequence of this process is a change of the relative levels of oxygenated and deoxygenated hemoglobin that can be detected on the basis of their differential magnetic susceptibility. The BOLD contrast, importantly, is sensitive to the presence of deoxygenated hemoglobin (Toga and Mazziotta, 2002). MRI is carried out through a cylindrical tube equipped with an electromagnet that generates field strength (measured in Tesla), which is about 50,000 times stronger than the field strength produced by the earth. Based on fMRI, it is possible to identify activity in a specific brain area within the millimeters range. Thus, spatial resolution is very high. Changes in brain activation that result from stimulus perception can be identified within a few seconds (Riedl and Léger, 2016). Therefore, the temporal resolution of fMRI is in the order of seconds, as is the hemodynamic response lag (for details, please see Table 1). Methodologically, the experimental design of an fMRI study typically involves one of two designs: block or event-related (Riedl and Léger, 2016). In a block design, stimuli pertaining to the same experimental condition are grouped and presented in blocks of time that are separated by resting periods. In an event-related design, stimuli of different experimental conditions are shown in random order.
EEG: Electroencephalograms (EEG) are recordings of the electrical activity of neurons in the brain. Using electrodes placed on the scalp, EEG measures the summation of synchronous postsynaptic action potentials produced by a population of neurons with a very high temporal precision (milliseconds) (Bronzino, 1995). The EEG system is composed of electrodes (which are usually placed on the scalp with a net or cap), amplifiers, an analog-todigital converter, and a recording device (typically a computer) (Riedl and Léger, 2016). EEG systems amplify and record small voltage fluctuations measured between pairs of electrodes, usually an electrode and a reference electrode. Amplitude of tens of microvolts are typical for EEG studies. The most common analysis of EEG activity is in terms of frequency (Müller-Putz et al., 2015;Riedl and Léger, 2016). Identification of neurocognitive processes specific to a particular event is a challenge, given that EEG measures refer to the summation of the electrical influx of a large number of neurons. To tackle this challenge, the Event Related Potential (ERP) technique was developed. An ERP, or evoked potential, is a patterned fluctuation of voltage recorded by the EEG that represents a cognitive process specific to a discrete event. If the background EEG activity is not filtered, it is hardly possible to identify an ERP signal, because it has low amplitude in comparison to the general EEG signal and other factors such as cardiac activity or muscle contractions that are referred to as ''noise'' in the EEG literature. Hence, many trials are needed to average responses and to filter the signal (Müller-Putz et al., 2015;Riedl and Léger, 2016). The temporal resolution of EEG is excellent, that is, changes of EEG patterns that result from changes in stimulus perception can be observed within the milliseconds range. For example, the first substantial peaks in the waveform that often occur about 100 ms after stimulus onset are called the P100 and N100 (attributes: positive or negative, 100 ms latency) or the P1 and N1 (indicating the first positive or negative peak) (Müller-Putz et al., 2015, p. 46). Spatial resolution, however, is highly limited. The so-called inverse problem (Helmholtz, 1853), which indicates that an infinite number of source configurations can generate identical surface potentials as measured by EEG, does not allow for an unambiguous identification of the neural generators (i.e., the location of the neural activity within the brain). This explains why localization of brain activity through EEG requires appropriate a priori assumptions about sources and parameters of volume conduction (Michel et al., 2004). It is critical to note that increasingly consumer-grade instruments (e.g., Emotiv Epoc+ 14-channel wireless EEG headset) are used for research purposes. As we outline in Section 5, it is a matter of ongoing discussion whether or not, and if so in which situations, consumer-grade instruments offer reasonable reliability and validity.
fNIRS: Functional Near-Infrared Spectroscopy (fNIRS) is a brain imaging technique that (like fMRI) uses hemodynamic responses to indirectly capture neuronal activity. However, compared to fMRI, fNIRS is less expensive and more portable, offering higher ecological validity (Riedl and Léger, 2016). Moreover, unlike fMRI, fNIRS is an optical imaging technique that uses near-infrared spectroscopy to detect cerebral blood flow and hemoglobin oxygenation level changes (Bunce et al., 2006;Villringer and Chance, 1997). The proper functioning of neurons relies on the oxygen and glucose supply provided by cerebral blood flow. Brain activity reduces local oxygen and glucose concentrations. Hence, the neurovascular coupling mechanism will increase blood flow in this region, supplying it with the appropriate concentration of the constituents needed to metabolize energy (Riedl and Léger, 2016). This phenomenon is the type of event that fNIRS captures at different points in time in order to assess changes that are a function of different experimental conditions (Bunce et al., 2006;Villringer and Chance, 1997). The common apparatus is composed of light sources applied to the scalp and light detectors sensitive to the light that is reflected by the different components of the cerebral cortex (Riedl and Léger, 2016). Given that a higher concentration of oxygen is needed in brain areas activated by an experimental task, the functional map provided by fNIRS informs the researcher on the different functionalities of brain areas (Bunce et al., 2006;Gefen et al., 2014;Villringer and Chance, 1997). Due to the physics of light propagation and corresponding propagation loss, application of fNIRS to study brain mechanisms has spatial resolution limitations. It follows, then, that light intensity is attenuated in tissue and therefore penetration depth is limited to the first 2-3 cm of the cortex (Bunce et al., 2006;Villringer and Chance, 1997). Thus, compared to fMRI, fNIRS offers inferior spatial resolution and limited penetration depth. However, when it comes to temporal resolution fNIRS is substantially faster than fMRI (sampling rates are typically 1 Hz-10 Hz, note that Hz is equivalent to cycles per second). Yet, the hemodynamic response lag is a few seconds. Table 1 provides a comparison of the above-described imaging methods along different dimensions. 2 Heart-related measurements: Heart rate (HR) is typically measured by an electrocardiogram (ECG). Application of ECG includes placing a cathode electrode beneath the right clavicle, a ground electrode under the left clavicle, and an anode electrode on the left side of the abdomen (Riedl and Léger, 2016). Note that HR can also be measured with photoplethysmography (PPG) sensors (placed at the wrist or finger, Rajala et al. (2018)), which use a light-based technology to sense the rate of blood flow as controlled by the heart's pumping action (Elgendi, 2012). Thus, PPG is an optical approach for measuring the blood volume pulse (BVP). However, ECG sensors detect electrical activity produced by a heartbeat. A heartbeat consists of a P wave that refers to 2 Table 1 is based on https://medizinio.de/en/medical-equipment/mri (a) , https://imotions.com/blog/eeg-headset-prices/ (b) , https://plux.info/kits/438fnirs-pioneer-820201240.html (c) , Bunce et al. (2006) (d) , Lystad and Pollard (2009) (e) , Quaresima and Ferrari (2016) (f) and Müller-Putz et al. (2015) (g) . Note (h) : Spatial resolution benefits from ultra-high field strength, today usually 7T scanners. Most studies in the extant cognitive neuroscience literature published in the past decade use 3T scanners. Thus, spatial resolution can even be lower than 2 mm if 7T scanners are used (for a recent example, see Rutland et al. (2019)). All fMRI studies (completed research) in this current NeuroSE review used a 3T scanner. atrial depolarization of a QRS complex, which represents ventricular depolarization and contraction of the large ventricular muscles, and of a T wave that reflects the rapid repolarization of the ventricles (Riedl and Léger, 2016). Compared to PPG, ECG is a more direct measurement of heart activity as it directly captures electrophysiological signals that result from the heartbeats, rather than downstream effects such as blood properties at distant locations such as wrist or finger. It follows, then, that while ECG measurement is accurate at the level of milliseconds, PPG based measurement is not due to the delay that is influenced, among others, by the pulse wave velocity and the vascular path from the heart to the location of the PPG sensor among other factors (Lekkala and Kuntamalla, 2017); consistent with this fact, highfrequency components of the signal are attenuated due to the long distance the blood has to travel through the body before being measured at the distant body locations (e.g., Buxi et al. (2015)). Most recent research shows that today it is even possible to use non-contact imaging of peripheral hemodynamics (such as blood volume pulse) to study cognitive and emotional constructs such as stress (McDuff et al. (2020)). Respiration rate (i.e., breathing frequency) is closely related to heart rate, and can also be very informative for SE researchers, primarily because increased mental activity implies greater consumption of oxygen which, in turn, affects respiration and heart rate. Increases in mental activity typically imply increases in respiration and heart rate (Riedl and Léger, 2016). Heart rate variability (HRV) is predominantly a function of ANS activity. The sympathetic part, among others, increases the heart's contraction rate and force (cardiac output), but decreases HRV. Conversely, the parasympathetic part reduces the heart rate, but increases HRV. This interplay between the sympathetic and parasympathetic parts of the ANS constitutes the physiological basis for the heart's instantaneous response to different situations and needs (Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology, 1996). It follows that a low HRV is undesirable, while a high HRV is desirable. Major HRV indicators, also referred to as features of the signal, are: SDNN: The standard deviation of NN intervals in the signal (note that NN interval refers to the time between two consecutive heartbeats, measured in milliseconds). SDANN: The standard deviation of the averages (taken over specific time segments) of the NN intervals in the signal. RMSSD: This feature depends on the differences of subsequent NN intervals. Square these differences, then take the square root of the arithmetic mean of these squares. LF-HRV: The power of the signal in the low-frequency spectrum (0.04-0.15 Hz). HF-HRV: The power of the signal in the high-frequency spectrum (0.15-0.5 Hz) (Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology, 1996). The first three HRV features are from the time domain, measured in milliseconds, while the remaining two are from the frequency domain, measured in squared milliseconds.
Skin-related measurements: Electrodermal activity (EDA) is a property of the human body that causes continuous variation in the skin's electrical characteristics. A galvanometer is used to assess the degree to which the skin permits transmission of an applied current, and the conductance is influenced by the galvanic state of the skin at different moments in time (Naqvi and Bechara, 2006). EDA reflects two types of activities: tonic and phasic (Boucsein et al., 2012;Naqvi and Bechara, 2006). Tonic activity is typically expressed in units of skin conductance level (SCL) and refers to a smooth and slowly changing level on a time scale of tens of seconds to minutes. Phasic activity is usually expressed as electrodermal response (EDR) or skin conductance response (SCR) and refers to short-lasting changes in EDA that appear as a response to a particular stimulus. Evidence indicates that the frequency of non-specific (i.e., non-stimulus related) changes in SCL is typically 1-3 per/min during rest and over 20 per/min in high arousal situations (Boucsein et al., 2012;Braithwaite et al., 2015;Dawson et al., 2016). For the decomposition of the EDA signal into a tonic and a phasic component several algorithms have been proposed (for an overview, see Posada-Quintero and Chon (2020)). EDA is related to cognitive, emotional, and attentional states, and it is simple to use, completely non-invasive, and provides data that is easily attributable to a single stimulus if EDR or SCR is used (Riedl and Léger, 2016). Physiologically, a galvanic skin response is a result of changes in the sympathetic part of the ANS. Such changes often occur when an individual is facing specific events and situations, including novelty, anticipation of an outcome, decision making, loud noises, fear, or surprise (Riedl and Léger, 2016), and they occur between 1-3 s after stimulus onset (Dawson et al., 2016). Generally, because changes in arousal often do not reach people's awareness level, the use of EDA to capture arousal responses is a popular method (Riedl and Léger, 2016). Eye-related measures: Oculometry concerns the biometric measurement of the condition and movements of the eye (Riedl and Léger, 2016). Many micro-movements of the eye and pupil-size modifications occur without conscious awareness, also because they are a function of ANS activity rather than deliberate thoughts in the brain. A tool used to measure conditions and movements of the eyes is eye tracking. Researchers applying this tool know, at any moment in time, the target of the gaze, which according to the eye-mind hypothesis is correlated with visual attention and hence it is assumed that we also know what information is being processed by the participant (e.g., on a computer screen) (Just and Carpenter, 1980). Knowing what the eyes are focusing on provides information about what the brain is processing (Riedl and Léger, 2016). To measure the point-of-regard, most systems apply the pupil-corneal reflection method, which uses an infrared camera to locate the features of the eye (Holmqvist and Andersson, 2017). One of the dominant aspects reported in eye tracking analysis is fixation. When the gaze temporarily stops on a specific stimulus (from 200 ms up to several seconds), it is possible to capture the time spent in a specific position (e.g., on a computer screen). Another measure is the saccade. Saccades are the fastest movements made by the human eye, moving from 30 to 80 ms, so that conscious and deliberate information encoding is hardly possible during the movements (Holmqvist and Andersson, 2017). Yet, saccades provide information on search behavior, along with rudimentary color detection and lightweight shape detection. Duchowski (2017) comprehensively summarizes the physiological, methodological and technological characteristics of eye tracking. Moreover, in stressful situations, the sympathetic division of the ANS becomes active and stimulates a number of responses, including pupil dilation (i.e., increased visual attention). When parasympathetic activation occurs, pupils constrict (Riedl and Léger, 2016). In this context, it is critical to mention that pupil dilation, along with more complex related indexes (Duchowski et al., 2020), can be used to measure cognitive load. For a recent review, see Wel and Steenbergen (2018). Finally, eye blinks are also important measures in eye tracking research (Kanoga et al., 2016;Walla et al., 2015). Three blinking types exist: intentional (e.g., if one decides to close the eyes to avoid external stimulation), spontaneous (e.g. corneal lubrication), and reflexive (e.g., Nakano et al. (2012) and Sforza et al. (2008)). Specifically, the startle eye-blink is a reflexive response that typically occurs when an individual encounters a sudden and unexpected stimulus (e.g., loud noise or increase in light). The startle reflex is influenced by brain and ANS activity and can be used to infer affective processing in humans (Lang et al., 1990;Walla et al., 2015).
It has to be noted that we focused on fMRI, EEG, fNIRS, heart rate measurement, electrodermal activity, and eye-related measures because, as it turned out after our review of the SE literature, that these methods were applied frequently in the reviewed studies. In neuroscience and related disciplines, however, a number of further methods are described (Senior et al., 2009). For a description of how each of these tools functions and of common data processing steps required to enable someone who is not familiar with that specific technology to make use of it, we refer the reader to Newman (2019), Harmon-Jones andBeer (2009), andSenior et al. (2009).

Methodology of the literature review
In order to identify NeuroSE publications, we conducted a literature search and considered peer-reviewed journal and conference publications. The review process was based on existing recommendations for conducting literature reviews (Kitchenham and Charters, 2007;vom Brocke et al., 2009;Webster and Watson, 2002). Therefore, we first identified keywords based on landmark publications and then selected our outlets for the literature search phase. Based on initially selected papers, we conducted an initial review, followed by backward snowballing, another preliminary review and forward snowballing and finally merged papers with a high degree of overlap referring to the same empirical study (cf. Fig. 1).

Search strategy
Keywords for the literature search were mainly derived from landmark publications that offer an introduction to the field of NeuroIS (Dimoka et al., 2012;Riedl et al., 2010a;Riedl and Léger, 2016). NeuroIS (Neuro-Information-Systems) ''relies on neuroscience and neurophysiological knowledge and tools to better understand the development, use, and impact of information and communication technologies. NeuroIS seeks to contribute to the development of new theories that enable possible accurate predictions of IS-related behaviors, and the design of information systems that positively affect economic and non-economic variables (e.g., productivity, satisfaction, adoption, well-being'', (Riedl et al., 2010a); for further details see www.NeuroIS.org). Considering the NeuroSE definition in the Introduction, we argue that NeuroIS is the thematically closest research field with the prefix ''neuro''. Hence, we based our initial search on landmark publications from this field. Similar to Riedl et al. (2017aRiedl et al. ( , 2020a we used terms that are representative of the data collection methods that are highlighted in these landmark publications such as ''eye tracking'' or ''cardiovascular'' being representative of cardiovascular measurements. In addition, we combined each ''neuro'' term with SE terms (e.g., source code or software design) derived from SE handbooks (Sommerville, 2010;Topi and Tucker, 2014). A list of the used ''neuro'' terms and SE terms used is provided in Table 2   After the filtering strategy was applied and an article was selected for inclusion (cf. Section 3.2), we used backward snowballing followed by another review and forward snowballing (cf. Section 3.3).

Filtering strategy
We first removed totally unrelated papers based on title and abstract which left us with 199 papers (i.e., 157 articles from the database search and 42 articles from the search in the conference proceedings). We then also removed duplicates, which left us with 138 unique papers. The remaining 138 papers were then analyzed in-depth based on the full-text and we applied the following inclusion and exclusion criteria: • Inclusion Criterion IC: The article focuses on the application of neurophysiological methods and/or knowledge to investigate the software development process, and/or its outcome, the software system.
• Exclusion Criterion EC1: Main focus on the development of new measurement techniques or the improvement of existing measurement techniques based on neurophysiological data with little or no emphasis on the SE process. For example, papers which proposed new brain-computer interface technology (e.g., Huang and Tognoli (2014)) or self-adaptive systems (e.g., Huang and Miranda (2015)) were excluded.
• Exclusion Criterion EC2: Papers focusing on the creation of SE artifacts which integrate neurophysiological data (i.e., neuroadaptive system), and not on SE-related phenomena (i.e., the investigation of the development process of the SE artifact itself), were excluded (e.g., Riseberg et al. (1998), Scheirer et al. (2002).
• Exclusion Criterion EC3: Papers focusing on the measurement of user experience, if it is not directly related to the SE artifact, were excluded (e.g., Jimenez-Molina et al. (2018)) or where the neurophysiological data stems from the end user of the software and not the developer (e.g., Lin and Imamiya (2006), Phukan (2009)).
• Exclusion Criterion EC4: Similar to Riedl et al. (2020a), articles applying eye tracking measurements that are not predominantly reflexive (e.g., gaze and saccade measurement) were excluded (e.g., Goldberg (2012Goldberg ( , 2014). It follows, then, that we only included eye tracking studies in this review if their focus was the investigation of pupil dilation or eye blinksstartle reflex.
After applying these inclusion and exclusion criteria to the initially identified 138 papers, 49 remained for further analysis. 14 out of 138 papers did not meet the inclusion criterion and were thus not considered. 6 papers were excluded based on exclusion criterion EC1, 0 based on EC2, 6 based on EC3, and 69 based on criterion EC4.

Backward snowballing and forward snowballing
This selection was then used for backward snowballing (i.e., searching the references), which resulted in an additional 37 publications being identified. After the application of inclusion and exclusion criteria 24 papers were removed (10 due to IC, 3 based on EC1, 3 based on EC2, 6 based on EC3, and 7 based on EC4), resulting into 13 remaining papers. Based on the previously identified 62 publications, forward snowballing (i.e., tracking the citations) using Google Scholar from 02/05/2018 to 02/08/2018, on 10/26/2018, and from 07/06/2020 to 07/07/2020 was conducted. This resulted in 1336 hits out of which 52 publications were selected for further investigation based on title and abstract (resulting in a total of 114 papers). After the application of inclusion and exclusion criteria 23 papers were removed (7 due to IC, 1 based on EC1, 1 based on EC2, 19 based on EC3, and 29 based on EC4). Therefore, 91 publications constitute the final sample for our literature review. There is a list of selected NeuroSE publications in Appendix A.

Merging of papers
Finally, we merged overlapping papers referring to the same empirical study. In particular, we merged Fakhoury et al. (2018Fakhoury et al. ( , 2020 as well as Doukakis (2019) and Doukakis et al. (2020), considering the more comprehensive paper version for further analysis. As a result of this merging, we obtained 89 publications that were included in our SLR.

Data extraction strategy
To answer the posed research questions (RQs) we extracted the following data. 0. General Info: General information about the paper, i.e., title, authors, outlet, type of outlet (e.g., conference, journal), and publication year. This data was used to answer RQ1.
1. Contribution: To classify the contribution of each publication, in line with Riedl et al. (2020a) we used one of five categories. ''Empirical'' papers focus on testing the relationship between at least two variables and feature information on their study design including data collection and analysis procedures as well as the results of their investigation. ''Research in progress'' papers are also empirical in nature, but do not offer all components of an ''empirical'' paper, for example, only reporting on preliminary results, or presenting their study design without having completely analyzed or even collected data. ''Methodological'' papers present information on new or existing methodological approaches for NeuroSE research, such as the introduction of eye tracking measures to investigate visual effort (including pupil dilation and blink rates) in the context of different software engineering tasks (e.g., Sharafi et al. (2016)) or the assessment of different emotion recognition methods in terms of their suitability for monitoring the emotional states of developers (e.g., Wróbel and Wrobel (2018)), in some cases supplemented by an exemplary empirical study (e.g., Peitek et al. (2018d)). ''Conceptual'' papers present a discussion on potential constructs for NeuroSE research, related research models (e.g., Brown et al. (2018)), or the design of a new SE artifact. Finally, ''Review'' articles focus on the analysis of previous research, based on a review of the literature (e.g., Sharafi et al. (2015b)). RQ2.1 was answered based on this data.
Moreover, for each completed empirical study we further extracted the following data: 2. Software Development Activities: We classified papers based on the following software development activities: code comprehension, code inspection, programming, change task, bug fixing, documenting code, general. We classified a paper as ''code comprehension'' when it was about reading and understanding source code snippets (e.g., Siegmund et al. (2014)). The label ''code inspection'' was used for papers that required detection of errors in the source code going beyond pure syntactical errors (e.g., Castelhano et al. (2018)) or make a decision on whether or not changes should be approved (e.g., Floyd et al. (2017)). We classified a paper as ''programming'' when it was about writing source code (e.g., Yamamoto et al. (2016)). Papers that required participants to change existing source code (e.g., to add additional functionality) were labeled as ''change tasks'' (e.g., Müller and Fritz (2015)). In turn, ''bug fixing'' was used to refer to papers that required the detection and correction of bugs (e.g., González et al. (2015)). Papers that were concerned with documenting existing source code were labeled as ''documenting code'' (e.g., González et al. (2015)). Finally, the category ''general'' refers to papers that look at software development as a whole rather than a specific activity (e.g., Müller and Fritz (2016)). Please note that several of the fMRI studies used a code inspection task (more specifically a syntax task) as contrast, while the main focus of interest was code comprehension. In these cases, we did not label the paper as "code inspection" (e.g., Siegmund et al. (2014)). RQ2.2 was answered based on this data.
3. Methods and Measures 3a. Methods: In this section, we indicate which types of neurophysiological data are collected in completed empirical studies. We consider fMRI, EEG, and fNIRS (the three methods for brain activity measurements), as well as HR, Skin, Eye Tracking, and Other. ''Other'' is a category that is used to indicate types that have been used infrequently (less than 3 times), such as breath-related measures.
Please note that we excluded studies tracking eye movements that are not directly controlled by the ANS in accordance with Exclusion Criterion 4. It follows that we considered neither those studies that only collected and analyzed fixationbased measures nor those studies based on saccadic measures. Since the focus of this literature review is on neurophysiological data, we only marked a study as ''eye tracking'' when the study mentioned pupil data or eye blinks (as they are strongly related to ANS activity). Additionally, note that several studies collected EMG (electromyographic) or EOG (electrooculographic) data for artifact removal from the EEG signal. Since this data was not the thematic focus of this paper, we did not consider it here.
3b. Collected measures: We highlight for each method which types of measures are used (e.g., types of skin-related measures, such as skin conductance level or skin conductance response) and how they are measured (e.g., if HR-related data is collected whether the data has been collected using ECG or PPG).
3c. Measurement instruments: Indicates which measurement instruments were used for each of the methods applied (e.g., the device that was used for data collection such as a Tobi TX300 eye tracker or an Emotiv EPOC EEG device).
RQ3 was answered based on this data. 4. Study Participants: provides details concerning the study population 4a. Sample size: Indicates how many individuals participated in the study and also (if applicable) highlights in which part of the study they participated. For example, in some cases several studies are reported in one publication with their respective sample sizes and in some cases different samples have been used for each type of data collected.
4b. Gender: Indicates the gender distribution in the sample (male, female).
4c. Age: Indicates the age distribution in the sample (i.e., chronological age).
4d. Study population: Indicates the occupation of participants (e.g., students or individuals of a certain profession).
4e. Background: Specifies (if indicated) the level of experience of the participants and lists further details (if available) that give some indication of the participant's background. 5. Stimuli: provides details concerning the stimuli used 5a. Task characteristics: provides details concerning the tasks.
5b. Programming language: specifies the programming language in which the stimuli were represented.
5c. Size: indicates the size (in lines) of the code snippets used.
6. Experimental Design: provides details concerning the experimental design 6a. Setting (''Laboratory/Field''): We classified whether an empirical study collected data in a controlled environment (''laboratory'') or in a context that is natural to the study population (''field'').
6b. Manipulation design (''between-subject/withinsubject''): We indicate whether participants experienced all, or at least several, of the conditions of an independent variable (''within-subject'') or only one (''between-subject''), or whether it was ''mixed''. We indicate studies without different experimental conditions with ''no conditions''.
6c. Experimental procedure: We provide details concerning the experimental procedure. In particular, we mention the presence of repeated measurement (either under different conditions or the presence of multiple trials for the same condition at different points in time).
7. Constructs and Research Questions 7a. Constructs: Includes a list of variables that were investigated in the respective studies. Variables were included here if they were actually measured, as highlighted mostly in the ''Methods'' section of a publication.
Further, we offer a classification for the type of variable, indicating the relationship between involved variables. The categories are independent variable (''IV'') and dependent variable (''DV''). If available, we based this classification on a study's research model or hypotheses. Per default, if a research model or hypotheses were not available, we assumed that manipulated variables were IVs, while measured responses were DVs. When the publication followed a data-driven approach using machinelearning, we assumed the input variables (i.e., the features) to be IVs and the predicted variables to be DVs.
7b. Research questions and hypotheses: lists research questions (and if available hypotheses) investigated in the studies. If not mentioned explicitly in the paper, research questions and hypotheses were reconstructed from the analyses actually conducted in the paper.
7c. Relationship between constructs: states which constructs were used to answer a certain research question or to test a specific hypothesis. Causal relationships are indicated as ''IV=>DV'', while correlations are indicated as ''DV<=>DV''.
8. Data Analysis: Indicates the statistical methods used to test hypotheses or to investigate the stated research questions. In some cases, statistical tools used for data cleaning are also mentioned, although we did not focus on data preparation (e.g., artifact removal) at this point.
RQ4 was answered based on this data. 9. Main findings: This section includes summaries of the research findings, predominantly those related to neurophysiological data. Findings as reported in our review are those related to the research questions and hypotheses investigated in the paper.
RQ5 was answered based on this data.

Results
This section presents major findings of our literature review structured along our four research questions.

Who published NeuroSE research and where (RQ1)?
Based on the analysis of N = 89 NeuroSE publications, we identified 191 different authors. The average number of authors per publication is 4.25, and the maximum number of authors is 12. Specifically, we found the following results: 8 papers had 1 author (abbreviated: 8P/1 A), 17P/2 A, 15P/3 A, 13P/4 A, 12P/5 A, 9P/6 A, 6P/7 A, 2P/8 A, 5P/9 A, 1P/10, and 1P/12. Another finding of our analysis is that out of the 191 different authors, 26 researchers (∼14%) authored at least 4 publications and together 42% of all publications. Furthermore, 14 researchers (∼7%) published at least 5 papers and were involved in 28% of all publications (cf. Fig. 2). Fig. 2 shows the concentration of NeuroSE publications across authors. Based on our dataset, we calculated the Gini coefficient (GC), a popular measure of inequality. GC is 0.39; GC=0 expresses perfect equality, where all authors would have contributed an equal number of publications to the NeuroSE literature, and GC=1 expresses maximal inequality.

What kind of NeuroSE research was published (RQ2)?
This section classifies all 89 NeuroSE papers in terms of their contribution and for the completed empirical research papers we also elaborate on their thematic orientation.

Which major thematic orientation did NeuroSE researchers choose (RQ2.2)?
This section summarizes the thematic orientation of the completed empirical research papers (N=47). An overview is provided in Table 4. Our results show that existing NeuroSE research has a strong focus on code comprehension (30 out of 47). For example, Siegmund et al. (2014) collected fMRI data while participants were processing short source code snippets to map the brain regions that are active during code comprehension. Code inspection tasks were in the focus of 6 papers (e.g., Floyd et al. (2017)) looked into differences in brain activation between code comprehension and code inspection, while programming tasks were addressed in 6 studies (e.g., Yamamoto et al. (2016) used EEG to predict whether or not a programmer found an implementation strategy during a programming task). Moreover, change tasks were picked up by 3 papers. For example, Müller and Fritz (2015) predicted developers' affective states and the perceived progress during change tasks using multi-modal measurements (i.e., EEG, eye tracking, skin-and heart-related measurements). In addition, bug fixing as well as documentation were picked up by 1 paper only in each case. For example, González et al. (2015) used a consumer-grade EEG tool to compare the signals of participants when documenting code versus bug fixing and programming. Moreover, 5 papers were classified as general. For example,  collected heart-related measurements in a field study with professional developers to predict interruptibility during general software development activities.

Which methods and measures were applied in NeuroSE publications (RQ3)?
This section summarizes which methods and measures were applied in NeuroSE publications and provides examples of typical research questions addressed by the different methods. Fig. 4 shows the extent to which different neurophysiological methods have been used in completed empirical NeuroSE research (N=47).

Overview of neurophysiological method usage
Methods related to brain activity measurement have been used frequently. Specifically, EEG was used 20 times, fMRI 10 times, and fNIRS 4 times. Regarding ANS activity measurement, we found the following: eye tracking was used 10 times, heartrelated measurements 10 times, skin-related measurements 8 times, and we counted one use of a measurement in the category "Other''. Moreover, we found that 38 studies used one neurophysiological method only, while 3 studies combined two neurophysiological methods, 5 studies combined three neurophysiological methods, and 1 study combined four neurophysiological methods. For example, Fritz et al. (2014) combined EEG with eye tracking and skin-related measurements. Moreover, Müller and Fritz (2016) additionally used heart-related measurements.

Overview of collected neurophysiological measures
An overview of the collected neurophysiological measures categorized by method is provided in Tables 5 and 6. An overview of the used measurement instruments categorized by method is provided in Appendix E.
EEG measures. Most studies using EEG focused on the analysis of frequency bands (i.e., the EEG signal was decomposed into frequency bands: Alpha, Beta, Gamma, Delta, and Theta, which were then analyzed for power differences). In addition, several studies considered frequency band ratios, e.g., the ratio of Alpha/Beta. In addition, one study used simple time-domain features like (normalized) signal average and variance. A few studies considered compound signals (e.g., attention) provided by the used EEG device through pre-built algorithms. Moreover, 3 studies used eye blinking rate as a feature, i.e., the number of eye blinks per minute. Two studies captured interhemispheric differences (i.e., power differences between right and left hemisphere) and two studies considered non-directed functional connectivity measures (i.e., statistical associations between spatially distinct brain areas). More specifically, one study used cross correlations of frequency band power between electrodes and one study used Phase Locking Values, i.e., measuring phase synchrony between pairs of electrodes. Our analysis also shows that apart from Ikramov et al. (2019), Lee et al. (2017Lee et al. ( , 2016, Medeiros et al. (2019) who used research-grade EEG instruments by the company Brain Products (2 studies), Compumedics (1 study) and Mitsar (1 study), all remaining EEG studies used low-cost consumer-grade EEG devices for data collection. Emotiv EPOC was used 5 times, Neu-roSky mindset headset 3 times, NeXus 10 MARK II 3 times, and NeuroSky mindwave headset, Emotiv EPOC+, BrainLink Pro, and BIOPAC MP150 were used once (for a discussion of potential Table 5 Overview of collected neurophysiological measures (Part 1, brain activity measurement tools).
limitations of using consumer-grade EEG devices in a research context, please see Section 5.3). 3 fMRI measures. All fMRI studies explicitly stated the captured signal, the BOLD contrast. In addition, one study used directed functional connectivity measures, i.e., they exploited temporal precedence information to detect the influence of brain regions and its direction (Kim et al., 2013). 7 out of 10 studies were run in a 3T Magnetom Trio Tim MRI scanner by Siemens. The three remaining studies were conducted in a 3T Magnetom Prisma scanner by Siemens, a 3T General Electric MR750, and a 3T Philips Achieva Multix X-Series scanner.
fNIRS measures. 2 studies used the concentration of oxygenated hemoglobin in the cerebral blood flow (i.e., Oxy-Hb), while 2 studies used the ratio of oxygenated and deoxygenated hemoglobin (i.e., Oxy-Hb/DeOxy-Hb). Devices for data collection included a NeXus10 with a Nexus HEG sensor, a Wearable Hikari Topography WOT-200, a fNIR100 by BIOPAC, and CW6 a fNIRS by TechEn Inc.
Eye-related measures. The studies covered by our review analyzed pupil size, i.e., a measurement describing how large the pupil is, LH ratio, i.e., the low frequency/high frequency ratio of the pupil size variability (Shaffer and Ginsberg, 2017), eye blink rate, i.e., the number of eye blinks over time (Holmqvist and Andersson, 2017), and eye blink duration, i.e., the average duration of the eye blinks (Holmqvist and Andersson, 2017). 4 Devices included a low-cost eye tracker by Eye Tribe 60 Hz (3 studies), SMI eye tracking 3 BIOPAC MP150 and NEXUS 10 MARK II can be used not only to measure EEG, but can also be equipped with sensors to measure other physiological indicators such as HR. The authors of the papers that used these devices did, however, not specify the exact types of sensors that were used aside from their intention to measure EEG 4 Please note that fixations and saccades are not included here, since they are not predominantly reflexive; cf. Exclusion Criterion 4. glasses 60 Hz (2 studies), a SMI eye tracker (2 studies), 5 a Tobii TX 300 (1 study), a Tobii X3-120 (1 study), and an ASL eye tracking system (1 study).
Heart-related measures. Our results show that studies used ECGmeasures, heart rate (HR), i.e., the number of heart beats per minute (7 studies) and heart rate variability (HRV), i.e., the changes in the time intervals between consecutive heartbeats called interbeat intervals (IBIs) (Shaffer and Ginsberg, 2017) (8 studies). More specifically, we could observe the usage of several HRV time-domain measures like SDNN (i.e., standard deviation of NN intervals 6 ), RMSSD (i.e., root mean square of successive RR interval differences), pNN20 and pNN50 (i.e., percentage of successive RR intervals 7 that differ by more than 20 ms and 50 ms respectively). In addition, two studies used frequency domain features like the low/high frequency ratio of the ECG RR interval variability (Shaffer and Ginsberg, 2017  Overview of collected neurophysiological measures (Part 2, autonomic nervous system measurement tools).
Other measures. 1 study considered the respiratory rate, i.e., the number of breaths taken per minute and used the chest strap SenseCore 9 for data collection.

Usage of methods
To guide the interested reader to the actual usage of the different neurophysiological methods, Appendix G provides an overview of the research questions answered using the different methods. In the remainder of this section, we briefly summarize the type of research questions addressed, structured by themes.
This overview shows that fMRI studies primarily looked into which brain regions are activated during different software development activities (e.g., code comprehension) or task-related events (e.g., ''bug suspicion''). For example, Siegmund et al. (2014) investigated the question ''Which brain regions are activated during program comprehension?'' or Duraes et al. (2016) looked into the question ''What are the brain activation patterns associated with bug confirmation?''. Moreover, several fMRI studies investigated differences in brain activation and brain regions involved between different experimental conditions. For example, Floyd et al. (2017) examined the question ''Are neural representations of programming languages and natural languages distinct?'' and 9 This sensor is not available anymore. Liu et al. (2020) investigated ''Is the neural signature of code comprehension similar to other culturally derived symbol systems (i.e., logic and math) or similar to natural language?''. Differences in brain activation between different conditions were also in the focus of several EEG studies. For example, Kosti et al. (2018) investigated ''How do comprehension and syntax tasks differ in terms of patterns of brain activation?''.
Another group of studies looked for correlates between brain activity and constructs such as cognitive load, task difficulty, or task performance. For example, using fMRI Huang et al. (2019) investigated ''What is the impact of task difficulty on brain activation?''. As another example, Duraisingam et al. (2017) asked ''Is task difficulty reflected in EEG electrical signal within programming comprehension tasks?'' or based on fMRI Castelhano et al. (2018) examined ''Does activation in the anterior insula correlate with bug detection precision?''. Note that the insula is a brain region which is related to emotionally aversive stimuli, including spiders and snakes, anticipation of physical pain, excessive prices in purchase decisions, as well as situations characterized by uncertainty, ambiguity, and distrust (for a collection of references, see a brief review by Riedl et al. (2010b, p. 405).
Various EEG studies and studies using (multi-modal) measurements related to ANS activity looked into the efficacy of neurophysiological measures to predict different dependent variables (e.g., cognitive load, interruptibility, affective state) in both offline and online settings. For example, Couceiro et al. (2019b) examined ''Can a developer's cognitive load be measured during code reading using pupillography?'' and Fritz et al. (2014) addressed the question ''Can we acquire psycho-physiological measures from eye tracking, EDA and EEG sensors to accurately predict task difficulty?''. Similarly,  investigated the question "Can we build a classifier that predicts a software developer's interruptibility accurately in the field?". Moreover, several studies investigated which combination of neurophysiological measures is best suited for a certain prediction and aimed to shed light on how they compare to more traditional measures. For example, Fritz et al. (2014) addressed the question "Which combination of sensor and associated features works best?" and Müller and Fritz (2016) posed the question "How do biometrics compare to more traditional metrics for predicting perceived difficulty and detecting quality concerns?".
Eye tracking was used together with other modalities for the fine-grained analysis of cognitive load. For example, Couceiro et al. (2019a) investigated ''Can eye tracking together with HRV and pupillography be used to identify non intrusively code lines (and even lexical tokens inside code lines) that correspond to mental effort peaks?''. Similarly, Fakhoury et al. (2020Fakhoury et al. ( , 2018 posed the question ''Can developers' cognitive load be accurately associated with identifiers' terms using fNIRS and eye tracking devices?''. In addition, several studies using eye tracking investigated pupil size and/or eye blinks along with behavioral eye tracking metrics (e.g., fixations and/or saccades) (Ahrens et al., 2019;Aschwanden and Crosby, 2006;Jbara and Feitelson, 2015;Wulff-Jensen et al., 2019). For example, Aschwanden and Crosby (2006) investigated ''What are the scanning patterns during program comprehension?'', while Jbara and Feitelson (2015) looked into ''Are developers' visual efforts equally divided among regular segments?''. Based on the formulation of these questions it is obvious that the eye measures related to ANS activity (pupil size, eye blink) did not play a major role in the reviewed studies. Rather, studies were focused on fixation patterns. We nevertheless kept these studies in our review, but only analyzed those aspects of the papers that were related to ANS activity.

How was the empirical NeuroSE research conducted (RQ4)?
In this section we report additional details on the completed empirical studies (N=47). This includes details concerning study participants, stimuli, experimental design, independent and dependent variables as well as data analysis approaches used.

Study participants, stimuli, and experimental design
Study Participants. Sample sizes of the empirical studies ranged from 2 to 70 participants with a median of 17. The distribution of sample sizes is depicted in Fig. 5.
Female participation ranged from 0% to 63.16% with a median of 13.39%. Participants' age range was from 16 to 60 years with a median age of 26 years (note that information on age was not provided in 20 out of 47 studies). When analyzing sample sizes per method we found the following median sample sizes: fMRI (18, 10 studies), fNIRS (13, 4 studies), EEG (10, 20 studies), eye tracking (18.5, 10 studies), heart-related measurements (22, 10 studies), and skin-related measurements (16, 8 studies). The majority of studies (27) relied on student subjects. Two studies used faculty members in addition to students (i.e., Jbara and Feitelson (2015), Kosti et al. (2018)). Five studies used both students and professional software developers (i.e., Ahrens et al.  2018)). 7 studies did not further specify whether their participants were students or professional developers. Appendix I summarizes details concerning the study participants.
Stimuli. Most studies used simple code snippets between 3 and 60 lines. In turn, a few studies used realistic tasks like adding a feature to an existing library (i.e., Ahrens et al. (2019), Müller and Fritz (2015), Züger and Fritz (2015)) and 3 studies were even conducted in a real-world setting (i.e., Müller and Fritz (2016), Vrzakova et al. (2020), ). Programming languages included primarily Java (28 out of 47), C (7 out of 47), Python (3 out of 47), Scratch (2 out of 47), but also C/C++, C#, and Processing (1 in each case). Detailed information about the tasks and their complexity is not provided in all studies. Appendix J provides an overview of task characteristics, programming languages, and the size of the used code snippets.
Experimental Design. With the exception of Ahonen et al. (2018Ahonen et al. ( , 2016, Müller and Fritz (2016), Vrzakova et al. (2020), Züger and Fritz (2015), , all studies were conducted in the lab. The study presented in Züger and Fritz (2015) reports on the results of two studies (one in the lab and one in the field). Most of the studies (namely 37) used a within-subject design and conducted repeated measurements of the same condition at different points in time (e.g., by conducting multiple trials) and/or applied different experimental conditions. In addition, 3 studies used a between-subject design, two studies used a mixed design, and 5 studies did not apply an experimental design.

Dependent and independent variables
The majority of studies used neurophysiological data as dependent variable. An overview of independent variables whose effect on neurophysiological data was investigated is shown in Table 7. The independent variables are categorized into taskspecific factors, developer-specific factors, team-specific factors, task performance, and context factors. Task-specific factors are further sub-divided into software development activities, taskrelated events, and task characteristics. These categories emerged as a result of a bottom-up coding process of the literature.
We found a substantial number of studies that looked into brain activation depending on different task-specific factors. In particular, several studies looked into brain activation patterns for specific software development activities (e.g., code comprehension, code inspection). For example, Peitek et al. (2018a) and Siegmund et al. (2017Siegmund et al. ( , 2014 investigated which areas of the brain become activated while engaging in code comprehension tasks in contrast to syntax tasks. 10 Moreover, differences in brain activation patterns as well as brain connectivity patterns during code comprehension vs. syntax tasks are investigated in Kosti et al. (2018). Brain activation during code comprehension versus code inspection is examined in Castelhano et al. (2018); specifically, this study contrasted code inspection (i.e., searching for bugs) with code understanding (i.e., reading neutral code). Programming, bug fixing, and documenting code is compared in González et al. (2015). Moreover, several studies looked into the difference between code and prose. For example, Castelhano et al. (2018) contrasted brain activation during source code understanding and pseudo-code text reading. In addition, Floyd et al. (2017) compared code inspection with prose review. Several recent studies examined differences in brain activation during code related activities (code comprehension or data structure manipulations), as well as brain activation in regions related to mental rotation , math/logic/language/multisource interference , sentence comprehension, nonword reading, and hard memory tasks .
Existing studies not only contrasted different software development activities, but also different task-related events. For example, based on fMRI Duraes et al. (2016) and Castelhano et al. (2018) analyzed how brain activation differs for the events ''bug suspicion'' and ''bug confirmation''. Moreover, several studies investigated the impact of different task characteristics on neurophysiological measures. For example, the study described in Siegmund et al. (2017) investigated the role of bottom-up program comprehension and comprehension with semantic cues 10 Syntax tasks require subjects to spot syntactical errors without requiring them to understand the behavior of the program.   Peitek et al., 2018a;Siegmund et al., 2017Siegmund et al., , 2014 code comprehension vs. fake code/math/logic/language/multi-source interference  code vs. sentence comprehension vs. nonword reading  code comprehension vs. hard working memory task  code understanding vs. pseudocode reading  code comprehension vs. inspection  programming vs. bug fixing vs. documenting (González et al., 2015) data structure manipulation vs. mental rotation  Task-related events bug suspicion vs. bug confirmation Duraes et al., 2016) Task characteristics task difficulty (Couceiro et al., 2019a,b;Duraisingam et al., 2017;Huang et al., 2019;Ikutani and Uwano, 2014;Nakagawa et al., 2014;Yeh et al., 2017) problem type (Ikutani and Uwano, 2014) bottom-up versus semantic cues  presence of English identifier  layout  linguistic antipatterns and structural inconsistencies  structural and textual features (Wulff-Jensen et al., 2019) code regularity (Jbara and Feitelson, 2015) attention data representation (Ahrens et al., 2019) textual vs. visual programming language (Doukakis, 2019;Doukakis et al., 2020) paper versus whiteboard  Developer-specific factors expertise Lee et al., 2016) Team-specific factors pair programming (collaborating dyads vs. shuffled pairs) (Ahonen et al., 2018 programming role solo, pair/navigator, pair/driver  role differences (driver, navigator) in task-relevant events (running and testing code) (Ahonen et al., 2018) Task performance finding implementation strategy  correctness of responses  passing and failing in task-relevant events running and testing code (Ahonen et al., 2018) completion of comprehension step (Ishida and Uwano, 2019b,a) completion of judgment step (Ishida and Uwano, 2019a) Context factors listening to music  in terms of brain activation and examined how layout and beacons in source code influence program comprehension. In addition, Fakhoury et al. (2018) investigated the impact of linguistic antipatterns and structural inconsistencies on perceived task difficulty and brain activation. The role of task difficulty on brain activation is investigated, in turn, in Couceiro et al. Additionally, studies looked into developer-specific factors like expertise (i.e., , Lee et al. (2016)). For example, Lee et al. (2016) investigated the impact of expertise on brain activation. Several studies picked up team-specific factors. For example, Ahonen et al. (2018Ahonen et al. ( , 2016 investigated synchrony in heart-and skin-related signals during pair programming. A few studies looked into the relationship between task performance and brain activation (i.e., , Yamamoto et al. (2016)). For example,  examined the relationship of expertise and task performance (in terms of correct answers) with brain activation patterns. Finally, context factors were addressed in one study; Ikramov et al. (2019) investigated the role of music on brain activation during programming. Fig. 6 shows a Sankey diagram summarizing how the studies in which neurophysiological data was used as dependent variable were conducted. The studies by Ahonen et al. (2018Ahonen et al. ( , 2016 Fig. 6 connects the software development activities (column 1), the category of the independent variables (column 2), the independent variables (column 3), the neurophysiological measures (column 4), and the neurophysiological methods (column 5). The thickness of the links provides an indication of how often a certain connection was observed. Colors were chosen randomly and hence do not have a specific meaning. For example, when focusing on column 1 the Sankey diagram shows that the majority of studies focused on code comprehension followed by programming. When looking at column 2 the figure shows that a strong focus was on task-specific factors. The links between column 1 and column 2 show, for example, that task-specific factors were investigated in the context of various software development activities. Looking at connections between column 2 and column 3 it becomes apparent that studies put a strong focus on task characteristics followed by software development activities. The links between column 3 and column 4 show that differences between software development activities were mainly investigated using measures obtained from fMRI, EEG, and fNIRS. The substantially higher number of outgoing links when compared to incoming links in column 3 shows that several studies used different measures in combination. The figure also shows that eye tracking was mainly used to investigate different task characteristics. Moreover, it is clearly visible that methods related to brain activation (i.e., fMRI, EEG, and fNIRS) were much more frequently used if compared to methods related to ANS activity.
In addition to the studies in which neurophysiological data were used as dependent variable, several studies also used neurophysiological measures as independent variables. An overview of dependent factors that were predicted based on neurophysiological data is shown in Table 8. The dependent variables are categorized into task-specific factors, developer-specific factors, mental states, and task performance. These categories emerged as a result of a bottom-up coding process of the literature.
All of these studies followed a data-driven approach and extracted various features from the neurophysiological data (in several cases along with additional data sources like, for example, behavioral data). Neurophysiological data was used to predict task-specific factors like the software development activity (e.g., Floyd et al. (2017)), the task (sub)category , the problem structure , or the presence of quality concerns (Müller and Fritz, 2016). For example, Floyd et al. (2017) distinguished code comprehension, code inspection, and prose review. Moreover, neurophysiological measures were used to predict developer-specific factors like expertise (Crk and Kluthe, 2014;Lee et al., 2017) or task performance (e.g., correctness of responses . In addition, several studies used neurophysiological data to predict different psychological constructs related to cognitive or emotional processes of the developer. In the remainder of this paper, we refer to these constructs as mental states (both cognitive and affective). Drawing upon Cowley et al. (2016), we define a mental state as any interesting aspect of an individual's state that can be interpreted from this individual's physiology and thus measured by sensors. Examples of mental states include cognitive load, affective state, perceived progress, interruptibility, and stress. For example, Fritz et al. (2014) used EEG, eye tracking, and EDA features to predict cognitive load during program comprehension. Similarly, Duraisingam et al. (2017) investigated whether the difficulty of a program comprehension task and the associated cognitive load can be predicted from EEG. In addition, Müller and Fritz (2015) used EEG, eye tracking, skin-related measures, and heart-related measures to predict a developer's affective state and perceived progress. The study by Züger and Fritz (2015) predicted interruptibility from EEG, skin-related measures, and heart-related measures. Similarly,  considered heart-related data along with computer interaction, sleep-, and physical activity-related data to predict interruptibility.
Appendix F provides a detailed overview of all the measures grouped by neurophysiological method that were used to predict the various dependent variables. Fig. 7 shows a Sankey diagram providing an overview of all the studies in which neurophysiological data was used as independent variable to predict an outcome variable (i.e., it is based on the studies by Behroozi and Parnin (2018) Müller andFritz (2015, 2016), Vrzakova et al. (2020), Züger and Fritz (2015), ). The Sankey diagram connects the software development activities (column 1), the category of the dependent variables (column 2), the dependent variables (column 3), and the neurophysiological methods that were used to predict the dependent variables (column 4). Looking at column 1, we see that code comprehension was the focus of the majority of studies, followed by studies on general software development activities. When looking at column 2 we see that mental states were the prevalent category of dependent variables. The links between column 1 and column 2, for example, show that dependent variables of all four categories were predicted during code comprehension, while for change tasks only mental states were predicted. For the node ''Mental State'' the figure shows one more outgoing link than incoming links. This indicates that one of the studies predicted two distinct mental states. In turn, for the node ''Task-specific factor'' there is a higher number of incoming than outgoing links, which signifies that one study investigated more than one software development activity. Focusing on the connections between column 2 and column 3 we can observe that cognitive load was by far the most popular dependent variable that was predicted using neurophysiological measurements. The high number of links between column 3 and column 4 highlights that the same neurophysiological methods were used to predict several dependent variables. For example, for predicting cognitive     Problem structure (if and for statements)  Presence of quality concerns (Müller and Fritz, 2016) Developer-specific factors Expertise (Crk and Kluthe, 2014;Lee et al., 2017) Mental states Cognitive Load (Couceiro et al., 2019c;Duraisingam et al., 2017;Fritz et al., 2014;Kosti et al., 2018;Lee et al., 2017;Medeiros et al., 2019;Müller and Fritz, 2016) Interruptibility Fritz, 2018, 2015) Affective state (Girardi et al., 2020;Müller and Fritz, 2015;Vrzakova et al., 2020) Perceived progress (Müller and Fritz, 2015) Stress  Task performance Correctness of responses  load EEG, eye tracking as well as heart, skin, and breath-related measures were used. Moreover, the higher number of outgoing links when compared to incoming links at column 3 signifies that several studies relied on multi-modal measurements combining several neurophysiological methods. For example, to predict the presence of quality concerns skin, heart, and breath-related measures were used. If compared to Fig. 6, it can be seen that methods related to ANS activity appeared much more frequently and were more popular when compared to methods related to the CNS (i.e., fMRI, EEG, and fNIRS). Thus, our review reveals that brain activity measurement is predominantly used as dependent variable (predicted by other factors), while ANS activity measurement is frequently used as independent variable (to predict other factors).
For each of the 47 completed empirical studies Appendix K provides a detailed overview of the research questions along with the independent and dependent variables that were used to answer these questions and a concise summary of the main findings (cf. supplementary material). Whenever neurophysiological data was used to operationalize these variables the used measures are listed.

Data analysis
Data analysis methods applied to examine research questions where neurophysiological data was the dependent variable include descriptive statistics and visual analysis, the testing for differences in means as well as the testing of relationships (i.e., regression and correlation), and brain connectivity analysis.  (2015), Nakagawa et al. (2014)).
• Testing for differences: Multiple studies applied statistics to test for differences in means, i.e., Ryan method (Ikutani and Uwano (2014)), ANOVA (González et al., 2015;Lee et al., 2016) t-test (Ahrens et al., 2019;Huang et al., 2019;Ishida and Uwano, 2019a;Jbara and Feitelson, 2015;Yamamoto et al., 2016), ANCOVA , paired t-test and ANOVA (Wulff-Jensen et al., 2019;Yeh et al., 2017), Mann-Whitney U test Ikramov et al., 2019), Wilcoxon Signed-Rank Test Fakhoury et al., 2018), Kruskal-Wallis , a permutation test (Ahonen et al., 2018. Moreover, the study described in Fakhoury et al. (2018) used the Simple Matching Coefficient to compare the similarity and diversity of sample sets. In addition, Ahonen et al. (2018) used the Minimum-width envelope method to compare physiological signals and identify periods where the different conditions significantly differed.  2017)). Moreover, regression analyses were used by Girardi et al. (2020), Müller and Fritz (2015), Vrzakova et al. (2020). In addition, to analyze the BOLD response in brain imaging studies, 5 fMRI studies Duraes et al., 2016;Peitek et al., 2018a;Siegmund et al., 2017Siegmund et al., , 2014) used a Random Effects General Linear Model (a type of regression analysis) to determine how brain activation changes across the different experimental conditions. Moreover, 3 studies Ivanova et al., 2020;Liu et al., 2020) used a multi-level approach for analyzing the BOLD response. In a first step they estimated parameters at an individual level for each subject using a General Linear Model (GLM) and then conducted a group-level analysis using a Random Effects GLM.
• Brain Connectivity Analysis: 3 papers that perform brainrelated measurements did not only analyzed brain activation, but additionally focused on the analysis of functional brain connectivity (i.e., Castelhano et al. (2018), Kosti et al. (2018), Lee et al. (2016)). For example, Castelhano et al. (2018) focuses on brain connectivity analysis and uses Granger Causality Maps. Similarly, Kosti et al. (2018) not only analyzes differences in brain activation between code comprehension and syntax tasks, but additionally examines connectivity patterns and show a topographical representation of phase interactions.
Data analysis methods applied to analyze research questions where neurophysiological data was the independent variable included supervised learning approaches (in particular classification) and supervised learning in form of clustering. Almost all studies where neurophysiological data was used as an independent variable (18 out of 19 studies, listed in Table 8) relied on supervised learning to predict dependent constructs from independent constructs (also denoted as features in this context), while one study used unsupervised learning (more specifically clustering) to predict expertise from EEG signals (Crk and Kluthe, 2014). Of the studies using supervised learning, 5 studies used feature selection methods to automatically choose the best features for classification (e.g., Couceiro et al. (2019c), Kosti et al. (2018), Medeiros et al. (2019), Müller and Fritz (2015), Züger and Fritz (2015)). Feature selection experiments investigating different combinations of features in terms of their classification accuracy were conducted by 9 studies ), Fritz et al. (2014), Fucci et al. (2019, Girardi et al. (2020), Lee et al. (2017), Müller and Fritz (2016), Vrzakova et al. (2020). In addition, model selection experiments comparing different classification algorithms were conducted by 5 studies to identify the best classification technique for a particular setting , Fucci et al. (2019), Girardi et al. (2020), Züger and Fritz (2015), ). Furthermore, a comparison of different window sizes was performed by 3 studies (Fritz et al. (2014), Züger and Fritz (2015), ).

Neurophysiological data used as dependent variable
This section provides a synthesis of the findings of the studies that looked into the effects of various task-specific factors, developer-specific factors, team-specific factors, task performance, and context factors on neurophysiological measures. For task-specific factors, we distinguish software development activities, task-related events, and task characteristics. For an overview of the main findings, we group studies focusing on the same independent variable and discuss all studies related to an independent variable in a paragraph each. Whenever an independent variable was investigated for different software development activities, we group studies referring to the same activity together, since findings for one software engineering activity (e.g., code comprehension) cannot necessarily be transferred to another software engineering activity (e.g., code inspection).
Task-specific factors: Software Development Activities. The research reviewed in this paper provided novel insights into brain activation patterns that emerge as a result of perception of different software development activities.
The first study investigating brain activation patterns in a software engineering context was the study by Siegmund et al. (2014). For program comprehension, based on fMRI Siegmund et al. (2014) found that five different brain regions associated with working memory (BA 6, BA 40), attention (BA 6), and language processing (BA 21, BA 44, and BA 47) (all in the left hemisphere) are activated (using syntax tasks as a contrast). These findings could be largely replicated by Siegmund et al. (2017); they found activation in BAs 21, 40, and 44 within the left hemisphere; no activation of 6 and 47. Additionally, Siegmund et al. (2017) found activation in BA 39 which is related to the integration of multi-sensory information. These findings are in line with our understanding of bottom-up program comprehension. Bottom-up program comprehension tasks require participants to analyze words and symbols and integrate them to semantic chunks (using the language network BA 21, BA 44, and BA 47 and presumably BA 39 for integration) and manipulate numbers and words according to the intention of the source code, which requires keeping values of manipulated numbers and words in mind (BA 6 and BA 40) (Peitek et al., 2018a). Siegmund et al. (2017) further demonstrated that during semantic-cue comprehension and bottom-up comprehension the same brain regions (with the exception of BA 39) were activated. BA 39 is deactivated during semantic-cues comprehension, but activated during bottom-up comprehension. For all areas, the activation is significantly lower for semantic cue comprehension than for bottom-up comprehension, which confirms that beacons ease comprehension. The study additionally found that neither beacons nor program layout seem to significantly affect the program comprehension process.
Based on fMRI, Ivanova et al. (2020) aimed to disentangle brain activation due to program comprehension and brain activation that results from the underlying problem content (an aspect that has not been differentiated in the studies by Siegmund et al. (2017Siegmund et al. ( , 2014 and Peitek et al. (2018a)). Their results show strong bilateral responses to code comprehension tasks in the multiple demand (MD) system (i.e., regions in frontal and parietal lobes, as well as a region in the anterior cingulate cortex). These responses were significantly stronger than for sentence problems. The fact that the MD system responds more strongly to code comprehension tasks than to tasks involving textual descriptions of the code demonstrates that the response to the MD system is specific to code comprehension and not just activated due to the underlying problem content. The activation in the MD system occurs irrespective of the problem type and problem structure, can be observed across most MD regions, and generalizes across programming languages. Results further suggest that the involvement of the language system (left-lateralized activation) was mostly driven by the processing of problem content rather than code comprehension, suggesting that the language system does not support code comprehension in proficient programmers. This work further showed that code comprehension is broadly supported by the MD system. At the same time the paper found that no MD regions are functionally specialized to process source code, which is in agreement with the findings of Liu et al. (2020).
Based on fMRI, Liu et al. (2020) showed that during code comprehension (in contrast to tasks based on fake code) the lateral prefrontal cortex (middle/inferior frontal gyri, inferior frontal sulcus; mainly BA 44/46, with partial activation in BAs 6,8,9,10,47), the parietal cortex (the intraparietal sulcus, angular and supramarginal gyri; BA 7) and the posterior middle temporal gyrus and superior temporal sulcus (BA 22/37) were activated in expert programmers. Activity was also observed in early visual cortices. Fronto-parietal responses were also observed in the study by Siegmund et al. (2017Siegmund et al. ( , 2014 during code comprehension (in contrast to syntax tasks), by Ivanova et al. (2020) during code comprehension (when compared to textual descriptions of the code), by Huang et al. (2019) during data structure tasks, by Floyd et al. (2017) during code inspection tasks (when compared to prose review). The study further found that brain activation patterns within this network differ for different control structures (i.e., for and if statements). This work further demonstrates that in terms of underlying neural basis, code comprehension largely overlaps with other culturally derived symbol systems, in particular formal logic and to a smaller degree math. Moreover, consistent with Floyd et al. (2017) and Ivanova et al. (2020) this work showed that the neural basis of code comprehension and language is distinct. However, laterality of code and language covaried across participants, an observation that was not made by Ivanova et al. (2020).
Based on fMRI and fNIRS, Huang et al. (2019) investigated differences between mental rotation and data structure operations (i.e., tree and sequence). fMRI results show that ''a number of Default Mode Network (DMN) regions involved in mental simulation were recruited more heavily during mental rotation than during code tasks; Still, 95% of voxels were statistically indistinguishable between Mental and Tree tasks'' (p. 402). DMN denotes a network of brain regions that are active ''when an individual is awake and alert and yet not actively engaged in an attention-demanding task'' (p. 682); this network comprises, among other areas, the medial prefrontal cortex (MPFC), as well as the posterior cingulate and precuneus (Raichle et al., 2001). fNIRS results demonstrate that mental rotation and data structure operations involve activation of the same brain regions (i.e., BAs 6-9, 17-19, 39 and 46). When comparing brain activation results of sequence, tree and mental operations, only the comparison ''Sequence > Mental'' showed differences. This suggests that spatial ability operations and data structure operations are related. While both fMRI and fNIRS found brain activation in similar areas during both mental rotation and data structure tasks, it has to be noted that several of the task differences observed with fMRI could not be observed with fNIRS (presumably due to the lower spatial resolution of fNIRS). The paper also reports differences in task performance between the two different methods (i.e., significantly lower task performance in terms of accuracy with fMRI). A comparison between fMRI and fNIRS with self-reports showed that the similarities between mental rotations and data structure operations (including the overlap in brain regions) were not well reflected in self-reports. What follows is that the use of brain imaging is critical to identify differences between tasks that would not have been observable with introspective methods such as surveys.
While the studies by Siegmund et al. (2017Siegmund et al. ( , 2014, Peitek et al. (2018a), Ivanova et al. (2020), Liu et al. (2020) and Huang et al. (2019) focussed on comprehension tasks, the studies of Duraes et al. (2016) and Castelhano et al. (2018) tried to understand brain activation patterns and connectivity patterns during code inspection. For code inspection, based on fMRI Duraes et al. (2016) found activation in several areas related to program comprehension including and beyond known language regions and areas related to working memory and decision making. At the moment of the bug detection, stronger activation was found (in the medial frontal region), while regions related to visual perception and decision-making became deactivated (right insula and bilateral occipital areas). This result suggests that the judgment about the presence of a bug had already been reached before the ''bug detection'' event. Moreover, Castelhano et al. (2018) showed through a connectivity analysis that evolutionary older brain regions initially used for different tasks in the history of mankind seem to be reutilized for recent and complex tasks such as code inspection.
While all of the above studies used fMRI (and additionally in one case fNIRS) two studies looked into the difference of brain activation for different software development activities using EEG. Kosti et al. (2018) showed that code comprehension (requiring mental simulation) is a more demanding task when compared to syntax tasks. In particular, increased brain activation over frontal areas (mainly the upper Beta band, [20-30 Hz]) could be observed. The differences are even more pronounced when looking at connectivity patterns. In particular, interhemispheric interactions within Theta and Upper Beta frequency bands distinguish code comprehension from syntax tasks. Moreover, the study by Kosti et al. (2018) could show that a programmer's workload correlates with brain activation patterns and the pattern of functional connectivity. More specifically, a strong correlation between brain activation in the higher bands , Upper Beta [20-30 Hz], and Gamma [30-100 Hz]) and a programmer's workload could be shown. In addition, the results indicate positive correlations between couplings (i.e., phase couplings between signals) and workload within the lowest bands and within Upper Beta and significant negative correlations between couplings and workload within the lowest 4 bands. For programming tasks and bug fixing tasks, based on EEG González et al. (2015) reported that significant differences in terms of EEG compound signals both in the physical and the digital setting exist.
Task-specific factors: Task-relevant events. Duraes et al. (2016) and Castelhano et al. (2018) are the only fMRI studies in our SLR that went beyond contrasting software development activities and additionally contrasted task-relevant events. Their studies showed that initially finding a bug (''bug suspicion'') and confirming a bug (''bug confirmation'') led to differences in brain activation, in particular in the right anterior insula. Moreover, they showed that the activity in this region during bug intuition is positively correlated with the precision of bug detection, suggesting "that this brain region signals the quality of programmers' intuitive capacity to identify bugs when facing the inspection or analysis of challenging code''. (Castelhano et al., 2018, p. 634).
Task-specific factors: Task characteristics. Numerous studies investigated the effect of task characteristics on neurophysiological measures. An overview of the results is summarized in Table 9.
It is worthwhile mentioning that most of the papers focusing on task characteristics analyzed the data at the task-level. In turn, Fakhoury et al. (2020Fakhoury et al. ( , 2018 and Couceiro et al. (2019a) showed the potential for analyzing neurophysiological data at a more fine-grained level. More specifically, Fakhoury et al. (2020Fakhoury et al. ( , 2018 showed that developers' cognitive load can be accurately associated with identifiers in source code and text (i.e., similarity of 78% compared to self-reported high cognitive load). Similarly, Couceiro et al. (2019a) showed that eye tracking has enough resolution to pinpoint specific code lines that correspond to moments when measurements of heart rate variability and pupillography show high levels of cognitive load.
Developer-specific factors: Expertise. For program comprehension, using EEG Lee et al. (2016) investigated differences in the information processing between novices and experts. The study showed that program comprehension involves Beta and Gamma frequencies (suggesting that high levels of concentration are used during comprehension). The results showed that the expert group experienced significantly higher power in the Beta range, presumably utilizing more brain resources. In addition, both novices and experts experienced Gamma frequency activation during program comprehension. Similarly, for program comprehension, applying EEG  investigated the usage of EEG indicators of working memory for evaluating expertise-related differences in subject performance. The results showed significant differences in Upper Alpha (ranging from Individual Alpha Frequency (IAF) 11 to IAF+2) and Lower-1 Alpha (ranging from IAF-4 to IAF-2) for both expertise level and correctness and significant interaction effects between correctness and expertise level. For Lower-2 Alpha (ranging from IAF-2 to IAF) and Theta (ranging from IAF-6 to IAF-4) significant differences were only found for expertise level, but not correctness. Overall, the paper shows that more direct measures of cognitive load can be used to quantify comprehension task performance across different levels of expertise and that cognitive demands depend on the expertise level.
Team-specific factors: Pair Programming. Team-specific factors were in the focus of Ahonen et al. (2018Ahonen et al. ( , 2016. For programming, using heart-and skin-related measurements Ahonen et al. (2018Ahonen et al. ( , 2016 investigated the synchrony in the physiological signals during pair programming. Results show evidence for social psychophysiological compliance (SPC) in both the heart rate variability signal and the EDA signal of collaborating dyads. Moreover, self-reported task difficulty ratings were associated with SPC. In turn, heart-related measurements of physical activity (HR) werein line with the expectations-insensitive to pair collaboration and were inconsistent with self-reports. Moreover, Ahonen et al. (2018) investigated role differences (driver versus navigator) concerning task-relevant events (i.e., running and testing code). The results show that failure events led to a significantly higher skin conductance response (SCR) for developers in the driver role, which ''might reflect engagement, or the liability of being in the leading role'' (Ahonen et al., 2018, p. 8). Moreover, they found that for drivers the SCR increases up to event time, while the physiological response of the navigator is delayed by several 11 A fixed Alpha frequency band (8-14 Hz) might not cover all Alpha activity, since peak Alpha frequency differs between individuals. Individual Alpha Frequency (IAF) can be calculated for each individual and frequency bands can then be defined based on IAF dynamically.

Table 9
Summary of findings for studies on task characteristics.

Task difficulty fNIRS (Oxy-Hb)
Normalized Oxy-Hb significantly larger for hard vs. easy tasks Nakagawa et al. (2014) fNIRS A comparison of hard and easy tasks did not yield any significant effects Ikutani and Uwano (2014) fMRI and fNIRS -increasing brain activation with increasing task difficulty using fMRI; significantly larger effect for difficult sequence tasks than for mental rotation tasks -no significant effects for fNIRS Eyetracking (pupillography) Programmers' cognitive load measured using pupillography is consistent with the subjective perception of task difficulty Couceiro et al. (2019b) Eyetracking (pupillography), Heart Eye tracking has enough resolution to pinpoint specific code lines that correspond to moments when measurements of heart rate variability and pupillography show high levels of cognitive load.

Structural inconsistencies and linguistic antipatterns
Eyetracking and fNIRS -developer's cognitive load can be accurately associated with identifiers in source code and text, with a similarity of 78% compared to self-reported high cognitive load -linguistic antipatterns in the source code led to a significant increase of the cognitive load experienced by the participants -no statistical effect of structural inconsistencies on average cognitive load (however participants report frustration) -source code containing both lexical and structural inconsistencies misled 60% of the participants -participants who successfully completed the task experienced higher cognitive load when both inconsistencies were present Fakhoury et al. (2020Fakhoury et al. ( , 2018 Problem Structure fNIRS Contrasting numeric, variable, and control tasks shows significantly higher brain activation in condition ''Variable'' compared to the other two conditions for problem type. Ikutani and Uwano (2014) Semantic-cues comprehension and bottom-up comprehension fMRI -for all areas involved during program comprehension, the activation is significantly lower for semantic cue comprehension than for bottom-up comprehension, which confirms that beacons ease comprehension -neither beacons nor program layout seem to significantly affect the program comprehension process Siegmund et al. (2017) fMRI In line with our understanding of bottom-up comprehension data-flow complexity and vocabulary size are highly correlated with the concentration level needed for program comprehension (i.e., there are no beacons that could act as cues for data-flow aspects).

Peitek et al. (2018a)
Interview Setting Eyetracking (eye blinks) -whiteboard setting perceived as more stressful by participants than the paper setting -significant differences between settings for several measures associated with stress and cognitive load (incl. longer blinks)  Code regularity Eyetracking (pupil size) Descriptive statistics show a decreasing tendency of visual effort (for fixation count and total fixation duration, but less clearly for pupil dilation) for repeated instances Jbara and Feitelson (2015) Presence/absence of structural and textual features Eyetracking (pupil size) No significant differences between conditions could be observed in terms of pupil size. The analysis of questionnaire data showed that the condition combining both structural and textual elements was significantly more comprehensible and readable than the three other conditions.

Wulff-Jensen et al. (2019)
Attention data representation (pupil size) No significant effect of attention data representation on efficiency and quality of software maintenance tasks (including cognitive load).

Ahrens et al. (2019)
Programming languages EEG Initial exploration of differences in brain activation dependent on the programming language using descriptive statistics Doukakis (2019), Doukakis et al. (2020) seconds. For pass events the SCR decreases for the driver before the event and it then rebounds once the driver relaxes. In turn, for the navigator, an increase after the pass event occurs. In addition, using EEG Ikramov et al. (2019) conducted an initial exploration of the influence of the programming role (i.e., solo, pair/navigator, and pair/driver) on brain activation. For the condition, Navigator, a significantly higher activity in the L1 Alpha sub-band could be observed. No significant differences in the other Alpha sub-bands and the Theta band could be observed.
Task Performance. For programming, Yamamoto et al. (2016) showed the usage of the EEG signal as an index for classification of programmers who fail to find an implementation strategy. The results show that the EEG signal during task execution contains a significantly larger Alpha wave power and Beta/Alpha ratio when successfully finding an implementation strategy. The EEG signal after task execution contains a significantly larger Alpha wave power when successfully finding an implementation strategy. When analyzing the effect of individual differences, the analysis showed that more than half of the participants had a significantly larger Alpha wave and Beta/Alpha ratio when they succeeded in finding an implementation strategy. In a related study, using EEG, Ishida and Uwano (2019a) investigated differences in brain activation depending on the success in performing a code comprehension task and obtained similar findings. Results show that both Alpha and Beta power significantly increased between start and end of the code comprehension task for developers that could successfully complete the task. For the failure group, the difference in brain activation between start and end was not significant. In addition, Ishida and Uwano (2019b,a) showed an earlier increase of Alpha activity in the success group. Similarly, for code inspection Ishida and Uwano (2019a) showed that both Alpha and Beta power for the success group significantly increase. For the failure group a significant increase of Beta, but not Alpha power could be observed.
Context Factors: Listening to Music. Using EEG, Ikramov et al.
(2019) explored the effect of programming with and without music in an initial study. Results remained inconclusive. Readers interested into the effects of music in the context of software development and information systems are referred to a recent paper by Gefen and Riedl (2018).

Neurophysiological data used as independent variable
This section provides a synthesis of the findings of studies that looked into the efficacy of neurophysiological measures to predict different dependent variables (e.g., task-specific factors, developer-specific factors, mental states, and task performance). Appendix F provides an overview of the measures the different studies used as features for their machine learning models. A comparison of the different classifiers in terms of their performance is presented in Appendix H. For a better comparison of the findings, we group studies focusing on the same dependent variable and discuss all studies related to one dependent variable in one paragraph. When a dependent variable was investigated for different software development activities, we group studies referring to the same activity together. The rationale for this approach is that findings for one software engineering activity (e.g., code comprehension) cannot necessarily be transferred to another software engineering activity (e.g., code inspection).
Task-specific factors: Software Development Activities. Based on fMRI, Floyd et al. (2017) showed that brain activation patterns of programming languages and natural languages are distinct; code review, code comprehension and prose review have largely dis-tinct activation patterns (a finding which is in line with Ivanova et al. (2020)). They further demonstrated that the classification of these tasks based on brain activity alone is possible. Moreover, the study showed that a number of prefrontal regions reliably distinguished between code and prose. For experienced programmers, code and prose were hardly distinguishable, which signifies that programming languages are increasingly treated like natural languages with increasing expertise. Similarly, based on EEG, skin-and heart-related measurements Fucci et al. (2019) showed that code and prose comprehension can be accurately differentiated. The best performing classifier reached an accuracy of 87% and was based on heart-related measurements. However, in contrast to the study by Floyd et al. (2017) no relationship between classification accuracy and expertise was found.
Task-specific factors: Task Category and Task Subcategory. Based on fMRI, Ikutani et al. (2020) showed that brain activation patterns for functional categories of source code can be distinguished during program categorization tasks. Results further showed that classification accuracies for various regions in the frontal, parietal, and temporal cortices were significantly linked to task performance. In addition, classification accuracies of subcategories on the left supramarginal gyrus and superior temporal gyrus regions were associated with task performance. Since task performance was also highly correlated with expertise, this suggests that cortical representations of functional categories (subcategories) might be associated with advanced-level programming expertise. Developer-specific factors: Expertise. For code comprehension, expertise is predicted by Lee et al. (2017) using EEG and eye tracking. The best overall performance could be achieved by using EEG and eye tracking features in combination using a SVM classifier (97.7% precision and 96.4% recall). Similarly, Crk and Kluthe (2014) used EEG (more specifically Upper Alpha and Theta) to predict the expertise of a developer during code comprehension tasks. The accuracy for detecting expertise ranged between 55% and 59% (and 56% and 67% when considering only correct answers). Therefore, compared to Lee et al. (2017), classifier performance is considerably lower. While both studies focused on a 2-state classification of expertise, the participants in the study by Lee et al. (2017) had clearly different levels of expertise (novices with 1 year of programming experience versus experts with at least 6 years of experience). In contrast, Crk and Kluthe (2014) examined students that differed in their class level, most likely signifying a much lower level of expertise variance. Differences in the used measurement instruments (i.e., Lee et al. (2017) used a research grade EEG device, while Crk and Kluthe (2014) used a consumer-grade device) and differences in the used neurophysiological methods (i.e., Lee et al. (2017) used multimodal measurements, while Crk and Kluthe (2014) used a single modality) probably also contributed to the observed differences in classifier performance.

Task-specific factors: Problem
Mental State: Cognitive Load. Several studies developed classifiers for cognitive load using different neurophysiological methods. More specifically, EEG, heart, skin, and breath-related measurements as well as measurements of pupil size were used along with behavioral measures like eye gazes, code metrics, interaction metrics, and change metrics.
Classifiers for predicting cognitive load (2-state classification) during code comprehension tasks based on a single modality are proposed by Duraisingam et al. (2017), Couceiro et al. (2019c), and Medeiros et al. (2019). Couceiro et al. (2019c) predicted cognitive load (2-state classification) during code comprehension using heart-related measurements (more specifically HRV), while Medeiros et al. (2019) used EEG. In both studies, code snippets differing in terms of code complexity were used as stimuli. Results show that neurophysiological measurements of cognitive load correspond to subjective load perceptions of the participants, but not with the code complexity of the stimuli. These findings are in line with the fMRI work of Peitek et al. (2018a) who examined the association of different complexity metrics and brain activation. While several metrics were tested only in one case, could a significant correlation be established (i.e., De-pDegree). In the light of cognitive load theory (Sweller, 2011) which differentiates between intrinsic load (inherent complexity of the task) and extraneous load (complexity stemming from the task representation), such discrepancies in study findings are not surprising. Existing metrics typically only capture some of the aspects that are known to contribute to the cognitive load perceived by a developer. Additional research is needed to better understand which metrics, or combinations of metrics, are best suited to capture what is cognitively demanding and what is not.
Similar to Medeiros et al. (2019), the classifier proposed by Duraisingam et al. (2017) is based on EEG. A comparison of different feature groups shows that the best results were obtained by considering interhemispheric differences (i.e., asymmetry ratio). In turn, Lee et al. (2017) uses a combination of EEG and eye tracking. The best overall performance (for a 2-state classification) was achieved by using EEG and eye tracking features in combination. A classifier for predicting cognitive load (2-state classification) during code comprehension based on low-cost, off-the-shelf sensors (EEG, EDA, eye tracking) was developed by Fritz et al. (2014). The authors tested the performance of the classifier when predicting a new participant, a new task, and a new participant-task pair. Their results showed that different combinations of sensors performed best depending on the setting. The study found that precision and recall for predicting new participants were 15% and 5% lower than for predicting new tasks. Interestingly, in contrast to several other studies (e.g., Lee et al. (2017) and Müller and Fritz (2016)), the combination of all sensors did not always lead to the best classifier performance. Thus, it is not possible to establish the general rule that a combination of different physiological indicators is always a better predictor of a dependent variable than one physiological measure alone.
While all of the above studies focused on a 2-state classification, Kosti et al. (2018) developed a 4-state classifier for cognitive load using a consumer-grade EEG. The results show that functional connectivity measures better express the relation between the difficulty of a comprehension task and the workload of a programmer, if compared to measures of signal power in the different frequency bands. Moreover, the functional connectivity approach offered an additional advantage: it assessed cognitive load independent of the participant, a property that increases practical applicability.
While all of the previously mentioned studies focused on code comprehension, Müller and Fritz (2016) focused on general software development activities and developed classifiers for predicting cognitive load (6-state classification) from neurophysiological data (heart, skin, and breath) as well as code, change and interaction metrics at both the class and the method level. The evaluation showed that the best results could be achieved by combining all metrics. The classifier using neurophysiological data outperformed the classifier using interaction, code, and change metrics in 3 out of 4 cases. In a replication, confirming the original study results, the classifier using all features performed best. The comparison of neurophysiological data with interaction, code, and change metrics again yielded mixed results. The findings further show that the classifier is highly sensitive to the individual subject. Specifically, it was found that when training the classifier based on the data of all subjects, predictions were not better than chance.
In general, the overlap of features between studies is rather limited. Thus, additional research is needed to advance our understanding of which feature combination works best in different software engineering contexts.
Mental State: Affective State. A classifier that can distinguish positive and negative emotions (i.e., valence) based on several features combining EEG, eye tracking, skin and heart-related measurements during change tasks was developed by Müller and Fritz (2015). Similarly, Girardi et al. (2020) developed a classifier for valence using EEG as well as skin-and heart-related measurements during a change task. The performance of the classifier only using skin and heart-related measurements as provided by the Empatica E4 wristband was comparable to a classifier additionally using EEG. While the results in terms of classifier performance are comparable with Müller and Fritz (2015), they are achieved using a smaller sensor set. In addition to valence, Girardi et al. (2020) additionally developed a classifier for arousal. Again, the performance of the classifier only using the Empatica E4 wristband (i.e., using skin-and heart-related measurements) was comparable to a classifier additionally using EEG.
The development of a classifier for both valance and arousal during code inspection combining skin-related measurements with behavioral data (touch and eye gazes) was in the focus of Vrzakova et al. (2020). For both valence and arousal the best performance could be achieved by a classifier combining all modalities (valence: accuracy = 90.0%, arousal: accuracy = 83.9%). The results also show that positive valance could be better predicted when compared to negative valence (as indicated by a true positive rate of 95.7% when compared to a true negative rate of 69.5%). The best classifier using one modality alone was based on eye gaze (valence: accuracy = 85.8%, arousal: accuracy = 76.6%).
Classifier performance is superior when compared to the work by Müller and Fritz (2015) and Girardi et al. (2020), however, it needs to be kept in mind that Müller and Fritz (2015) and Girardi et al. (2020) focused on change tasks, while Vrzakova et al. (2020) focused on code inspections. In addition, Vrzakova et al. (2020) showed that affect builds up over time. More specifically, it was found that when using data from the beginning of the task, compared to data from the end of the task, performance of the classifier combining all modalities decreased by 4% for both valence and arousal.
Mental State: Interruptibility. Interruptibility could be classified with high accuracy into two states for general software development activities (both in a lab and a field study) using EEG, skin and heart-related measurements (Züger and Fritz, 2015). Features that proved useful for a 2-state and 5-state classification in both studies include activity in the EEG's beta and gamma frequency band as well as mean skin temperature. In turn,  focused on predicting interruptibility based on heart-related measurements as well as behavioral data (computer interactions, sleep, and physical activity) in the field. Results indicate that interaction metrics slightly outperformed heart-related measures. The best results were achieved by a classifier combining interaction metrics and heart-related measures. The study also showed that the optimal time window varies per feature. Furthermore, the paper demonstrated that even a generally trained model can accurately predict interruptibility for new subjects. This is an important result, because the possibility to use a machine learning model which was not specifically trained for a particular subject significantly increases practical applicability. The results of our review show that for other mental states, like cognitive load, classifiers were often sensitive to the individual. For example, Müller and Fritz (2016) report that their classifier trained on the data of all participants did not work better than chance.
Mental State: Stress. Stress was classified using eye tracking measures (including both neuro-physiological and behavioral measures) during general software development activities . The best performing classifier, irrespective of the applied labeling strategy, was Random Forest. Classifier performance was best when considering the interview setting (i.e., paper or whiteboard) along with the stress rating in the labeling (accuracy 88%). This implies that eye tracking measurements show stress differently as a function of the setting. Therefore, the study results are hardly surprising as most of the measures that were considered as features in the machine-learning model were based on eye movements (which were likely influenced by the setting). The results further showed that fixationbased measures, as well as pupil size, are more predictive than saccadic measurements.
Mental State: Perceived Progress. A classifier that can distinguish high from low progress was developed by Müller and Fritz (2015) based on multi-modal measurements combining features from EEG, eye tracking, and EDA. High and low progress could be distinguished in 67.70% of all cases. Change in Alpha activity, change in Beta/Theta ratio, change in mean temperature peak amplitude, maximum pupil size, change in mean pupil size, and change in mean skin conductance level were the most predictive features. Change in Alpha activity and change in Beta/Theta ratio were also shown to be among the most predictive features for distinguishing positive and negative affect.
Task performance. While Müller and Fritz (2015) focused on perceived progress,  predicted task performance during code comprehension tasks. The study found that both the individual alpha frequency (IAF) 12 and programming experience play a statistically significant role. The results further show that experience increased the likelihood of correct answers substantially more than IAF.
Presence of Quality Concerns. For general software development activities, Müller and Fritz (2016) developed a classifier for predicting quality concerns from neurophysiological data (heart, skin, and breath) and compared this physio-classifier with classifiers using interaction, code, and change metrics. The results show that for a within subject setting the classifier based on neurophysiological data outperformed the classifier based on code and interaction metrics as well as the classifier combining all features. For the across subject setting, in turn, the classifier based on all data sources performed best. 12 A fixed Alpha frequency band (8-14 Hz) might not cover all Alpha activity, since peak Alpha frequency differs between individuals. Individual Alpha Frequency (IAF) can be calculated for each individual.

Who published NeuroSE research and where (RQ1)?
Based on the analysis of N = 89 NeuroSE publications, we identified 191 different authors. A recent review of the NeuroIS literature found that 432 different authors published 200 NeuroIS articles (Riedl et al., 2020a). Using this finding from NeuroSE's major sister discipline as benchmark, we can conclude that the group of researchers involved in NeuroSE research is still relatively small. Moreover, our results show that there is a small group of authors that is highly active (14 authors are involved in more than 28% of the NeuroSE publications). It follows that there is a notable inequality in research contributions. The small absolute number of highly engaged researchers together with the observed inequality can be seen critical. One major consequence of this still low number of highly engaged NeuroSE researchers is that potential candidates to serve as editors and reviewers of NeuroSE papers are a scarce resource. However, our list of top-14 contributors (cf. Fig. 2) can be used as a basis to select associate editors and reviewers for corresponding publications.
Our analysis of outlets shows that more than 40% of the NeuroSE publications appeared in five outlets (EMIP, ICSE, ICPC, SEmotion and FSE) and over 60% of the NeuroSE publications concentrated on the 13 outlets (cf. Table 3). The remaining papers are distributed across more than 30 further outlets. So far, only three journals published more than one NeuroSE paper (i.e., Information and Software Technology, Empirical Software Engineering, and eLife). The low ratio of journal papers is an indication that the field is still rather nascent. At the same time, since 2014 five NeuroSE papers were published in ICSE, which underlines the relevance of the topic and highlights that several high quality contributions already exist. The increasing attention on neurophysiological measurements at workshops like EMIP or SEmotion, or at conferences like ICPC, makes us confident that we will see more conference and journal papers with a NeuroSE focus in the near future.

What kind of NeuroSE research was published (RQ2)?
Our results indicate that -except 2019 -there was no year in which more than ten completed empirical research papers with a NeuroSE focus were published (see Fig. 3, 2019, blue bar). This shows that the field of NeuroSE is still at a relatively nascent stage. Our literature review also identified methodological contributions. This includes recent work by Peitek et al. (2018d) investigating the integrated collection of fMRI and eye tracking data, the introduction of an infrastructure to collect eye tracking data linked to the software artifact (Guarnera et al., 2018), and a novel tool for multi-modal data exploration during code comprehension experiments (Peitek et al., 2019). With the exception of recently proposed methodological guidelines for conducting eye tracking studies (Sharafi et al., 2020), we find that what is missing so far are papers providing methodological guidelines on how to conduct NeuroSE research based on specific methods. Such papers should consider the idiosyncrasies of the field. The NeuroIS field may serve as an example. A number of methodological papers have been published in this field, including EEG guidelines (Müller-Putz et al., 2015), fMRI guidelines (Dimoka et al., 2012), fNIRS guidelines (Gefen et al., 2014), or eye tracking guidelines (Djamasbi, 2014); and even more specific papers on analysis techniques such as those related to network analysis of brain imaging data have been published in this field (Hubert et al., 2017). In this context, another fruitful avenue for future work is to focus on methodological contributions related to specific themes. Again, the NeuroIS field may serve as an example. In this field, methodological contributions related to technostress (i.e., stress caused by the use and ubiquity of digital technologies) have been published, such as a paper on blood pressure measurement  and an article on heart rate variability (Baumgartner et al., 2019).
The thematic focus of the majority of studies was on code comprehension (30 of 47). This is not surprising since code comprehension forms the basis of other software development activities like, for example, code inspection or change tasks. While earlier fMRI studies focused on code comprehension and to a smaller extent code inspection Duraes et al., 2016), and hence contributed to a better understanding of the underlying cognitive processes, more recent fMRI studies have focused on the question in how far code comprehension is distinct from other culturally derived symbol systems like math, logic, or language Liu et al., 2020) (cf. Table 4). We expect that fMRI studies will continue to focus on code comprehension and code inspection, because participants are highly restricted in their motor movements during fMRI studies, and code comprehension and code inspection tasks (unlike programming or change tasks) can be designed in such a way that user interactions are minimized. However, this focus on code comprehension and code inspection was also apparent when other neurophysiological methods were used. In particular, methods related to ANS activity (as shown for example by Ahonen et al. (2018Ahonen et al. ( , 2016, Müller and Fritz (2016), Vrzakova et al. (2020), Züger and Fritz (2015), ) have the potential for being used in real-world settings to assist programmers during various software development activities. With the ongoing maturity of the field we expect corresponding contributions and hence a more diverse focus on different software development activities can be expected in the future. Several studies compared brain activation between different software development activities (e.g., code comprehension and code inspection) and could show corresponding differences in brain activation patterns (e.g., Floyd et al. (2017)). This constitutes evidence that the mental processes underlying different SE activities vary. Moreover, this insight suggests that we cannot expect that classifiers for mental states trained on one software development activity generalize towards another software development activity. Thus, replication studies are needed to test the usage of the developed classifiers in different settings.

Which methods and measures did existing NeuroSE publications apply (RQ3)?
Our literature review revealed that brain activity (fMRI, EEG, fNIRS) was studied more frequently than ANS activity (heart rate, skin conductance, pupil dilation). Moreover, our results indicate that EEG is the dominant method in NeuroSE research (20 out of 47 papers; 43% of the completed empirical studies used EEG), followed by fMRI (10 papers; 21%), eye tracking (10 papers; 21%), heart-related measurements (10 papers; 21%), skin-related measurements (8 papers; 17%), fNIRS (4 papers; 9%), and measurement of respiration (1 paper; 2%). The mentioned review of the NeuroIS literature by Riedl et al. (2020a) also identified EEG as the most frequently applied method. One major explanation for this finding is that consumer-grade EEG measurement devices are increasingly available on the market, motivating researchers to explore the usage of this method in an SE context.
The consequence of this development is that researchers can measure brain activity at low cost and with relatively little effort (e.g., because the application of electrodes is usually much more time-consuming with research-grade instruments). However, whether or not these consumer-grade EEG devices offer the necessary reliability and validity is a topic of ongoing discussion (Riedl et al., 2020b). In our review we identified 20 completed empirical EEG studies, six of which used Emotiv EPOC (a consumer-grade instrument). Other consumer-grade tools which were used are NeuroSky mindset headset (three studies), a NeXus 10 MARK II (three studies), a NeuroSky mindwave headset (one study), and a BrainLink Pro headset (one study).
Some validation studies exist which suggest that these lowcost EEG tools may offer measurement quality similar to high-end research devices. For example, Badcock et al. (2013) showed that Emotive EPOC may prove a valid alternative to researchgrade EEG tools like NeuroScan for recording reliable auditory event-related potentials. Moreover, Sánchez Reolid et al. (2018) showed that Emotiv EPOC+ can be used with high confidence to classify the emotional state of a user. Other researchers, based on empirical evidence, are less optimistic. For example, Duvinage et al. (2012) write that ''the Emotiv headset performs significantly worse than the [ANT system, Advanced Neuro Technology]'' (p. 1). Based on this finding, Emotiv should only be used ''for noncritical applications such as games'' (Duvinage et al., 2012, p. 1). This clearly highlights the need for future methodological studies that compare, in different SE contexts, low-cost EEG tools with research-grade instruments in different SE contexts to establish an enhanced understanding of the measurement properties of different instruments in different contexts. We make a call NeuroSE researchers to actively contribute to this discussion, also because this issue not only pertains to EEG as method, but also to other measurement instruments, such as heart rate measurement based on smart watches (if compared to high-end ECG devices). As a starting point for EEG, please refer to Riedl et al. (2020b).
While most studies only applied one neurophysiological method, the studies by Ahonen et al. (2018) Müller andFritz (2015, 2016), Züger and Fritz (2015) were multi-modal and hence combined several neurophysiological methods. In addition, the studies by Castelhano et al. (2018), Couceiro et al. (2019a), Fakhoury et al. (2020Fakhoury et al. ( , 2018, Ishida and Uwano (2019b), Lee et al. (2017), Vrzakova et al. (2020),  combined one or several neurophysiological method(s) with behavioral data. Moreover, the studies by Ahrens et al. (2019), Aschwanden and Crosby (2006), , , Jbara and Feitelson (2015), Wulff-Jensen et al. (2019) combined pupil size and/or blink rate with behavioral eye tracking metrics (i.e., fixation-based measures and/or saccadic measures). The collection of multi-modal neurophysiological data is likely to increase in the future, since one modality often balances the limitations of another modality regarding measurement of a specific theoretical construct like cognitive load (Işbilir et al., 2019). In this context, diagnosticity (''a property of a measure that describes how precisely it captures a target construct as opposed to other constructs'' (Riedl et al., 2014, p. 29)) and confounding factors are essential. For example, measurements of pupil dilation are related to attention and cognitive load, among other constructs, and are influenced by various context factors, such as ambient light (Duchowski et al., 2018). In turn, EEG offers high temporal resolution, but to achieve high spatial resolution, EEG instruments with a sufficient number of electrodes and algorithms which draw upon different neurophysiological assumptions are required (Müller-Putz et al., 2015). Moreover, artifacts due to eye movements, muscular movements and heart beats pose challenges to the use of EEG in the field (Işbilir et al., 2019). fNIRS provides a balance of spatial and temporal resolution. The latency of the signal, however, might pose limits to the method. Research has shown that the combination of multiple modalities may lead to superior results when compared to single modalities (e.g., Aghajani et al. (2017), Liu et al. (2017)), yet this is not necessarily the case.
The benefits of combining different modalities are supported by several of the studies included in our review. For example, the results described in Lee et al. (2017) showed that combining EEG and eye tracking features led to more accurate predictions of task difficulty and expertise when compared to using each modality alone. Similarly, the results described by Müller and Fritz (2016) and Vrzakova et al. (2020) suggest that the combination of modalities can improve classifier performance. The complementary strength between fMRI and fNIRS, in turn, is discussed in Huang et al. (2019). While both fMRI and fNIRS found brain activation in similar areas during both mental rotation and data structure tasks, several of the task differences observed with fMRI could not be observed with fNIRS, which might be a result of the lower spatial resolution of fNIRS, and its property that areas located deep in the brain (such as limbic areas mainly related to affective information processing) cannot be studied at all. In turn, the usage of fMRI led to a significantly lower task performance in terms of accuracy , which presumably stems from its higher intrusiveness (''the extent to which a measurement instrument interferes with an ongoing task, thereby distorting the investigated construct [whereat the] three major dimensions of intrusiveness are degree of movement freedom, degree of natural position, and the invasiveness of an instrument'' (Riedl et al., 2014, p. 29). While benefits of multimodal experiments in providing a more holistic understanding have been shown in several studies, tool support is still scarce. This need is taken up by the work of Peitek et al. (2019) proposing CodersMuse, a tool for exploring multi-modal data during program comprehension experiments. Similarly, Roy et al. (2020) proposed a tool called VITALISE for combining physiological data and eye tracking data.
The results of the literature review also show that there is a tendency to explore the combined use of neurophysiological data and eye tracking. For example, Ishida and Uwano (2019b) explored differences in brain activation and eye movements over time depending on task performance. Moreover, Castelhano et al. (2018) used eye tracking to determine fixations inside and outside of bug AOIs (Area of Interest) with the goal to control for visual attention. Moreover, Peitek et al. (2018d) explored the combined collection of fMRI and eye tracking data in order to detect fixations at the level of source code identifiers. In both cases eye tracking data (more specifically fixations) are used to enable a more fine-grained analysis of the fMRI BOLD signal. Similarly, Fakhoury et al. (2020Fakhoury et al. ( , 2018 demonstrated that a developer's cognitive load can be accurately assessed using fNIRS and eye tracking. Couceiro et al. (2019a) explores the usage of eye tracking together with HRV to identify code lines (and even lexical tokens inside of code lines) that correspond to high cognitive load. We expect to see more studies going in this direction, since the combined usage of eye tracking with neurophysiological measurements allows researchers to link neurophysiological data with the software artifact the subject is looking at and enables a fine-grained analysis of a developer's mental state. One notable example in this domain is a paper which introduces the eyefixation related potential (EFRP) method. This method allows one to synchronize eye tracking with EEG recording to precisely capture a person's neural activity at the exact time at which he or she starts to cognitively process a stimulus (e.g., event on the screen) (Léger et al., 2014).

How was the empirical NeuroSE research conducted (RQ4) and what are the main findings (RQ5)?
Study Participants. Our review shows that the mean sample size is 17.00, ranging from 2 to 70 participants. In comparison, Riedl et al. (2020a) found an average sample size of 45 subjects (min: 5, max: 166, median: 30, SD: 35) in a review of NeuroIS papers. Thus, the average sample size in NeuroIS research is significantly larger than in NeuroSE research. However, sample sizes of the fMRI studies in our review (mean: 18.00, 10 studies) are almost identical to the NeuroIS review (17.9, 11 studies) (Riedl et al., 2020a). Moreover, according to Riedl et al. (2017b) these numbers are comparable with sample sizes in brain imaging studies including investigations published in prestigious journals such as Neuron, Science, and Nature (average N = 18). For all other methods the sample sizes we observed are considerably below the ones reported for the field of NeuroIS (Riedl et al., 2020a). In particular, mean sample sizes of EEG studies (min: 2; mean: 10; max: 38) are very small, indicating overall relatively low maturity in this specific methodological domain. The studies included in the review range from studies that mostly explored the potential of using EEG in a SE context to studies conducting advanced analyses including connectivity measures. What follows is that some of the EEG studies constitute high-quality work.
Our review also showed that study participants were predominantly male with a mean of 13.39% female participants. This distribution is far from balanced. However, it has to be noted that this roughly corresponds to the gender distribution in typical Computer Science programs (Huyer, 2015). Because experimental research often draws upon student samples, our finding regarding gender distribution in the sample is not surprising. Importantly, as NeuroSE research moves from the laboratory to the field, it is hoped that the rate of female subjects will become larger.
Stimuli. Our analysis showed that the stimuli used (i.e., code snippets) were relatively small and ranged for most studies from just a few lines up to 60 lines. One exception is the study by Ahrens et al. (2019), which is considerably larger with over 900 lines. In particular, when using measurement methods like fMRI where participants are highly restricted in their movements, the usage of small code snippets that can be read without substantial need for navigation is reasonable. At the same time, however, tasks involving just a few lines of code do not represent the complexity a professional developer encounters when developing software. Combined with the observation that except for six studies all were conducted in the lab, our results suggest that their emphasis was on controlled settings and that their focus was less on ecological validity than on internal validity. The usage of methods related to ANS activity (e.g., heart-related measurements, skinrelated measurements, eye tracking), along with mobile EEG, has the potential to complement the existing lab studies with findings from real-world settings (i.e., Ahonen et al. (2018Ahonen et al. ( , 2016), Müller and Fritz (2016), Vrzakova et al. (2020), ).
To fully exploit the potential of neurophysiological measurements, the combined use with behavioral data seems highly promising. Recent research on eye tracking in the context of SE increasingly exploits the benefits of linking eye-tracking data to the software artifact. For example, Abid et al. (2019) present an eye tracking study on reading and summarizing Java methods. Unlike previous studies that only used short Java methods in isolation, this study had access to all source code (using iTrace Guarnera et al., 2018) and subjects could freely scroll and navigate between files. Following a similar approach, future research could link neurophysiological measures with the software artifacts, or other behavioral datasets such as clickstream or mouse data. For example, using a combination of user interactions and eye tracking data Burattin et al. (2019) were able to predict (during the creation of a conceptual model) the task in which a user was engaged at a specific point in time. Information about what the user is currently working on allows contextualization of the observed mental states and can give raise to the development of neuro-adaptive systems that support the user in a contextspecific manner. As an example, Adam et al. (2017) presented a blueprint for stress-sensitive adaptive enterprise systems. A major characteristic of such systems is that neuro-signals (e.g., heart rate or skin conductance) are integrated as real-time stress measures, with the goal that systems automatically adapt to the users' stress levels, thereby improving human-computer interactions. A major source for corresponding research is IEEE Transactions on Affective Computing. We refer the reader to this outlet to learn more about the foundations and applications in the domain of neuro-adaptive systems.
Experimental design. Most of the studies used a within-subject design with repeated measurements. This is not uncommon for neurophysiological experiments, since the range of differences across individuals is, for most neurophysiological measures larger than the range of expected changes as a result of a stimulus (Jennings and Allen, 2016). This can also be a challenge for the development of robust classifiers of mental states (e.g., frustration, cognitive load) that ideally should work with high accuracy not only for the subjects that were used for training, but also for new subjects.
Dependent and independent constructs. The results of the review show that task-specific factors (e.g., software development activity, task-related events, task characteristics) were primarily used as independent variables when neurophysiological data was used as a dependent variable (cf. Fig. 6). An example question, yet one of the most fundamental ones, is ''What happens in the brain, or the nervous system, while a developer is engaged in a specific SE task?''. Importantly, such brain processes are dependent on an individual's expertise. The study described in Floyd et al. (2017) is one of the few studies considering developer-specific factors like expertise (collected via subjective measurements). The paper showed that for experienced programmers Code and Prose were hardly distinguishable. It follows that source code is increasingly treated like a natural language with increasing expertise. This suggests that papers looking into brain activity should consider expertise in their research models (in particular when study subjects differ in their level of expertise). Similarly, when developing classifiers for mental states like perceived task difficulty, or cognitive load, the inclusion of developer-specific factors like expertise seems critical, since individuals differ in how efficiently they can make use of their working memory capacity and other cognitive resources.
While several studies focused on cognitive processes and the prediction of cognitive states like cognitive load or interruptibility, so far only three studies (i.e., Girardi et al. (2020), Müller and Fritz (2016), Vrzakova et al. (2020)) present completed research using neurophysiological measurements with a focus on emotional processes. Yet, emotions in an SE context have received increasing attention in recent years. For example, emotions in SE are in the focus of the International Workshop on Emotion Awareness in Software Engineering (SEmotion). Six of the papers published at SEmotion either discussed the usage of neurophysiological measurements at a conceptual level or presented suggestions for study designs. Moreover, sentiments and emotions in SE were the focus of a recent IEEE Software special issue . Similarly, a recent Journal of Systems and Software special issue had affect awareness as its focus . The papers of these special issues primarily relied on natural language processing and sentiment analysis for detecting affective states. In addition, a recently published review by Sánchez-Gordón and Colomo-Palacios (2019) focuses on emotions in SE. The review showed that neurophysiological data for measuring emotions has only obtained limited attention so far. This increasing interest in the role of emotions in SE is an opportunity for the field of NeuroSE, particularly when it comes to the detection of affective states. Future research is needed that investigates the detection of emotions using neurophysiological data in comparison and combination with other modalities. The paper by Vrzakova et al. (2020) is a first example in which neurophysiological and behavioral measurements for affect prediction are compared (i.e., skin-related measurements, eye gazes, and touch). The results showed that the classifier based on eye gaze data was able to achieve the best performance. The potential of neurophysiological methods for recognizing and inducing emotions during programming has also been investigated in recent work by Girardi et al. (2020). A section on ''emotion research'' in a recent research agenda paper may serve as a starting point for future research. In particular, we stress that conceptual clarity about the meaning of emotion, and how it differs from related concepts such as feelings and affect, is critical to move the discipline forward with the necessary scientific rigor. This issue is discussed in a research agenda paper that pertains to the NeuroIS field (vom Brocke et al., 2020), but the arguments presented in this paper are directly relevant to the NeuroSE field too.
Data analysis and Findings. Our data analysis showed that most of the studies related to brain activity (fMRI, EEG, fNIRS) focused on simple brain mapping and brain activation identification during specific tasks and in response to specific events, and analyzed differences in brain activation depending on different conditions. Three studies (i.e., Castelhano et al. (2018), Kosti et al. (2018), Lee et al. (2016)) performed a more in-depth analysis and additionally looked into functional brain connectivity (i.e., the organization, inter-relationship, and integrated performance of different brain regions (Bastos and Schoffelen, 2016;Rogers et al., 2008)). NeuroSE researchers must keep in mind that the emergence of complex mental processes such as code comprehension is based on activity in a network of brain regions rather than on activity in one area alone; see, for example, Appendix C in Riedl et al. (2017a) which summarizes cognitive neuroscience work without using terminology specific to neuroscience. Accordingly, more sophisticated data analysis techniques are needed in future brain research studies which consider that simple one-to-onemappings between cognition and brain areas do not exist (see, for example Hubert et al. (2017) who introduced some techniques recently).
In this context, Friston (1994) distinguishes four major concepts: functional specialization (i.e., analyses of region-specific effects: Which brain regions are involved in a specific mental process?), functional integration (i.e., analyses of effects between brain regions: Which interactions exist between brain regions so that a specific mental processes emerges?), functional connectivity (i.e., the temporal correlation between regionally separate brain processes), and effective connectivity (i.e., the influence of one brain region on another region). Note that functional and effective connectivity are subcategories of functional integration. First, current NeuroSE fMRI research focused on functional specialization. Second, in cognitive neuroscience functional integration and its subcategories have received much attention in the fMRI literature in the past two decades, based on seminal methods papers (e.g., Friston (2002), O'Reilly et al. (2012), Penny et al. (2004)). It follows that in cognitive neuroscience, if compared to current NeuroSE research, brain mechanisms underlying human cognition and emotion are often studied in a more realistic manner. Yet, because NeuroSE is a young discipline, this fact is more a call for corresponding future research rather than a fundamental critique of the currently available research. Third, unlike functional connectivity analyses in fMRI research, network examinations in fNIRS have only recently begun to become popular (e.g., Li and Yu (2018)). However, because tool boxes are also increasingly available to study near-infrared spectroscopy data from a connectivity perspective (e.g., Xu et al. (2015)), we foresee more corresponding studies in the future. Note that in the current fNIRS literature, network studies predominantly refer to resting state situations. Here, connectivity between brain regions that occurs when an explicit task is not being performed is studied (e.g., Buckner et al. (2013), Sharaev et al. (2016)). Yet, in NeuroSE research a number of research contexts are imaginable in which this kind of network analyses may be useful. Imagine, for example, the study of functional resting state differences between highly and less experienced programmers. Moreover, in fNIRS optodes placement is critical for regions-of-interest analyses, and recently a toolbox for probe arrangement has been presented in the literature (Zimeo Morais et al., 2018).
Studies related to the ANS have mostly relied on data-driven approaches (in particular classification). In the future, due to advances in machine learning and the availability of low cost measurement instruments, we particularly expect an increase in such studies. The biggest challenge here is, however, that it is unclear to what extent the developed models can be generalized or need to be developed and trained for a specific context. These insights call for additional research to establish a better understanding to what extent classifiers for mental states trained on one software development activity can be used for another software development activity.

Validity threats
The following four major validity threats were identified and mitigated in relation to the review: Descriptive Validity. Descriptive validity concerns the extent to which observations are described accurately and objectively. To mitigate this threat and to objectify the data collection process we designed a data collection form that we could always revisit.
Theoretical Validity. Theoretical validity refers to the ability to capture what we intend to capture. To mitigate the risk that studies are missed we carefully designed the search string by systematically combining a ''Neuro'' term with a SE term and complemented the search with a backward and forward search. To reduce bias in data extraction and classification the first author and second author extracted the data and if borderline cases appeared, they were discussed by the first author and the second author, with the involvement of the third author.
Interpretive validity. Interpretive validity is ensured when the conclusions are drawn from the data. A threat to interpretative validity is researcher bias. None of the authors is a coauthor of a paper, which was reviewed. Moreover, collectively the coauthors of this review have approximately three decades of experience in neurophysiological measurements that may help in the interpretation of data.
Repeatability. Repeatability requires a detailed description of the research process including data analysis. We explained in detail the process we followed. Moreover, we followed existing guidelines on how to conduct SLRs and have made the data extraction form and a file with all study details available as supplementary material.

Conclusion and future work
This paper maps the literature using measurements of brain and autonomic nervous system activity in a software engineering context and provides a comprehensive overview of the NeuroSE literature. We hope that this literature review will make it easier for other researchers to engage in NeuroSE research. To lower the entry barriers for conducting NeuroSE research, methodological contributions providing guidelines on how to conduct NeuroSE research would be helpful. Moreover, we also regard infrastructural contributions that support the collection and analysis of neurophysiological data in an SE context as critical (e.g., tools for linking of neurophysiological measures with the software artifacts in the vein of iTrace (Guarnera et al., 2018) or tools for the multimodal exploration of data (Peitek et al., 2019;Roy et al., 2020)). Both methodological and infrastructural contributions can play an important role in helping to increase the number of studies collecting neurophysiological data in the future.
Overall, neurophysiological measurements can contribute to the field of software engineering in several ways: (1) Neurophysiological measurements can contribute to enhancing our understanding of human factors in software engineering. For example, neurophysiological measurements can be used to investigate cognitive as well as emotional processes of developers and hence can complement subjective (self-reported) and behavioral measurements.
(2) Studies using neurophysiological measurements can inform the development of methods, tools, and techniques to improve the development of software.
(3) Neurophysiological measurements can play an important role in understanding the use of software systems, i.e., software analytics, and provide insights into how software is perceived by their users. (4) Neurophysiological measurements can be used to develop so-called neuro-adaptive systems, software systems that are able to adapt to the mental state of their users (based on neurophysiological information, which constitutes a correlate of that mental state). This can include neuro-adaptive integrated development environments, but also neuro-adaptive learning platforms (e.g., for programming). While the NeuroSE field is still at a nascent stage, we conclude that what has been revealed so far constitutes a valuable basis for future research. We believe that a better understanding of brain processes and processes related to ANS activity will contribute to advancements in software engineering and better software systems.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Overview of selected empirical papers
See Table 10.

Appendix B. Overview of remaining selected papers
See Table 11.

Appendix C. Overview of types of contributions
See Table 12.

Appendix D. Usage of neurophysiological methods
See Table 13.

Appendix E. Overview of used measurement instruments
See Table 14.

Appendix F. Measures used as features for machine learning classifiers
See Table 15.

Appendix G. Research questions categorized by methods
See Table 16.

Appendix H. Overview of classifier performance
See Table 17.

Appendix I. Summary of study participant information
See Table 18.

Appendix J. Overview of used stimuli
See Table 19.

Appendix K. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jss.2021.110946.    Table 12 Types of contributions of NeuroSE research (N=89).

Table 14
Measurement instruments used in completed empirical NeuroSE research (N=47).

Table 16
Overview of research questions addressed by different neurophysiological methods (N=47).
Research questions answered using fMRI • Which brain regions are activated during program comprehension? (Siegmund et al., 2014) • Can we replicate the results of Siegmund et al. (2014)?  • What is the difference between bottom-up program comprehension and comprehension with semantic cues in terms of activation and the brain areas involved?  • How do layout and beacons in source code influence program comprehension?  • Which brain regions are activated during bottom-up program comprehension? (Peitek et al., 2018a) • Does source-code complexity correlate with concentration levels during bottom-up program comprehension? (Peitek et al., 2018a) • Does programming experience correlate with brain activation strength during bottom-up program comprehension? (Peitek et al., 2018a) (continued on next page) • Are neural representations of programming languages and natural languages distinct? (Floyd et al., 2017) • Can we relate tasks to brain regions? (Floyd et al., 2017) • Can we relate expertise to classification accuracy? (Floyd et al., 2017) • What are the brain activation patterns associated with bug confirmation? (Duraes et al., 2016) • What are the brain activation patterns associated with bug suspicion (in contrast to bug confirmation)? (Duraes et al., 2016) • What are the brain activation patterns for software program processing in expert participants while performing bug detection tasks?   • Can we build a classifier for code complexity using EEG?  Are there differences in brain activation dependent on the programming language? (Doukakis, 2019;Doukakis et al., 2020) Research questions answered using eye tracking • What are the scanning patterns during program comprehension? (Aschwanden and Crosby, 2006) • Are developers' visual efforts equally divided among regular segments? (Jbara and Feitelson, 2015) • Does attention data representation have any effect on the efficiency and quality of software maintenance tasks? (Ahrens et al., 2019) • How do structural and textual readability features affect program comprehension? (Wulff-Jensen et al., 2019) • How much affect the characteristics of the interview setting eye tracking measurements?  • Can we detect differences in stress and cognitive load between the paper and whiteboard technical interview settings?  • Can a developer's cognitive load be measured during code reading using pupillography? (Couceiro et al., 2019b) • Do measurements of cognitive load based on pupillography correlate with subjective measurements of cognitive load (NASA-TLX) (Couceiro et al., 2019b) Research questions answered using heart-related measurements • Can we build a classifier that predicts a software developer's interruptibility accurately in the field?  • Can a developer's cognitive load be measured during code reading using HRV? (Couceiro et al., 2019c) • Do HRV measurements of cognitive load correlate with subjective measurements of cognitive load? (Couceiro et al., 2019c) • Do source code complexity metrics correlate with subjective measurements of cognitive load? (Couceiro et al., 2019c) • Can we extract Social Physiological Compliance (SPC) from an ECG signal in a natural protocol?  Research questions answered using multiple modalities Eye tracking, Heart • Can eye tracking together with HRV and pupillography be used to identify non intrusively code lines (and even lexical tokens inside code lines) that correspond to mental effort peaks? (Couceiro et al., 2019a) Skin, Heart • Can we extract Social Physiological Compliance (SPC) from an ECG signal in a natural protocol? (Ahonen et al., 2018) • Can windowed heart-related measurements be substituted by fast biosignals for examining SPC in a natural protocol? (Ahonen et al., 2018) • Can the physiological signals with high temporal resolution, found to reflect SPC, be associated with task related emotional valence and engagement? (Ahonen et al., 2018) EEG, Eye tracking, Skin • Can we acquire psycho-physiological measures from eye tracking, EDA and EEG sensors to accurately predict task difficulty? (Fritz et al., 2014) • Which combination of psycho-physiological sensors and associated features best predicts task difficulty? (Fritz et al., 2014) • Can we use psycho-physiological measures to predict task difficulty as the developer is working? (Fritz et al., 2014) EEG, Skin, Heart • Can we predict the interruptibility of a knowledge worker in a real-world working context using a combination of psycho-physiological sensors? (Züger and Fritz, 2015) • Can we classify which task a participant is undertaking based on signals collected from lightweight biometric sensors? (Fucci et al., 2019) • Can we relate expertise to classification accuracy? (Fucci et al., 2019) • What is the minimal set of non-invasive biometric sensors to recognize developers' emotions? (Girardi et al., 2020) Heart, Skin, Breath • Can we use biometrics to identify places in the code that are perceived to be more difficult by developers? (Müller and Fritz, 2016) • Can we use biometrics to identify code quality concerns found through peer code reviews? (Müller and Fritz, 2016) • How do biometrics compare to more traditional metrics for predicting perceived difficulty and detecting quality concerns? (Müller and Fritz, 2016) EEG, Eye tracking, Skin, Heart • Can we use biometric sensors to determine developers' emotions and progress during change tasks? (Müller and Fritz, 2015) Skin, Eye tracking, Touch • How do nonverbal physiological signals predict components of affect after the code review task? (Vrzakova et al., 2020) a Lee et al. (2017) uses eye tracking, but the eye tracking measures used are not detailed. Thus, it is unclear whether the paper uses neurophysiological measures like pupil size or eye blinking rate. We therefore classified this study as an EEG study only.      (2) 3rd year students in computer science Müller and Fritz (2015) 17 16 1 20-51 -Students (11) and professional developer (6) PhD students with a major in CS   active reviewers with more than ten reviews completed in the past three months, and more than five code reviews pending Three comprehension tasks (with two levels of difficulty); one used for familiarization C 17-32 lines Siegmund et al. (2014) 12 comprehension tasks Java max 20 lines González et al. (2015) 1 programming/documenting/bug fixing task (both physical and digital) Java - Jbara and Feitelson (2015) 2 code comprehension tasks C 26-53 Müller and Fritz (2015) 2 change tasks (Java program interacting with Stack Exchange API and adding new feature in JHotDraw) Java -  3 code comprehension tasks Java Small programs Duraes et al. (2016) 6 software programs (3 with faults, i.e. code inspection and 3 without, i.e. code comprehension) C 20-37 lines of C code; 1-3 screens Müller and Fritz (2016) No specific task, developers recorded during 2-week period during general software development activities --(continued on next page)   Ivanova et al. (2020) Comprehension task (with 72 problems), spatial working memory localizer task, and a language localizer task Experiment 1 (Python): Python programs with English identifiers, Python programs with Japanese identifiers, sentence versions of programs 50% of the programs required math operations, 50% string operations; 1 3 required sequential statements, 1 2 if statements and 1 3 loops Experiment 2 (Scratch): 2 conditions: short programs in Scratch, sentence version of these programs; 3 levels of difficulty