The eects of change-decomposition on code review-A Controlled Experiment

Background: Code review is a cognitively demanding and timeconsuming process. Previous qualitative studies hinted at how changesets divided according to a logical partitioning could be easier to review. Aims: (1) antitatively measure the eects of change-decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) alitatively analyze how subjects approach the review and navigate the code building knowledge and addressing existing issues, in large vs. decomposed changes. Method: Controlled experiment using the pull-based development model involving 28 soware developers among professionals and graduate students. Results: Change-decomposition leads to fewer wrongly reported issues, inuences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions: Change-decomposition reduces the noise for subsequent data analyses but also signicantly support the tasks of the developers in charge of reviewing the changes. As such, commits belonging to dierent concepts should be separated, adopting this as a best practice in soware engineering.


INTRODUCTION
Code review is the activity performed by so ware teams to check code quality, with the purpose of identifying issues and shortcomings [1]. Nowadays, reviews are mostly performed in an iterative, informal, change-and tool-based fashion, also known as Modern Code Review (MCR) [10]. Both open-source and industry so ware teams employ MCR to check code changes before being integrated in their codebases [32]. Past research has provided evidence that MCR is associated with improved key so ware quality aspects, such as maintainability [24] and security [11], as well as with less defects [23].
Reviewing a source code change is a cognitively demanding process and researchers provided evidence that understanding the code change under review is among the most challenging tasks for reviewers [1]. In this light, past studies have argued that code changes that-at the same time-address multiple, possibly unrelated concerns (also known as noisy [25] or tangled changes [18]) can hinder the review process [18,19], by increasing the cognitive load for reviewers. Indeed, it is reasonable to think that grasping the rationale behind a change that spans multiple concepts in a system requires more e ort than the same patch commi ed separately. Moreover, the noise could put a reviewer on a wrong track, thus leading to missing defects (false negatives) or to raising unfounded issues in sound code (false positives in this paper).
Although such tools are expectedly useful, the e ects of changedecomposition on review is an open research problem. Tao and Kim presented the earliest and most signi cant results in this area [38], showing that change-decomposition allows practitioners to achieve their tasks be er in a similar amount of time.
In this paper, we continue on this research line and focus on evaluating the e ects of change-decomposition on code review. We aim at answering questions, such as: Is change-decomposition bene cial for understanding the rationale of the change? Does it have an impact on the number/types of issues raised? Are there di erences in time to review? Are there variations with respect to defect lifetime?
To this end, we designed a controlled experiment focusing on pull requests, a widespread approach to submit and review changes [15]. Our work investigates whether the results from Tao and Kim [38] can be replicated, as well as extends the knowledge on the topic. With a Java system as a subject, we asked 28 so ware developers among professionals and graduate students to review a refactoring and a new feature (according to professional developers [37], these are the most di cult to review when tangled). We measure how the partitioning vs. non-partitioning of the changes impacts defects found, false positive issues, suggested improvements, time to review, and understanding the change rationale. We also perform qualitative observations on how subjects conduct the review and address defects or raise false positives, in the two scenarios.
is paper makes the following contributions: • the design of an asynchronous controlled experiment to assess the bene ts of change decomposition in code review using pull requests, available for replication [12]; • empirical evidence that change decomposition in the pullbased review environment leads to fewer false positives. e paper proceeds as follows: Section 2 illustrates the related work; Section 3 describes our research objectives; the design of our experiment is described in Section 4; limits of the experiment are discussed in Section 5; results are presented in Section 6; Section 7 reports the discussion based on the results; nally, Section 8 summarizes our study.

RELATED WORK
Several studies explore tangled changes and concern separation in code reviews. Tao et al. investigate the role of understanding code changes during the so ware development process, exploring practitioners' needs [37]. eir study outlined that grasping the rationale when dealing with the process of code review is indispensable. Moreover, to understand a composite change, it is useful to break it into smaller ones each concerning a single issue. Rigby et al. empirically studied the peer review process for six large, mature OSS projects, showing that small change size is essential to the more ne-grained style of peer review [31]. Kirinuki et al. provide evidence about problems with the presence of multiple concepts in a single code change [19]. ey show that these are unsuitable for merging code from di erent branches, and that tangled changes are di erent to review because practitioners have to seek the changes for the speci ed task in the commit.
Regarding empirical controlled experiments on the topic of code reviews, the most relevant work is by Uwano et al. [40]. ey use an eye-tracker to characterize the performance of subjects reviewing source code. eir experimentation environment enabled them to identify a pa ern called scan, consisting of the reviewer reading the entire code before investigating the details of each line. In addition, their qualitative analysis found that participants who did not spend enough time during the scan took more time to nd defects. Uwano's experiment was replicated by Sharif et al. [34].
eir results indicate that the longer participants spent in the scan, the quicker they were able to nd the defect. Conversely, review performance decreases when participants do not spend su cient time on the scan, because they nd irrelevant lines.
Even if MCR is now a mainstream process, adopted in both open source and industrial projects, we found only two studies on change partitioning and its bene ts for code review. e work by Barne et al. [2] analyzed the usefulness of an automatic technique for decomposing changesets. ey found a positive association between change decomposition and the level of understanding of the changesets. According to their results, this would help time to review as the di erent contexts are separated. Tao and Kim [38] proposed a heuristic-based approach to decompose changeset with multiple concepts. ey conducted a user study with students investigating whether their untangling approach a ected the time and the correctness in performing review-related tasks. Results were promising: Participants completed the tasks be er with untangled changes in a similar amount of time. In spite of the innovative techniques they proposed to untangle code changes and in these promising results, the evaluation of e ects of change-decomposition was preliminary.
In contrast, our research focuses on se ing up and running an experiment to empirically assess the bene ts of change-decomposition for the process of code-review, rather than evaluating the performances of an approach.

MOTIVATION AND RESEARCH OBJECTIVES 3.1 Experiment de nition and context
Our analysis of the literature showed that there is only preliminary empirical evidence on how code review decomposition a ects its outcomes, relatively to change-understanding, time to completion, e ectiveness (i.e., number of defects found), false positives (issues mistakenly identi ed as defect by the reviewer), and suggested improvements. is motivates us in se ing-up a controlled experiment, exploiting the popular pull-based development model, to assess the conjecture that a proper separation of concerns in code review is bene cial to the e ciency and e ectiveness of the review.
Pull requests features asynchronous, tool-based activities in the bigger scope of pull-based so ware development [14]. e pullbased so ware process features a distributed environment where changes to a system are proposed through patch submissions that are pulled and merged locally, rather than being directly pushed to a central repository.
Pull requests are the way contributors submit changes for review in GitHub. Change acceptance has to be granted by other team members called integrators [15]. ey have the crucial role of managing and integrating contributions and are responsible for inspecting the changes for functional and non-functional requirements. 80% of integrators use pull requests as the means to review changes proposed to a system.
In the context of distributed so ware development and change integration, GitHub is one of the most popular code hosting sites with support for pull-based development. GitHub pull requests contain a branch from which changes are compared by an automatic discovery of commits to be merged. Changes are then reviewed online. If further changes are requested, the pull request can be updated with new commits to address the comments. e inspection can be repeated and, when the patchset ts the requirements, the pull request can be merged.

Research questions
e motivation behind modern code review is to nd defects and improve code quality [1]. We are interested in checking if reviewers are able to address defects (referenced in this paper as e ectiveness). Furthermore, we focus on comments pointing out wrongly reported defects and suggested improvements (which are non-critical nonfunctional issues). Suggested improvements highlight reviewer participation [22] and these comments are considered very useful [6].
Our rst research question is: RQ1. Do tangled pull requests in uence e ectiveness (i.e., number of defects found), false positives, and suggested improvements of reviewers, when compared to untangled pull requests?
Based on the rst research question, we formulate the following null-hypotheses for (statistical) testing: Tangled pull requests do not reduce: H 0e the e ectiveness of the reviewers during peer-review H 0f the false positives detected by the reviewers during peerreview H 0c the suggested improvements wri en by the reviewers during peer-review Given the structure and the se ings of our experimentation, we can also measure the time spent on review activity and defect lifetime. us, our next research question is: RQ2. Do tangled pull requests in uence the time necessary for a review and defect lifetime, when compared to untangled pull requests?
For the second research question, we formulate the following null-hypotheses: Tangled pull requests do not reduce: H 0t1 time to review H 0t2 defect lifetime Further details on how we measure time and de ne defect lifetime are described in Section 4.7.
In our study, we aim to measure whether change-decomposition has an e ect on understanding the rationale of the change under review. Understanding the rationale is the most important information need when analyzing a change, according to professional so ware developers [37]. As such, the question we set to answer is: RQ3. Do tangled pull requests in uence the reviewers' understanding of the change rationale, when compared to untangled ones?
For our third research question, we test the following null-hypotheses: Tangled pull requests do not reduce: H 0u change-understanding of reviewers during peer-review when compared to untangled pull requests Finally, we qualitatively investigate how participants individually perform the review to understand how they address defects or potentially raise false positives. Our last research question is then: RQ4. What are the di erences in pa erns and features used between reviews of tangled and untangled pull requests?

EXPERIMENTAL DESIGN AND METHOD
In this section, we detail how we designed the experiment and the research method that we followed.
4.1 Object system chosen for the experiment e system that will be considered by the reviews in the experiment is JPacman, an open-source Java system available on GitHub 1 that emulates a popular arcade game used at the Del University of Technology to teach so ware testing. e system has about 3,000 SLOC and was selected because a more complex and larger project would require participants to grasp the rationale of a more elaborate system. In addition, the training phase required for the experiment would imply hours of e ort, increasing the consequent fatigue that participant might experience. In the end, the experiment targets assessing di erences in review partitioning and is tailored for a process rather than a product.

Recruiting of the subject participants
e study was conducted with 28 participants recruited by means of convenience sampling [41] among experienced and professional so ware developers, PhD, and MSc students. We involved as many di erent roles as possible to have a larger sample for our study and increase its external validity. Using a questionnaire, we asked development experience, language-speci c skills, and review experience as number of reviews per week. We also included a question that asks if a participant knows the source code of the game. Table 1 reports the results of the questionnaire, which are used to characterize our population and to identify key a ributes of each subject participant.

Monitoring versus realism
Designing an experiment requires carefully managing the trade-o between realism and control. In controlled experiments, researchers strive to have the most realistic scenario possible, compromising as li le as possible the control of the se ings that are put in place.
Due to the nature of pull-based so ware development and its peer review with pull requests, we established the experimentation phase to be executed asynchronously. is implies that participants can run the experiment when and where they feel most comfortable, with no explicit constraints for place, time or equipment.
With this choice, we are purposefully giving up some degree of control to increase realism. Having a more strictly controlled experimental environment would not replicate the usual way of running such tasks (that is, asynchronous and informal). Besides, an experiment run synchronously in a laboratory would anyway raise some control challenges: It might be distracting for some participants, or even induce some follow the crowd behavior, thus leading to people rushing to nish their tasks.
To regain some degree of control, participants run all the tasks in a provided virtual machine available in our replication package [12]. In addition, the screencast of the experiment is recorded, therefore not leaving space to misaligned results and mitigating issues of incorrect interpretation. Subjects were provided with instructions on how to use the virtual machine, but no time window was set. e independent variable of our study is change-decomposition in pull requests. We split our subjects between a control group and a treatment group: e control group receives one pull request containing a single commit with all the changes tied together; the treatment group receives two pull requests with changes separated according to a logical decomposition.
Participants are randomly assigned to either the control group or the treatment using strata based on experience as developers and previous knowledge. Previous research has shown that these factors have an impact on review outcome [6,30]: Developers who previously made changes to les to be reviewed had a higher proportion of useful comments.
All subjects are asked to run the experiment in a single session so that external distracting factors can be eliminated as much as possible. If a participant needs a pause, the pause is considered and excluded from the nal result as we measure and monitor for periods of inactivity. We seek to reduce the impact of fatigue by limiting the expected time required for the experiment to an average of 60 minutes; this value is closer to the minimum rather than the median for similar experiments [20]. No learning e ect is present as every participant runs the experiment only once.

Pilot experiments
We ran two pilot experiments to assess the se ings. e rst subject (a developer with 5 FTE 2 years of experience) took too long to complete the training and showed some issues with the virtual machine. Consequently, we restructured the training phase addressing the potential environment issues in the material provided to participants.
e second subject (a MSc student with low experience) successfully completed the experiment in 50 minutes with no issues. Both pilot experiments were executed asynchronously in the same way as the actual experiment.
4.6 Tasks of the experiment e participants were asked to conduct the following four tasks. Further details are available in the online appendix [12].
2 -Training the participants. Before starting with the review phase, we rst ensure that the participants have a su cient familiarity with the system. It is likely that the participants had never seen the codebase before: this situation would limit the realism of the subsequent review task.
To train our participants we ask subjects to implement three di erent features in the system: (1) Change the way the player moves on the board game, using di erent keys, (2) check if the game board has null squares (a board is made of multiple squares) and perform this check when the board is created, and (3) implement a new enemy in the game, with similar arti cial intelligence to another enemy but di erent parameters. is learning by doing approach is expected to have higher e ectiveness than providing training material to participants [36]. By de nition, this approach is a method of instruction where the focus is on the role of feedback in learning. e desired features require change across the system's codebase. e third feature to be implemented targets the classes and components of the game that will be addressed by the review tasks. e choice of using this feature as the last one is to progressively increment the level of di culty.
No time window is given to participants with the aim of having a more realistic scenario. As explicitly mentioned in the provided instructions, participants are allowed to use any source for retrieving information about something they do not know. is is permi ed as the study does not want to assess skills in implementing some functionality in a programming language. e only limitation is that the participants must use the tools within the virtual machine.
e virtual machine provides the participants with the Eclipse Java IDE. e setup already has the project imported in Eclipse's workspace. We use a plugin in Eclipse, WatchDog [4], to monitor development activity. With this plugin, we measure how much time participants spend reading, typing, or using the IDE. e purpose is to quantify the time to understand code among participants and whether this relates to a di erent outcome in the following phases. Results for this phase are shown in Figure 1, which contains boxplots depicting the data. It shows that there is no signi cant di erence between the two groups.
3 -Perform code review on proposed change(s). Participants are asked to review two changes made to the system: (1) the implementation of the arti cial intelligence for one of the enemies (2) the refactoring of a method in all enemy classes (moving its logic to the parent class). ese changes can be inspected in the online appendix [12] and have been chosen to meet the same criteria used by Herzig et al. [17] when choosing tangled changes. Changes proposed can be classi ed as refactoring and enhancement. Previous literature gave insight as to how these two kinds of changes, when tangled together, are the hardest to review [37]. Although recent research proposed a theory for the optimal ordering of code changes in a review [3], we use the default ordering and presentation provided by GitHub, because it is the de-fact standard. Changesets are included in pull requests on private GitHub repositories so that participants perform the review in a real-world environment. Pull requests have identical descriptions for both the control and the treatment, with no additional information except their descriptive title. While research shows that a short description may lead to poor review participation [39], this does not apply to our experiment as there is no interaction among subjects. Subjects were instructed to understand the change and check its functional correctness. We asked the participants to comment on the pull request(s) if they found any problem in the code, such as any functional error related to correctness and issues with code quality. e changes proposed had three di erent functional issues that were intentionally injected into the source code. Participants could see the source code in case more context was needed, but only through GitHub's browser-based UI. e size of the changeset is around 100 lines of code and it involves seven les. Gousios et al. showed that the number of total lines changed by pull requests is on average less than 500, with a median number of 20 [14]. erefore, the number of lines of the changeset used in this study is between the median and the average. 4 -Post-experiment questionnaire. In the last phase participants are asked to answer the questions shown in Table 4.
estions Q1 to Q4 are about change-understanding, while Q5 to Q12 involve subjects' opinions about changeset comprehension and its correctness, rationale, understanding, etc. Q5 to Q12 are a summary of interesting aspects that developers need to grasp in a code change, as mentioned in the study of Tao et al. [37]. e answers must be provided in a Likert scale [26] ranging from 'Strongly disagree' (1) to 'Strongly agree' (5).

Outcome measurements
E ectiveness, false positives, suggested improvements. Subjects are asked to comment a pull request in the pull request discussion or in-line comment in a commit belonging to that pull request. e number of comments addressing functional issues is counted as the e ectiveness. At the same time, we also measure false positives (i.e., comments in pull request that do not address a real issue in the code) and suggested improvements (i.e., remarks on other non-critical non-functional issues). We identify such comments as the three functional defects were intentionally put in the source code. Comments that do not directly and correctly tackle one of these three issues are classi ed either as false positives or suggested improvements. ey are identi ed by the rst author by looking at the description provided by the subject. A correct identi ed issue needs to highlight the problem, and optionally provide a short description.
Time. Having the screencast of the whole experiment, as well as using tools that give time measures, we gather the following measurements: • Time for Task 2, in particular: total time Eclipse is [opened/active] -total time the user is [active/reading/typing]; as collected by WatchDog (Section 4.6).
• Total net time for Task 3, de ned as from when the subject opens a pull request until when (s)he completes the review, purged of eventual breaks. • Defect lifetime, de ned as from when the subject opens a pull request to when (s)he writes a comment. For the case of multiple comments on the same pull request, this is the time between nishing with one issue and addressing the next. A similar measure was previously used by Prechelt et al. [29]. All the above measures are collected in seconds elapsed.
Change-understanding. In this experiment, change-understanding is measured by means of a questionnaire submi ed to participants post review activity, as mentioned in Task 4 in Section 4.6.
estions are shown in Table 4 from Q1 to Q4. Its aim is to evaluate di erences in change-understanding. A similar technique is used by Binkley et al. who adopt a SAT-style questionnaire [5].
Final Survey. Lastly, participants are asked to give their opinion on statements targeting the perception of correctness, understanding, rationale, logical partitioning of the changeset, di culty in navigating the changeset in the pull request, comprehensibility, and the structure of the changes. is phase, as well as the previous one, is included in Task 4, corresponding to questions Q5 to Q12 (Table 4). Results will be given on a Likert scale from "Strongly disagree" (1) to "Strongly agree" (5) [26], reported as mean, median and standard deviation over the two groups, and tested for statistical signi cance with the Mann-Whitney U-test.

Research method for RQ4
For our last research question, we aim to build some initial hypotesis to explain the results from the previous research questions. We seek what actions and pa erns lead a reviewer in nding an issue or raising false positive, as well as other comments. e method to map actions to concepts starts by annotating the screencasts retrieved a er the conclusion of the experimental phase. Subjects perform a series of actions that de ne and characterize both the outcome and the execution of the review. e rst author inserted notes regarding actions performed by participants to build a knowledge base of steps (i.e., participant opens fileName, participant uses GitHub search box with the keyword, etc.).
Using the methodology for qualitative content analysis delineated by Schreirer [33], we rstly de ned the coding frame. Our purpose is having a characterization of the review activity based on pa erns and behaviors. As previous studies already tackled this problem and came up with reliable categories, we used the investigations by Tao et al. [37] and Sillito et al. [35] as the base for our frame. We used the concepts from Tao et al. [37] regarding Information needs for reasoning and assessing the change and Exploring the context and impact of the change, as well as the Initial focus points and Building on initial focus points steps from Sillito et al. [35].
To code the transcriptions, we use the deductive category application, resembling the data-driven content analysis technique by Mayring [21]. We read the material transcribed, checking whether a concept covers that action transcribed (e.g., participant opens le fileName so that (s)he is looking for context). We group actions covered by the same concept (e.g., a participant opens three les, but always for context purpose) and continue until we build a pa ern that lead to a speci c outcome (i.e., addressing a defect or a false positive). We split the pa erns according to their concept ordering such that those that lead to more defects found or false positive issues are visible. Ultimately, we identify the major di erences between the two experimental groups.

THREATS TO VALIDITY AND LIMITATIONS
We describe the threats according to the structure given by Cook and Campbell [7] using the threats' catalog by Wohlin et al. [41].
Conclusion's validity. reats to conclusion validity can be found in the participants selection. If the group is very heterogeneous, there is a risk that the variation due to individual di erences is larger than due to the treatment. Our se ings put in place countermeasures to mitigate this issue: we do not ask subjects to quantify objective measures (i.e., time to review), as this has a tendency to reduce the reliability of results. e measure chosen to evaluate dependent variables allowed us to assess results in an objective manner, avoiding to give subjective scores. Choosing more homogeneous groups will, on the other hand, a ect the external validity [7]. estionnaires were designed using standard ways and scales [26]. Statistical tests used in our study were not performed with the Bonferroni correction, following the advice by Perneger: "Adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference" [28]. Internal validity Regarding the instrumentation, artifacts, and environment used in our experimentation, the development environment provided is a standard Linux installation with Eclipse IDE (v. 4.4.2) and the plugin from the work by Beller et al. [4]. Concerning our participants, they were drawn from a population sample that volunteered to participate. So ware developers belong to three so ware companies, PhD students belong to three universities and MSc students to di erent faculties despite being from the same university.
Construct validity Construct threats are related to mono-operation bias and mono-method bias [41]. e se ings of our experiment are analyzing only the pull-based development process reviews with pull requests, in the frame of a single Java system. is was necessary as, given the structure of our experiment, we wanted to replicate a real-world scenario for this particular process. is allows increasing its external validity.
External validity e most important threat is related to the interaction of the se ings and treatment. In fact, one might argue that the system used is not fully representative of a real-world scenario. Our choice, however, as explained in Section 4.1, aims to reduce the training phase e ort required from participants and to encourage the completion of the experiment. Furthermore, we evaluate the e ects that external environment factors could have. We accept this threat as we run our experiment in an asynchronous environment, even if we believe that our description of the se ings, tools, and methods provides adequate countermeasures in this context.

RQ1. E ectiveness, false positives, and suggestions
For our rst research question, descriptive statistics about results are shown in Table 2. It contains data about e ectiveness of participants (i.e., correct number of issues addressed), false positives, and number of suggested improvements. Given the sample size, we applied a non-parametric test and performed a Mann-Whitney U-test to test for di erences between the control and the treatment group. is test, unlike a t-test, does not require the assumption of a normal distribution of the samples. Results of the statistical test are intended to be signi cant for a signi cance level of 95%.
Results indicate a signi cant di erence between the control and the treatment group regarding the number of false positives, with a p-value of 0.03. On the contrary, there is no statistically signi cant di erence regarding the number of defects found (e ectiveness) and number of suggested improvements. e example of a false positive is when subject C9 writes: " is doesn't sound correct to me. Might want to x the for, as the variable varName is never used".
is is not a defect, as varName is used to check how many times the for-statement has to be executed, despite not being used inside the statement. is is also wri en in a code comment. Another false positive example is provided by subject C1, who, reading the refactoring proposed by the changeset under review, writes: " e method methodName is used only in Class ClassName, so x this". is is not a defect as the same methodName is used by the other classes in the hierarchy. As such, we can reject only the null hypothesis H 0f regarding the false positives, while we cannot provide statistically signi cant evidence about the other two variables tested in H 0e and H 0c . e statistical signi cance alone for the false positives does not provide a measure to the actual impact of the treatment. To measure the e ect size of the factor over the dependent variable we chose the Cli 's Delta [8], a non-parametric measure for e ect size. For data with skewed marginal distribution it is a more robust measure if compared to Cohen standardized e ect size [9]. e computed value shows a positive (i.e., tangled pull requests lead to more false positives) e ect size (δ = 0.36), revealing a medium e ect.
Result 1: Untangled pull requests (treatment) lead to fewer false positives with a statistically signi cant, medium size e ect.
Given the presence of suggested improvements in our results, we found that the control group writes in total seven, while the participants in the treatment write nineteen. is di erence is interesting, calling for further classi cation of the suggestions. For the control group, participants wrote respectively three improvements regarding code readability, two concerning functional checks, one regarding understanding of source code and one regarding other code issues. For the treatment group, we classi ed ve suggestions for code readability, eight for functional checks and seven for maintainability. Although subjects have been explicitly given the goal to nd and comment exclusively functional issues (Section 4.6), they wrote these suggestions spontaneously e suggested improvements are included in the online appendix [12] along with their classi cation.

RQ2. Review time and defect lifetime
To answer RQ2, we measured and analyzed the time subjects took to review the pull requests, as well as the amount of time they used to x each of the issues present. Descriptive statistics about results for our second research question are shown in Table 3. It contains data about the time participants used to review the patch, completed by the measurements of how much they took to x respectively two of the three issues present in the changeset. All measures are in seconds. We exclude data relatively to the third defect as only one participant detected it. To analyze this data, we used the same statistical means described for the previous research question.
When computing the review net time used by the subjects, results show an insigni cant di erence, thus we are not able to reject null-hypothesis H 0t1 . is indicates that the average case of the treatment group takes the same time to deliver the review, despite having two pull requests to deal with instead of one. However, analyzing results regarding the defect lifetime we also see no signi cant di erence and cannot reject H 0t2 . Data show that the mean time to address the rst issue is about 14% faster in the treatment group if compared with the control. is is because subjects have to deal with less code that concerns a single concept, rather than having to extrapolate context information from a tangled change. At the same time the treatment group is taking longer (median) to address the second defect. We believe that this is due to the presence of two pull requests, and as such, the context switch has an overhead e ect on that. Subjects had to close a pull-request and then review the next, where they need to gain knowledge on di erent code changes.
Result 2: Our experiment was not able to provide evidence for a di erence in net review time between untangled pull requests (treatment) and the tangled one (control); this despite the additional overhead of dealing with two separate pull requests in the treatment group.

RQ3. Understanding e Change's Rationale
For our third research question, we seek to measure whether subjects are a ected in their understanding the rationale of the change by the dependent variable. Rationale understanding questions are Q1 to Q4 (Table 4) and Figure 2 reports the results. Higher scores for Q1, Q2, and Q4 mean be er understanding, whereas for Q3 a lower score signi es a correct understanding. As for the previous research questions, we test our hypothesis with a non-parametrical statistical test. Given the result we cannot reject the null hypothesis H 0u of tangled pull requests reducing change-understanding. Participants are in fact able to answer the questions correctly, independent of their experimental group.
A er the review, our experimentation also provided a nal survey (Q5 to Q12 in Table 4) that participants lled in at the end. Results shown in Figure 2 indicate that subjects judge equally incorrect the changeset (Q5), found no di culty in understanding the changeset (Q6), agree on having understood the rationale behind the changeset (Q7), did not nd the changeset hard to navigate (Q9), and believe that the changeset was comprehensible (Q11). Result 3: Our experiment was not able to provide evidence of a di erence in understanding the rationale of the changeset between the experimental groups. Subjects reviewing the untangled pull requests (treatment) recognize the bene ts of untangled pull requests, as they evaluate the changeset as being (1) be er divided according to a logical separation of concerns, (2) be er structured, and (3) not spanning too many features.

RQ4. Tangled vs. Untangled review patterns
For our last research question, we seek to identify di erences in pa erns and features during review, and their association to quantitative results. We derived such pa erns from Tao et al. [37] and Sillito et al. [35]. ese two studies are relevant as they investigate the role of understanding code during the so ware development  process. Tao et al. [37] lay out a series of information needs derived from state-of-the-art research in So ware Engineering, while Sillito et al. [35] focus on questions asked by professional experienced developers while working on implementing a change. e mapping found in the screencasts is shown in Table 5. Table 6 contains the qualitative characterization, ordered by the sum of defects found. Values in each row correspond to how many times a participant in either group used that pa ern to address a defect or point to a false positive.
Results indicate that pa ern P1 is the one that led to most issues being addressed in the control group (eight), but at the same time is the most imprecise one (three false positives). We stress out the lack of context-seeking concept. Pa erns P1 and P3 have most false positives addressed in the control group. In the treatment group, The e ects of change-decomposition on code review -A Controlled Experiment Submi ed to ESEM '18, October 11-12, 2018, Oulu, Finland pa ern P2 led to more issues being addressed ( ve), followed by the previously mentioned P1 (four). Analyzing the transcribed screencasts, we noted an overall trend of reviewing code changes in the control group, exploring the changeset using less context exploration than in the treatment. Among the participants belonging to the treatment, we witnessed a much more structured way of conducting the review. e overall behavior is that of ge ing the context of the single change, looking for the les involved, called, or referenced by the changeset, in order to grasp the rationale. All of the subjects except three repeated this step multiple times to explore a chain of method calls, or to seek for more context in that same le opening it in GitHub. We consider this the main reason to explain that untangled pull requests lead to more precise (fewer false positives) results.
Result 4: Our experiment revealed that review pa erns for untangled pull requests (treatment) show more context-seeking steps, in which the participants open more referenced/related classes to review the changeset.

DISCUSSION 7.1 Implications for Researchers
In past studies, researchers found that developers call for tool and research support for decomposing a composite change [37]. For this reason, we were surprised that our experiment was not able to highlight di erences in terms of reviewers' e ectiveness (number of defects found) and reviewers' understanding of the change rationale, when the subjects were presented with smaller, self-contained changes.
If we exclude latent problems with the experiment design that we did not account for, this result may indicate that reviewers are still able to conduct their work properly, even when presented with tangled changes. However, the results may change in di erent contexts. For example, the cognitive load for reviewers may be higher with tangled changes, thus the negative e ects in terms of e ectiveness could be visible when a reviewer has to assess a large number of changes every day, as it happens with integrators of popular projects in GitHub [15]. Moreover, the changes we considered are of average size and di culty, yet results may be impacted by larger changes and/or more complex tasks. Finally, participants were not core developers of the considered so ware system; it is possible that core developers would be more surprised by tangled changes, nd them more convoluted or less "natural," thus rejecting them [16]. We did not investigate these scenarios further, but studies can be designed and carried out to determine whether and how these aspects in uence the results of the code review e ort.
Given the remarks and comments of professional developers on tangled changes [37], we found also surprising that the experiment did not highlight any di erences in the net review time between the treatment groups. Barring experimental design issues, this result can be explained by the additional context switch, which does not happen in the tangled pull request (control) because the changes are done in the same les. An alternative explanation could be that the reviewers with the untangled pull requests (treatment) spend more time "wondering around" and pinpointing small issues because they found the important defects quicker; this would be in line with the cognitive bias known as Parkinson's Law [27] (all the available time is consumed). However, time to nd the rst and second defects (3) is the same for both experimental groups thus voiding this hypothesis. Moreover, similarly to us, Tao and Kim also did not nd di erence with respect to time to completion in their preliminary user study [38]. Further studies should be designed to replicate our experiment and, if results are con rmed, to derive a theory on why there is no reduction in review time.
Our initial hypothesis on why time does not decrease with untangled code changes is that reviewers of untangled changes (control) may be more willing to build a more appropriate context for the change. is behavior seems to be backed up by our qualitative analysis (Section 6), through the context-seeking actions that we witnessed for the treatment group. If our hypothesis is not refused by further research, this could indicate that untangled changes may lead to a more thorough low-level understanding of the codebase. Despite we did not measure this in the current study, it may explain why we see less false positives with untangled changes. Finally, untangled changes may lead to be er transfer of code knowledge, one of the positive e ects of code review [1].

Recommendation for Practitioners
Our experiment is not able to show no negative e ects when changes are presented as separate, untangled changesets, despite the fact that reviewers have to deal with two pull requests instead of one, with the subsequent added overhead and a more prominent context switch. With untangled changesets, our experiment highligthed an increased number of suggested improvements, more context-seeking actions (which, it is reasonable to assume, increase the knowledge transfer created by the review), and a lower number of wrongly reported issues.
For the aforementioned reasons, we support the recommendation that change authors prepare self-contained, untangled changeset when they need a review. In fact, untangled changeset are not detrimental to code review (despite the overhead of having more pull-requests to review), but we found evidence of positive e ects. We expect the untangling of code changes to be minimal in terms of cognitive e ort and time for the author. is practice, in fact, is supported by answers Q8, Q10, Q12 to the questionnaire and by comments wri en by reviewers in the control group (i.e., "Please make di erent commit for these two features", "I would prefer having two pull requests instead of one if you are xing two issues").

CONCLUSION
e goal of the study presented in this paper is to investigate the e ects of change-decomposition on modern code review [10], particularly in the context of the pull-based development model [14].
We involved 28 subjects, who performed a review of pull request(s) pertaining to (1) a refactoring and (2) the addition of a new feature in a Java system. e control group received a single pull request with both changes tangled together, while the treatment group received two pull requests (one per type of change). We compared control and treatment groups in terms of e ectiveness (number of defects found), number of false positives (wrongly reported issues), number of suggested improvements, time to complete the review(s), and level of understanding the rationale of the change. Our investigation involved also a qualitative analysis of the review performed by subjects involved in our study.
Our results suggests that untangled changes (treatment group) lead to: (1) fewer reported false positives defects, (2) more suggested improvements for the changeset, (3) same time to review (despite the overhead of two di erent pull requests), (4) same level of understanding the rationale behind the change, (5) and more context-seeking pa erns during review.
Our results support the case that commi ing changes belonging to di erent concepts separately should be an adopted best practice in contemporary so ware engineering. In fact, untangled changes not only reduce the noise for subsequent data analyses [17], but also support the tasks of the developers in charge of reviewing the changes by increasing context-seeking pa erns.