Can We Trust Tests To Automate Dependency Updates? A Case Study of Java Projects

Developers are increasingly using services such as Dependabot to automate dependency updates. However, recent research has shown that developers perceive such services as unreliable, as they heavily rely on test coverage to detect conflicts in updates. To understand the prevalence of tests exercising dependencies, we calculate the test coverage of direct and indirect uses of dependencies in 521 well-tested Java projects. We find that tests only cover 58% of direct and 20% of transitive dependency calls. By creating 1,122,420 artificial updates with simple faults covering all dependency usages in 262 projects, we measure the effectiveness of test suites in detecting semantic faults in dependencies; we find that tests can only detect 47% of direct and 35% of indirect artificial faults on average. To increase reliability, we investigate the use of change impact analysis as a means of reducing false negatives; on average, our tool can uncover 74% of injected faults in direct dependencies and 64% for transitive dependencies, nearly two times more than test suites. We then apply our tool in 22 real-world dependency updates, where it identifies three semantically conflicting cases and five cases of unused dependencies. Our findings indicate that the combination of static and dynamic analysis should be a requirement for future dependency updating systems.


Introduction
Modern package managers facilitate reuse of open source software libraries by enabling applications to declare them as versioned dependencies.Crucially, when a new version of a dependency is made available, package managers will automatically make it available to the client application.This mechanism helps projects stay up-to-date with upstream developments, such as performance improvements or bug fixes, with minimal fuss.Typically, package managers implement a set of interval operators (dependency version ranges) on top of the SemVer protocol [1] that developers use to declare update constraints.For example, a dependency declared with the range >= 1.0.0 < 1.5.0 restricts updates to backwardcompatible changes up to 1.5.0.On the other hand, >= 1.0.0 welcomes automatic updates of all new version releases starting from 1.0.0.Given a new library release with version 1.5.0, the latter constraint will allow an update but the former will not.
In practice, most package managers use a liberally interpreted version of the SemVer protocol with no vetting, allowing library maintainers to release new changes based on their self-interpretation of backward compatibility [1,2].As a consequence, client programs may un- expectedly discover regression-inducing changes, such as bugs or semantic changes that break code contracts.Discovering, debugging and resolving such issues, as exemplified in Figure 1, remains a challenging task for development teams [2].In fact, unexpected regressions are one of the main reasons that deter developers from upgrading dependencies to new versions [3].
Developers can mitigate the risk of integration errors by either using restrictive strategies, such as version locking, or permissive strategies involving dependency update tooling.Version locking effectively makes the dependency tree of client programs immutable and disables automated updates.This strategy offers maximum stability but is prone to incurring technical debt due to outdated dependencies.Moreover, developers need to manually discover and apply security hotfixes.On the other hand, dependency update checkers analyze version compatibil-ity before deciding to update.There are two main techniques for deciding version compatibility, breaking change detection [4,5,6] and regression testing [7,8].Detecting potential breaking changes (i.e., backward API incompatibilities) prevents client programs from updating to versions that will result in compile failures.A major shortcoming of this technique is that it depends on the compilation and the existence of a static type system; many of today's most popular languages are dynamically typed.A more popular option among developers is the use of services providing automated dependency updating, such as greenkeeper.io[9], Dependabot [10], and renovate [11], that use project test suites to detect regression changes on every new update.
The effectiveness of such services depends highly on the quality of end-users test suites [12].Poor test coverage of dependency usage in client code can lead to missing update-induced regressions.Recent studies [13,14] suggest that high statement coverage in test suites does not guarantee to find regressions in code changes.Failing to detect regressions stemming from updates can have dire consequences for client programs: for example, users dependent on npm's event-stream package did not notice a malicious maintainer planting a hidden backdoor for stealing bitcoin wallets inside the library's source code [15].Moreover, a recent qualitative study [16] also revealed that developers are generally suspicious of automatically updating their dependencies.One of the prime reasons is that developers perceive their tests as unreliable.To reduce the number of false negative updates, we develop a static change impact analysis for dependency updates called Uppdatera.By statically identifying changed functions and approximating call-relationships between an application and its dependencies, change impact analysis can fill in gaps where test suites have limited coverage or cannot reach.
In this paper, we set out to empirically understand how reliable developer tests are in automated dependency updating by addressing the following research questions: • RQ1: Do tests cover uses of third-party libraries in projects?
• RQ2: How effective are project test suites and change impact analysis in detecting semantic changes in third-party library updates?
• RQ3: How useful is static analysis in complementing tests for compatibility checking of new library versions?
To study the prevalence of tests exercising dependencies in projects, we first establish all uses of library functionality from direct and transitive dependencies in 521 well-tested projects and then measure how much test suites cover those usages.By systematically mutating dependency uses in 261 projects, we then conduct a comparative study on the adequacy of test suites and change impact analysis in detecting artificial updates with simple faults.To understand the strengths and weaknesses of using static analysis as a complement to tests in a practical setting, we evaluate the performance of test suites and Uppdatera on 22 newly created pull requests that update dependencies.
Our results indicate that tests lack considerable coverage of function calls in projects that target library dependencies; average coverage is 55% for direct dependencies and 26% for transitive dependencies.Similarly, the average effectiveness of test suites is 47% for direct dependencies and 35% for transitive dependencies.When using change impact analysis, the average effectiveness increases to 74% for direct dependencies and 64% for transitive dependencies, suggesting that static analysis can cover open coverage gaps in tests.Through our manual analysis, we identify that Uppdatera is more effective in avoiding faulty updates than tests.However, it is prone to false positives due to difficulties in evaluating over-approximated execution paths.
Our findings raise awareness of the risks involved with automated dependency updating.Tool creators should consider reporting how adequately project tests exercise changed functionality in libraries under update.In future updating systems, tool creators should investigate hybrid workflows to complement gaps in regression testing with static analysis and help developers with prioritizing testing efforts.

Package Managers
Package managers such as Java's Maven, JavaScript's npm, or Rust's Cargo provide tooling to simplify the complexities of maintaining, distributing, and importing external third-party software libraries in development projects.As a community service to its users, package managers also host a public Online Package Repository (OPR) where developers can freely contribute with new packages (e.g., a database driver) or build upon existing packages (e.g., use a parser library to build a JSON parser).This helps package manager users to reduce development efforts by benefitting from existing functionality in their language environments.In a nutshell, a package is a distributable, versioned software library.
Because of the relative ease of building packages on top of each other, OPRs today grow quickly and become evermore inter-dependent [17,18].As a consequence, package manager users experience a dynamic growth of new hidden dependency imports in their projects and frequent dependency updates that increase the risk of build failures due to breaking backward compatibility [19,20,21,2].The risk of breaking backward compatibility varies between OPRs: npm and Maven Central move the burden of checking incompatible changes on its users while R/CRAN minimize this risk by requiring a mutual change-cost negotiation between library maintainers and their users [2].Users of OPRs, such as npm or Maven, either uses additional tooling or disable dependency updates through versionlocking as a protective measure.Version-locking dependencies guarantee a stable build environment.Additional tooling provides an extra layer control by scanning dependencies for vulnerabilities [20], freshness [22] or update compatibility [4,6].

Safe Backward Compatible Updates
Update checkers such as cargo-crusador [23], JAPICC [24], and dont-break [25] typically determine backward compatibility by ensuring that the new version is consistent with the public API contract of the old version.Removals or changes in method signatures, access modifiers, and types (e.g., classes and interfaces) are examples of inconsistencies that can lead to compile failures in client code [26,27].
Checking dependency updates for API inconsistencies is a necessary precondition to a safe update, but not a sufficient one.From Listing 1, we consider an additional class of changes, semantic changes, that are API-compatible (i.e., respects the public API contract) but introduces incompatible behavior (i.e., regression changes) for clients after dependency updates.The code example illustrates a client that depends on p2 which in turn depends on p1.There are two changes that are not semantic preserving in p1: a() returns 1 instead of 0 (line 27-28) and v(int a) compares variable a with a different comparison oper-ator (line [30][31].On the other hand, the change in p2 is semantic preserving: z() still returns false despite replacing it with a method call to make false (line 18-21).Given a scenario in which client automatically updates to the next release of p1, and p1 updates to the next release of p2.The changes made in p1 will indirectly impact the behavior of client despite seeming hidden and distant.The change in a() of p1 results in b() to match the if statement on line 14 and return 0 instead of doing an addition of x and y in p2 (line 15).This further propagates to the client where b() is called.Similarly, the change in v() flips the condition to false instead of true in p2 which result in skipping y+2 at line 12.These two code changes illustrate how the client behavior or the execution flow is not honored after updating to a newer version.
Unlike breaking API contracts, semantic changes are not inherently bad: the refactoring of z() in p2 introduces a new execution path (e.g., new behavior) to make false which continues to return false after the change.Source code changes that preserve the same behavior before and after an update are semantic backward compatible changes.Deciding semantic backward compatibility is also a contextual problem: Given another client, client2 that use the same dependency p2 as client but don't call b() and z() (line 4).The same update we illustrate for client is semantic backward compatible for client2 as it functions the same way before and after the update.
Following the observations in Listing 1, we denote a semantic backward compatible update or safe update as the following: We denote Lib 1 , Lib 2 ∈ Library as two versions of the same library and a client C with dependency tree as T C = (V, E) where V is a set of resolved versioned libraries used by C, and E is the directed dependence between them.Let P DG T C represent a sound programdependence graph [28] true where i varies from 1 to n and n is the cardinality of set D.

Research Questions
The goal of this paper is to understand how reliable test suites are as a means to evaluate the compatibility of updated library versions in projects.To that end, we study a large number of test suites from Maven-based Java projects that depend on external libraries.
Bogart et al. [2] report that developers create strategies to select high-quality libraries based on signals such as active contributors, project history, and personal trust in project maintainers to reduce the exposure of unwanted changes.Thus, in our first research question, we investigate whether testing of third-party libraries is prevalent and a strategy to minimize the risk of breaking changes: RQ1: Do test suites cover the uses of third-party libraries in projects?
Mirhosseini et al.'s [16] qualitative study suggests that developers have trust issues with automated updates and perceive tests as unreliable.A compelling complement to evaluate the effect of dependency changes is the use of change impact analysis.We set to measure how capable both test suites and change impact analysis can catch simple semantic faults in both direct and indirect uses of third-party libraries: RQ2: How effective are project test suites and change impact analysis in detecting semantic changes in thirdparty library updates?While static analysis can yield higher coverage, it is also more prone to false positives by classifying safe updates as unsafe.To understand the strengths and weaknesses of static analysis in a practical environment, we ask: RQ3: How useful is static analysis in complementing tests for compatibility checking of new library versions?
We extract a set of real-world update cases from pull requests generated by the popular service Dependabot and manually investigate the correctness of each pull request.Then, we analyze each pull request using change impact analysis to compare the results with the test suite and our ground truth.

Research Method
We follow the study design depicted in Figure 2 to evaluate the reliability of test suites for automated dependency updating and the potential of using static analysis.First, we select Java repositories with high-quality assurance badges and at-least one test class from GitHub 1 .Then, we build each repository to infer a complete dependency tree of the project along with its source-and test classes in 2 .Second, we feed the source classes together with the dependencies of a project to the call extractor and statically extract all its direct and indirect uses of thirdparty libraries 3 .Third, we use instrumentation to learn all invocations from a project to its dependencies via its test suite 4 .Then, we use the information from the previous step to calculate the dependency coverage of a project.Fourth, we generate mutations of dependencies by inserting simple faults (See Table 1) in dependency functions executed by tests.Here, we use dynamic call paths (from 4 ) to identify such functions 5 .We can then run both the test suite and the change impact analysis to measure the detection score 6 .Finally, we harvest Dependabot pull requests in a real-time fashion and then manually evaluate how both test suites and change impact analysis perform in practice 7 .

Identifying Usages of Third-Party Libraries
We refer to the use of third-party libraries as using functionality from externally-developed libraries in software projects.Specifically, we focus on functionality exposed as functions in libraries as they are among the most widespread forms to achieve code reuse.Thus, we consider a function call from a project to a library dependency as third-party library use.As projects depend on an ordered tree of library dependencies, there are both implicit and explicit third-party uses.An explicit use is a direct function call between a project and one of its declared libraries.On the other hand, implicit use is when a function in a project transitively calls underlying libraries in a dependency tree.Given the following example scenario: project A depends on library B, and library B depends on library C. If there is a function call path between a function a() in A to a function c() in C via called functions in B, project A is implicitly using functionality in library C.
To identify explicit use of third-party libraries, we statically extract all function calls to functions that are neither part of the project under analysis or the Java standard library.By deduction, all such method invocations represent calls to third-party libraries.For implicit use of thirdparty libraries, we statically derive call graphs capturing call paths between a project and its dependency tree, similar to Ponta et al [29].Finally, we prune function and call sequences belonging to the Java standard library to derive a graph representing interactions between a project and its transitive dependencies.
To measure dependency coverage in a project in RQ1, we use instrumentation to record all project invocations to third-party libraries during test suite execution.Using the recorded set, we calculate the proportion of statically inferred functions covered by the test recorded set of function as dependency coverage (Recorded f unctions ⊂ Declared f unctions): Recorded f unctions Declared f unctions Effectively, dependency coverage is function coverage [30], but only restricted to dependency calls.

Heuristics for Static Impact Analysis
The central task of automated dependency updating is to facilitate the continuous integration of new compatible library versions with minimal developer intervention.Unlike static analysis that may contain false warnings [31], automated updating suffers instead from false negatives.A faulty update has a potentially high maintenance penalty if merged into the project and could cascade into breaking the build of externally depending projects.
As a step towards reducing false negatives, we are investigating change impact analysis as a means to potentially reduce coverage gaps where tests are not able to reach in dependencies.Change impact analysis estimates the reach and fraction of affected execution paths in a program given a set of code changes [32].While there are advancements towards inference of semantic changes in static analysis such as data flow analysis with equivalence relations [33] and mining techniques [34], precise static interpretation of semantic changes such as faulty updates is an undecidable problem [35].Moreover, most of these techniques only analyze method bodies, and thus not practical for inter-procedural analysis of projects and their dependencies.
Without possibilities to precisely determine if an update is faulty or not, we approximate a faulty update (or semantic change) as a change to the execution flow of a project.We use control flow graphs (CFGs) [36] to represent all possible execution paths of functions.There are two types of statements in CFG terminology that affect the execution flow of a program, namely control and write statements [37].A change to a write statement can affect the program state (i.e., assign a value to a variable).A change to a control statement changes the program counter (i.e., determine which statement to be executed next).By reading the program state, changes to the two other statements passively impacts read statements.Thus, we derive the following heuristics to classify unsafe updates: The definition is an over-approximation; code changes such as refactoring would result in being classified as an unsafe update if and only if affected functions are reachable.As services such as Dependabot present only the outcome of test results and a changelog between the old and new version of a library, change impact analysis instead precisely pinpoint affected execution paths in an update.Such information help project maintainers prioritize testing efforts or determine the potential risk of the update.

Creating Unsafe Updates in Project Dependencies
For seamless integration, it is important for automated dependency updating to detect incompatibilities that arise when updating a library dependency.By using mutation analysis to seed artificial faults in all uses of third-party libraries in a project, we can derive an adequacy test of detecting incompatibilities in automated dependency updates.We first dynamically extract a set of called thirdparty functions in a project and then apply mutation operators defined in Table 1 to construct a set of artificial updates that are false negatives.As static analysis can over-approximate execution paths (i.e., risk creating falsepositive cases), we resort to dynamic analysis to ensure mutations of truly invoked functions.For the selection of mutation operators, we choose operators common in mutation testing studies [39,38] that focus on simple logical flaws and exclude mutation operators with a limited effect such as deleting statements [39].
In comparison to using actual update cases, the mutation setup provides a systematic way to introduce simple faults in all uses of third-party libraries in a project to measure the effectiveness of detecting faulty updates.Manually curating false-negative cases of dependency updates limits to specific project-library pairs and may not generalize to other projects that use the same library.Moreover, finding such pairs for all libraries in a project to create an overall assessment may not be possible in most projects.
For RQ2, we denote mutation detection score for dependencies (an adaption of mutation score [40]) as a tool's ability to detect a mutated reachable dependency function as (mutants): Detected mutants All mutants

Manual Analysis of Pull Requests
As the artificially created updates address only false negatives, we also need to understand how static analysis performs in practice.Thus, we manually analyze the applicability of static analysis using pull requests through a lightweight code review.Due to the absence of established ground truth or a benchmark, we resort to manually creating a ground truth of libraries under update.As understanding the use context of a project-library is challenging, we also, attempt to corroborate our findings by posting our assessment as pull request comments.Below, we define our setup for the manual analysis: Selection criteria.We select pull requests generated from the popular service Dependabot on GitHub that supports automated updates of Java projects using the Maven-build system.To select significant and high impactful projects and increase the chance for a response by a project maintainer, we harvest newly created pull requests using GHTorrent's event stream [41] and adopt the following filter criteria: (1) high stargazer, watchers or forks count indicate popularity, (2) no passive users indicate projects that assign reviewers and frequently merge Dependabot-pull requests, (3) dependency type indicates that we only consider Maven compile and runtime dependencies, and (4) project buildability indicates that we can compile the project out of the box.
Code review protocol.After a pull request meets the selection criteria, we first inspect the diff in the pull request to identify the old and the new version number of the library under update.Then, we download the source jar of the old and new version from Maven Central and use a diffing tool to localize the set of changes.By reviewing the change location, consulting the changelog, inspecting the tests of the library, we classify the nature of a change as refactoring, structural (i.e., breaking change), or behavioral (i.e., semantic change).Next, we check out the project at the commit described in the pull request and manually localize uses of the library by first performing keyword search of import statements leading to the library under update.Then, we track the data-and control-flow of imported items (e.g., object instantiations, function invocations, and interface implementations) to map out how the project uses the library under update.If the library under update is a transitive dependency, we first trace how the project uses its direct dependency and then how the used subset of the direct dependency uses the transitive dependency.After mapping out uses of the library under analysis, we can then establish whether a project directly or indirectly uses any of the changed classes and function signatures identified in the diff and whether those changes make the update safe or not.If the changes do not alter the logic (e.g., refactorings) of the project, we consider the update safe.Refactorings are in some cases highly contextual and can yield different outcomes as exemplified in the following: the changed function foo(x) adds a new IF-statement with the condition x > 50 that breaks the original functionality.Project A uses foo(x) indirectly, and through the manual analysis (including inspection of its tests), we can establish that the threshold is x < 20 in all cases, and thus the update is safe to make.On the other hand, project B has a public function bar(x) that passes x in a function call to foo(x).Here, we cannot assume anything around x as users of B could call bar(x) with any x.In this case, we consider the update unsafe.
After manually evaluating pull requests, we classify them using one of three categories: • Safe: the update is safe to perform and will not negatively impact the functionality of the project.
• Unsafe: the update is risky and could lead to potential unexpected runtime changes.
• Unused : the update of an unused dependency (i.e., it is only declared in the project but not used).
Based on the outcome of the update tooling, we compare it with the classification above and consider the following: • False Negative (FN) when classifying an unsafe update as safe.
• False Positive (FP) when classifying a safe update as unsafe or falsely updating an unused dependency.
• True Positive (TP) when both our manual classification and update tooling has the same conclusion.
• True Negative (TN) when not creating an update for an unused dependency.

Dataset Construction
We sample 1, 823 repositories from GitHub that have Java as the primary language, Maven as the primary build system, and have at-least a high-quality assurance badge (i.e., Travis CI, CodeClimate, coveralls, and CodeCov) as a signal for having tests [42].Services such as Dependabot can update dependencies in projects as long as there is a valid pom.xml file.Next, we build and then dry-run projects on both the instrumentation and mutation pipeline to eliminate incompatible projects.In total, there are 818 repositories that compile to Java 8 bytecode and have at least one compiled test class.Out of the 818 built projects, 521 projects successfully run the instrumentation pipeline, and a subset of 262 projects are compatible with the mutation pipeline.The number of projects in the mutation pipeline is nearly double the ratio of a recent previous study [43].Table 2 presents descriptive statistics on four aggregated variables for projects belonging to the instrumentation pipeline.The median number of declared methods is 210 (mean: 668) with a heavily positive skewed distribution.75% of all projects in our sample cluster around 588 or less declared methods with 36 projects having more than 1400 methods.The largest project is oracle/oci-java-sdk with 22, 264 methods.As per Section 4.1, we measure test coverage of all function calls made in a project.We can observe that the test coverage is generally high: half of the projects have coverage of 67% or more.For the number of dependencies, we can observe that the distribution does not drastically change: the median changes from 7 to 16, indicating a small expansion of transitive dependencies.Overall, our dataset represents mid-sized projects that use a significant number of dependencies with varying test coverage.

Implementation
We discuss the implementation of Uppdatera, a tooling for performing change impact analysis of library dependencies in Maven, and our pipeline to run our experiments.We have open-sourced the tooling and docker images for automation and reproducibility of our study (see Section 6.3).

Uppdatera
Given a request to update a dependency to a new version in a pom.xml file, Uppdatera first performs AST differencing of the current and new version of the dependency to identify a list of functions with potential behavioral changes using SpoonLabs/GumTree [44].Then, Uppdatera computes a call graph inferring all control-flow paths between client and dependency functions following Ponta et al. [29] approach for call graph construction (using WALA).Finally, Uppdatera performs a reachability analysis using the list of possible behavioral changes on the call graph to find reachable paths to the client code.Figure 3 demonstrates an example of using Uppdatera for updating the library io.reactivex:rxjava from version 1.3.4 to 1.3.8 in opentracing-contrib/java-rxjava.The report features a call stack to the changed function along with a set of AST diffs.In this particular case, the onError() function in the class TracingSubscriber transitively calls getPluginImplementationViaProperty() in the dependency class RxJavaPlugins.The addition of a try-catch block in the function takes care of unhandled exceptions which may have been handled by clients in previous versions (i.e., potential regression change) In the following, we motivate our implementation choices for a change impact analysis tool designated for updating library dependencies.
Diffing.Uppdatera performs source code differencing at the abstract syntax tree (AST) level of both the current and the new version of a dependency to identify functions with code changes.AST differencing algorithms [45,44] produce fine-grained and accurate information about the type and structure of source code changes.Following Definition 4.1, we capture AST transformations at the statement level and map the following as regression changes: • Any method-level move operation mirrors moving a statement from line x to y.
• deletion, update or insertion of Expression ASTs mirrors data-flow changes.
• deletion, update or insertion of control struct ASTs such as IF, While, FOR mirrors control-flow changes.
• deletion, update or insertion of Call-Expression ASTs represents changes mirrors control-flow changes.
As an alternative to AST differencing, we could consider bytecode differencing.Bytecode (e.g., LLVM's IR or JVM code) differencing compute edit scripts at the instruction level.Although this technique offers a finegrained and a compelling alternative to AST differencing, instruction-level changes can be difficult to understand for developers not familiar with low-level details.
Call Graph Construction.Uppdatera constructs a call graph capturing inter-procedural control-flow paths between client and dependency functions.Each node in the call graph represents a fully resolved function identifier and should be identical to the identifiers in the changeset of the Diffing phase.
We advocate the use of call graph algorithms that are both soundy [46] and scalable for analyzing projects in the wild as a general guideline.The call graph algorithm should support and resolve as many language features as possible.Limited support of language features could potentially leave gaps in the coverage of projects making use of unsupported features.Similar to static analyses of security applications, achieving high recall is more crucial than precision to avoid recommending faulty updates.
As recent studies [18,19] suggest that irrespective of the OPR, the majority of packages have a small number of direct dependencies, but a high and growing number of transitive dependencies.For example, 50% of all packages in Crates.iohave a dependency tree depth of at least 6 [19].Therefore, performing static analysis at the boundary of a project and its dependency tree can become computationally expensive and impractical in DevOps environments.Moreover, as Uppdatera can expect to analyze any compatible project in the wild, the algorithm should be scalable to cater large projects and cheap to construct to cut down computation time.
Finally, a potential trade-off of using call graphs instead of CFGs is the loss of analysis precision due to the absence of data-flow paths in the graph.However, taking into account program features such as aliases, arrays, structs, and class objects in dataflow analysis adds additional complexity and scalability problems when moving the analysis boundary to include project dependencies.Supporting such analysis adds extra precision but may not yield extra actionability.
Reachability Analysis.For each changed function identified in the Diffing phase, Uppdatera performs a reachability analysis on the call graph to detect paths connecting changed dependency functions to functions in the analyzed project.If Uppdatera finds such paths, it marks the update as potentially unsafe.If no such paths are found, Uppdatera marks it as a potential safe update and recommends the update to the package manager.Finally, Uppdatera also reports the impacted paths between dependencies and project functions, to inform developers of the program paths that need to be inspected in response to an update in a dependency.

Experimental Pipeline
To implement our methodology, we first develop a call extractor that records complete call sequences between a project and its library dependencies.The implementation builds on instrumenting library classes using ASM [47] and the Maven Dependency Plugin.To infer function calls to libraries from a project (RQ1), we use ASM to statically extract call sites for direct dependencies.We generate call graphs using WALA [48] configured for the CHA algorithm for transitive dependencies.Following Reif et al. [49]'s comprehensive benchmark of call graph algorithms for Java, we find that the CHA algorithm supports the most language features and has a lower runtime in comparison to more precise points-to analysis algorithms such as 0-1-CFA or N-CFA.
For RQ2, we implement the update emulation pipeline (i.e. mutation analysis) on top of PITest [50], a popular in-memory-based mutation testing framework that works with the popular test runners JUnit and TestNG by limiting mutations to library functions identified from the call extractor.We exclude the use of experimental mutation operators that cannot guarantee non-equivalent mutations.For each mutated class, we use Procyon [51] to decompile into a source file for AST diffing in the case of Uppdatera.

Results
Here, we report the results of our research questions.

RQ1: Dependency coverage
Figure 4 presents a violin plot of dependency coverage on the left-hand side, and dependency coverage including transitive dependencies on the right-hand side.Overall, 13% (67/521) projects have less than 10% coverage, suggesting at large that a majority of projects have some tests exercising at least one dependency use.We observe that the median coverage is 58% (mean: 55%): half of the GitHub projects miss coverage of more or at least 42% of all dependency function calls.In practice, there is a risk that automated dependency updating may not have tests that exercise changes in dependencies.
The right-hand side of Figure 4 shows the dependency coverage taking into account reachable paths to transitive dependencies in projects.The distribution has a bimodal shape with two peaks at, 9%, and at 52%, suggesting two classes of projects.In the first class, half of the projects have a median dependency coverage of 21% (mean: 26%), indicating that project test suites at large do not exercise dependencies in depth.This is not surprising: an ergonomic factor of third-party libraries is that they are welltested and should in principle not need extra tests [52].In the second class, we can observe that projects have tests that exercise dependencies in-depth, suggesting the presence of projects with adequate test suites.As mentioned in Section 4.1, these results are indicative as we compare against statically inferred call paths, which, being overapproximating, may not be representative of actual calls.
Findings from RQ1: Half of the 521 projects exercise less than 60% of all direct dependency calls from their tests; this drops to 20% if paths to transitive dependencies are considered.

RQ2: Detecting Simple faults in Dependencies
Our benchmark generated in total 1, 122, 420 artificial updates for 311 Maven modules belonging to 262 GitHub projects.Figure 5 shows a violin plot of the mutation detection score for both direct and transitive dependencies, split by project test suites on the left-hand side and Uppdatera on the right-hand side.The median detection rate score is 51% (mean: 47%) for direct dependencies and 36% (mean: 35%) for transitive dependencies.We can observe that 25% of the projects have a high test suite effectiveness greater or equal to 80% for direct dependencies.When looking at transitive dependencies, the median of direct dependencies and the third-quantile of transitive dependencies are similar, showing that only one-fourth of the test suites remain effective in detecting faults in transitive dependencies.Moreover, we can also see more dispersion in effectiveness among direct dependencies than transitive dependencies, half of the projects have a detection score raging between 16 to 54% for transitive dependencies.Overall, the results indicate that tests are effective for a limited number of cases and dependencies.At large, however, a small minority of projects have test suites that can comprehensively detect faulty updates.
On the other hand, Uppdatera, has a median detection score of 97% (mean: 74%) for direct dependencies and 88% (mean: 64%) for transitive dependencies.Generally, we see that static analysis is highly effective in detecting simple faults with a slightly decreased effectiveness for transitive dependencies.Half of the projects with a low detection score (< 50%) using tests now have detection score greater than 80%.In the lower half of the median for both direct and transitive dependencies, we see large variations between the projects.As change impact analysis is largely a generic technique, we manually investigate why Uppdatera was unable to detect changes in 76 modules having a low detection score of less than or equal to 39% and 22% for direct and transitive dependencies, respectively.We perform a manual investigation using the following protocol: (1) back-track from the dynamic call trace to test suite, (2) identity potential tests cases that invoke the path in the call trace, and (3) investigate both the test case setup and source code in-depth to understand how Uppdatera could miss the regression change in the update.
In total, we identified four potential reasons for Uppdatera to miss faulty updates: 29 cases involving code generation, 26 cases involving class loading, 19 cases involving instrumentation, and 2 cases of instantiations of generic methods.Dynamic class loading along with code generation makes use of Java's Reflection API such as Class.forName("DynClass");.A majority of the inspected cases stem from libraries such a FasterXML Jackson-databind, Jersey REST framework, Spring framework, JAI ImageIO, Hibernate Validator, and Google Guice.Reflection is useful in cases such as the creation of data bindings (jackson-databind), data validation (hibernate or guice) or generation of HTTP endpoints from annotated user methods (jersey or spring framework).Resolving cases involving reflection is a known limitation of static analysis [49].
Although we do not instrument JUnit and Maven (which we use to power our setup), projects can bypass our exclusion filter by putting those libraries under a different namespace, a practice known as class shading.We identify several instances of bypassing the filter, an effect we cannot easily control.Finally, in two cases, generic methods defined in user projects were only instantiated in tests but not in the project source code.Generally, call graph generators do not resolve generic methods unless there is a concrete instantiation of it.
Findings from RQ2: Project tests are effective in a limited number of cases but not at large.Uppdatera can detect twice as many faulty artificial updates as opposed to project test suites.Libraries making use of Java's reflection API could affect its applicability.

RQ3: Change Impact Analysis in Practice
We conducted our online monitoring for two weeks between 13-27 Apr 2020 evaluating in total 22 Dependabot pull requests.On average, we harvested around 350 pull requests per day between Mondays and Wednesdays, 150 pull requests per day between Thursday and Fridays, 50 pull requests per day on the weekends.While the number of pull requests may seem high, a majority of them were updates of Maven plugins or test dependencies, uncompilable, or superseding previous pull requests.Thus, we posted on average two pull requests per day taking anywhere between one to four hours to manually evaluate pull requests and post our findings as comments.
Table 3 presents the analyzed pull requests along with the update type, ground truth class (i.e., Class column), external confirmation (i.e., Confirm column), results from the tooling, and execution times (in minutes).In total, our ground truth consists of 15 pull requests where the update is safe (i.e., S class), three pull requests where the dependency under update is unused and only declared (i.e., N class), and four pull requests where the updates that are unsafe (i.e., U class).The test suites of the analyzed pull requests classified 15 update as true positives(TP), four update cases as false positives (FP), and three update cases as false negatives (FN).Uppdatera classified 12 update cases as true positives (TP), seven cases as false positives (FP), three cases as true negatives (TN).There are 12 cases where the two techniques report differently as highlighted by the colors in Table 3.Most notable are false positives; Uppdatera incorrectly reports six updates (highlighted yellow in the table) as unsafe that test suites can detect as safe.In those cases, the heuristics failed to account for refactorings or falsely derived call paths due to dynamic dispatch.In four cases, Uppdatera could not detect that the changes were refactorings (i.e., semantic-preserving changes).One such example is a confirmed minor update of the Apache commons-lang3 library refactoring array length and null checks into a new function.In the two remaining cases, all reachable call paths were over-approximations.The update of org.eclipse.emf.common in one project included changes to List structures implementing methods of Java's List Interface (such as addAll()), resulting in unrelated interface calls being linked to it.This is a limitation of the CHA algorithm as it links interface calls to all available implementations.In three confirmed N-cases (highlighted blue in the table) where tests would falsely pass the updates, Uppdatera correctly identified no use of the dependency under update in the projects.The project maintainers in two of the reported cases have started refactoring work to remove those identified dependencies.
Uppdatera was able to complement test suites in three false-negative cases (highlighted red in the table).In our confirmed case of an unsafe update, Uppdatera identified the Apache commons-lang3 library to break the application logic of a project due to changes in calculating string edits using the Jaro-Winkler distance.Generally, we can observe that solely using static analysis may risk falsely classifying safe updates as unsafe.Finally, we also make a comparison of execution times between running tests and Uppdatera.The results reveals that Uppdatera has faster or comparable times in 16 out of 22 cases, suggesting that change impact analysis can be a viable option to complement tests in CI environments.
Findings from RQ3: Semantically equivalent changes (refactorings) and over-approximated function calls are the main sources of false positives in Uppdatera.However, Uppdatera helped project maintainers identify risky updates and unused dependencies.

Evaluating Library Updates
Updating to a new version of a third-party library is not a trivial task, and for good reasons: interface refactorings induce additional maintenance burden and integrating untested behavior can jeopardize project stability.Services such as Dependabot advocate a modest update strategy focusing on project compatibility: only update if the tests pass with the new library version.Effectively, developer-written tests act as the first-line defense against library updates introducing regression changes.
A key insight in our work is that automated dependency updates are not reliable.Our results strongly suggest that existing developer-written tests lack specifications that exercise dependencies in depth.This finding is in line with the work by Mirhosseini et al. [16], where developers report being suspicious of integrating automated updates due to fear of breakage.When selecting to adopt a third-party library, Bogart et al., report that developers look at aspects such as reputation, code quality standards and active maintenance to build up trust [2].Perceived high-quality libraries can eliminate the need for extensive testing.In our case, we found evidence against this practice.The minor backward-compatible update of org.apache.commons:commons-lang3, a high-quality library, had changes that would break the application logic in one project if the pull request was merged in our manual analysis.In addition, the practice of testing third-party libraries is not common among popular testing books [53, 30,54], very few research papers suggest testing of thirdparty libraries [55,56].Directing testing efforts to dependencies would be a potential solution to the problem.Therefore, we recommend practitioners to use automated updating services cautiously and complement with tests for critical library dependencies.For tool creators in the domain, we argue for increased transparency in automated updating.With a small minority of projects having both coverage and tests capable of detecting simple regressions, pull requests could feature a confidence score on how well it is able to test new changes in a library under update.As a first step, tool creators can make use of our study setup to measure both coverage and quality of tests as an indication of confidence.A confidence score could also help reduce false negatives: if no tests are exercising a changed functionality of a dependency under update, Dependabot could avoid recommending it.

Strengths and Weaknesses of Static Analysis
Without needing to maintain additional dependencyspecific tests, static analysis can be effective in deterring updates with potential regression changes.For a large number of projects with limited test quality, change impact analysis can fill the gap where tests are unable to reach and would be a compelling option for tool creators to consider.For a minority of projects, however, we identify certain third-party libraries that impede the overall analysis accuracy.Libraries heavily relying on code generation such as the Spring framework makes use of the Java Reflection API that are known to be statically difficult to analyze [57], could miss critical execution paths in projects that make use of them.Moreover, by linking interface calls to all its implementations, call graphs contain overapproximated call paths.We could observe non-existing interface calls from functions in the unused dependency to classes implementing the interface in the project during the manual analysis.As Ponta et al. [29] approach base on building a call graph with the project and its dependencies together, we make preliminary observations that projects having library dependencies with several common interfaces between them are likely to have many unrelated function calls.Exploring improvements such as using type hints with data flow analysis could potentially eliminate such function calls.Overall, we argue that static analysis is a useful complement in use cases where tests lack coverage.By also revealing and presenting gaps and quality issues in test suites, static analysis can help developers in prioritizing testing efforts of dependencies.

Threats to Validity
Sampling random projects from GitHub pose threats to our results: tests or dependencies in projects may not exercise production classes.To mitigate this risk, we configure our call extractor to only record call paths originating from the project source code.Call paths that do not traverse via project source code are excluded (e.g., test class directly calling a dependency).
The use of mutation analysis to emulate source code changes in dependency functions has several potential threats to validity.First, we acknowledge that the applied mutation operators do not substitute actual regression changes in library updates.Our objective is to exercise all uses of libraries in a project by injecting simple faults to uncover potential coverage gaps in updating tools.Using real-world cases for this purpose would be challenging and potentially adding hidden uncontrolled factors.Second, our ground truth in RQ2 represents reachable call paths inferred from running project tests, making it a subset of all possible executions and is a limitation of the benchmark.A potential avenue to explore is the use of test generation techniques such as EvoSuite [58] to discover new call paths.However, EvoSuite generates tests at the class level without considering its interaction with other classes or dependencies, generating artificial tests that may not represent valid use cases.
The false-positive rate in RQ3 is indicative and not representative.Without domain knowledge of the interplay between a project and a dependency, the code reviews may state incorrect or incomplete information.To mitigate this risk, we post our code review assessment in the update for the project maintainer to react in case of incorrect analysis.Finally, for the reproducibility of our study, we have made the source code,1 the experimental pipeline,2 and our data publicly available [59].Specifically, we include the examined projects, applied mutation changes, and their dynamic and static call graph.

Related Work
Updating library dependencies in projects.To assist developers with updating dependencies in projects, researchers have studied practices around updating dependencies [3,16,2,60,26,27] and proposed tools leveraging both static-and dynamic analysis [6,4,61].Kula et al. [3] empirical study of 2, 700 library dependencies in 4, 600 Java project found that 81.5% remain outdated, even with security problems.The study found that factors such as uncertainties around estimating refactoring efforts and other task priorities as reasons for developers to not update dependencies.To address the update fatigue for developers, automated dependency updaters such as Dependabot and greenkeeper.ioactively reminds and suggests dependency updates to developers through the use of pull requests.A study by Mirhosseini et al. [16] found that pull requests encourage developers to update dependencies more frequently but the frequency of updates and lack of convincing arguments defer them from updating.On similar lines, the work of Bogart et al. [2] also suggests that developers perceive the use of monitoring tools to have a high signal-to-noise ratio than giving actionable insights.Finally, the empirical work of Dietrich et al. [26] suggests that 75% of emulated library updates in the Qualitas dataset has breaking changes.However, only a few updates resulted in an error, motivating the need for contextual analysis.
Recently researchers have started to explore the use of static-and dynamic analysis to identify library updates with breaking changes, saving developers time, and review efforts of library updates.NoRegrets [4,61] is a tool that detects breaking changes in test suites of dependent npm packages before releasing an update of the library.Although helpful in minimizing the chances of breaking changes for clients, the identified subset of clients may not be representative of other clients.Similarly, Foo et al. [6] describes a static approach using simple diffing and querying Veracode's SGL [62] graph to find clients affected by breaking changes.In contrast to this approach, Uppdatera analyzes at the project level (e.g., does not search for affected clients), targets diff with data-and control flow changes (i.e., not only interface changes), and includes a benchmark to compare updating tools.
Change Impact Analysis.Change Impact analysis is a widely studied problem in program analysis research [63,64].Propagation of changes in package repositories have become an important research area in light of incidents such as the left-pad incident, and recent moves to emulate these problems on package-based networks [65,18].Several techniques [66,67,68,69,70] use call graphs as an intermediate representation for change impact analysis.Alternative techniques to call graphs are static and dynamic slicing [71,32], profiling [72,73] and execution traces [74].Due to cost-precision trade-offs, several proposed approaches use a combination of these techniques.One such example is Alimadadi et al.'s work on Tochal, that leverages both runtime data and call graphs to more accurately represent changes to dynamic features such as the DOM.For a comprehensive overview of impact analysis techniques and change estimations, we refer the reader to Li et al 's [63] survey on code-based change impact analysis techniques An application of change impact analysis is regression test selection techniques [75] (RTS) such as class-based STARTS [76,77] and probabilistic test selection [78] that find relevant tests for evaluating new code changes.We found in our evaluation that test suites have limited coverage of dependencies, thus RTS may not be able to find tests relevant for changes in dependencies or have enough test data to build a prediction model for average GitHub projects.Finally, Danglot et al. [79] and Da Silva et al. [80] investigate the use of search-based methods such as test amplification and automated test generation for detecting semantically conflicting changes.Although search-based methods are effective in reducing false positives and to some degree eliminating false negatives present in static analysis, they are limiting for integration test scenarios such as automated dependency updating.Da Silva et al. [80] found that automated test generation such as Evo-Suite [58] have difficulties in generating effective tests for complex objects with internal or external dependencies.

Conclusions and future work
In this paper, we empirically investigate the reliability of test suites for automating dependency updates.With an increasing number of developers relying on services that automate updates of dependencies, our goal was to uncover to what degree project tests exercise utilized functionality in library dependencies, how effective they are in catching simple regressions, and there performance in practice.As recent research highlights the need for conservative techniques, we explored the use of change impact analysis to reduce false negatives.
Our findings show that half of 521 well-tested projects with tests cover less than 60% of their function calls to direct dependencies.The coverage drops to 20% when considering call paths to transitive dependencies.By artificially injecting simple faults in library dependencies to 262 projects, we observe that one-forth of the projects can detect 80% or more faults in functions of direct libraries.When considering transitive dependencies, the number of projects drop to one-eighth.Conversely change impact analysis, can detect 80% of potentially breaking in changes in both direct and transitive dependencies, two times more than using test suites.Although change impact analysis is a promising direction to flag faulty updates, we also manually investigate whether it can complement tests in 22 Dependabot pull requests.Our results show that change impact analysis could avoid unsafe updates in three cases where tests failed and spotted unused libraries in two cases.However, there are more false positives with change impact analysis as it is more imprecise than tests.
Our findings suggest that developers that are making use of automated dependency updating need to be aware of the risks with using project tests for compatibility checking.Without coverage or adequate tests for all usages of library dependencies, updates can silently introduce unintended functionality over time.As services such as Dependabot do not advertise risks involved with updating dependencies, tool creators could introduce reliability measurements such as scoring test suites in pull requests.As we investigate the use of change impact analysis, we argue that tool creators should explore combining dynamic and static analysis to derive verification techniques that do not strongly depend on users' test suites.
In future work, we aim to establish best practices for updating third-party libraries.As a first step, we aim to understand whether developers direct testing efforts towards dependencies and uncover the strategies they use.Moreover, we also intend to explore hybrid workflows through data-driven methods for efficient update checking by combining dynamic and static analysis.

Figure 2 :
Figure 2: Overview of our study infrastructure

Definition 4 . 1 .
Given a diff mapping D = Lib 1 \ Lib 2 between code entities in Lib 1 and Lib 2 , we consider a code change as not semantic preserving if and only if d i ∈ D has a source location with a reachable control flow path to client C and maps to the following potential actions in a CFG: 1. d i translates to change in the expression of write or read statements (data-flow change) 2. d i translates to moving a statement from position x to y (control-flow change) 3. d i translates to removing or expanding with new control flow paths (control-flow change)

of T C connecting data and control dependencies between program statements in both client and dependency code. The transition [Lib 1 → Lib 2 ] C rep- resents replacing Lib 1 with Lib 2 in client C. We arrive at the following definition of a safe update:
Definition 2.1.Given that Lib 1 ∈ T C and a request by a package manager to perform [Lib 1 → Lib 2 ] C , let D = Lib 1 \Lib 2 be a source code diff mapping between Lib 1 and Lib 2 , and function f : D → Y determine semantic compatibility for diff d i ∈ D in client C where Y ∈ {true, f alse}, an automatic update (or safe update) can only be made if and only if ∀d

Table 2 :
Descriptive Statistics for 521 GitHub projects (each variable aggregated per project)

Table 3 :
Results of running Uppdatera on 22 Dependabot pull requests