The State of Research on Function-as-a-Service Performance Evaluation: A Multivocal Literature Review

Function-as-a-Service (FaaS) is one form of the serverless cloud computing paradigm and is defined through FaaS platforms (e.g., AWS Lambda) executing event-triggered code snippets (i.e., functions). Many studies that empirically evaluate the performance of such FaaS platforms have started to appear but we are currently lacking a comprehensive understanding of the overall domain. In our work, we survey existing research on FaaS performance evaluation and present results from a multivocal literature review (MLR) covering 112 studies from academic (51) and grey literature (61). We find that existing work heavily focuses on AWS Lambda and prevalently evaluates micro-benchmarks using simple functions to measure CPU speed and FaaS platform overhead (i.e., container cold starts). Further, we identify a mismatch between academic and industrial sources on tested platform configurations, conclude that function triggers remain insufficiently studied, and list HTTP API gateways and cloud storages as the most used external service integrations. Following existing guidelines on experimentation in cloud systems, we discover many flaws threatening the reproducibility of experiments presented in the surveyed studies. We conclude with a discussion of gaps in literature and highlight methodological suggestions that may serve to improve future FaaS performance evaluation studies.


Introduction
Cloud computing continues to evolve, moving from lowlevel services such as Amazon Web Services (AWS) EC2, towards integrated ecosystems of specialized high-level services. Early Infrastructure-as-a-Service (IaaS) cloud services are very generalist solutions, which only provide a low-level abstraction of computing resources, typically in the form of self-administered virtual machines. In contrast, the emerging serverless 1 paradigm aims to liberate users entirely from operational concerns, such as managing or scaling server infrastructure, with fully-managed services offered with fine-grained billing. These high-level services become increasingly specialized, ranging from simple object storage (e.g., Amazon S3) to deep learning-powered conversational agents (e.g., Amazon Lex, the technology behind Alexa). However, to connect the different services (e.g, feed images from S3 into a transcoding service), a serverless-butgeneralist service is required as "glue" to bridge the gaps (in triggers, data formats, etc.) between services. This is the primary niche that Function-as-a-Service (FaaS) platforms, such as AWS Lambda 2 , have emerged to fill. In FaaS, developers provide small snippets of source code (often JavaScript or Python) in the form of programming language functions adhering to a well-defined interface. These functions can be connected to trigger events, such as incoming HTTP requests, or data being added to a storage service. The cloud provider executes the function (with the triggering event as input) on-demand and automatically scales underlying virtualized resources to serve elastic workloads of varying concurrency. Such high elasticity is exemplified by PyWren [1], a Python scale-out data analytics framework, where a central coordinator parallelizes tiny individual processing tasks into up to thousands of Lambda function instances. Previous research has shown that FaaS is used for a wide variety of tasks [2], including as the glue holding together a larger serverless application, as a backend technology to implement REST services, and for a variety of data analytics and machine learning tasks. This makes their performance crucial to the efficient functioning of a wide range of cloud applications.
Unfortunately, previous research has indicated a wide variety of performance-related challenges common to many FaaS platforms, including fully-managed providers and selfmanaged open source systems (e.g., Apache OpenWhisk 3 ). Among others, cold start times (the time required to launch a new container to execute a function in) can lead to execution delays of multiple seconds [3], hardware hetero-geneity makes predicting the execution time of a function difficult [4], and complex triggering mechanisms can lead to significant delays in function executions on some platforms [5]. However, so far reports about performancerelated challenges in FaaS are disparate and originate from different studies, executed with different setups and different experimental assumptions. We are lacking a consolidated view on the state of research on FaaS performance. This paper addresses this gap. We conduct a multivocal literature review (MLR) [6] to consolidate academic and industrial (i.e., grey literature) sources that are published between 2016 and 2019 and report performance measurements of FaaS offerings of different platforms. The field of our study is the performance evaluation (also referred to as performance benchmarking) of FaaS offerings, both of commercial public services and open source systems intended to be installed in private data centers. Our research goal is two-fold. Firstly, we characterize the landscape of existing isolated FaaS performance studies. Secondly, we identify gaps in current research (and, consequently, in our understanding of FaaS performance). We also provide methodological recommendations aimed at future FaaS performance evaluation studies.
The remainder of this paper is structured as follows. Section 2 introduces FaaS performance benchmarking. Section 3 defines and motivates our research questions. Section 4 describes our MLR study design before we present and discuss the results in Section 5. The main findings then lead to the implications of our study in Section 6, where we also identify gaps in current literature. Section 7 relates our work and results to other research in the field. Finally, Section 8 summarizes and concludes this paper.

Background
This section introduces FaaS performance benchmarking based on the two benchmark types covered in our work. Micro-level benchmarks target a narrow performance aspect (e.g., floating point CPU performance) with artificial workloads, whereas application-level benchmarks aim to cover the overall performance (i.e., typically end-to-end response time) of real-world application scenarios. We clarify this distinction of benchmark types based on example workloads from our analyzed studies.

Micro-Benchmarks
Listing 1 shows a simple CPU-intensive AWS Lambda function written in the Python programming language. This example function serves as a CPU micro-benchmark in one of our surveyed studies [A21]. It implements a provider-specific handler function to obtain the parameter n from its triggering invocation event (see line 13). The floating point operations helper function (see line 4) exemplifies how common FaaS micro-benchmarks measure latency for a series of CPU-intensive operations.  Figure 1 depicts the architecture of an AWS Lambda FaaS application that performs machine learning (ML) inferencing. The diagram is based on the mxnet-lambda reference implementation 4 used in some adjusted form by one study [A16] to benchmark ML inferencing. The application predicts image labels for a user-provided image using a pre-trained deep learning model. A user interacts with the application by sending an HTTP request to the HTTP API gateway, which transforms the incoming HTTP request into a cloud event and triggers an associated lambda function. The API gateway serves as an example for a common function trigger. However, lambda functions can also be triggered programmatically (e.g., via CLI or SDK), by other cloud events, such as file uploads (e.g., creation or modification of objects in S3), or various other trigger types.  Lambda functions implement the actual application logic, in our example application by loading the pre-trained ML model from S3, downloading the image from the userprovided URL from the internet, and then performing the inference computation within the lambda function. Such lambda functions commonly make use of cloud services for data storage (e.g., object storage S3, document database DynamoDB), logging (e.g., CloudWatch monitoring), analytics (e.g., Amazon EMR including Hadoop, Spark, HBase and other big data frameworks), machine learning (e.g., natural language translation with AWS Translate) and many more purposes. The image download exemplifies other potential interactions with third-party services, such as REST APIs. Our interactive example application finally returns the address and geographical coordinates (i.e., the predicted image labels) to the user through the HTTP API gateway as an HTTP response. In other non-interactive scenarios, lambda functions typically deliver their results to other cloud services, which might themselves trigger further actions or even other lambda functions as part of a workflow.

Research Questions
In the context of studies on FaaS performance evaluation, our main research questions address publication trends (RQ1), studied platforms (RQ2), evaluated performance characteristics (RQ3), used platform configurations (RQ4), and reproducibility (RQ5): Publication Trends: What are the publication trends related to FaaS performance evaluations?

RQ1
This question helps us understand how active research on FaaS performance evaluation has been over years, as well as give insights on publication types and academic venues. This question addresses an inherently important quality of experimental designs by assessing how well the FaaS community follows existing guidelines on reproducible experimentation in cloud systems [7].

Study Design
This section describes the methodology of our Multivocal Literature Review (MLR) based on the guidelines from Garousi et al. [6]. We first summarize the overall process, then detail the strategies for search, selection, and data extraction and synthesis, followed by a discussion of threats to validity.

MLR Process Overview
We divide the MLR process into a part for academic and grey literature. We classify peer-reviewed papers (e.g., papers published in journals, conferences, workshops) as academic literature (i.e., white literature) and other works (e.g., preprints of unpublished papers, student theses, blog posts) as grey literature. The search process and source selection for academic literature follows a conventional systematic literature review (SLR) process. Figure 2 summarizes this multi-stage process originating from three different search sources and annotates the number of studies after each stage. The process for grey literature studies is summarized in Figure 3 with sources originating prevalently from web search. Notice that the number of relevant studies are already deduplicated, meaning that we found 25 relevant studies through Google search and the additional +8 studies from Twitter search only include new, non-duplicate studies. A key motivation for the inclusion of grey literature is the strong industrial interest in FaaS performance and the goal to identify potential mismatches between the academic and industrial perspectives.

Search Strategies
We first describe manual and database search for academic publications, then highlight the adjustments for web search, and finally discuss how alert-based search and snowballing complement the classic search strategies.

Manual Search for Academic Literature
We use manual search to establish an initial seed of relevant sources to refine the database search query and to complement database search results with sources from third party literature collections. We screen the following sources for potentially relevant studies: a) Studies from the preliminary results of an SLR targeting benchmarking of FaaS platforms [8]: Their references from Table 1 are all relevant for our MLR but limited to 9 FaaS benchmarking studies, from which we removed one due to duplication (a journal extension covering more experiments than the initial extended abstract). c) Studies from a systematic mapping study (SMS) on engineering FaaS platforms and tools [10]: Their 62 selected publications focus on novel approaches and thus explicitly exclude benchmarking studies "without proposing any modifications" [10]. We still identify a total of 10 relevant studies for our MLR in their categories related to benchmarking and performance and by screening their complete references.

Database Search for Academic Literature
Following standard SLR guidelines [11], we define a search string to query common digital libraries for potentially relevant papers. We make use of logical OR operators to consider alternative terms given the lack of terminology standardization in the field of serverless computing. We refine the search string based on the insights from manual search, as suggested by Zhang et al. [12], by adding an additional keyword (i.e., lambda appeared in all full texts) but omitting double quotes for exact matching. Our final search string is defined as follows: (serverless OR faas) AND (performance OR benchmark) AND experiment AND lambda We apply the search string to 7 common digital libraries, namely ACM Digital Library, IEEE Explore, ISI Web of Science, Science Direct, Springer Link, Wiley In-terScience, and Scopus. The libraries are configured in their advanced query modes (if available) to search within full texts and metadata fields for maximal search coverage. The exact search query for each library can be found in the online appendix 5 including direct links and instructions for reproducing the search. The search was performed in Oct 2019 and all raw search results are exported into the bibtex format.

Web Search for Grey Literature
For querying grey literature, we modified our original search string to account for less formal language in online articles. We replicate our academic search for one Google query but omit the terms experiment and lambda for all remaining queries using the following simplified search string: (serverless OR faas) AND (performance OR benchmark) We apply the search string to 5 search engines, namely Google Search, Twitter Search, Hacker News Algolia Search, Reddit Search, and Medium Search. These engines (with the exception of Google Search) lack support for logical OR expressions. Therefore, we compose and combine four logically equivalent subqueries equivalent to the defined search string. Most searches were performed in Dec 2019 and, for replicability, we save the output of every search query as PDF and HTML files.

Alert-based Search
We configure web-alerts to discover recently published literature. Our previous search strategies often miss recent literature because manual search heavily relies on previous work and database search might suffer from outdated query indices or omit academic literature in press 5 https://github.com/joe4dev/faas-performance-mlr (i.e., accepted but not yet published). Therefore, we configure Google Scholar alerts 6 for the broad search term serverless and the more specific search term serverless benchmark over a period of 5 months (2019-10 till 2020-02) and screen all articles for potential relevance. This search strategy discovers many preprints (e.g., from arXiv.org) for which we explicitly check whether they are accepted manuscripts (i.e., white literature) or unpublished preprints (i.e., grey literature).

Snowballing
After applying the selection criteria, we perform snowballing for academic and grey literature. For academic literature, we apply backward snowballing by screening their reference lists and forward snowballing by querying citations to relevant papers through Google Scholar. For grey literature, we prevalently apply backward snowballing by following outgoing links and occasionally (particularly for popular and highly relevant sources) apply forward snowballing by querying incoming for links through a backlink checker 7 .

Selection Strategy
Following established SLR study guidelines [11], we define the following inclusion (I) and exclusion (E) criteria for our study: I1 Studies performed at least one performance-related experiment (i.e., exculding purely theoretical works, simulations, and works where a performance experiment was only mentioned as a sidenote) with a real FaaS environment as System-Under-Test (SUT). The FaaS environment can be fully managed or self-hosted.
I2 Studies presented empirical results of at least one performance metric.
I3 Studies published after Jan 1st 2015, as the first FaaS offering (AWS Lambda) was officially released for production use on April 9, 2015 8 .
E1 Studies written in any other language than English E2 Secondary or tertiary studies (e.g., SLRs, surveys) E3 Re-posted or republished content (e.g., sponsored repost, conference paper with a journal extension) As suggested by Wohlin et al. [13], we only consider the most complete study as relevant primary study in cases of partial republication, for instance in the case of a journal extension of a conference paper. The two authors classified each potentially relevant study either as relevant, uncertain (with an indication whether rather relevant or not), or not relevant. All studies classified as uncertain were examined again and the rationale for the final decision was documented following the selection strategy presented above. If the title, keywords, and abstract were insufficient for obviously excluding a study, we read the full text of the study to take a final decision as practiced for all included studies.

Data Extraction and Synthesis
Guided by the research questions, we extract the corresponding information based on a structured review sheet.
Publication Trends (RQ1). To capture how many studies of which type are published, we extract the following metadata: (i) the publication date (ii) the venue type for academic literature (i.e., journal, conference, workshop, doctoral symposium) and grey literature (i.e., preprint, thesis, blog post) (iii) the name of the venue (e.g., IEEE CLOUD, USENIX ATC), and a ranking of the venue (i.e., A*, A, B, C, W for workshop, unranked). The venue ranking follows the CORE ranking for conferences (CORE2018 9 ) and journals (ERA2010 10 ).
Studied Platforms (RQ2). To assess which offerings are particularly well-understood or insufficiently researched, we extract the names of all FaaS platforms that are empirically investigated in a study.
Evaluated Performance Characteristics (RQ3). To understand which performance characteristics have been benchmarked, we distinguish between micro-and applicationbenchmarks, collect a list of micro-benchmarks (e.g., CPU speed, network performance), and capture more general performance characteristics (e.g., use of concurrent execution, inspection of infrastructure). We start with an initial list of characteristics and iteratively add popular characteristics from an open Others field.
Reproducibility (RQ5). To review the potential regarding reproducibility, we follow existing guidelines on experimentation in cloud systems [7]. The authors propose eight fundamental methodological principles on how to measure and report performance in the cloud and conduct an SLR to analyze the current practice concerning these principles covering top venues in the general field of cloud experimentation. As part of our work, we replicate their survey study in the more specific field of FaaS experimentation. We largely follow the same study protocol by classifying for each principle whether it is fully met (yes), partially present (partial ) but not comprehensively following all criteria, or not present (no). Additionally, we collect some more fine-grained data for certain principles. For example, we distinguish between dataset availability and benchmark code availability for P4 (open access artifact) because we consider public datasets to be essential for replicating (statistical) analyses and public benchmark code is practically essential for reproducing the empirical experiment. For P3 (experimental setup description), we additionally capture whether a study describes the time of experiment (i.e., dates when the experiment was conducted), cloud provider region (i.e., location of data center), and function size (i.e., used memory configurations).

Threats to Validity
We discuss potential threats to validity and mitigation strategies for selection bias, data extraction and internal validity, replicability of the study, and external validity.
Selection Bias. The representativeness of our selected studies is arguably one of the main threats to this study. We used a multi-stage process (see Section 4.1) with sources originating from different search strategies. Initial manual search based on existing academic literature collections allowed us to fine-tune the query string for database searches against 7 well-established electronic research databases. We optimize our search string for more informal grey literature and query 5 search engines specializing in generalpurpose search, social search, and developer-community search. Additionally, our complementary search strategies aim to discover studies that were recently published, found in the other context (i.e., academic vs grey), or spotted through more exploratory search (e.g., looser adaptation of search terms).
Data Extraction and Internal Validity. Tedious manual data extraction could potentially lead to inaccuracies in extracted data. To mitigate this threat, we define our MLR process based on well-established guidelines for SLR [11] and MLR [6] studies, methodologically related publications [14], and topically relevant publications [8,10]. Further, we setup a structured review sheet with practical classification guidelines and further documentation that was incrementally refined (e.g., with advice on classifying borderline cases). We implemented traceability through over 700 additional comments, at least for all borderline cases. The data extraction process was conducted by both authors, with the first author as main data extractor and the second author focusing on discussing and verifying borderline cases. We also repeatedly went over all sources to verify certain data (e.g., based on refined classification scheme) and collect more details (e.g., individual aspects of more vague P3 on experimental setup description). For the reproducibility part (see RQ5), we refer to the statistical evaluation on inter-reviewer agreement in the original study [7], which achieved very high agreement.
Replicability of the Study. We publish a replication package [15] to foster verification and replication of our MLR study. Our package includes all search queries with direct links and step-by-step instructions on how to replicate the exact same queries, query results in machine-readable (BibTeX/HTML) and human-readable (PDF) formats, a structured review sheet containing all extracted data and over 700 comments with guidance, decision rationales, and extra information, and code to reproduce all figures in our study. The latest version of the replication package and further documentation is also available online 11 .
External Validity. Our study is designed to systematically cover the field of FaaS performance benchmarking for peerreviewed academic white literature and unpublished grey literature including preprints, theses, and articles on the internet. However, we cannot claim generalizability to all academic or white literature as we might have missed some studies with our search strategies. The inclusion of grey literature aims to address an industrial perspective but is obviously limited to published and indexed content freely available and discoverable on the internet (e.g., excluding paywall articles or internal corporate feasibility studies).

Study Results and Discussion
This section presents and discusses the main outcomes of our MLR study guided by our research questions stated in Section 3. The results are based on the extracted and synthesized (according to Section 4.4) survey data from 112 selected (according to Section 4.3) primary studies including 51 academic publications and 61 grey literature sources. For each research question, we briefly describe context, motivation and methodology, followed by relevant results and their subsequent discussion.

Publication Trends (RQ1)
Description. We describe the publication trends on FaaS performance evaluations by summarizing the publication statistics over years and venue types, the venue rankings for academic literature, and the most popular publication venues. The venue ranking follows the CORE ranking for conferences (CORE2018) and journals (ERA2010).
Results. Figure 4 shows the distribution of published studies for academic and grey literature over years and venue types. We observe a growing interest for both types of literature, with early studies appearing in mid 2016 [A15, A35], followed by a drastic increase in 2017, and a surge   Discussion. Our results are generally in line with the related systematic mapping study from Yussupov et al. [10]. However, we see a stronger emphasis on workshop publications, which appears plausible for a more narrow topic of investigation. Additionally, our work indicates that grey literature follows a similar but possibly more pronounced hype trend with blog posts spiking in 2018 and declining stronger in 2019 than cumulative academic literature. Related to academic venue rankings, we interpret the relative over-representation of top-ranked publications (in comparison to relatively few full papers in C-ranked venues) as a positive sign for this young field of research. The strong representation of workshop papers, particularly at WoSC, is plausible for a relatively narrow topic in a young line of research.

Studied Platforms (RQ2)
Description. The two main types of FaaS platforms are hosted platforms and platforms intended to be installed in a private cloud. Hosted platforms are fully managed by a cloud provider and often referred to as FaaS providers. All major public cloud providers offer FaaS platforms, including AWS Lambda, Microsoft Azure Functions, Google Cloud Functions, and IBM Cloud Functions. Installable platforms are provided as open source software and can be self-hosted in on-premise deployments. Prominent open source platforms include Apache OpenWhisk, Fission, Knative, or OpenFaaS. Self-hosting requires extra setup, configuration, and maintenance efforts, but allows for full control and inspection during experimentation. Dozens more hosted services 12 and many more FaaS development frameworks 13 and installable platforms have emerged in this fast growing market.
Results. The first row of the bubbleplot in Figure 6a summarizes the total number of performance evaluation experiments in absolute frequency counts for the 5 most popular hosted FaaS platforms in our study. Self-hosted platforms are only depicted in aggregation due to their low prevalence in literature. The x-axis is ordered by cumulative platform frequency, where AWS Lambda leads with a total of 99 studies divided into 45 academic and 54 grey literature studies. Thus, 88% of all our selected studies perform experiments on AWS Lambda, followed by Azure (26%), Google (23%), self-hosted platforms (14%), IBM (13%), and CloudFlare (4%). For hosted platforms, we omit Lambda@Edge 14 (3) and Binaris 15 (1) because Lambda@Edge is covered in the same experiments as Cloud-Flare and Binaris only occurred once. Within self-hosted platforms, academic literature mostly focuses on Open-Whisk (70%), whereas grey literature covers other platforms, such as Fission, Fn, or OpenFaaS. We observe that the attention by literature type is appropriately balanced for AWS Lambda and Azure Cloud Function. Academic literature exhibits a higher coverage than grey literature for Google (+7%) and in particular for IBM (+12%), as well as for self-hosted platforms (+17%). Grey literature covers CloudFlare (+7%), which is not covered at all in academic literature.
Discussion. In comparison to other surveys, our overall results for percentage by provider closely (±5%) match the self-reported experience per cloud provider in a 2018 FaaS survey [2](N=182). Our results are also reasonably close (±5% except for AWS +13%) to self-reported use in organizations in a 2019 O'Reilly survey on serverless architecture adoption (N>1500) 16 . Initial results of the latest 2020 survey (N>120 for the first day) 17 indicate similar results for FaaS products currently in use, with a very close match for AWS (1% deviation). However, this survey shows even lower numbers (up to -12%) for other providers.
Hence, our results show that AWS is currently overstudied in absolute numbers (by a factor of 3x). However, the strong emphasis on AWS appears to be justified in relative numbers given the industrial importance of AWS in this domain. IBM appears to be over-represented in academic studies, potentially influenced by the active contributions to academic literature originating from IBM Research.

Evaluated Performance Characteristics (RQ3)
To answer RQ3, the facetted bubbleplot in Figure 6 combines performance characteristics for a) benchmark types b) micro-benchmarks, and c) general characteristics across FaaS platforms. All these plots can be interpreted as a heatmap ranging from few studies in the bottom-left corner to many studies in the top-right corner for a given characteristic-platform combination. We provide relative      We distinguish between micro-and applicationbenchmarks as summarized by our benchmark taxonomy in Figure 7 and introduced in Section 2. Results. Figure 6a summarizes which high-level types of benchmarks are used across which FaaS platforms. The rightmost Total per Characteristic column indicates that micro-benchmarks are the most common benchmark type, used by 75% of all selected studies (84/112). Interestingly, we observe a particularly strong emphasis on microbenchmarks in grey literature (50 studies, or 82%). However, also two thirds of the selected academic literature performs studies using micro-benchmarks. Applicationlevel benchmarks are used by 48 (43%) of all selected studies, and interestingly are much more prevalent among academic literature with 29 (57%) studies compared to grey literature with only 19 (31%) studies. Further, 12 (24%) academic studies combine micro-benchmarks and application-benchmarks, which can be derived by the difference (i.e., 12) between the total number of academic literature studies (51) and the sum of total micro-benchmarks and application-benchmarks (34 + 29 = 63). For grey literature, only 8 (13%) studies combine the two benchmark types and thus the vast majority of studies (87%) uses micro-or application-benchmarks in isolation. Finally, micro-benchmarks are more commonly used across different providers, whereas application-benchmarks are prevalently (> 84%) used for benchmarking AWS.
Discussion. We were surprized to see relatively many academic studies using application-benchmarks. Closer investigation reveals that many of these academic studies demonstrate or evaluate their proposed prototypes on a single FaaS platform (e.g., "MapReduce on AWS Lambda" [A12]) focusing on thorough insights and leaving crossplatform comparisons for future work. While we agree that such studies on a single FaaS platform can be great demonstrators of ideas and capabilities, the general usefulness of application-benchmarks evaluated in isolation on a single platform is limited, as the inability to relate results from such a work against any baseline or other reference platform makes it hard to derive meaningful conclusions. This threat is particularly relevant for hosted platforms, as the performance observed from such experiments depends strongly on the underlying hard-and software infrastructure. Therefore, we argue that reproducibility (see RQ5) is particularly important for this type of studies.
Some studies clearly intent to conduct end-to-end (i.e., application-level) measurements, however, their applications and workloads are insufficiently described such that it is unclear what exactly they do. This unclarity is reinforced by the tendency of application-benchmarks to remain closed source, with only 35% of the studies publishing at least partial benchmark code compared to 50% overall.

Evaluated Micro-Benchmarks (RQ3.2)
Description. We cover micro-benchmarks targeting CPU, file I/O, and network performance. The Others category summarizes other types of micro-benchmarks such as cold start evaluations (i.e., platform overhead). Notice that we grouped platform overhead as general performance characteristics because some studies alternatively use applicationbenchmarks with detailed tracing.
Results. Figure 6b summarizes which micro-benchmark performance characteristics are used across which FaaS platforms. The rightmost Total per Characteristic column shows that CPU is by far the most evaluated microbenchmark characteristic, used by 40% of all studies. Network and file I/O performance are less common for academic literature studies, and even more rare in grey literature. These two less common characteristics are all evaluated on the AWS platform (with the exception of 2 file I/O grey literature studies) but practically uncovered on self-hosted platforms (only 2 studies overall). The Others category mainly consists of platform overhead and workload concurrency evaluated through micro-benchmarks.
Discussion. Our results suggest that CPU performance is an overstudied performance characteristic among FaaS micro-benchmarks. Many studies confirm that CPU processing speed scales proportionally to the amount of allocated memory (i.e., configured function size) for AWS [A8, A49, A3, G10, G43] and Google [A8, A49, A3, G10]. This empirically validated behavior is also explicitly stated in the documentation of the providers. For instance, the AWS Lambda documentation states that "Lambda allocates CPU power linearly in proportion to the amount of memory configured." 18 . The Google Cloud Functions documentation also used to mention proportional scaling explicitly. A subset of these studies [A8, A3, G2] also cover Azure and IBM and conclude that these platforms assign the same computational power for all functions. Notice that Azure does not expose an explicit memory size configuration option as common for the other providers, but rather determines available memory sizes based on a customer's service subscription plan 19 .

Evaluated General Characteristics (RQ3.3)
Description. We cover four general performance characteristics, namely platform overhead, workload concurrency, instance lifetime, and infrastructure inspection. These general characteristics are orthogonal to previously discussed characteristics, and can be measured using either micro-or application-level benchmarks. Platform overhead (e.g., provisioning of new function instances) mainly focuses on startup latency and in particular on quantifying the latency of cold starts. Workload concurrency refers to workloads that issue parallel requests, or to benchmarks evaluating platform elasticity or scaling behavior (e.g., [A26, A36]). Instance lifetime or infrastructure retention attempts to re-engineer the provider policy on how long function instances are kept alive until they get recycled and trigger a cold start upon a new function invocation. Infrastructure inspection aims to re-engineer underlying hardware characteristics (e.g., CPU model to detect hardware heterogeneity) or instance placement policy (e.g., instance identifier and IP address to detect coresidency on the same VM/container [A49]).
Results. Figure 6c summarizes which general performance characteristics are benchmarked across which FaaS platforms. Workload concurrency is a commonly studied characteristic, but more so in academic literature (65%) than in grey literature (48%). On the other hand, grey literature seems to focus more on platform overhead (56%) than academic literature (45%). Infrastructure inspection is exclusively analyzed in academic literature studies (12%). Note that this line of inquiry does not make sense for selfhosted platforms, and hence is not studied in this context. Finally, the data from the Others column shows that there is currently a lack of cross-platform comparisons of function triggers.
Discussion. General performance characteristics focus on particularly relevant aspects of FaaS and only few studies aim towards reverse-engineering hosted platforms. Elasticity and automatic scalability have been identified as the most significant advantage of using FaaS in a previous survey [2], which justifies the widespread evaluation of concurrency behavior. Given the importance of this characteristic, we argue that concurrent workloads should be an inherent part of all FaaS performance evaluations going forward (going beyond the 50% of studies observed in our corpus). Container start-up latency has been identified as one of the major challenges for using FaaS services in prior work [2] and motivates a large body of work related to quantifying platform overheads.
In prior IaaS cloud performance evaluation research, reverse-engineering cloud providers was a common theme and lead to exploitation approaches for hardware heterogeneity [16,17]. However, as hardware heterogeneity became less relevant over time [18], we refrain from interpreting the current lack of infrastructure inspection studies as a research gap that requires more attention. The lack of studies from grey literature might also hint that this characteristic is currently of less interest to practitioners.

Used Platform Configurations (RQ4)
To answer RQ4, we present a facetted barplot (Figure 8), visualizing the share of studies using a given configuration. We report the share as percentage against all academic and all grey literature studies.

Used Language Runtimes (RQ4.1)
Description. The language runtime is the execution environment of a FaaS function. Fully managed platforms offer a list of specific runtimes (e.g., Node.js, Python) determining the operating system, programming language, and software libraries. Some providers support the definition of custom runtimes by following a documented interface, often in the form of Docker images. If customization is not available in a platform, shims can be used to invoke an embedded target language runtime through a support runtime via system calls (e.g., invoking binaries through Node.js).
Results. Figure 8a shows how frequently different language runtimes are evaluated. Overall, Python and Node.js are the most popular runtimes, followed by Java. Interestingly, Node.js and Java are twice as popular among grey literature compared to academic literature. Grey literature generally covers more, and more diverse, langues in comparison to academic works. The category of Others includes a list of 13 languages (e.g., F#, Scala, Haskell) evaluated through custom runtimes or shims.
Discussion. The large differences between academic and grey literature indicates a potential mismatch of academic and industrial interests. This assumption is supported by other studies reporting the use of FaaS languages that similarly conclude Node.js to be roughly 20% more popular than Python 20 [2].

Used Function Triggers (RQ4.2)
Description. Function triggers cover alternative ways of invoking FaaS functions. FaaS functions can be triggered explicitly (e.g., through code) or implicitly through events happening in other services (e.g., image uploaded to cloud storage triggers function). HTTP triggers invoke functions on incoming HTTP requests. SDK and CLI triggers use Results. Figure 8b shows how frequently different types of function triggers are evaluated. HTTP triggers are by far the most commonly evaluated type of trigger, and are used by 57% of all studies. Invocation through storage triggers is surprisingly uncommon for grey literature (10%). In general, only two studies cover more than two trigger types [A41, A27], with the vast majority focusing on a single type of trigger.
Discussion. It appears function triggering has received little attention given that most studies go for the de-facto default option of exposing a function via HTTP. There are a wide range of other ways to trigger function execution (e.g., through a message queue, data streaming service, scheduled timer, database event, an SDK, etc.), which are currently not widely used and evaluated.

Used External Services (RQ4.3)
Description. We now discuss which external services are commonly used in FaaS performance evaluations. Cloud API gateways offer a fully managed HTTP service, which is commonly used to trigger functions upon incoming HTTP requests. Cloud storages offer object storage for blob data, such as images. Cloud databases offer structured data storage and querying. Cloud workflow engines manage the state of complex processes across multiple functions. Cloud stream, cloud queue, and cloud pub/sub are different types of message processing services. Cloud networks refer to configurable private network services, such as AWS Virtual Private Cloud (VPC).
Results. Figure 8c shows how frequently different external services are used. Cloud API gateways are the most commonly used external service, which is unsurprising given that most studies use HTTP events to trigger functions. About half of the academic literature studies use cloud storage compared to only 10% grey literature studies. Overall, database services are among the most popular integrations. The Others category includes caching services, self-hosted databases, and special services such as artificial intelligence APIs. In general, given how central service ecosystems are to the value proposition of cloud functions, it is surprising how rarely FaaS benchmarking studies incorporate external services beyond API gateways.
Discussion. The result from function triggers explains the strong emphasis on cloud API gateway services. However, other studies indicate that database services are more commonly used in conjunction with FaaS 21 [2]. In this context we are particularly surprized that even in grey literature, FaaS solutions are typically evaluated in isolation. A possible explanation lies in the strong focus of grey literature on micro-benchmarks, which typically use no external services or only an API gateway for easy invocation. We conclude that the integration of external services in FaaS performance evaluations in a meaningful way remains a gap in current literature.

Reproducibility (RQ5)
Description. To evaluate the maturity of literature with regard to reproducibility, we rely on recent work by Papadopoulos et al. [7]. They propose eight methodological principles for reproducible performance evaluation in cloud computing, which we now summarize and apply to our corpus: P1 Repeated Experiments: Repeat the experiment with the same configuration and quantify the confidence in the final result.
P2 Workload and Configuration Coverage: Conduct experiments with different (preferably randomized) workloads and configurations motivated by real-world scenarios.
P3 Experimental Setup Description: For each experiment, describe the hardware and software setup, all relevant configuration and environmental parameters, and its objective.
P4 Open Access Artifact: Publish technical artifacts related to the experiment including software (e.g., benchmark and analysis code) and datasets (e.g., raw and cleaned data).
P5 Probabilistic Result Description: Describe and visualize the empirical distribution of the measured performance appropriately (e.g., using violin or CDF plots for complex distributions), including suitable aggregations (e.g., median, 95th percentile) and measures of dispersion (e.g., coefficient of variation also known as relative standard deviation).
P6 Statistical Evaluation: Use appropriate statistical tests (e.g., Wilcoxon rank-sum) to evaluate the significance of the obtained results.
P7 Measurement Units: For all the reported measurements also report the corresponding unit.
P8 Cost: Report the calculated (i.e., according to cost model) and charged (i.e., based on accounted resource usage) costs for every experiment.
Results. Figure 9 shows to what extent the reproducibility principles from Papadopoulos et al. [7] are followed by our selected academic and grey literature. Overall, we find that 7 out of 8 principles are not followed by the majority of studies and, although academic literature performs better (>20%) than grey literature for 3 principles (i.e., P2, P3, P8), we do not see a clear trend that academic work follows the proposed principles more strictly. Interestingly, grey literature is even better than academic literature with regards to providing open access (P4) and probabilistic result descriptions (P5).

Literature Type
Academic Grey Further, academic studies tend to define their experiment goals more formally based on testable hypotheses (e.g.,

Figiela et al. [A8] or Manner et al. [A34]).
P4: Open Access Artifact. Technical artifacts are unavailable for 61% of the academic and 43% of the grey literature studies. Grey literature more commonly publishes their benchmark code (43% vs 16%) but more academic studies provide complete open source access to benchmark code and collected datasets (24% vs 15%). The partial fulfillment category has only two exceptions of grey literature studies solely publishing their dataset but not their benchmark code instead of vice versa. We discovered one of the following three practical issues related to handling open access artifacts in 9% of all studies. Firstly, we found inaccessible links in 3 studies that claim their artifacts are open source. Secondly, we noticed obviously incomplete implementations (e.g., only for one provider, isolated legacy code snippet, code within inaccessible pictures) in another 3 studies. Thirdly, we discovered open source artifacts that were not explicitly linked in 4 studies but could be discovered via manual Google or Github search or were implicitly linked in user comments (e.g., upon request of commenters).
The violation of P4 is particularly severe in combination with insufficient experimental setup description (P3). A total of 19 (17%) studies neither provides any technical artifacts nor any proper experimental setup description, rendering these studies practically impossible to replicate in practice. Another 20 (18%) studies violate P4 and omit relevant details in their experimental setup description. Thus, these studies are hard to replicate under similar conditions (but a "similar" experiment could be conducted).
P5: Probabilistic Result Description. About 40% of all studies appropriately visualize or characterize their empirical performance data, but roughly the same percentage of all studies ignore complex distributions and primarily focus on reporting averages. These nearly 40% of the studies fulfilling P5 commonly use CDFs, histograms, or boxplots complemented with additional percentiles. The 15% of academic and 26% of grey literature studies partially fulfilling P5 often give some selective characterization of the empirical distribution by plotting (raw) data over time or by violating P1 (i.e., insufficient repetitions).
P6: Statistical Evaluation. Almost none of the selected studies perform any statistical evaluations. Only two academic papers and one preprint use statistical tools such as the spearman correlation coefficient [A34] or a nonparametric Mann-Whitney U test [G49].
P7: Measurement Units. Overall, P7 is followed almost perfectly with no major violations. Grey literature occasionally (16%) omits measurement units (most commonly in figures) but the missing unit can be derived relatively easy from the context (or is mentioned in the text).
P8: Cost. Cost models are missing in 55% of the academic and 79% of grey literature. Two academic studies fulfill P8 partially by discussing costs in a general sense (e.g., as motivational example), but without discussing actual costs of the experiments. While there are some studies that particularly focus on costs (e.g., Kuhlenkamp and Klems [A24]), most studies typically calculate costs based on accounted or self-measured resource usage (e.g., runtime), but omit the actually charged cost.
Discussion. We expected peer-reviewed academic literature to consistently achieve more methodological rigour than largely individually-authored grey literature. Surprisingly, we do not see a clear trend that academic literature disregards the principles less often than grey literature. It is concerning that even simple principles such as publishing technical artifacts are frequently neglected, and grey literature is even better in providing at least partial open access. Methodologically long-known principles from academic literature are still commonly overlooked in academia, exemplified by statistical guidance from 1986 on avoiding misleading arithmetic mean values [21]. The presumably "more informal" grey literature is often on par or even better in appropriately describing performance results.
On the other hand, we emphasize that the clear lead of academic literature for three principles (i.e., P2, P3, P8) goes beyond the expressiveness of a 3-point discrete scale (i.e., yes, partial, no). Experimental setup description (P3) has many facets and our results prevalently cover the presence or absence of relevant conditions, but fail to appropriately account for other important facets, such as clear structure and presentation. Grey literature includes examples of unstructured studies, where results are presented without any discussion of methodology and scarce details about the experiment setup are scattered throughout a blog post. In terms of P2, grey literature frequently picks one of the easiest available workloads, whereas academic studies more often motivate their workloads and attempt to link them to real-world applications.
We found that although many studies seemingly evaluate similar performance characteristics, comparing actual performance results is very difficult due to a large parameter space, continuously changing environments, and insufficient experimental setup descriptions (P3). We collected some exemplary results for the hosted AWS platform and find dramatic differences in numbers reported for platform overhead/cold starts ranging from 2ms (80th percentile, Python, 512mb but presumably reporting something else, maybe warm-start execution runtime of an empty function) [G5] up to 5s (median, Clojure via Java JVM, 256mb) [G54]. More common results for end-to-end (i.e., including network latency of typically pre-warmed HTTPS connection) cold start overhead (i.e., excluding actual function runtime) for the Nodejs runtime on AWS (according to live data from 2020-02) are in the orders of ≈50ms (median) to ≈100ms (90th percentile) [A8,G11]. Studies from 2019 tend to report slightly higher numbers mostly around 200-300ms (median) [G11, G33,G3].
In the following, we highlight some insights into practical reproducibility related to P3 and P4. We strongly agree with Papadopoulos et al. [7] that preserving and publishing experiment artifacts (P4) may be the only way to achieve practical reproducibility given that an exhaustive description (P3) of a complex experiment is often unrealistic. We further argue that at least any time-consuming repetitive manual steps (but preferably any error-prone manual setup step that could lead to potential misconfiguration and affect the outcome of a study) should be fully automated [22]. We are positive to discover many automated setup and evaluation approaches in open source artifacts (P4) accompanying our studies, but still encounter too many studies with inexistent or tedious manual setup instructions.

Implications and Gaps in Literature
We now discuss the main findings and implications of our study and identify gaps in current literature.

Publication Trends (RQ1)
FaaS performance evaluation is a growing field of research in academic as well as grey literature, with a surge of new studies appearing in 2018. Around 20% of the selected academic studies are published in top-ranked conferences or journals, surprisingly few studies appear in midtier venues, and a majority of studies appear in workshops and unranked venues. The most popular target venues for FaaS benchmarking studies are the International Workshop on Serverless Computing (WoSC), USENIX Annual Technical Conference (ATC), and IEEE International Conference on Cloud Computing (CLOUD).

Studied Platforms (RQ2)
The most evaluated platforms are AWS Lambda (88%), Azure Functions (26%), Google Cloud Functions (23%), IBM Cloud Functions (13%), and self-hosted platforms (14%), predominantly Apache OpenWhisk. In absolute numbers, AWS is currently overstudied (by a factor of 3x). However, other sources have reported that AWS is also predominant in actual production usage by a similar margin (see Section 5.2-Discussion). Despite current industrial practice, future FaaS benchmarking studies should go beyond performance evaluations for the most popular platforms (e.g., avoid studying only AWS) to broaden our understanding of the field in general. Particularly concerning in this context is that quickly rising cloud providers (e.g., Alibaba Cloud Function Compute as the leading Asian cloud provider 26 ) currently see virtually no attention in literature.

Evaluated Performance Characteristics (RQ3)
The lack of cross-platform benchmarks is a common theme across the performance characteristics discussed in the following.

Evaluated Benchmark Types (RQ3.1)
The predominant use of micro-benchmarks in 75% of all studies indicates an over-emphasis on simple easy-tobuild benchmarks, compared to application-benchmarks, which are used in 57% of the academic and 31% of the grey literature studies (i.e., overall 18% use both). This insight is supported by the large percentage of studies conducting platform overhead benchmarks with trivial functions (e.g., returning a constant) and CPU benchmarks using common workloads (e.g., prime number calculations). Future work needs to go beyond such over-simplified benchmarks, and focus on more realistic benchmarks and workloads. We also identify a need to develop cross-platform applicationlevel benchmarks as the current focus on a single platform (88% of all application-benchmarks are evaluated on AWS) limits their usefulness for comparing platforms. However, such cross-platform benchmarks are challenging to develop due to heterogenous platforms and their complex ecosystems [23].

Evaluated Micro-Benchmarks (RQ3.2)
Most micro-benchmarks (40%) evaluate CPU performance, and show that CPU performance in FaaS systems is indeed proportional to the memory size of the selected function type for certain providers (i.e,. AWS, Google). This is disappointing, as this behavior is well-documented by the cloud providers themselves and does not justify much further study. We understand the need for periodic re-evaluations due to the dynamic nature of continuously evolving FaaS platforms [24] and want to emphasize the importance of studies targeting continuous benchmarking efforts (see examples in Section 5.5-P1). However, given the large scientific support that CPU performance of FaaS services behaves as documented, we suggest future studies to de-emphasize this aspect and focus on other characteristics such as network or function trigger performance (or real-world application-benchmarks).

Evaluated General Characteristics (RQ3.3)
The most evaluated general performance characteristics are FaaS platform overhead (i.e., cold starts) and workload concurrency (i.e., invoking the same function in parallel), both used by about half of the studies. This makes sense, as these aspects link to FaaS specifics and the most significant advantages of using FaaS, as reported in other surveys [2]. No study currently evaluates function triggers across platforms. We think the integration through triggers is an important aspect for FaaS performance, where insights can guide decisions about function invocation, function coordination, and usage of appropriate external services. A major open research challenge towards such cross-platform benchmarks is the heterogenous landscape of FaaS systems [23].

Used Platform Configurations (RQ4)
Our study indicates a broader coverage of language runtimes, but shows that other platform configurations focus on very few function triggers and external services.
We identify a mismatch between academic and industrial sources, as Node.js, Java, Go, and C# are evaluated two times more frequently in grey literature than in academic work. Grey literature is generally more focused in covering more and more diverse runtimes than academic literature. We suggest future academic literature studies to diversify their choice of runtimes, potentially also including insufficiently researched runtimes, such as Go or C#.

Used Function Triggers (RQ4.2)
At the moment, a majority of studies (57%) focuses on HTTP triggers. We conclude that many trigger types remain largely insufficiently researched and suggest future studies to explore alternative triggers, such as message queues, data streams, timers, or SDKs.

Used External Services (RQ4.3)
Integrating external services in a meaningful way into FaaS performance evaluation studies remains an open challenge. Despite their importance to overall serverless application performance, most current evaluations choose to abstract away from external services. The only services we have seen used with some frequency are cloud API gateways (57%), cloud storage (47% academic vs 10% grey literature), and cloud databases (10-15%).

Reproducibility (RQ5)
We find that 7 of 8 reproducibility principles are not followed by the majority of the analyzed studies. This is in line with the results of the original study [7] on cloud experimentation in general. We classify one third of all studies as practically impossible or hard to replicate under reasonably similar conditions due to the simultaneous lack of sufficient experimental setup description and available artifacts. Overall, academic studies tend to satisfy the principles more comprehensively than grey literature but we do not see a clear trend that academic literature is less susceptible to disregarding the principles. Academic work is considerably better (principle fully met >20%) than grey literature in choosing appropriate workloads (P2), describing the experimental setup (P3), and reporting costs (P8). However, grey literature is considerably better in providing at least partial open access to experimental artifacts (i.e., code and data). We support the trend towards artifact evaluations 27 and recommend focusing on artifact availability first (e.g., explicitly include availability check in reviewer guidelines) and subsequently target more qualitative attributes (e.g., ACM Functional, defined as documented, consistent, complete, exercisable P8 Report a cost model

Related Work
We compare and relate our results to existing literature reviews on FaaS and more generally on cloud performance evaluations, and compare our FaaS-specific results on reproducibility principles to the original study on cloud experimentation.

Literature Reviews on FaaS
Kuhlenkamp and Werner [8] introduce a methodology for a collaborative SLR on FaaS benchmarking and report on preliminary result of 9 studies. They capture more fine-grained experiments within each paper and extract data regarding workload generator, function implementation, platform configuration, and whether external services are used. A completeness score of these categories represents the reproducibility of FaaS experiments and indicates insufficient experimental description. Somu et al. [26] summarize the capabilities of 7 FaaS benchmarking studies along 34 characteristics for parameters, benchmarks, and metrics. Their results indicate a strong focus on the AWS Lambda platform and identify a lack of support for function chaining, especially in combination with different trigger types. These two most related works hint towards 27 https://www.acm.org/publications/policies/artifactreview-badging some of our results but cannot confidently identify overall trends due to their limited scope.
Taibi et al. [27] conduct an MLR on serverless cloud computing patterns to catalogue 32 patterns originating from 24 sources. Their MLR has a strong practitioner perspective but is limited to 7 peer-reviewed sources. Our work focuses on performance whereas their pattern catalogue only occasionally mentions performance as part of discussing a pattern.
Yussupov et al. [10] conduct a systematic mapping study on FaaS platforms and tools to identify overall research trends and underlying main challenges and drivers in this field across 62 selected publications. Their work covers a broader range of FaaS research and explicitly excludes FaaS benchmarking studies "without proposing any modifications" [10] through their exclusion criteria. Nevertheless, they identify 5 benchmarking studies and 26 function execution studies on performance optimization. Al-Ameen and Spillner [28] introduced a curated "Serverless Literature Dataset" that initially covered 60 scientific publications and preprints related to FaaS and Serverless computing in general, but in its latest Version 0.4 (2019-10-23) [9] the dataset has been extended to 188 articles. The authors classify their work as no survey itself, but rather envision its potential as input for future surveys such as ours. We demonstrate this potential in the manual search process for academic literature where the serverless literature dataset covers 34 out of 35 relevant studies. These two general studies identify publication trends, common technologies, and categories of research but do not extract and synthesize more specific data on FaaS benchmarking aspects we cover in our work. To the best of our knowledge, we present the first comprehensive and systematic literature review on FaaS performance evaluation covering academic as well as grey literature.

Literature Reviews on Cloud Performance
We relate our results to existing literature reviews on general cloud performance topics. These studies apply similar methods to us, but in the context of cloud performance evaluation in general. Li et al. [29] conducted an SLR on evaluating commercial cloud services for 82 relevant studies. Their work is methodologically closely related to our MLR, but targets a more general field of research than our FaaS benchmarking study. Their SLR has a strong focus on publication trends and performance metrics building upon the authors' previous work on cataloguing [30] and classifying [31] performance evaluation metrics. In contrast, our work specializes on performance characteristics in the field of FaaS, extends the scope beyond academic research by including grey literature, and reports on the reproducibility of the analyzed studies. Leitner and Cito [24] used an SLR methodology and open coding for identifying hypotheses seeding their principled experimental validation study on performance predictability of public IaaS clouds. They performed experimental validation on common patterns of results and conclusions but did not extract further data on benchmarking studies. A recent preprint (March 2020) [32] conducts an SLR on benchmarks and metrics within software engineering in the context of migrating from monolithic to microservice architectures. The most frequent metrics for their 33 selected articles are latency, CPU, throughput, and network indicating that their study partially uses similar characteristics but in a less structured way (e.g., network and throughput are orthogonal aspects).

Reproducibility Principles
We compare our FaaS-specific results to the results of the original study by Papadopoulos et al. [7] on more general experimentation in cloud environments. Our MLR study specifically targets FaaS experiments for academic and grey literature resulting in a largely disjoint set of studies with only 2 of our studies matching their stricter venue and impact criteria (i.e., >= 15 citations). Overall, our results for academic literature studies are reasonably similar (±10%) except for P1 and P5. For P1, we speculate that we might have been more lenient in classifying studies, especially when no long-time experiments were present. For P5, we see an improvement and notice more widespread use of CDFs, histograms, and boxplots or dotplots with error margins and accompanying percentiles. Smaller trends suggest that more of our selected studies tend to open source technical artifacts (P4) and report costs (P8), but perform slightly worse in workload and configuration coverage (P2).

Conclusion
This paper presented results from the first systematic and comprehensive survey on FaaS performance evaluation studies. We conducted a multivocal literature review (MLR) across 112 studies from academic (51) and grey (61) literature. Our main findings are that AWS Lambda is the most evaluated FaaS platform (88%), that micro-benchmarks are the most common type of benchmark (75%), and that application benchmarks are currently prevalently evaluated on a single platform. We further find that reproducibility principles on cloud experimentation from prior work are not followed by the majority of studies. Academic studies tend to satisfy the principles more comprehensively than grey literature, but we do not see a clear trend that academic literature is less susceptible to disregarding the principles. We identify gaps in literature and give actionable recommendations highlighting the next steps towards compliance with the reproducibility principles. We recommend future studies to broaden their scope of platforms beyond AWS as a single platform and in particular, contribute cross-platform applicationlevel benchmarks. FaaS performance evaluation studies need to address flaws threatening their reproducibility and should particularly focus on choosing relevant workloads, publishing collected datasets, and statistically evaluating their results. Our survey consolidates existing work and can guide future research directions. It provides a useful instrument for systematic discovery of related works and thus helps future studies to relate and discuss their results in a wider context.

Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement