MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

Function-as-a-Service (FaaS) has raised a growing interest in how to"tame"serverless computing to enable domain-specific use cases such as data-intensive applications and machine learning (ML), to name a few. Recently, several systems have been implemented for training ML models. Certainly, these research articles are significant steps in the correct direction. However, they do not completely answer the nagging question of when serverless ML training can be more cost-effective compared to traditional"serverful"computing. To help in this endeavor, we propose MLLess, a FaaS-based ML training prototype built atop IBM Cloud Functions. To boost cost-efficiency, MLLess implements two innovative optimizations tailored to the traits of serverless computing: on one hand, a significance filter, to make indirect communication more effective, and on the other hand, a scale-in auto-tuner, to reduce cost by benefiting from the FaaS sub-second billing model (often per 100ms). Our results certify that MLLess can be 15X faster than serverful ML systems at a lower cost for sparse ML models that exhibit fast convergence such as sparse logistic regression and matrix factorization. Furthermore, our results show that MLLess can easily scale out to increasingly large fleets of serverless workers.


Introduction
A vivid interest has recently arisen over the issue of serverless computing and its implications for general-purpose computations. Originally geared towards web microservices and IoT applications, recently researchers have started to examine its potential in data-intensive applications [2,3,4,5,6,7,8,9]. Altogether, these works have led to a clear identification of "what" workloads are best suited to serverless computing.
Similarly, a recent trend on building machine learning (ML) on top of Function-as-a-Service (FaaS) platforms has emerged as a new research area [10,11,12,13,14,15]. Since ML inference is a trivial use case of FaaS computing [13,14], attention has turned into ML model training, which is a deal more difficult. Despite all the preceding efforts, it still remains uncertain under what conditions ML tranining on top of FaaS may be beneficial. This is not a trivial question, as the evaluation of serverless ML training is not as simple as running VM-based ML systems such as PyTorch or TensorFlow on top of cloud functions. The fundamental reason is that traditional ML systems have not been prepared to deal with the idiosyncrasies of the FaaS model such as the impossibility of function-to-function communication, the limited memory and transient nature of serverless functions [16,11].
Our aim in this work is to understand the feasibility of supporting distributed ML training over FaaS platforms. Concretely, we are interested in the following question: When can a FaaS platform be more cost-efficient than a VM-based, "serverful" substrate (IaaS) for distributed ML training?
To help in this endeavor, we introduce MLLess, a prototype FaaS-based ML training system atop IBM Cloud Functions. To pick a point in the design space that is more cost-efficient than the prior serverless ML systems [10,11,12], MLLess comes up with two novel optimizations tailored to the traits of the FaaS model. Our view is that in the Limitations. Despite the good news, it is important to remind ourselves about the most prominent hurdles to serverless ML training. First, today's FaaS platforms only support stateless function calls with limited resources and duration. For instance, a function call in IBM Cloud Functions can use up to 2GB of RAM and must finish within 10 minutes 1 . Such limits automatically discard some natural practices such as loading all training data into local memory, which must be downloaded remotely from shared storage in mini-batches, while inhibiting the use of any ML library that has not been designed with these constraints in mind. For instance, the authors of Cirrus [11], a serverless ML system, found impossible to run Tensorflow [19] or Spark [20] on AWS lambdas with such resource-constrained setups.
However, the most critical issue is the impossibility of direct communication, which requires a trip through shared external storage to pass state between functions. This not only contributes significant extra latency, often hundreds of milliseconds, but also prevents exploiting HPC communication topologies adopted in ML such as tree-structured and ring-structured all-reduce [21]. For this reason, a careful optimization of communication, ranging from serialization of high sparsity models to the development of communication-reduction techniques, such as our significance filter, is crucial.

MLLess
We implement MLLess, a prototype FaaS-based ML training system built on top of IBM Cloud Functions. In this section, we describe its main components and defer the explanation of our two key optimizations to §4.

System Overview
An architectural overview of MLLess is illustrated in Fig. 1. MLLess consists of a driver that runs on the local machine of the data scientist. When the user launches a ML training job, the driver invokes the requested number of serverless workers, who execute the job in a data-parallel manner. Each worker maintains a local replica of the model and uses the library of MLLess to train it. We have chosen this decentralized design for MLLess since it better abides by to a pure FaaS architecture compared to the VM-based parameter server [22] model, e.g., followed by other works such as Cirrus [11].
Supervisor. Since the driver is typically far from the data center (e.g., at a university lab), tasks, such as aggregating statistics to find whether the convergence criterion has been reached, can introduce significant delays. To minimize latency, the driver also starts up a serverless function which acts as a supervisor. The role of the supervisor is to collect and aggregate statistics, synchronize worker progress, e.g., in order to bound the divergence between model copies, and terminate the training job when the stopping criteria is fulfilled, among other tasks. Nevertheless, one of the core attributions of the supervisor is to automatically remove workers when their marginal contribution to convergence is minor, or even negative due to increased communication costs (please, see §4.2 for details).
Since the supervisor is a serverless function, it is subjected to the time constraints of the underlying FaaS platform, in this case, to a maximum execution time of 600 seconds (IBM Cloud Functions). Although the supervisor never ran out of time in our experiments, it would not be laborious for the supervisor to pause execution when the 10-minute timeout is close, checkpoint its internal state to storage and re-launch it as a new worker.
Synchronization. The iterative nature of ML algorithms may imply certain dependencies across successive iterations. To keep consistency, synchronizations between workers must happen at certain boundary points. To this aim, MLLess supports different consistency models: the Bulk Synchronous Parallel (BSP) model where the workers must wait for each other at the end of every iteration, and the Stale Synchronous Parallel (SSP) [17], a synchronization model that relaxes consistency by permitting workers to read stale parameter values as long as they are not too "stale". SSP was proposed to overcome the straggler problem suffered by BSP, where each iteration proceeds at the pace of the slowest worker. For this reason, SSP defines an explicit "slack" parameter for coordinating progress among the workers. The slack specifies how many iterations out-of-date the local replica of a worker can be, which implicitly dictates how far ahead of the slowest worker any worker is allowed to progress. For instance, with a slack of s, a worker at iteration t is guaranteed to see all updates from iterations 1 to t − s − 1, and it may see (not guaranteed) the updates from iterations t − s to t − 1. We set BSP as the default synchronization model because it simplifies the reasoning about the impact of our optimizations on model convergence.
MLLess also includes a variant of BSP where the workers only send those updates that are significant. This variant reduces communication costs, but allows local model copies to diverge across workers (see §4.1). Compared with SSP, our variant restricts how inaccurate the aggregated update for a model parameter can be, in comparison to its current value, instead of bounding stateleness in terms of the iteration count.
Communication channels. Due to the absence of direct communication between the workers or with the supervisor, MLLess establishes two channels of indirect communication: • Signaling channel. For exchanging control messages between the workers and the supervisor (e.g., to signal a worker to advance to the next iteration), it leverages a messaging service built on RabbitMQ 2 , though it could be replaced by the native IBM's MQ messaging service 3 without complications.
• Intermediate state. For sharing the intermediate state generated during model training (e.g., local gradients), MLLess employs Redis 4 , a low-latency, in-memory key-value store that supports thousands of requests/s [23].
To store the input dataset mini-batches, MLLess uses IBM COS, the serverless object storage service from IBM Cloud. Since it is an "always on" service, it does not incur any startup delay, though it is a little bit slower compared to Redis, but sufficient for our purposes.

Model Training
MLLess assumes that a training dataset D consists of N independent and identically distributed (IID) data samples drawn by the underlying data distribution D.
where v i denotes the feature vector and l i represents the label of the i th data sample. The objective of training is to find an ML model x that minimizes a loss function f over the dataset D: argmin x In today's systems, one typical optimizer is Stochastic Gradient Descent (SGD) [24], an iterative training algorithm that adjusts x based on a few samples at a time: where η t is the learning rate at step t of the algorithm, B t is a mini-batch of B training samples, and ∇ f B t (x t ) is the gradient of the loss function, averaged over the batch samples: MLLess supports different optimizers (see Table 1 for further details).
Since serverless workers have very limited memory, e.g., IBM Cloud Functions can only access at most 2 GB of local RAM, it is infeasible to replicate all training data into memory. Hence, MLLess assumes that the training dataset is stored in an object store, i.e., IBM COS, and partitioned into mini-batches of size B. To generate the mini-batches in the appropriate format (e.g., feature normalization), MLLess leverages PyWren-IBM [7], a FaaS-based map-reduce framework. For instance, by chaining two map-reduce jobs, it is straightforward to normalize a dataset using min-max scaling, where the first map-reduce job gets the minimum and maximum values of each feature, and the second one does the actual scaling.
Job execution. A typical execution of a training job with MLLess involves the following three steps: ❶ Once up and running, each worker creates a local copy of the model with the aid of the MLLess library, and starts to optimize the loss function f ; ❷ In each iteration, eachworker separately fetches a mini-batch from IBM COS, and then it calculates a local update from its model replica before synchronization takes place. The type of local update depends on the ML algorithm. In the case of the Stochastic Gradient Descent (SGD) [24] algorithm, local gradients are averaged to obtain a global gradient update; ❸ Due to the lack of direct communication, each worker independently of the others pulls all the local updates from external storage (Redis), and aggregates them to update its local model copy. The availability of a local update is announced to the rest of workers through the signaling channel.
It is worth to note here that this decentralized design is easy to scale out, since no single component is responsible for merging all the local updates, as it occurs in LambdaML [12]. Indeed, to scale to large input data sizes or number of workers, it suffices to add more Redis instances and shard the local, ephemeral updates from the workers across them, so that the load is evenly distributed among all Redis shards. Further, the separation of control and data flows makes it easy to support different consistency models, from strict models such as BSP to relaxed ones such as SSP, with little or no changes in the iterative optimizers.
Weak scaling. MLLess parallelism strategy keeps the mini-batch size B the same when the number of worker reduces by the action of our scale-in auto-tuner (see §4.2). The reason is to avoid that every change in the number of workers incurs costly data repartitioning transfers to adjust the mini-batch size, since it may lower the net benefit of worker downscaling. Nonetheless, this entails that the global batch size B g decreases linearly with the number of workers P. That is, B g = PB, which may affect the convergence speed of the optimizer [25]. To prevent significant deviation, the auto-tuner only removes a worker if the degradation in loss reduction does not exceed a certain threshold (see §4.2 for details).

Optimizations
FaaS is typically more expensive in terms of $ per CPU cycle than "serverful" computing. This means that a priori, a user optimizing for cost would likely prefer IaaS over FaaS. Fortunately, FaaS-based training runtimes still show a large margin of improvement that can lead to more cost-effective training, particularly, for models that converge fast. Here we describe two optimizations to confirm this intuition. In §4.1, we elaborate on an optimization to improve throughput, and discuss the scale-in autotuner details in §4.2.

Significance Filter
As cloud providers disallow direct communications between functions, fast aggregation of gradients cannot be made with optimal primitives such as ring all-reduce [21], and must be done through external storage. Despite MLLess uses a low-latency key-value store such as Redis for this purpose, the exchange of updates is, as expected, a high-cost operation, which can significantly diminish the benefits of parallelism. This is particularly visible for the Bulk Synchronous Parallel (BSP) model of computation, where no worker can proceed to the next step without having all workers finish the current step.
To reduce strain on external storage, MLLess comes along with a variant of the Approximate Synchronous Parallel (ASP) model [26], we name it 'Insignificance-bounded Synchronous Parallel' (ISP) to distinguish it from the original consistency model. In short, ASP was originally proposed to break the communication bottleneck over WANs in geodistributed ML systems. The central idea of ASP was to remove insignificant communication across data centers, yet ensuring the correctness of ML algorithms. In this sense, ISP borrows from ASP the idea of filtering non-significant updates, but applies it to accelerate the broadcast of local gradients between workers within the same data center, or cluster. Since ISP operates at the cluster level, its implementation is much simpler than ASP. It does not need complex synchronization mechanisms between data centers such as the ASP selective barrier and mirror [26], which facilitates its adoption in current serverless architectures. Further, ASP was originally implemented using the parameter server model [22], and not for fully decentralized training systems such as MLLess.
Overview. In a nutshell, ISP can be viewed as a technique to reduce the per-step communication complexity while preserving the convergence rate, which results in an improvement in system throughput, so as job training times. More concretely, its goal is to reduce the size of the local update to be transferred to the rest of workers, after the local worker goes through its mini-batch. This is possible because ISP benefits from the robustness of many ML algorithms (e.g., logistic regression, matrix factorization, collapsed Gibbs, etc.), which tolerate a bounded amount of inconsistency. To ensure equal algorithmic progress per-step, ISP enables users to tune the strictness of the significance filter to achieve the sweet spot. Typically, the strictness is controlled by a threshold v, which is reduced over time. That is, if the initial threshold is v, then the threshold value v t at step t of the ML algorithm is given by v t = v √ t . Very importantly, ISP is synchronous in nature. That is, all workers must finish the current step before proceeding to the next iteration. Therefore, ISP differs from bounded asynchronous consistency models such as SSP [17], which sets a fixed upper-bound on the iteration gap between the fastest worker and the slowest one. The focus of ISP is thus on reducing communication requirements rather than alleviating system heterogeneity as SSP realizes. Observe that a smaller communication complexity also means a lower computation complexity, since less model parameters must be updated per iteration.
It is worth to note here that the original definition of the ASP model [26] does not presume a specific significance function, as clearly reflected in its proof of convergence in Theorem 1, which only provides a general analysis of ASP. However, to yield a more robust evidence of ISP validity, we "incarnate" the ISP model with a concrete significance filter, and prove its convergence exclusively for this function.
Significance function. To trim communication while bounding deviation between any two model replicas, a "clever" compression technique is to have each worker aggregate its local updates while they are non-significant. In this way, if the accumulated update eventually becomes significant, the worker will be able to broadcast the complete history of its non-significant updates encoded as a single update, thereby minimizing both the communication burden and deviation from the "true" mini-batch gradient.
More formally, let x t ∈ R n be the parameters of the model at step t, and u t be the associated update s.t. x t = x t−1 +u t . As the update operation is associative and commutative, we simply aggregate the non-significant updates for any model parameter by summing them up. Eventually, the per-parameter accumulated update may become significant and be pushed to the rest of workers. Let t p i be the last propagation time for the i th parameter. Then, we define the per-parameter significance filter as: Note that with the above significance filter, the compression factor becomes proportional to the number of accumulated updates, i.e., m t := t − t p i . This number can be arbitrarily big, provided that the magnitude of the accumulated update relative to the current model parameter value is less than v t . For this reason, it is key to show that ISP is able to maintain an approximately-correct copy of the global model in each worker. We formulate this in Theorem 1: Figure 2: Sample execution of the scale-in scheduler. In the low convergence zone, workers are increasingly eliminated from the pool.

t . Then, under suitable conditions: f t are L-Lipschitz and the distance between any
and thus lim T →∞

R[X]
We provide the details of the proof of Theorem 1 and the notations in A.

Scale-in Scheduler
Compared with cluster computing, one major advantage of the FaaS model is that it enables the rapid adjustment of the number of workers over time. For instance, the removal of a worker in the middle of a training job does not leave cluster resources unallocated, or demand a prompt re-allocation of them to other concurrent jobs such as in the case of reserved VMs [27,28]. This ability opens the door to the invention of novel schedulers that, for example, minimize monetary cost by dynamically adjusting the number of workers as the job progresses. This may result in a more costeffective training, compared to traditional "serverful" cloud computing, which charge customers based on the time that the reserved VMs remain active.
To show that a better cost-efficiency ratio is possible with FaaS computing, MLLess includes a dynamic and finegrained scheduler designed to remove "unneeded" workers. ML training is typically an iterative process where the level of quality improvement decreases as the number of training steps increases [27]. For instance, SGD reduces loss approximately as a geometric series on convex problems [31]. This implies that, while a higher number of workers is desirable during the first training steps to steeply diminish loss, a large worker pool gives only marginal returns when loss reduction slows down, which ends ups worsening the cost-efficiency ratio. In this sense, the primary objective of MLLess's scale-in scheduler is to increasingly cut down the training cost as the job progresses in order to maintain cost-effectiveness. Fig. 2 depicts an example of how workers are increasingly removed from the system when loss reduction stagnates as a result of the action of the scale-in scheduler.
Algorithm. From an initial number of workers P, the scale-in scheduler dynamically reduces the worker pool based on the feedback of the ML algorithm, which includes not only the loss values but also the speed of the training steps.    [29] on MovieLens-1M data [30].
Using the loss information, the scheduler first detects the "knee" in the convergence rate, after which loss reduction slows down significantly, and uses the history of loss values at this time to fit the reference training loss curve L P (t). This curve will be used by the scheduler to quantify the deviation from the original convergence rate introduced by a future removal of a worker. Further, the scheduler estimates the reference step duration d P by averaging the duration of all training steps up to this time. After estimation of these quantities, the scheduler removes the worker with the lowest-quality replica of the model from the pool, and waits for the next scheduling interval. Now let 1 < p ≤ P−1 denote the current number of workers. Then, the scheduler repeats the following sequence of operations upon each scheduling interval: 1. Estimation phase. It fits a new training loss curve ℓ p (t). But, at this time, it uses only the loss values collected so far since the last worker removal. The key reason is that the removal of a worker may affect convergence due to weak scaling [25], for it is required a new fitting to capture the potential deviation from the reference curve. Also, it estimates the current step duration d p by the same procedure as above. Computation of this estimate is necessary as d p < d P . This occurs because the per-step communication overhead is O(p), where O hides the dependence on the model size. This is easy to see in Fig. 3a, where a matrix factorization model is trained with a varying number of workers. The figure shows how training speed decreases linearly with the number of workers. As we fix the local mini-batch size to avoid repartitioning data, less workers implies less data to pull from external storage per iteration, so as the communication overhead. 2. Decision phase. In this phase, the scheduler decides to remove a new worker based on the relative error in the projected loss reduction in time horizon ∆: Then, the scaling-down condition is simply whether this term is below a certain threshold: Intuitively, this term tells how much the convergence rate of the ML algorithm may worsen with p workers compared to the original P-worker configuration in the region of slow convergence. Note that the value of s ∆ (t) can be negative, which means that system throughput is indeed better as a result of removing workers. This can happen if the decrease in the communication cost outweighs the loss of parallelism, for instance.
Finally, we want to signal that although the parameter ∆ can take arbitrary values, it has been designed to anticipate the behavior of the system before a new scheduling interval arrives. Presume a fixed scheduling epoch of duration T . Since a new scheduling decision can be made after time T , the idea is to choose ∆ ≤ T to ascertain whether the removal of a worker is beneficial in a short time horizon T . In general terms, the value of ∆ will vary depending upon the specific ML job. The reason is that while iterations may last 10-100 ms in some ML jobs, they may take a few seconds to complete in others. Irrespective of the ML algorithm, performing scheduling on short intervals could be disproportionally expensive due to the scheduling overhead, which involves function fitting in our case.
Loss deviation. To predict how far a declining worker pool may deviate from the initial convergence rate, as defined in Eq. (1), the scheduler performs online fitting on two types of learning curves, namely, the reference curve, L P (t), and the family of curves, ℓ p (t) 1<p≤P−1 , drawn as the number of workers decreases over time. To improve prediction accuracy, each type of curve has a different shape for the following reason. While L P (t) is built on the loss values from the region of fast convergence, the curves ℓ p (t) 1<p≤P−1 are much more flat, as they correspond to the region where loss reduction slows down and stabilizes, so assuming an appropriate curve for each region makes prediction more fine-grained.
We observe that most ML jobs use first-order algorithms such as mini-batch SGD 5 , which exhibits a convergence rate of O(1/ √ Bt + 1/t) [32], where B denotes the mini-batch size. Consequently, we use the following model for the reference curve: where θ 0 , θ 1 , θ 2 and θ 3 are non-negative coefficients. An example of online curve fitting when training a PMF model is depicted in Fig. 3b. For the slow-convergence curves, we set: as in [27], where θ 0 , θ 1 , θ 2 and θ 3 are also non-negative. We utilize a non-negative least squares solver [33] to fit the points in all the curves. Before doing curve fitting, the loss values are always passed through an exponentially weighted moving average (EWMA) filter to remove outliers.
Retaking the MF training example, Fig. 3c gives the error when estimating the loss values for an increasing number of steps ahead from the "knee". Here the prediction error is the difference between the actual and estimated loss values, divided by the actual one. As shown in Fig. 3c, both the reference curve L P (t) and the slow-convergence model ℓ p (t) 5 Assume the loss function f is convex, differentiable, and ∇ f is Lipschitz continuous.  Figure 4: Speedup of two threads relative to single-thread performance within a function as memory size is varied.
achieve a prediction error inferior to 1.5%, even when predicting up to 200 steps in advance. Finally, Fig. 3d shows how estimation improves as more and more data points are collected for fitting the curve ℓ p (t), irrespective of how many steps are predicted in advance.
Automatic "knee" detection. To favor convergence, the scale-in scheduler never eliminates a worker before passing the "knee". The reason is to maximize the time that the ML algorithm stays within the region of fast convergence, only scaling down the number of workers once the learning curve starts to flatten out. There are several methods out there to automatically identify "knee" points from discrete data (e.g., [34] and Kneedle [35]), which can be plugged into MLLess without further adaptations. For all ML jobs considered in this work, though, a simple threshold-based heuristic on the first derivative of the learning curve, i.e., the slope of the tangent line, worked well in all cases.
Eviction policy. By default, the scheduler eliminates the worker with the lowest-quality model replica from the pool. If the significance filter is enabled, i.e., v t > 0, the leaving worker p stores its local replica of the model x t,p to external storage before terminating itself. Subsequently, each active worker p ′ p downloads x t,p from external storage and averages it with its local model, i.e., x t,p ′ = 1 2 x t,p + x t,p ′ , to reintegrate the non-significant updates from the leaving worker into its local model. For v t = 0, we note that the ISP model reduces to the BSP model (see A), and thus, this additional one-shot synchronization is unneeded.

Implementation
We implement MLLess by extending PyWren-IBM [7] -a Python-based serverless data analytics framework. Although PyWren-IBM allows users to execute user-defined functions (UDFs) as serverless workers, it is painful slow for ML training [11]. So, to make MLLess competitive with the "serverful" ML libraries, we reimplemented part of PyWren-IBM's runtime, the models and optimizers (SGD, SGD with momentum, ADAM, etc.), including sparse data structures, in Cython 6 , using C-style static type declarations that allow compilation. ML frameworks such as PyTorch rely heavily on C++ and math libraries such as Intel MKL 7 to speed up computations on CPU. Thus, a pure Python implementation for MLLess would have degraded system throughput to a large extent.
Intra-level parallelism. A final important observation to make is the lack of thread-level parallelism of IBM Cloud Functions. For the maximum memory allocation of 2GB, we can get the equivalent of one vCPU. This implies that we cannot exploit data parallelism within a worker as ML systems such as PyTorch do -e.g., through OpenMP. To corroborate this, we ran a small micro-benchmark. Concretely, a probabilistic matrix Factorization (PMF) [29] model was trained running SGD on either one or two threads. We measured the per-step running time of the computations inside the workers and computed the speedup of the two threads relative to single-threaded performance. The results are plotted in Fig. 4. As can be seen in the figure, PyTorch is able to extract some parallelism within a worker, but it is clearly not enough to exploit data parallelism. For workers with 1536 MiB of memory, we even found that the performance with 2 threads was worse than single-threaded performance due to a misallocation of resources.

Evaluation
In this section, we perform a series of experiments to answer the following main questions: • What is the individual contribution of each optimization to cost-efficiency? For we perform a number of microbenchmarks.
• Is it possible to achieve better cost-efficiency with an optimized FaaS platform than a VM-based, i.e., "serverful" substrate (IaaS) for distributed ML training? For we run several ML training jobs of different flavors, including both dense and sparse ML models. We use PyTorch [1], a specialized "serverful" ML library, but also a nonspecialized, serverless data-analytics system, to determine what happens when FaaS is not specialized to model training.
• Is the ISP consistency model much more effective than other bounded staleness models such as SSP? The goal is to infer what type of synchronization strategy is more appropriate for the indirect communication model of FaaS cloud platforms.
To conclude, we also evaluate the scalability of MLLess on the exchange of intermediate training state, which is the main system bottleneck due to the impossibility of function-to-function communication. Note that the scalability of object storage for the storage of training datasets has already been assessed in other works [11]. Consistent with these works, we have observed no bottleneck for the download of mini-batches from IBM COS.

Methodology
Competing systems. Concretely, we compare MLLess with the following implementations: • • PyWren-IBM [7]. We use PyWren-IBM as non-specialized serverless ML representative. PyWren-IBM has been optimized to run on IBM Cloud Functions. Since it is a MapReduce framework, we leverage the map phase to process mini-batches in parallel and reduce tasks to aggregate the local updates. All communication is done through IBM COS, including the sharing of updates, to keep its pure serverless, general-purpose architecture.
Datasets. We utilize three datasets in our evaluation. First, we use the Criteo display ads dataset [38], which contains 47M samples and has 11GB of size in total. Each sample consists of 13 numerical and 26 categorical features. Before training, we normalize the dataset. In particular, we manipulate this dataset in two forms. On one hand, we only use the 13 numerical features to produce a dense dataset. On the other hand, we hash all the categorical dimensions to a sparse vector of size 10 5 ("hashing trick"), along with the 13 numerical features, to produce a sparse dataset. In this way, we can evaluate the impact of sparsity on the cost-efficiency of FaaS over IaaS as another evaluation dimension. Also, we use the MovieLens-10M and MovieLens-20M datasets [30]. The former consists of 10M movie reviews from N u = 10, 681 users on N m = 71.567 movies. The latter bears around 20M reviews from N u = 27, 278 users on N m = 138, 493 movies. Notice that all the datasets are (highly)-sparse to verify MLLess support for sparse data. Table 1, we train different models on different datasets, i.e., Criteo for logistic regression (LR), and MovieLens-10M/20M for probabilistic matrix factorization (PMF) [29]. Concretely, for PMF, we factorize the partially filled matrix of review ratings R of size N u × N m into two latent matrices: U N u ×r and M N m ×r , such that R ≈ UM.

ML models. As shown in
Setup. The VM instances used for the experiments are deployed on the IBM Cloud. Unless otherwise noted, when running MLLess, we use two VM instances: a C1.4x4 instance (4vCPUs, 4GB of RAM) to host the messaging service, and a single M1.2x16 instance (2vCPUs, 16GB of RAM) to deploy Redis, in addition to the chosen number of FaaS workers. To use as many workers for PyTorch as MLLess, the PyTorch cluster will consist of 3 or 6 B1.4x8 instances (4vCPUs, 8GB of RAM). All instances have a 1Gbps NIC. As MLLess workers, we use the largest-sized functions of 2GB of memory. All VMs and MLLess workers are deployed on the same region (us-east). 11  = 0.05$/hour, which is more than two times cheaper than a serverless worker: 0.122 $/hour. We finally observe that the use of VMs confers some extra advantage to PyTorch, as the exchange of intermediate training state across all processes (AllReduce) can leverage the fact that some PyTorch workers are physically located in the same machine.
Sanity check. Before conducting any experiment, we first realized a sanity check to make sure that all the models were identical in all systems. To this end, we fixed a random seed, and trained all models in each system using a single worker. We then verified that the convergence rate at each step was exactly the same in all systems. This guarantees no technical advantage of one system over the other due to subtle model artifacts such as ℓ 1 -and ℓ 2 -regularization, etc.

Micro-benchmarks
To better understand the individual contribution of each optimization to cost-efficiency, we run a number of microbenchmarks.

Significance Filter
We first evaluate the effectiveness of ISP to improve system throughput as the significance threshold v increases, i.e., it becomes more strict, thereby filtering out more aggressively those updates than produce small changes to the model. As a metric, we make use of the execution time until algorithm convergence. For LR, we fix a Binary Cross Entropy (BCE) loss threshold of 0.58, and stop training when the threshold is reached. For PMF, we set a Root Mean Squared Error (RMSE) loss threshold of 0.82. Because of the "pay-as-you-go" model of cloud functions, the key point to note here is that by decreasing the execution time, ISP cuts the cost forthwith. We use the BSP synchronization model.
The results are plotted in Fig. 5. When training PMF on both MovieLens datasets, ISP is able to improve system throughput significantly with no side effects on convergence. For ML-20M, speedup reaches 3X. This result indicates (a) LR, Criteo dense.

Scale-in auto-tuner.
We assess in isolation the effect of scaling down dynamically the amount of workers. To draw an unbiased picture of its performance, it is insufficient to only look at the cost profile. A bad adjustment policy could trade off convergence speed for cost, e.g., by aggressively evicting workers from the pool. Ideally, both metrics should dwindle in parallel.
To capture the effect of the auto-tuner in a single metric, we use Perf/$ defined as: Perf/$:= 1 Exec.time(s) × 1 Price($) , so that any improvement in latency, cost, or both, caused by the auto-tuner is reflected in this composite metric. We also use raw execution time as a secondary metric, to detach the $-cost normalization effect. For Perf/$, higher is better. As before, we run all the ML algorithms until convergence, defined as a threshold on the observed loss. Concrete values for thresholds are given in the caption of Fig 6 itself. For the scale-in auto-tuner, we set the scheduling interval to 20s and fix the parameter ∆ at a half of the scheduling epoch, that is, ∆ = 10 sec.
Results are illustrated in Fig 6. For sparse LR, the results of the auto-tuner are excellent. The auto-tuner improves the Perf/$ between 1.4X-1.5X, while reducing the running time slightly by up to 10%. For dense LR, the auto-tuner in isolation is only capable of slightly increasing the Perf/$ for 12 workers, leading to an improvement of 1.1X over the baseline. For P = 24 workers, the Perf/$ worsened a little bit due to a 9% underestimation of the original convergence rate caused by an imprecise fitting of the reference curve L P (t). We leave for future work the development of a more precise estimation method for the reference curve to prevent any degradation of Perf/$.
Interestingly, the fact that the execution time increases with more workers for the LR use case is attributable to a loss of statistical efficiency [39] due to weak scaling, rather than to a poor scalability of MLLess. To corroborate this claim, we repeated the same experiment, but now adjusting the mini-batch size B as we varied the number of workers to keep the global batch size B g the same at all times. We got comparable results, as listed in Table 3, which shows that the converge rate was equivalent in all worker configurations for a constant B g . By adapting the mini-batch size B, model replicas synchronized more frequently as the number of workers grew, thus preserving statistical efficiency. For PMF, the results were also nice. For all settings, the auto-tuner improved the Perf/$. For the ML-20M dataset, it even led to 1.6X gain since it also delivered a significant improvement in speed. The small degradation of around 7.1% in execution time for the ML-10M dataset was due to an aggressive purge of the workers too much early by the auto-tuner, which can be solved by adjusting the "knee" finder (see §4.2).
As a main insight, we see that for users who must curtail costs, a competent exploitation of the FaaS "pay-as-yougo" model as ours can be of great help to manage their budgets.

Cost-Efficiency
In this section, we explore the cost-efficiency of MLLess, while seeking to answer the nagging question of whether FaaS can outperform a VM-based, IaaS infrastructure for distributed ML learning.

Performance comparison.
To assess the benefits of a specialized system for serverless ML training, we compare MLLess against PyTorch [1] and PyWren-IBM [7]. We use PyTorch as a representative of an IaaS-based ML library. We adopt PyWren-IBM to verify that a vanilla, non-specialized design of MLLess would have been dramatically inefficient.
For this experiment, we execute three variants of MLLess. The baseline version using the BSP synchronization model, and labeled 'MLLess' in the figures. A second variant with ISP replacing BSP, termed 'MLLess + ISP', and a third one, with both optimizations all at once, labeled 'MLLess + All'. For ISP, we set the significance threshold v = 0.7. For the auto-tuner, we set the scheduling epoch to 20s with ∆ = 10s. For all the systems, we only report the results for P = 24 workers. The trends were similar for 12 workers.
Results. The results are shown in Fig. 7. The first observation to be made is that PyWren-IBM is very inefficient in all jobs. This is mostly due to two facts. The first is that local updates are communicated across workers through slow storage only, i.e., IBM COS. The second is the non-specialization of PyWren-IBM for iterative ML training.
The second observation to be made is that MLLess is able to converge significantly faster than PyTorch. To give a sense of the performance gap, let us focus on the PMF+ML-10M application. To achieve a loss value of 0.9, MLLess needs 23 seconds while PyTorch gets to this loss only after 90 seconds. This gap increases over time and to converge to a "prudent" RMSE loss of 0.738, PyTorch spends 2, 029 seconds. MLLess, however, reaches this loss value after 140 seconds. This yields a speedup of 14.49X.
For the PMF+ML-20M application, we get similar results. To converge to a loss of 0.821, PyTorch spends 1, 800 seconds. MLLess achieves this loss within 115 seconds, 15.65X faster than PyTorch. Via thorough analysis, we found that PyTorch's speed is affected by the high sparsity of the datasets as it occurs to TensorFlow [40]. Unlike PyTorch, MLLess employs Cython to directly operate on sparse data and sparse gradients, and hence, save significant time on serializing and deserializing data. In this way, MLLess leads to faster convergence. Either way, the gap between plain MLLess and the optimizations is significant for PMF, which demonstrates that an optimized treatment of sparsity by its own cannot realize such savings. To wit, plain MLLess spends 334 seconds to reduce RMSE to 0.821, 3X slower than with all the optimizations present.
The LR+Criteo dense job produced another interesting result, which further buttresses the idea that optimizations tailored to the FaaS environment are crucial to be competitive against IaaS-based ML training. Unlike in all the other experiments, Pytorch is able to outperform plain MLLess in this case. However, when the MLLess optimizations are enabled, MLLess overtakes Pytorch in the middle of the execution and is able to converge to a lower BCE level.
As a final observation, it is worth to note that the auto-tuner does not slow down convergence in any job as shown by the 'MLLess + All' curves. On the contrary, it helps to improve convergence speed in addition to decrease cost. Also, the use of ISP consistency for large models such as ML-20M has been vital to ensure fast convergence for the few initial seconds.
Main insight. As a key conclusion, we find that FaaS can be more performant than IaaS under the same conditions (i.e., number of workers and memory per worker) for, at least, fast-convergence models if ML training is specialized to FaaS architectures.

Cost comparison.
As shown above in §6.3.1, FaaS can outperform IaaS-based training. However, the price/time unit of a serverless worker is typically higher than IaaS-based worker. As given in Table 2, a PyTorch worker costs 0.2$/hour 4 = 0.05$/hour, which is more than two times cheaper than a serverless worker: 0.122 $/hour. Therefore, a better cost-efficiency for FaaS-based ML training is a priori more difficult, but plausible, mostly because of the possibility to dynamically adjust the number of workers, among other abilities.
Following the same path traced above, here we compare MLLess against PyTorch and PyWren-IBM in terms of cost. We extract the cost of each system from the executions in the prior evaluation to ease cross comparison.
General results. As a headline observation, MLLess is cheaper than PyTorch in all applications, but the improvement gap is not as big as in the performance dimension. For example, when training PMF on ML-20M, MLLess spends $0.0948 to reach a loss of 0.82, compared to the $0.6 invested by PyTorch. This leads to a 6.32X savings on cost. Likewise, PyTorch spends $0.667 to achieve to a loss value of 0.738 for the PMF+ML-10M job, while MLLess cuts this cost to $0.1348, 4.94X cheaper than PyTorch.
Fixed-budget cost results. While MLLess saves money, for some users the "pay-as-you-go" model is in conflict with the way they manage their budgets. For instance, these may be fixed in advance. Therefore, it is interesting to examine what would be the performance of MLLess for a fixed budget. To answer this question, Fig. 8 illustrates to what extent each system is able to converge under a fixed budget in dollars. The numbers above the bars report the maximum execution time affordable with each possible budget.
As can be seen in the figure, MLLess + All provides the best cost-performance trade-off in all applications, even for the tiny budget of 9 cents. Non-surprisingly, PyTorch is able to run longer than the rest of systems due to the lower pricing of the rented VM instances. For the largest budget, it even doubles the maximum execution time affordable by MLLess. Per contra, MLLess is significantly more efficient per time unit and better adjusts to the cost plan. We note that the auto-tuner helps to gain some extra seconds, up to 115 seconds, as shown by the MLLess + All-labeled bars. This is another experimental evidence of the economic utility of our scale-in auto-tuner. The minimal exception to the above rule of thumb is for LR+Criteo dense (Fig. 8a). For this task, Pytorch is able to deliver the best performance for the 9/ c-and 18/ c-budgets, mostly because of its optimal, ring-based All-reduce primitive for dense data. Fortunately, by leveraging the combined effect of the scale-in auto-tuner and ISP, MLLess manages to incrementally improve the cost-efficiency ratio, achieving a lower BCE value for 36/ c.

Main insight.
As a main takeaway, FaaS can be more cost-efficient than IaaS for fast-convergent models if FaaSbased model training is crafted for the serverless environment.

SSP vs. ISP for FaaS-based ML training
Due to the need of indirect communication in FaaS-based ML training, another interesting question is to ascertain whether ISP is better suited for serverless model training than other popular yet loose consistency models such as SSP [17]. To this goal, we integrated SSP into MLLess, and compared it with ISP, as well as with the baseline BSP-based version. To carry out this comparison, we experimented with PMF on the ML-20M dataset for an increasing number of workers P. To not compromise statistical efficiency due to weak scaling (see §6.2.2 and Table 3 for further details), we fixed the global batch size B g and adjusted the mini-batch size B accordingly. Concretely, we set B = 12K for 12 workers, B = 6K for 24 workers and B = 3K for 48 workers. In this way, we made sure that the effect of stale updates came out neatly for each synchronization model. For SSP, we set a slack of s = 3 iterations.
Results. The results are depicted in Fig. 9. As expected, SSP shows a better average speedup of 1.1X over the default BSP implementation for 12 and 24 workers. For 48 workers, SSP, however, performs worse than the synchronous BSP model due to the lack of intra-function parallelism. More technically, SSP is agnostic to the computation capacity of workers, but merely ensures that the number of iterations between the fastest and the slowest workers does not exceed the staleness bound s. In a distributed setting such as that of MLLess, where each worker is responsible to aggregate the updates from the rest, i.e., there are no global parameters, the lack of intra-function parallelism means there is no way for the slowest workers to hide the latency of downloading and applying the missing updates. A solution to this problem would be to use a serverless backend such as Crucial [6] to perform the storage-side aggregation of gradients.
The more relevant finding of this experiment is, however, that ISP outperforms SSP in all cases, yielding a speedup of 1.9X and 1.4X for 12 and 24 workers, respectively. The reason why ISP is way better than SSP is that ISP permits any worker to delay the synchronization of a parameter indefinitely as long as its aggregated update is non-significant. Under SSP, however, update synchronization is delayed up to most s iterations for the fastest workers, but sooner or later, all the committed updates from the workers are added to the model replicas, thus not reducing the communication overhead at all. Put another way, the loose synchronization property of SSP is not enough to outweigh the reduction in network traffic achieved by ISP, being the latter more effective to yield faster convergence for FaaS-based ML training due to the need of indirect communication.
Main insight. For FaaS-based model training, where the exchange of intermediate updates is the main limiting factor, parameter staleness is not of practical utility unless it saves network traffic.

Scalability
Finally, scalability is a critical property of any ML training system, and MLLess is not the exception. Cloud object storage as a means to hold mini-batches has proven to provide good scalability [11], and we empirically found that the   signaling channel was able to support thousands of messages per second, enough to scale to hundreds of concurrent workers. Certainly, among all the MLLess components, we observed that the indirect communication channel built to exchange updates is the one subjected to a major strain. Fortunately, this channel can be easily "scaled out" by adding more Redis instances and "sharding" intermediate updates over the pool of servers using the worker IDs. To verify this claim, we ran multiple training jobs with the ML-20M dataset for an increasing number of workers. For each worker size, we trained the model with 1 and 2 Redis instances, stopping at a RMSE value of 0.77 in all settings. To preserve statistical efficiency, i.e., a similar convergence progress per second, we adjusted the batch size as in the prior test.
Results. The scalability results are illustrated in Fig. 10a. For ease of comparison, we normalized the execution times, choosing the configuration of 24 workers with 1 Redis server as the baseline. Non-surprisingly, doubling the number of Redis instances has almost no effect for a small number of workers. Nonetheless, its effect becomes more apparent as the number of workers increases and a single server cannot keep up with the high rate of updates. With two Redis servers, MLLess is able to deliver a speedup of 1.4X for 64 workers, despite a super-linear increase in the relation of the number of workers to the Redis servers with respect to the baseline setup, i.e., 64 #workers 2 #Redis servers / 24 #workers 1 #Redis servers = 1.33. Likewise for 96 workers, the execution time is a 10% better with 2 Redis servers, despite featuring 4X more number of workers than the baseline setup. This confirms that the addition of more servers enables MLLess to scale to a larger number of workers.
It is worth to mention here that as in IaaS-based ML training, the scaling of the training process in FaaS platforms is equally challenging. Simply put, users expect the training time to go down with the number of workers. However, even if the sharing of updates is not a bottleneck with the addition of more Redis instances, the relation between the minibatch size and the number of workers may hinder statistical efficiency. This is reflected in Fig. 10b, where the number of training steps until reaching the threshold is shown. We use the number of training steps, instead of time, as a metric, because the total number of steps to convergence is independent of the size of the Redis cluster. It is a quality measure that depends on the global batch size and the number of workers, which captures very well how frequently the workers synchronize in relation to the processed training data.
As seen in this figure, the number of iterations to convergence decreases until 96 workers, the point beyond which adding more workers starts hurting convergence. At this point, adding more Redis servers can help reduce I/O time, so as the training time. But eventually, the increase in the number of iterations caused by the usage of more workers will limit scalability. For this reason, users try to compensate this reduction in statistical efficiency by either increasing the learning rate [41], or adjusting the batch size adaptively [42].
Main insight. We conclude that MLLess is scalable, and that can be easily "scaled out" through in-memory storage sharding. As in traditional VM-based systems, the ultimate scaling of the training process depends on the mini-batch size, the synchronization model, etc., irrespective of whether storage sharding can eliminate the bottlenecks.

Related Work
Serverless Data Processing. A large bulk of previous works have proposed high-level frameworks for running largescale analytics on serverless functions. For example, PyWren [3], IBM-PyWren [7] and Lithops [8,9] are map-reduce frameworks running over FaaS executors that take advantage of object storage to store intermediate data. Lithops [8,9] is multi-cloud and also implements the native multiprocessing module available in Python to enable the transparent execution of multiprocessing applications over FaaS platforms. Further, gg [4] is a library that uses AWS Lambda for CPU-bound intensive jobs such as video-encoding. Numpywren [5] is an elastic linear algebra library on top of a pure serverless architecture. Starling [43] proposes a serverless query execution engine. Serverless ML systems, including MLLess, build upon the lessons learned from these works to increase their performance and cost-efficiency.
Another important work is Crucial [6,44]. Crucial is a framework for building stateful FaaS-based multi-threaded applications, and as such, it includes fine-grained synchronization primitives such as semaphores and barriers. As part of its evaluation, Crucial was compared to Spark using two classical ML algorithms: K-means and logistic regression, showing an on-par performance with Spark. Although Crucial is not cost-efficient per se, we believe that it would be a good option to implement a parameter server-like interface for server ML training. Serverless ML. A number of works have been devoted to leveraging FaaS platforms for building ML systems. Since ML model inference is a representative use case of serverless computing [13,14], recent research efforts have been directed towards model training [11,10,12]. All these works use AWS Lambda, which confers them some advantage over MLLess, and make direct comparison problematic. First, AWS Lambda enables multi-threaded parallelism [45,11], while IBM Cloud Functions are limited to 1vCPU at most. Also, AWS Lambda workers can access 10GB of local RAM, which allows them to hold larger data partitions and mini-batches compared with MLLess that is restricted to 2GB of memory. Despite this, MLLess outweighs these limitations and manages to deliver speedups superior to 15X and with 6.3X lower cost than PyTorch, and very importantly, excluding the start-up time, which is longer in PyTorch (e.g., a cluster of 6 VMs takes > 1 min. to boot up).
To put in a nutshell, Cirrus [11] is a serverless ML system that implements a parameter server (PS) [22] on top of VMs, where all FaaS workers communicate with this centralized PS layer. Such a hybrid design has its merit, mainly because the ability of PS servers of doing computation delivers 200% communication savings compared with indirect communication via external storage. According to [16], Cirrus is 3X-5X faster than VMs, but up to 7X more costly. Compared to MLLess, Cirrus is thus not cost-efficient, mostly because it does not exploit well the definitory properties of the FaaS model such as "pay-per-usage", which allows to save money through the fine-grained, dynamic allocation of serverless workers.
Siren [10] presents an asynchronous ML framework, where each worker runs independently, i.e., it reads a (stale) model from remote storage (e.g., AWS S3), updates it with a mini-batch of local data, writing the new model back to storage. Its major strength is withal its scheduler built upon reinforcement learning (RL) that adjusts the number of workers dynamically, subject to a certain budget. Compared to MLLess, its scheduler is more coarse-grained as it adjusts the number of workers once per epoch, and achieves a lower cost-efficient ratio. Concretely, Siren reduces job execution time by up to 44.3% at the same cost than EC2 clusters.
Finally, [12] proposes LambdaML, a FaaS-based training system to determine the cases where FaaS holds a sway over IaaS. Interestingly, our results mirrors their observation that FaaS is more cost-efficient for models that quickly converge. Unlike LambdaML, however, we reach the same conclusion by getting out of the equation start-up times. That is, if the start-up time is excluded, LamdaML is slower than PyTorch, while MLLess outperforms PyTorch, yet being cheaper. In this sense, we believe that MLLess opens the door to the adoption of serverless ML training as a truly cost-efficient option in the cloud.

Conclusion and Future Work
We have examined the question of whether serverless ML training can be more cost-effective than traditional IaaSbased computing. To answer this question, we have developed MLLess, a prototype system of FaaS-based ML model training built on top of IBM Cloud Functions, and empowered it with two new optimizations: one aimed to reduce communication bandwidth, the other intended to exploit the essential qualities of the FaaS model to jointly decrease cost and execution time. Our results demonstrate that MLLess is more cost-efficient than serverful ML libraries at a lower cost for ML models with fast convergence. We also validate the scalability of MLLess, and the benefits of loose synchronization models that allow to smoothly trade off communication bandwidth and convergence time.
The cost-effectiveness of serverless ML training suggests a variety of potential future work. An interesting avenue of research would be to investigate the advantage of supporting ML-specific logic on the server side through serverless data stores (e.g., Crucial [6,44]). Another topic would be to adapt MLLess to Federated Learning (FL) environments. A commonplace practice in FL is to select a random subset of the available clients in each training step, which results in many clients staying idle for a long time. By extending MLLess to run the clients as functions on edge devices only when needed, it would be possible to improve cost-efficiency. A final research topic would be to examine the potential effects of lossy gradient compression techniques such as gradient quantization on serverless ML training.