PiPar: Pipeline Parallelism for Collaborative Machine Learning

Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1x, and (ii) the overall training time can be accelerated by up to 34.6x under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.


I. INTRODUCTION
Deep learning has found application across a range of fields including computer vision [1,2], natural language processing [3,4] and speech recognition [5,6].However, there are important data privacy and regulatory concerns in sending data generated on mobile devices to geographically distant cloud servers for training deep learning models.A new class of machine learning techniques has therefore been developed under the umbrella of collaborative machine learning (CML) to mitigate these concerns [7].CML does not require data to be sent to a server for training deep learning models; rather the server shares models with devices that are then locally trained on the device.
CML is used in many real-world use-cases comprising a central server and multiple homogeneous mobile devices.Smartphone manufacturers, for example, analyze user data to improve the performance of a specific smartphone model [8,9].For instance, CML can be employed to analyze the battery usage patterns of individual users on their phones to offer personalized plans for optimizing battery life.Similarly, CML can be used to analyze the typing habits of the users and then automatically complete and correct the typing of the users.
There are three notable CML techniques reported in the literature, namely federated learning (FL) [10,11,12,13], split learning (SL) [14,15] and split federated learning (SFL) [16,17].However, these techniques under-utilize both compute and network resources, which results in training times that do not meet real-world requirements.The cause of resource underutilization and the resulting performance inefficiency in the three CML techniques is considered next.
In FL, each device trains a local model of a deep neural network (DNN) using the data it generates.Local models are uploaded to the server and aggregated as a global model at a pre-defined frequency.However, the workload of the devices and the server is usually imbalanced [18,19,17].This is because the server is only employed when the local models are aggregated and is idle for the rest of the time.
In SL, a DNN is usually decomposed into two parts, such that the initial layers of the DNN are deployed on a device and the remaining layers on the server.A device trains the partial model and sends the intermediate outputs to the server where the rest of the model is trained.The training of the model on devices occurs in a round-robin fashion.Hence, only one device or the server will utilize its resources while the rest of the devices or the server are idle [16,7].
In SFL, which is a hybrid of FL and SL, the DNN is split across devices and the server.The devices, however, unlike SL, train the local models concurrently.Nevertheless, the server must wait while the devices train the model and transfer data, and vice versa.
Therefore, the following two challenges need to be addressed for improving resource utilization in CML: a) Sequential execution on devices and the server causes resource under-utilization: For FL, the server aggregates the models obtained from all devices after they complete training; for SL, after the training of the initial layers is completed on the devices, the remaining layers of the DNN are trained on the server.Since device-side and server-side computations in CML techniques occur in sequence, there are long idle times on both the devices and the server.
b) Communication between devices and the server results in resource under-utilization: Data transfer in CML techniques is time consuming [20,21,22], during which time no training occurs on both the server and devices.This increases the overall training time.
Although low resource utilization of CML techniques makes training inefficient, there is currently limited research that is directed at addressing this problem.This paper aims to address the above challenges by developing a framework, PiPar (pronounced as 'piper'), that leverages pipeline parallelism to improve the resource utilization of devices and servers in CML techniques when training DNNs, thereby increasing training efficiency.The framework distributes the computation of DNN layers on the server and devices, balances the workload on both the server and devices and reorders the computation for different inputs in the training process.PiPar overlaps the device and server-side computations with communication between the devices and server, thereby improving resource utilization, which in turn accelerates CML training.
PiPar redesigns the training process of DNNs.Traditionally, training a DNN involves the forward propagation pass (or forward pass) and backward propagation pass (or backward pass).In the forward pass, one batch of input data (also known as a mini-batch) is used as input for the first DNN layer and the output of each layer is passed on to subsequent layers to compute the loss function.In the backward pass, the loss function is passed layer by layer from the last layer to the first layer to compute the gradients of the DNN model parameters.
PiPar divides the DNN into two parts and deploys them on the server and devices as in SFL.Then the forward and backward passes are reordered for multiple mini-batches to reduce idle time.Each device executes the forward pass for multiple mini-batches in sequence.The immediate result of each forward pass (activations) is transmitted to the server, which executes the forward and backward passes for the remaining layers and transfers the gradients of the activations back to the device.The device then sequentially performs the backward passes for the mini-batches.The devices operate in parallel, and the local models are aggregated at a set frequency.Since many forward passes occur sequentially on the device, the communication for each forward pass overlaps the computation of the subsequent forward passes.Also, in PiPar, the server and device computations occur simultaneously for different mini-batches.Thus, PiPar reduces the idle time of devices and servers by overlapping server and device-side computations and server-device communication.
This paper makes the following contributions: (1) The development of a novel framework PiPar to accelerate collaborative training of DNNs by improving resource utilization.PiPar is the first work to reduce resource idling in CML by reordering training tasks across a server and the participating devices.Idle time is reduced by leveraging pipeline parallelism to overlap device and server computations and device-server communication.
(2) Development of an low overhead automated parameter selection approach for further optimizing CML workloads across devices and servers to maximize the overall training efficiency.
PiPar and the automated parameter selection approach are evaluated on a lab-based testbed.The experimental results demonstrate that: a) compared to FL, PiPar can accelerate the training process by up to 34.6×, and the idle time of hardware resources is reduced by up to 64.1×.b) the automated parameter selection approach can find the optimal or nearoptimal parameters in less time than an exhaustive search.It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
The rest of this paper is organized as follows.Section II considers the background and work related to this research.The PiPar framework and the two approaches that underpin the framework are detailed in Section III.Section IV provides a theoretical analysis of model convergence and accuracy using PiPar.Experiments in Section V demonstrate the effectiveness of the PiPar framework.Section VI concludes this article.

II. BACKGROUND AND RELATED WORK
Section II-A provides the background of collaborative machine learning (CML), and Section II-B introduces the related research on improving the training efficiency in CML.

A. Background
The training process of three popular CML techniques, namely federated learning (FL), split learning (SL) and split federated learning (SFL), and their limitation due to resource under-utilization are presented.
1) Federated learning: FL [10,11,12,13] uses a set of devices coordinated by a central server to train deep learning models collaboratively.
Assume K devices participate in the training process as shown in Figure 1(a).In Step ①, the devices train the complete model M k locally, where k = 1, 2, ..., K.In each iteration, the local model trains on a mini-batch of data by completing the forward and backward passes to compute gradients of all model parameters and then update the parameters with the gradients.A training epoch involves training over the entire dataset, which consists of multiple iterations.In Step ②, after a predefined number of local epochs, the devices send the local models M k to the server, where k = 1, 2, ..., K.In Step ③, the server aggregates the local models to obtain a global model M , using the FedAvg algorithm [13]; Typically, local model training on devices (Step ①) takes most of the time, while the server with more significant compute performance is idle.Therefore, PiPar utilizes the idle resources on the server during training.
2) Split learning: SL [14,15] is another privacy-preserving CML method.Since a DNN consists of consecutive layers, SL splits the entire DNN M into two parts at the granularity of layers and deploys them on the server (M s ) and the devices (M c k , where k = 1, 2, ..., K).
As shown in Figure 1(b), the devices train the initial layers of the DNN and the server trains the remaining layers, and the devices work in a round-robin fashion.In Step ①, the first device executes the forward pass of M c1 on its local data, and Compared to FL, device-side computation is significantly reduced because only a few layers are trained on devices.However, since the devices work in sequence (instead of in parallel as FL), the overall training efficiency decreases as the number of devices increases.
3) Split federated learning: Since FL is computationally intensive on devices and SL works inefficiently on the device, SFL [16] was developed to alleviate both limitations.Similar to SL, SFL also splits the DNN across the devices (M c k , where k = 1, 2, ..., K) and the server (M s ) and collaboratively trains the DNN.However, in SFL, the devices train in parallel and utilize a 'main' server for training the server-side model and a 'Fed' server for aggregation.
The training process is shown in Figure 1(c).In Step ①, the forward pass of M c k , where k = 1, 2, ..., K, are executed on the devices in parallel, and in Step ②, the activations are uploaded to the main server.In Step ③, the main server trains M s , and in Step ④, the gradients are sent back to all devices before they complete the backward pass in Step ⑤.At a predefined frequency, the models M c k , where k = 1, 2, ..., K, are uploaded to the Fed server in Step ⑥.In Step ⑦, the models are aggregated to the global model M c .In Step ⑧, M c is downloaded to the devices and used for the next round of training.
SFL utilizes device parallelism to improve the training efficiency of SL [7].However, the server still waits while the devices are training the model (Step ①) and transmitting data (Step ②), and vice versa, which leads to resource underutilization.PiPar addresses this problem by parallelizing the steps performed on the server and the devices.

B. Related work
Existing research to improve the training efficiency of CML focuses on the three aspects considered below.
1) Improving resource utilization using pipeline parallelism: Approaches employing pipeline parallelism have been proposed to improve the compute and network utilization of resources.GPipe [23] and PipeDream [24] use pipeline parallelism when a DNN is distributed to multiple computing nodes and parallelizes the computations on different nodes.They both reduce the idle time on computing resources.To further improve the hardware utilization, PipeMare [25] implements asynchronous training, and Chimera [26] uses bidirectional pipelines instead of a unidirectional one.PipeFisher [27] takes advantage of idle resources to execute second-order optimization to accelerate model convergence.However, these approaches for distributed DNN training cannot be directly applied to CML for three reasons.
Firstly, the context in which current pipeline parallelism approaches were designed to operate in is completely different from CML.They were designed for GPU clusters with substantial compute resources where the data flow is sequential (data is provided as input to one node and then the output goes to the next node and so on).However, in CML, the data generated on end-user devices is not shared with other devices or servers to preserve privacy.PiPar is therefore proposed to tackle the problem of distributed training of devices in centralized topologies.
Secondly, in existing pipeline parallelism approaches, different layers of the DNN are mapped onto different nodes in a cluster and they do not share weights.However, in CML, each device trains a local model on its data, and the model weights are subsequently synchronized with other devices.PiPar splits the model across the server and all the devices to alleviate the computational burden on the devices and proposes a method to synchronize the server-side and client-side models.
Thirdly, the bandwidth between different devices and servers in CML will be variable as seen in real-world mobile environments.However, the bandwidth between the nodes of a GPU cluster is relatively less prone to such variability.Since communication time between nodes of a GPU cluster is relatively small, existing pipeline parallelism approaches tend to hide communication behind computation.However, the communication of activations and gradients in CML is a substantial volume, which is not handled by existing approaches.PiPar takes this into account, and hence, a parameter selection method is proposed to overlap the communication and computation.
Given the above limitations, we propose a novel framework, PiPar that fully utilizes the computing resources on the server and devices and the bandwidth available between them to improve the training efficiency of CML.
2) Reducing the impact of stragglers: Stragglers among the devices used for training increase the overall training time of CML.A device selection method was proposed based on the resource availability of devices to minimize the impact of stragglers [28].Certain neurons of the DNN on a straggler are masked to accelerate computation [29].Local gradients were aggregated hierarchically to accelerate FL on heterogeneous devices [30].To balance workloads across heterogeneous devices, FedAdapt [17] offloaded DNN layers from devices to a server.An adaptive asynchronous federated learning mechanism [31] was proposed to mitigate stragglers.These methods alleviated the impact of stragglers but did not address the fundamental challenge of sequential computation and communication between the devices and server that results in low resource utilization.
3) Reducing communication overhead: In limited bandwidth environments, communication overhead limits the training efficiency of CML techniques.To reduce the communication traffic in FL, a relay-assisted two-tier network was developed [32].Models and gradients were transmitted simultaneously and aggregated on the relay nodes.Pruning, quantization and selective updating were used to reduce the model size and thus reduce the computation and communication overhead [33].The communication involved in the backward pass of SFL was improved by averaging the gradients on the server-side model and broadcasting them to the devices instead of unicasting the unique gradients to devices [34].Overlap-FedAvg [35] was proposed to decouple the computation and communication during training and overlap them to reduce idle resources.However, the use of computing resources located at the server was not fully leveraged.
These methods are effective in reducing the data volume transferred over the network, thus reducing the communication overhead.However, this reduces model accuracy.

III. PiPar
This section develops PiPar, a framework to improve the resource utilization of CML in the context of FL and SFL.PiPar accelerates the execution of sequential DNNs for the first time by leveraging pipeline parallelism to improve the overall resource utilization in centralized CML.
The PiPar framework is underpinned by two approaches, namely pipeline construction and automated parameter selection.The first approach constructs a training pipeline to balance the overall training workload by (a) reallocating the computations for different DNN layers on the server and devices, and (b) reordering the forward and backward passes for multiple mini-batches of data by scheduling them onto idle resources.Consequently, not only is the resource utilization improved by using PiPar, but also the overall training of the DNN.The second approach of PiPar enhances the performance of the first approach by automatically selecting the optimal control parameters (such as the point at which the DNN must be split across the device and the server and the number of mini-batches that can be executed concurrently in the pipeline).

A. Motivation
The following three observations on low resource utilization when training DNNs in CML motivate PiPar.
(1) The server and devices need to work simultaneously: The devices and server work in an alternating manner in the current CML methods, which is a limitation that must be addressed to improve resource utilization.In FL, the server starts to aggregate local models only after all devices have completed training their local models.In SL/SFL, the sequential computation of DNN layers results in the sequential working of the devices and the server.The dependencies between serverside and device-side computations need to be eliminated to reduce the resulting idle time on the resources.PiPar attempts to make the server and the devices work simultaneously by reallocating and reordering training tasks.
(2) Compute-intensive and I/O-intensive tasks need to be overlapped: Compute-intensive tasks, such as model training, involve large-scale computations performed by computing units (CPU/GPU), while IO-intensive tasks refer to input and output tasks of disk or network, such as data transmission, which usually do not have a high CPU requirement.A computationally intensive and an I/O-intensive task can be executed in parallel on the same resource without mutual dependencies.However, in current CML methods, both server-side and device-side computations are paused when communication is in progress, which creates idle time on compute resources.PiPar improves this by overlapping compute-intensive and I/O-intensive tasks.
(3) Workloads on the server-side and client-side need to be balanced: Idle time on resources is also caused due to imbalanced workloads on the server and devices.PiPar balances the workloads on the server and device sides by splitting the DNN carefully.

B. Pipeline construction
Assume that K devices and a server train a sequential DNN collaboratively by using data residing on each device.Conventionally, the dataset on each device is divided into multiple mini-batches that are fed to the DNN in sequence.Training on each mini-batch involves a forward pass that computes a loss function and a backward pass that computes the gradients of the model parameters.A training epoch ends after the entire dataset has been fed to the DNN.To solve the problem of low resource utilization faced by the current CML methods, PiPar constructs a training pipeline that reduces the idle time on resources during collaborative training.
Each forward and backward pass of CML methods comprises four tasks: (i) the device-side compute-intensive tasks, such as model training; (ii) the device-to-server I/O-intensive task, such as data uploading; (iii) the server-side computeintensive task, such as model training (only in SL and SFL) and model aggregation; (iv) the server-to-device I/O-intensive task, such as data downloading.The four tasks can only be executed in sequence in current CML methods, resulting in idle resources.To solve this problem, a pipeline is developed to balance and parallelize the above tasks.The pipeline construction approach involves three phases, namely DNN splitting, training stage reordering and multi-device parallelization.
Phase 1 -DNN splitting: The approach aims to overlap the above-mentioned four tasks to reduce idle time on computing resources on the server and devices as well as idle network resources.Since this approach does not reduce the computation and communication time of each task, it needs to balance the time required by the four tasks to avoid straggler tasks from increasing the overall training time.For example, in FL, the device-side compute-intensive task is the most timeconsuming, while the other three tasks require relatively less time.In this case, overlapping the four tasks will not significantly reduce the overall training time.Therefore, it is more appropriate to split the DNN and divide the training task across the server and the devices (similar to previous works [16,17]).In addition, since the output of each DNN layer has a variable size, different split points of the DNN will result in different volumes of transmitted data.Thus, changing the split point based on the computing resources and bandwidth can also balance the I/O-intensive tasks with compute-intensive tasks.The selection of the best splitting point is presented in Section III-C.
Splitting DNNs does not affect model accuracy, since it does not alter computations but rather the resource on which they are executed.In FL, each device k, where k = 1, 2, ..., K, trains a complete model M k .PiPar splits M k to a deviceside model M c k and a server-side model M s k represented as: where the binary operator ⊕ stacks the layers of two partitions of a DNN as a complete DNN.
There are k pairs of {M c k , M s k }, where M c k is deployed on device k while all of M s k are deployed on the server.This is different from SL and SFL where only one model is deployed on the server-side.Assume the complete model M k   Figure 2(b) shows that compared to conventional training (Figure 2(a)), the four tasks can be considerably overlapped and it is possible to significantly reduce the idle time of the server and the devices.
To guarantee a similar model accuracy as classic FL, the gradients must be obtained from the same number of data samples when the model is updated.This requires that the number of data samples involved in each training iteration in PiPar should be the same as the original batch size in FL.Since N mini-batches are used in each training iteration, the size of each mini-batch B ′ is reduced to 1/N of the original batch size B in FL.
Reordering training stages does not impact model accuracy, which is demonstrated in Section IV-B.
Phase 3 -Multi-device parallelization: The workloads of multiple devices involved in collaborative training need to be coordinated.On the device-side, each device k is responsible for training its model M c k , and PiPar allows them to train in parallel for efficiency.On the server-side, the counterpart K models (M s1 to M s K ) are deployed and trained simultaneously.However, this may result in contention for compute resources.
Figure 3(a) shows the case of a single device (same as Figure 2(b) but does not show communication), whereas Figure 3(b) and Figure 3(c) show the case of multiple devices.Figure 3(b) offers a solution to train the server-side models sequentially.However, the server-side models that are trained relatively late will cause a delay in the backward passes for the corresponding device-side models, for example, b c2 n , where n = 1, 2, ..., N , in Figure 3(b).
Alternatively, data parallelism can be employed.The activations from different devices are deemed as different inputs and the server-side models are trained in parallel on these inputs.This is shown in Figure 3(c).It is worth noting that, compared to training a single model, training multiple models at the same time may result in longer training time for each model on a resource-limited server.This approach, nonetheless, mitigates stragglers on devices.
At the end of each training epoch, the device-side models M c k are uploaded to the server and will constitute the entire models M k when combined with the server-side models M s k (Equation 1).The complete model M k of each device is aggregated to obtain a complete global model M , using the FedAvg algorithm [13].Compute the activation a n using Equation 177 Send a n and labels y n to the server Compute the gradients of model weights g(M c k |g(a n )) using Equation 2212 If a given device disconnects from the server, the aggregation carried out on the server will exclude the model of the device.If the device reconnects to the server, then it will download the latest global model and continue training.
Algorithm 1 and Algorithm 2 have the same computational complexity as the FedAvg algorithm [13].However, PiPar introduces parallelism within training.Compute the output ŷn using Equation 1810 Compute loss function l(y n , ŷn ) Compute the gradients of activation g(a n ) ← ∂l ∂ŷn bQ ( bQ−1 (... bQ+1 (a n ))) 12 Compute the gradients of model weights g(M s k |a n ) using Equation 2213 Send g(a n ) to device k

C. Automated parameter selection
To maximize the utilization of idle resources, two parameters of PiPar that impact the performance of the training pipeline are considered: a) Split point of a DNN is denoted as P .All layers with indices less than or equal to P are deployed on the device and the remaining layers are deployed on the server.The number of layers determines the amount of computation on a server/device, and the volume of data output from the split layer determines the communication traffic.Therefore, finding the most suitable value for P for each device will balance the time required for computation on the server and the device as well as the communication between them.
b) Parallel batch number denoted as N is the number of mini-batches used for concurrent training in each iteration.The computations of the mini-batches fill up the pipeline, so the number of mini-batches for each training iteration must be determined.
The naive choice of {P, N } makes the results of PiPar no worse than FL and SFL.When P is the layer number and N = 1, PiPar is the same as FL; when P is the same split point as SFL and N = 1, PiPar is the same as SFL.However, carefully selected {P, N } values can further optimize the performance of PiPar.The optimal values of {P, N } can be obtained by an exhaustive search.The model will be trained with all parameter combinations, and then the optimal parameter combination with the shortest training time can be selected.This is unsuitable to be adopted in PiPar in practical as it is time consuming.In addition, we can also select {P, N } values empirically.Empirical selections will make PiPar a better solution than FL and SFL, but cannot make it achieve its optimal performance as the exhaustive search.Therefore, we propose an automated parameter selection approach that identifies an optimal or near-optimal combination of parameters in a shorter time than exhaustively searching.These parameters vary with DNNs, server/device combinations, and network conditions.Therefore, the developed approach relies on estimating the training time for different parameters given the DNN and the network condition.
The approach aims to select the best pair of {N k , P k } for each device k to minimize the idle resources in the three phases.Firstly, we need to know how much they affect the pipeline.Several training iterations are profiled to identify the size of the output data and the training time for each layer of the DNN.Secondly, the training time for each epoch can be estimated using dynamic programming, given a pair of {N k , P k }.Thirdly, the candidates for {N k , P k } are shortlisted.Since the training time can be estimated for every candidate, the one with the lowest training time will be selected.The three phases are explained in detail as follows.
Phase 1 -Profiling: In this phase, an additional training period is required.The complete model is trained on each device and server separately for a predefined number of iterations.If the entire model cannot fit in the memory of the devices, the devices train as many layers as possible and the server trains the complete model.The following information is empirically collected: a) Time spent in the forward/backward pass of each layer deployed on each device and server.Assume that f c k q , bc k q , f s q and bs q denote the forward and backward pass of layer q on device k and server, and t() denotes time.Then, t( f c k q ), t( bc k q ), t( f s q ) and t( bs q ) are the time taken for the forward and backward pass on the devices and server, which are measured and recorded during training.
b) Output data volume of each layer in the forward and backward pass.ṽf q and ṽb q denote the output data volume for layer q in the forward and backward passes.Assume that f c k n , b c k n , f s k n and b s k n is the time spent in the forward and backward passes of M c k and M s k for mini-batch n, where n = 1, 2, ..., N k .The time spent in each stage is the sum of the time spent in all relevant layers.Since the size of each mini-batch in PiPar is reduced to 1/N k , the time required for each layer is reduced to 1/N k .The time of each training stage is estimated by the following: Assume that u k n and d k n are the time required for uploading and downloading between device k and the server for minibatch n, where n = 1, 2, ..., N k , and w k u and w k d are the uplink and downlink bandwidths.Since the size of transmitted data is reduced to 1/N k : The time required by all training stages is estimated using the above equations.The training time of each epoch can be estimated using dynamic programming.Within each training iteration, a given training stage has previous and next stages (exclusions for the first and last stages) as shown in Table I.The first stage is f c k 1 and the last stage is b c k N .We use T (r) to denote the total time from the beginning of the training iteration to the end of stage r, and t(r) to denote the time spent in stage r.Thus, the overall training time is T (b c k N ).Since any stage can start only if all of its previous stages have been completed, we have: where prev() is the function to obtain all previous stages of the input stage.Since t(b c k N ) is already obtained in Phase 2, Equation 11 to Equation 13 can be solved by recursion.The overall time of one training iteration can then be estimated.
Phase 3 -Parameter selection: In this phase, the candidates of {N k , P k } are shortlisted.Since the training time can be estimated for each candidate, the one with the shortest training time can be selected.
Assume that the DNN has Q layers, such as dense, convolutional and pooling layers, and that the memory of devices can only accommodate the training of Q ′ layers, where Q ′ ≤ Q.The range of P k is {P k |1 ≤ P k ≤ Q ′ , P k ∈ Z + }, where Z + is the set of all positive integers.
Given P k , the idle time of the device k between the forward pass and backward pass of each mini-batch (the blank timeline between f c and b c in Figure 2(a)) needs to be filled up by the forward passes of the following multiple mini-batches.As a result, the original mini-batch and the following mini-batches are executed concurrently in one training iteration.
For example, as shown in Figure 2(a), the device idle time between f c and b c is equal to t(u) + t(f s ) + t(b s ) + t(d).Thus, the forward passes or backward passes of the subsequent ⌉ mini-batches can be used to fill in the idle time, making the parallel batch number Since the batch size used in PiPar is reduced to 1/N , the time required for forward and backward passes of each layer, uploading and downloading is reduced to 1/N .The parallel batch number for device k is estimated as: For each device k, the best {N k , P k } can be selected from the shortlisted candidates by estimating the training time.
Since the training time of PiPar with parameter pair {N k , P k } is estimated based on profiling data from training complete models with the original batch size, this approach does not guarantee the selection of optimal parameters.However, our experiments in Section V-D show that the parameters selected by this approach are similar to optimal values.

IV. CONVERGENCE ANALYSIS
This section analyzes the impact of splitting neural network and reordering training stages on model convergence and final accuracy.

A. Splitting DNNs and model accuracy
We will demonstrate that splitting a DNN does not impact model accuracy.Assuming that x 0 is a mini-batch of data and y is the corresponding label set, fq denotes the forward pass function of layer q and x q denotes the output of layer q, where q = 1, 2, ..., Q.

Stage
Previous Stages Next Stages The forward pass of the complete model M k in FL is: where ŷ is the output of the final layer.If the model is split, then the training that occurs on the device and server is also split into two phases.
ŷ′ = fQ ( fQ−1 (... fP +1 (a))) (18) where a is the activations that are transferred from device k to the server and ŷ′ is the final output.
Thus, the loss function when splitting the model is the same as the original loss function when the model is not split.
We use bq to denote the backward pass function of layer q, which is the derivative of fq .

bq (x
The weights in layer q of the original model and the split model are denoted as w q and w ′ q , respectively.Assume g is the gradient function, then: g(w q ) = ∂l(y, ŷ) ∂ŷ bQ ( bQ−1 (... bq+1 (x q ))) Based on Equation 22and Equation 23: Since splitting a DNN does not change the gradients, it consequently does not impact model accuracy.

B. Reordering training stages and model accuracy
We will demonstrate that the model accuracy of a DNN remains the same before and after reordering the training stages.The dataset on client k is denoted as D k .B k denotes a mini-batch in the original training process and B k n , where n = 1, 2, ..., N , to denote mini-batches in a training round after reordering training stages, where In the original training process, the model is updated after the backward pass of each mini-batch B k .Assuming M k is the original model and η is the learning rate, then the updated model is: In PiPar, the model is updated after the backward pass of the last mini-batch B k n in each training round.The updated model is: We have: Therefore, the updated models with and without reordering the training stages are nearly the same (and the same if N B ′ = B).Thus, reordering training stages does not impact model accuracy.

V. EXPERIMENTAL STUDIES
This section quantifies the benefits of PiPar and demonstrates its superiority over existing CML techniques.We first consider the experimental environment in Section V-A.The training efficiency and the model accuracy and convergence of PiPar are compared against existing CML techniques in Section V-B and Section V-C, respectively.In Section V-D, the performance of the proposed automated parameter selection approach is evaluated.Section V-E analyses the impact of batch size on the performance of PiPar.Section V-F explores the impact on performance when using heterogeneous devices, when using differential privacy methods and when the bandwidth changes.

A. Setup
The test platform consists of one server and 100 devices.An 8-core i7-11850H processor with 32GB RAM is used as the (3) WiFi: 50Mbps uplink bandwidth and 50Mbps downlink bandwidth.A regular network with a normal error rate is used in the experiments.The TCP/IP protocol used will handle packet loss.When the protocol detects packet loss, it will retransmit the packet.
Two settings, the first using small DNNs and the second using large DNNs, are used in the experiments.The small DNNs, namely VGG-5 [36], ResNet-18 [37] and MobileNetV3-Small [38] (Table II) are trained on the MNIST [39] and CIFAR-10 [40,41] datasets.The large DNNs, namely VGG-16 [36], ResNet-101 [37] and MobileNetV3-Large [38] (Table III) are trained on the CIFAR-100 [41] and Tiny Ima-geNet [42] datasets.VGG, ResNet and MobileNet series models are convolutional neural networks (CNN) and are representative of high-performing models from the computer vision community for testing CML methods on devices [43,44,16].Since the Raspberry Pis have limited memory, the large DNNs cannot be trained using FL as the entire model needs to fit on the device memory.The small and large DNNs can be trained using SFL and PiPar since the models are split across the device and server, and the device only executes a few layers.We have chosen a range of small and large DNNs to demonstrate that PiPar can work across a range of settings.MNIST and CIFAR-10 have ten classes, while CIFAR-100 and Tiny ImageNet have 100.Each dataset is split into training, validation and test datasets, as shown in Table IV.During training, the data samples are provided to the DNN as minibatches.The size of each mini-batch (referred to as batch size), unless otherwise specified, is 100 for each device in FL and SFL.The batch size in PiPar is ⌊100/N k ⌋, where N k is the parallel batch number for device k and k = 1, 2, ..., 100 (refer to Equation 2).

B. Efficiency results
The experiments in this section compare the efficiency of PiPar with FL and SFL.Although SL is a popular CML technique, it is significantly slower than SFL since each device operates sequentially.Hence, SL is not considered in these experiments.All possible split points for SFL are benchmarked (based on the benchmarking method adopted in Scission [45]), and the efficiency of SFL with the best split point is reported.The split point and parallel batch number for PiPar are selected by the approach proposed in Section III-C.2) Comparing resources utilization: The metric used to compare the utilization of hardware resources is the idle time of the server and devices, which is the total time that the server/device does not contribute to training models in an epoch.The device-side idle time is the average idle time for all devices.A lower idle time corresponds to a higher hardware resource utilization.Since the devices are homogeneous, it is assumed that there is a negligible impact of stragglers.
As shown in Figure 6, PiPar reduces the server-side idle time under all network conditions when training VGG-5, ResNet-18 and MobileNetV3-Small on MNIST and CIFAR-10.Since the server has more computing resources than the devices, model training is faster on the server.Hence, reducing the server-side idle time takes precedence over reducing the device-side idle time.Since FL trains complete models on the devices, the devices are rarely idle.However, the server is idle for a large proportion of the time when the model is trained.Compared to FL, SFL utilizes more resources on the server because the server trains multiple layers.PiPar reduces the server-side idle time by overlapping the server-side computations, device-side computations and communication between the server and the devices.Compared to FL and SFL, the server-side idle time using PiPar is reduced up to 64.1× and 2.9×, respectively.PiPar also reduces the device-side idle time of SFL up to 23.1× in all cases.

C. Convergence and model accuracy results
It is theoretically proven in Section IV that PiPar achieves comparable model accuracy and convergence as FL.It will be empirically demonstrated that PiPar does not adversely impact the convergence and accuracy of models.
The convergence curves and test accuracy of the small and large DNNs using FL, SFL and PiPar are reported.Note that due to the limited memory of devices, the large DNNs could not be executed using FL.Since network conditions do not affect model convergence and accuracy in FL and SFL, only the results for WiFi are reported.
1) Comparing convergence: Figure 8 and Figure 9 report the loss curves of FL, SFL and PiPar on the validation datasets using the small and large DNNs, respectively.The results highlight that for all combinations of DNNs and datasets, the loss curves of PiPar generally overlaps those of FL and SFL.Therefore, PiPar does not affect model convergence.
It is noted that regardless of the DNN and dataset choice, PiPar converges within the same number of epochs as FL and SFL.Since PiPar reduces the training time per epoch, as presented in Section V-B1, the overall training time is therefore reduced.
2) Comparing accuracy: In Table V, the test accuracy of the small DNNs using FL, SFL and PiPar are reported.The last row shows the difference between the model accuracy of PiPar and the higher one of FL and SFL, denoted as ∆.The results for FL 2 , SFL and PiPar on the large DNNs are shown in Table VI.As seen in both tables, in all cases PiPar achieves comparable accuracy as FL and SFL on the test dataset, where the difference in accuracy ranges from -0.2% to +2.07%.Specifically, in the worst case, the test accuracy of training MobileNetV3-Small on CIFAR-10 achieved by PiPar is 0.2 lower than FL but still 0.7 higher than SFL.
These results empirically demonstrate that splitting a DNN and reordering the training stages in PiPar does not sacrifice model accuracy while obtaining a higher training efficiency.

D. Evaluation of automated parameter selection
The results presented here demonstrate the effectiveness of the automated parameter selection approach in PiPar.Initially, we exhaustively benchmarked all possible parameters to obtain the optimal parameters.We then show that PiPar selects parameters that are obtained in less time than an exhaustive search, but achieves optimal or near-optimal training time.
The control parameters of PiPar, namely split point P k and parallel batch number N k for device k, where k = 1, 2, ..., K, affects training efficiency (Section III-C).P k and N k for all devices are the same in our experiments since we consider homogeneous devices; so we use P and N .
The optimal split point P opt and parallel batch number N opt can be found by exhaustively searching given a finite search space.As shown in Table II and Table III, VGG-5, ResNet-18, MobileNetV3-Small, VGG-16, ResNet-101 and MobileNetV3-Large consist of 5, 10, 15, 16, 35 and 19 sequential layers, respectively.We have P ∈ [1,5] for VGG-5, P ∈ [1,10] for ResNet-18, P ∈ [1,15] for MobileNetV3-Small, P ∈ [1,16] for VGG-16, P ∈ [1,35] for ResNet-101 and P ∈ [1,16] for MobileNetV3-Large.Note that DNNs, such as ResNet-18 and ResNet-101, have parallel branches that cannot be split.In this case, only connections between sequential layers can be selected as split points.Assuming batch size for FL is B, since the batch size for PiPar ⌊B/N ⌋ is no less than 1, we have N ∈ [1, B].To exhaustively search for the optimal pair {P opt , N opt }, the DNNs are trained for one iteration using all possible {P, N } pairs in PiPar, and the pair with the shortest training time is considered optimal.
The proposed method is to select P and N for each experiment.There is only one training iteration in the profiling stage.We compare {P, N } selected by our approach against {P opt , N opt } determined by the exhaustive search in terms of training time and search time.T P,N denotes the training time for each epoch given P and N .We have T P,N ≥ T Popt,Nopt .The score in Equation 28 measures how close T P,N is to T Popt,Nopt , which is between 0 and 1.The higher the score, the better the {P, N } values perform in terms of training time.
The results for small and large DNNs are shown in Table VII and Table VIII, respectively.S P,N is the search time to obtain P and N for the automated parameter selection approach or the exhaustive search; smaller is better.For both small and large DNNs, the proposed method selects near optimal parameters in all cases (with a Score ≥ 0.96) and optimal parameters in 83.3% cases.In addition, our approach selects optimal split point P in all cases.
The results highlight that the time to exhaustively search is substantially high to be practical in the real-world.The average cost of our approach is 27% of one training epoch, which is 6957× faster than exhaustively searching.The approach is only executed once before training.Since raining consists of hundreds of epochs or more, the overhead of executing this algorithm is negligible (less than 0.3%).Therefore, our approach provides a practical approach to determine the parameters of PiPar.

E. Batch size analysis
Compared to FL, PiPar in Phase 2 increases the number of mini-batches involved in each training iteration and reduces the batch size.Assume B is the FL batch size.The batch size in PiPar is B ′ = ⌊B/N ⌋, where N is the parallel batch number (Equation 2).The DNNs are trained using FL and SFL for different B and PiPar for corresponding B ′ under different network conditions.
Figure 10 shows the training time per epoch for FL, SFL and PiPar using VGG-5 and CIFAR-10, while Figure 11 shows the training time per epoch for SFL and PiPar using VGG-16 and CIFAR-100.The training time of FL/SFL/PiPar decreases as the batch size increases because intra-batch parallelisation can be leveraged for matrix multiplication operations when a larger batch is trained.However, increasing batch sizes is not an effective way to speed up training because it requires more memory and reduces model accuracy [46].The results highlight that PiPar is consistently faster than FL and SFL for VGG-5 and faster than SFL for VGG-16 under all network conditions, regardless of batch sizes.The same trend is seen when training other four DNNs and two datasets.

F. Robustness Analysis
We explore the robustness of PiPar to more complex environments.In Section V-F1, heterogeneous devices are used to evaluate the performance of PiPar against FL and SFL.The impact of using differential privacy methods is considered in Section V-F2.Finally, the impact of changing network bandwidth on the overhead of the automated parameter selection approach is considered in Section V-F3.In this section, only representative results are shown that are obtained from an evaluation using VGG-5, ResNet-18 and MobileNetV3-Small on CIFAR-10 under different network conditions.A similar trend is noted for other datasets.1) Impact on Performance with Heterogeneous Devices: The impact on the performance using a homogeneous and heterogeneous testbed is considered.The setup of the homogeneous testbed was presented in Section V-A.In the heterogeneous testbed, the same number of devices are used but the CPU frequency of half of the devices is reduced from 1.2 GHz to 600 MHz to create an environment with different compute capabilities of devices.
Figure 12 shows the training time per epoch for FL, SFL and PiPar.Compared to the homogeneous testbed, the training time on the heterogeneous increases since there are slower devices.The faster devices have to wait for the stragglers before model aggregation.In all cases, PiPar has lowest training time compared to FL and SFL on both homogeneous and heterogeneous testbeds.Specifically, on the heterogeneous testbed, PiPar accelerates training of FL by up to 32× and SFL by up to 1.8×.In addition, FL has a larger difference in performance between testbeds than SFL and PiParsince the latter trains the last several layers of the DNN (as the DNN is partitioned) on the server, which is not affected by the heterogeneity of devices.
2) Impact on Performance When Using Differential Privacy Methods: Differential Privacy (DP) [47,48] is used in CML methods to enhance privacy in CML by adding noise into data transferred between the devices and server.We consider the performance overhead introduced when using DP methods in FL, SFL and PiPar.
Two DP methods are considered.Firstly, classic DP [47] is used to add noise to local models on devices before they are sent to the server to make them irreversible.Secondly, PixelDP [48] adds an additional noise layer before the first layer of device models, which prevents activations from being restored to raw data via reverse engineering.
Figure 13 shows the training time per epoch for different CML methods with and without DP methods.Classic DP introduces an overhead of up to 11.7s to FL, 0.16s to SFL and 0.15s to PiPar.Compared to SFL and PiPar, FL has the largest overhead using DP because in FL the entire model is trained on the device and classic DP will add noise to each parameter in the model.The overhead of PixelDP on FL, SFL and PiPar (up to 0.3s) is comparable since they use the same size of inputs and PixelDP adds noise to these inputs.The results highlight that the two DP methods applied to PiPar do not introduce a larger overhead than FL and SFL.
3) Impact of Changing Bandwidth on the Overhead for Automated Parameter Selection: It was shown in Section V-D that the automated parameter selection approach only needs to execute once before training in a stable network environment.Therefore, the overhead incurred is negligible.In this section, we measure the overhead of the approach in an unstable network where the bandwidth changes between 4G, 4G+ and WiFi conditions periodically in a controlled manner.
The overhead is measured for different intervals in which the bandwidth changes.The network is more unstable for smaller intervals.Figure 14 shows the percentage overhead for running the parameter selection approach with respect to training time for different intervals in which the bandwidth changes.The intervals considered range from 1 minute to 60 minutes.If the bandwidth changes every hour, the overhead of parameter selection is only 0.14%, 0.11% and 0.1% of the training time for VGG-5, ResNet-18 and MobileNetV3-Small, respectively.Considering the worst case in the experiments, which is a change of bandwidth every minute, the approach overhead is up to 8.3% of the training time.If the bandwidth change occurs on an average every 10 minutes or more then the overhead incurred is less than 1%.

VI. CONCLUSION
Deep learning models are collaboratively trained using paradigms, such as federated learning, split learning or split federated learning on a server and multiple devices.However, they are limited in that the computation and communication across the server and devices are inherently sequential.This results in low compute and network resource utilization and leads to idle time on the resources.We propose a novel framework, PiPar, that addresses this problem for the first time by taking advantage of pipeline parallelism, thereby accelerating the entire training process.A novel training pipeline is developed to parallelize server-side and deviceside computations as well as server-device communication.In the training pipeline, the DNN is split and deployed on the server and devices, and the training process on different mini-batches of data is re-ordered.A low overhead parameter selection approach is then proposed to maximize the resource utilization of the pipeline.Consequently, when compared to existing paradigms, our pipeline significantly reduces idle time on compute resources by up to 64.1× in training popular DNNs under different network conditions.An overall training speed up of up to 34.6× is observed.It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
where D k is the local dataset on device k and | • | is the function to obtain the set size.In Step ④, the global model is downloaded to the devices.The next round of training continues until the model converges.

Fig. 1 .
Fig. 1.Training of CML methods, assuming K devices.The training steps (circled numbers) are explained in Section II-A Conventional training pipeline when using a split DNN Server-Side Comp.Uploading Device-Side Comp.Downloading (b) Training pipeline in PiPar.N mini-batches are trained in parallel in each training iteration; subscripts indicate the index of the mini-batch.

Fig. 2 .
Fig. 2. Pipelines for one training iteration in conventional training and PiPar when using a split DNN."Comp" is an abbreviation for computation.f , b, u and d represent forward pass, backward pass, upload and download, respectively.Superscripts indicate server-side (s) or client-side (c) computation or communication.

Figure 2 (
a) shows the pipeline of one training iteration of a split DNN for one pair of {M s k , M c k } (the device index k is not shown).Any forward pass (f ), backward pass (b), upload task (u) and download task (d) for each mini-batch is called a training stage.The idle time on the device exists between the forward pass f c and the backward pass b c of the device-side model.Thus, PiPar inserts the forward pass of the next few minibatches into the device-side idle time to fill up the pipeline.As shown in Figure 2(b), in each training iteration, the forward passes for N mini-batches, f c 1 to f c N , are performed on the device in sequence.The activations of each mini-batch are sent to the server (u 1 to u N ) once the corresponding forward pass is completed, which utilizes idle network resources.Once the activations of any mini-batch arrive, the server performs the forward and backward passes, (f s 1 , b s 1 ) to (f s N , b s N ), and sends the gradients of the activations back to the device (d 1 to d N ).After completing the forward passes of the mini-batches and receiving the gradients, the device performs the backward passes, b c 1 to b c N .Then the model parameters are updated and the training iteration ends.A training epoch ends when the entire dataset has been processed, which involves multiple training iterations.

4 for n = 1 to N k do 5
Fig. 3. PiPar using single and multiple devices.Comp, f , b, u and d represent computation, forward pass, backward pass, upload and download, respectively.The superscripts s k and c k represent the index of the model M s k and M c k , k = 1, 2, respectively.

8 end 9 for n = 1
to N k do 10 Receive g(a n ) from the server 11

14 end 15 Send 'stop epoch' signal to the server 16 Send M c k to the server 17 Receive M c k ′ from the server 18 Update
M c k ← M c k ′ end Send 'stop training' signal to the server Return M c k and labels from device k (Line 7) and uses them to compute the loss function (Line 9 to Line 10).After that, the gradients of activations and model weights are computed (Line 11 to Line 12).The former is then sent to device k (Line 13), and the latter is used to update M s k at the end of the training iteration (Line 15).After receiving the 'stop epoch' signal, the server receives the device-side model M c k from device k (Line 17) and makes up a complete model M k (Line 18).The K models M k , where k = 1, 2, ..., K, are aggregated into a global model M (Line 20).M is then split into a server-side model M s k ′ and a device-side model M c k ′ (Line 21).M c k ′ is sent to device k (Line 22), and M s k ′ is used to update M s k (Line 23).A training epoch ends.Training is completed when the 'stop training' signal is received from all devices.

Algorithm 2 : 6 for n = 1 to N k do 7 Receive activations a n and labels y n 8 Update
Server-Side Training in PiPar /* Run on the server.*/ Input: Number of devices K; structure of the DNN with Q layers; learning rate η; model split point P k and number of mini-batches in each iteration N k , where k = 1, 2, ..., K Output: Server-side models M s k , where k = 1, 2, ..., K for k = 1 to K in parallel do 4 Initial the size of dataset on device k: D k ← 0 5 while 'stop epoch' signal not received from all devices do // Start a training iteration D k ← D k + |a n | 9 The data volumes are measured and recorded during training.Phase 2 -Training time estimation: To estimate the time spent in each training epoch of {M c k , M s k }, given the pairs of {N k , P k } for device k, the time for each training stage must be estimated.

1 )
Comparing efficiency: The efficiency of the CML techniques is measured by training time per epoch.Section V-C1 will highlight that the loss curves of FL, SFL and PiPar overlap, so the same number of epochs are required for model convergence using the three techniques.Hence, if PiPar reduces the training time per epoch, it reduces the overall training time.

Figure 4 Fig. 6 .
Fig. 6.Idle time per epoch on the server and devices in FL, SFL and PiPar under different network conditions for small DNNs.'S' and 'D' in the legend represent server-side and device-side idle time, respectively.They are shown in the upward and downward bars.

Fig. 7 .
Fig. 7. Idle time per epoch on the server and devices in SFL and PiPar under different network conditions for large DNNs.'S' and 'D' in the legend represent server-side and device-side idle time, respectively.They are shown in the upward and downward bars.FL results are not shown as the entire DNN does not fit on the device memory.

Fig. 11 .
Fig. 11.Training time per epoch for SFL and PiPar using VGG-16 and the CIFAR-100 dataset with different batch sizes B. FL results are not shown as the entire DNN does not fit on the device memory.

Figure 7
highlights that, compared to SFL, PiPar also reduces idle time when training VGG-16, ResNet-101 and MobileNetV3-Large on CIFAR-100 and Tiny ImageNet.Server-side idle time and device-side idle time of SFL are reduced up to 2.3× and 2.5×, respectively.

Fig. 12 .
Fig. 12. Training time per epoch for FL, SFL and PiPar using small DNNs on CIFAR-10 under different network conditions on homogeneous and heterogeneous testbeds.

Fig. 13 .
Fig. 13.Training time per epoch for FL, SFL and PiPar using small DNNs on CIFAR-10 under different network conditions with and without differential privacy methods.

Fig. 14 .
Fig. 14.The percentage overhead of the automated parameter selection approach with respect to training time for different intervals in which the bandwidth changes.

TABLE V MODEL
ACCURACY (PERCENTAGE) FOR FL, SFL AND PiPar USING SMALL DNNS.

TABLE VI MODEL
ACCURACY (PERCENTAGE) FOR FL, SFL AND PiPar USING LARGE DNNS

TABLE VII PARAMETERS
SELECTED BY THE APPROACH IN PiPar IN CONTRAST TO THE OPTIMAL PARAMETERS FOR SMALL DNNS.

TABLE VIII PARAMETERS
SELECTED BY THE APPROACH IN PiPar IN CONTRAST TO THE OPTIMAL PARAMETERS FOR LARGE DNNS.