An Efficient Message Filtering Strategy Based on Asynchronous ADMM with L1 Regularization

With the increment of data scale, distributed machine learning has received more and more attention. However, as the data grows, the dimension of the dataset will increase rapidly, which leads to the increment of the communication traffic in the distributed computing cluster and decreases the performance of the distributed algorithms. This paper proposes a message filtering strategy based on asynchronous alternating direction method of multipliers (ADMM), which can effectively reduce the communication time of the algorithm while ensuring the convergence of the algorithm. In this paper, a soft threshold filtering strategy based on L1 regularization is proposed to filter the parameter of master node, and a gradient truncation filtering strategy is proposed to filter the parameter of slave node. Besides, we update the algorithm asynchronously to reduce the waiting time of the master node. Experiments on large-scale sparse data show that our algorithm can effectively reduce the traffic of messages and make the algorithm reach convergence in a shorter time.


1.
Introduction In the era of big data, the scale of data is exponentially increasing, for machine learning, big data is not only reflected in the sample size of data, but also in the dimension of data. For example, in the problem of click-through rate prediction [1,2] of online advertising, the dimensions of the dataset can reach more than 10 million dimensions, and the data is often stored in a sparse way. Under the large-scale data, it is difficult to optimize the traditional machine learning problem on a single machine. In this situation, distributed optimization [3,4] is a better solution, and global optimization solution can be obtained by the cooperative computing and communication between nodes in a computing cluster.
ADMM [3] is a distributed optimization algorithm that can effectively solve global consensus optimization problems and has a wide range of applications in machine learning [5] and computer vision [6]. The ADMM algorithm accelerates the solution of machine learning problems by splitting the convex loss function ( ) into sub-functions ( ) and computing in parallel. The update of the distributed ADMM algorithm can be described as: Where is the number of iterations of the ADMM algorithm and is the penalty term parameter. In the distributed ADMM algorithm, there is a master node responsible for updating the global variable . The others are slave nodes, which solve the local objective function according to local dataset, obtain the local optimization solution +1 and +1 . In each iteration, the master node need to collect all the local optimization solution +1 and +1 from slave nodes to update global variable . The ADMM algorithm achieves convergence in a certain number of iterations by alternating updates of the , and .
In the distributed ADMM algorithms, each iteration, the master node need to communicate with all the slave nodes. However, when the dimension of the communication parameters is high, the communication time in the network will be greatly increased. In addition, because of the computing performance of each slave node is different, and the master node must wait the slowest node to complete computing for each iteration, which will make the algorithm waste a lot of time on waiting.
There are many researches to solve the optimizations problems of large-scale sparse data [7,8]. The main idea of them is that each dimension of the model parameter has different convergence speed. In their approaches, each iteration, only partial valid dimensions are selected to update the model parameter, and the algorithms achieve faster convergence. In addition, recently, asynchronous update mechanism [4,9,10] is also proposed to reduce the waiting time of the distributed algorithm.
Motivated by the above researches, this paper proposes an efficient message filtering strategy based on ADMM algorithm to solve the communication problem of distributed optimization in large-scale sparse dataset. In our strategy, message will be filtered before the parameter is sent, which reduces the message traffic and communication time is significantly reduced. In addition, we also adopted the asynchronous way to update the algorithm, and the waiting time is reduced. The experiment based on sparse logistic regression shows that our algorithm achieves better convergence and lower communicate latency compared with the algorithm without message filtering. This paper is organized as follow. Section 2 surveys related works. Section 3 describes the proposed algorithm, including the message filtering strategy and the asynchronous update mechanism. Section 4 carries on the distributed experiments to the algorithm with large-scale datasets. Section 5 concludes the paper.

2.
Related work In order to improve the computational efficiency of distributed algorithms, many researches have focused on reducing communication time and reducing the latency of computing network. In this paper, we will review the ADMM algorithm with asynchronously update and the distributed algorithm with message filtering.
In 2014, R.L. Zhang [9] proposed an asynchronous ADMM algorithm to reduce the waiting time while ensuring the convergence of the algorithm. In their approach, algorithm can update asynchronously with two control conditions: partial barrier and bounded delays. Reference [11] extends the asynchronous ADMM algorithm to solve the problem of non-convex optimization and gives a detailed. Reference [12] proposed an ADMM algorithm based on parameter server architecture, which divides the data according to the model, the service node is responsible for the aggregation of the model parameters, and the working node performs the asynchronous update calculation. Although this algorithm divides the data into dimensions, and reduces the traffic of communication, but it also increases the complexity of the algorithm.
Li Mu [13] proposed an asynchronous distributed block proximal gradient algorithm, which can filter the message of network, and the algorithm set a bounded delay to make the algorithm update asynchronously. In their approach, several message filter strategies is proposed to compress the data, and reduce the communication time of the network. In addition, for the optimization algorithm of largescale sparse data, [8] proposed the FOBOS algorithm, the idea of the algorithm is to set a gradient truncation, at each iteration, the value of a dimension in the gradient is little than the gradient truncation, the dimension will be filtered. In this way, the algorithm achieves faster convergence. And [7] proposed

3.
Algorithm Development In this section, we will describe the message filtering strategy and the asynchronous update mechanism. Since the distributed ADMM update with a master node and several slave nodes. For master node, we propose a soft threshold filtering strategy to filter the message of global variable . For slave node, we propose a gradient truncation update strategy to filter the message of local variable and . For the asynchronous control of the algorithm, we refer to [9]the implementation of the asynchronous ADMM algorithm.

message filtering strategy
In the distributed optimization with large-scale sparse dataset, the dimensions of the model parameters will be very high. Due to the different convergence speeds of different dimensions of the parameters, for those dimensions that have already converged, we can filter it before the message sent and reduce the communication traffic. In addition, when the large-scale sparse dataset is divided into several subdatasets, some dimensions are invalid for a sub-dataset, and the value of these dimensions is 0 in the local model parameter. These dimensions have no benefit to the update of local model parameter, and we can also filter it to reduce communication.
Before introducing the message filtering strategy, we first define the message format for communication. The format of each message meta is defined as follows: Where denotes the index of each message in the parameter and v denotes the value corresponding to the th dimension of the parameter. As shown in figure 1, before the message is sent, the valid elements of the parameter will be collected with the index marked. Then the collected elements with indexes will be sent to receiver. After receiving the filtered message, the receiver needs to aggregate the message and put the valid elements into the related index in the message cache.

Soft threshold filtering strategy.
In the distributed ADMM algorithm, the master node is responsible for the update of the global variable , and the update formula of is as shown in the equation (2), where ( ) is generally a regularization term. When the ADMM algorithm contains the L1 regularization, ( ) = ‖ ‖ 1 , the update of can be reformulate as: Since the update of z needs to collect the variables from all the slave node, and the update of z is quadratic equation, the update of can be simplified as: Where denotes the number of slave nodes, ̅ +1 denotes the average value of the local variable +1 , and � denotes the average value of the dual variable y . Although the L1 regularization term is non-differentiable, we can still solve it by Sub-gradient differentiation: Where / denotes the soft threshold operator. We can get the final update formula of the global variable z: We use � +1 � to denote the value of the th dimension of the +1 parameter, which can be derived from the above update formula when �� ̅ +1 � + (1/ )� � � � ≤ / , the value of � +1 � takes as 0. In this situation, we can deem that � +1 � have no help with update of slave nodes, so this dimension will be filtered. The master node filters the updated global variable z according to the soft threshold strategy, and compress it according to the defined message format, and sends it to each slave node. After receiving the filtered message from master node, the slave node will decompress the message and store the global variable +1 . Then the local variable +1 and the dual variable +1 will update with the global variable +1 .

Gradient truncation strategy.
In the distributed ADMM algorithm, each slave node is responsible for the update of the local variable and the dual variable . The dual variable can be regarded as an auxiliary variable that keeps the global variable and the local variable consistent. According to the global consensus constraint, when the algorithm reaches convergence, the optimization solution ( * , * ) of the ADMM algorithm satisfies * − * = 0, and the update formula for dual variables can be simplified to: Where ∆ = +1 − , it can be seen as the gradient of the parameter +1 , when the algorithm approaches convergence, ∆ approaches to 0. However, with the different convergence of each dimension in the parameter, the value of each dimension in ∆ is different. We set a truncation value θ of ∆ , when the absolute value of a dimension in ∆ is little than the truncation value θ, we consider that the dimension of the parameter has converged, and it will be filtered before parameter is sent. According to equation (1), we describe the update of and with message filtering as follow: . While [11]shows that ADMM algorithm achieves convergence with �1/√ �, so we adjust the truncation as �1/√ � dynamically. At each iteration, slave nodes first perform the update of and , and parameters will be filtered by updating

Asynchronous update of the algorithm
In the synchronous ADMM algorithm, the master node needs to collect local variable , from all the slave nodes to update the global variable . It means that, each iteration, the master node needs to wait the update completed of the slowest node, and much time is wasted on waiting. In the asynchronous ADMM algorithm, the master node only collects local variable from partial slave nodes, and the slave nodes can update asynchronously in a bounded delay. In this way, the asynchronous ADMM algorithm can achieves convergence in a shorter time. Next, we will introduce the update method of asynchronous ADMM algorithm.

Partial barrier.
Partial barrier denotes that, each iteration, master node only collects local variable from partial slave nodes, and does not need to wait the slowest slave node to complete update. In this paper we use to denote the partial barrier, and use as a node set to record the partial slave nodes which the master node received message from. In each iteration, only when the count of message master collected reach to (| | ≥ ), the master could to update global variable . After completing update of , the global variable will be sent to the slave nodes which in .

Bounded delay.
Under the control of partial barrier conditions, the slave nodes can update asynchronously. However, if each node performs the update completely asynchronously, the convergence of the ADMM algorithm will not be guaranteed. Therefore, bounded delay is set to control the update of slave nodes. We use τ to denote the value of bounded delay. Bounded delay denotes that the master node must receive at least one message sent by each slave node every τ iterations. In this way the convergence of the ADMM algorithm can be guaranteed. Specifically, the master node sets a count variable τ for each slave node. When the master node receives the message from the node , the value of the corresponding count variable τ in the master node is set to 1, when the variable of the master has updated, the value of all the count variables τ is incremented by 1. When the delay of a node reaches τ (τ > ), the other slave nodes must stop updating and wait for the node to complete the update. We describe the message filtering strategy based on asynchronous update as shown in figure 2. In a network of one master node and three slave nodes, we set the partial barrier to 2, and set bounded delay τ to 10. As shown in the figure, at iteration 0, the master node receives the message from the node 1 and the slave node 3, the partial barrier reaches to 2. Then the master node can update the variable , and send the message to the node 1 and node 3 after message filtering. At this time, the delay count τ 1 and τ 3 will set to 1, and τ 2 will increment by 1 and set to 2. When τ 2 reaches to 10, slave node 1 and node 3 must stop update and wait slave node 2 to complete update. Based on the above analysis, we describe the asynchronous ADMM algorithm based on message filtering in algorithm 1. Waiting to receive a message, and store the slave nodes in ; 5: until � � ≥ and max ( 1 , 2 … ) ≤ τ 6: for node ∈ : do 7: Set = 1; 8: Decompress the message and aggregate the message into and ; 9: end for 10: for node ∉ : do 11: Set = + 1; 12: end for 13: Update and filter the message according to equation(6); 14: for node ∈ : do 15: Send the filtered message to node ; 16: end for 17: Set = + 1; 18: until satisfies the stopping condition; Algorithm of the Slave node: 1: Initialize , and , set = 0; 2: repeat 3: Waiting to receive message from master node, decompress the message and aggregate the message into ; 4: Update , and filter the message according to equation(8); 5: Send the filtered message to master node; 6: until satisfies the stopping condition;

4.
Experiments In this Section, we implement the ADMM algorithm based on message filtering strategy and use it to solve the sparse logistic regression problem under large-scale datasets. Logistic regression algorithm is a classic machine learning algorithm. It is widely applied in data mining, click through rate prediction and other fields. The problem of logistic regression can be formulated as: Where is the mode parameter, is the penalty parameter, and are different from the variable of ADMM algorithm. In this situation, is the sample of dataset, is the label of sample.
The algorithm is implemented in C++, and MPICH is adopted for distributed computing and communication. We tested the algorithm on the high-performance cluster of Shanghai University and used 65 cores for computing. We use one core to simulate the master node and the other 64 cores are used to simulate the slave nodes. Each core is allocated with 4GB memory, and the communication bandwidth between nodes is 5.6Gb/s. We use the trust region newton method [14] to solve the optimization of , this method is seen to have a better performance than the method such as LBFGS [15] for large-scale logistic regression.
We perform experiment on 3 large-scale sparse datasets (kdd2010, avazu, url), which can be found in the Internet 1 . Among the datasets, kdd2010 has 19,264,097 samples and 1,163,024 dimensions, avazu has 40,428,967 samples and 1,000,000 dimensions, url has 2,396,130 samples and 3,231,961 dimensions.
We set the primal residual as = ∑ − =1 according to the paper [3], where denotes the algorithm th iteration, is the number of slave nodes, and then the convergence stopping condition of the algorithm can be described as: Where abs > 0 is the absolute error, > 0 is the relative error, here we both set them to 10 −3 . We first analyse the convergence performance of the algorithm with kdd2010 dataset. As shown in figure 3, we compare the primal residual of the algorithm under different settings. It can be seen from the figure that the convergence speed of the synchronization algorithm and the asynchronous algorithm are not much different when the message is not filtered, while the asynchronous algorithm adopting the message filtering strategy can achieve convergence in a shorter time. This is because of the message filtering strategy can significantly reduce the traffic of the message communication, and reduce the We perform the experiment on the proposed algorithm with 3 different datasets. In this experiment, the penalty parameter of the ADMM is set to 2, and the penalty parameter of L1 regularization is set to 1. When the algorithm updates with message filtering strategy, the gradient truncation θ is set as 10 −3 . When the algorithm updates asynchronously, partial barrier is set to 32, and bounded delay τ is set to 10.  Table 1 shows the detailed result of the proposed algorithm with different settings. Here, we use Ctime to denote the communication time of algorithms, and use R-time to denote the run time of algorithms. As can be found in the table, the message traffic of each iteration can be greatly reduced while the algorithm adopting the message filtering strategy. Compared with synchronous algorithm, the Asynchronous algorithm with message filtering achieves faster convergence and has a better communication performance. When the algorithm is asynchronously updated, the algorithm has a lower accuracy. However, it is advisable to improve the performance of the algorithm with an appropriate accuracy loss.

5.
Conclusion This paper proposes a message filtering strategy based on asynchronous ADMM algorithm to solve the problem of communication in distributed optimization with large-scale sparse data. By using L1 regularity, a soft threshold filtering strategy is set on the master node, and gradient truncation is set by the update feature of the slave node. In addition, the algorithm adopts the asynchronous update mechanism to reduce the waiting time of the network. The related experiments show that the message filtering algorithm based on asynchronous ADMM can significantly reduce the communication time of the algorithm while ensuring the convergence of the algorithm. In the next work, the message filtering strategy will be considered to be applied to the ADMM algorithm of model parallel, and make efficiency of the algorithm will be further improved.