CAI Weibo ,YANG Shulin ,SUN Gang ,ZHANG Qiming ,YU Hongfang
(1.University of Electronic Science and Technology of China,Chengdu 611731,China;
2.ZTE Corporation,Shenzhen 518057,China)
Abstract: In distributed machine learning (DML) based on the parameter server (PS) architecture,unbalanced communication load distribu‐tion of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth.To ad‐dress this problem,a network-aware adaptive PS load distribution scheme is proposed,which accelerates model synchronization by proac‐tively adjusting the communication load on PSs according to network states.We evaluate the proposed scheme on MXNet,known as a realworld distributed training platform,and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.
Keywords: distributed machine learning;network awareness;parameter server;load distribution;heterogeneous network
Machine learning is widely used in many fields such as image classification[1],speech recognition[2],and natural language processing[3].With the continuous increase in training data and the model size,the huge time cost of single-machine training is unacceptable to users.Therefore,distributed machine learning (DML) based on multi-machine parallelism has drawn more and more atten‐tion.Usually,distributed training is carried out within a single cluster,since it is considered that networks with limited band‐width and complex and changeable states across clusters will seriously slow down the communication process of DML.How‐ever,due to the limitations of data privacy protection[4],data aggregation among clusters for model training is not allowed in some cases.In addition,with the proposal of Computing First Network[5–6],DML model training based on the integrated computing power of the whole network gradually shows great application prospects.Based on the consideration mentioned above,DML in heterogeneous networks across clusters has great research value.
There are mainly two communication architectures for DML: one is a centerless architecture,represented by AllReduce[7–8],and the other is a centered architecture,represented by a pa‐rameter server (PS) architecture[9–11].In the PS architecture,there are usually two types of nodes in the DML system: the worker responsible for model training and the server for model aggregation and parameter update.During a typical training it‐eration of data parallelism and synchronous update mode,work‐ers send model gradients uniformly after completing the train‐ing based on the local model and data,and the server receives the model gradient from workers.Thereafter,the model aggrega‐tion operation is performed to generate a global model,and the global model is sent to workers.Workers immediately replace the local model after receiving the global model from the server and start a new training iteration.
In this process,since the data from all workers need to be aggregated on the server,servers with limited bandwidth re‐sources could become the bottleneck of transmission,which is also an inherent problem of the PS architecture[12].In order to tackle this problem,a traditional solution[13]is to increase the number of servers and let multiple servers share the heavy communication load.Since the load distribution of each server usually follows the principle of fairness,this scheme has an ideal effect on homogeneous networks.However,in networks with heterogeneous bandwidth resources,since the system is agnostic about networks,it is impossible to match the com‐munication load undertaken by each server with its communi‐cation capability.This leads to a consequence that the serv‐ers with low communication capacity slow down the commu‐nication time during the entire iteration process due to exces‐sive load.
To efficiently handle the problem,this paper proposes an adaptive load balancing scheme for network-aware-PS-based DML over heterogeneous networks.The scheme senses the throughput of each link in networks in real time through a de‐signed network awareness mechanism,reasonably evaluates the communication capability of each server based on this,and then selects appropriate servers to undertake the appropri‐ate model aggregation tasks according to their communication capabilities.Finally,each server is assigned with communica‐tion load that matches its communication capability.The main contributions of this paper are as follows:
• We achieve an effective estimation of the link throughput by the low-cost and high-precision statistics method of the data transmission time with a simple and ingenious design,so as to learn the global network state information;
• We conduct an in-depth theoretical analysis of finegrained data transmission and find a method to solve the opti‐mal granularity of data slices.
• We design a simple and effective aggregation node selec‐tion method and a specific data slice assignment method,which can achieve efficient slice assignment.
Multiple servers are typically used to alleviate heavy traffic on a single server in the PS architecture.But the specific implementation of the traditional PS architecture is network unawareness (such as MXNet[14],TensorFlow[15],and Pe‐tuum[16]),making it impossible to distribute the communica‐tion load more reasonably according to the actual communica‐tion capabilities of each server.Therefore,it is generally as‐sumed that their communication capabilities are basically the same and are distributed according to the principle of fair‐ness[17].This usually results in poor performance in heteroge‐neous networks.
The authors in Ref.[18] have proposed an elastic PS load distribution scheme,which mainly analyzes the performance of servers by calculating the transmission time of the param‐eters using the linear regression method,and finally distrib‐utes communication load accordingly.Considering that the load distribution is in a complex network environment,the pri‐mary problem is the awareness of the network state.However,the authors do not provide a statistical method of parameter transmission time to implement network awareness,which makes the engineering solution to this kind of problem practi‐cally impossible.In addition,this scheme fails to deeply con‐sider the optimal granularity of fine-grained transmission,and only uses empirical values,which cannot make the transmis‐sion reach the optimal state.
Based on the understanding of the related work about the PS load distribution of DML and the in-depth thinking of the problem,our approach is proposed as follows.First of all,the data are segmented according to the established slice granular‐ity.The system in real time senses the network state through the cleverly designed network awareness mechanism,then evaluates the network communication capabilities of each node accordingly,and selects a part of the nodes as aggrega‐tion nodes.Finally,the complete distribution of fine-grained data is realized according to the PS load distribution and slice assignment algorithms.
3.1 Slice Granularity
During the model aggregation for DML,the process of work‐ers sending data to the server to aggregate (PUSH) and the pro‐cess of workers receiving the aggregated data returned from the server (PULL) are usually carried out synchronously,as shown in Fig.1.The system performs the PULL process of data Slice 1 after all workers have completed the PUSH pro‐cess of data Slice 1 (the time of data aggregation can be ig‐nored),and the PUSH process of data Slice 2 is performed syn‐chronously,thus overlapping PUSH and PULL.Theoretically,the smaller the data slice is,the better the overlapping of PUSH and PULL,ultimately making the aggregation quicker to complete.However,in practice,because there is a certain overhead in the data segmentation process,and there is also a certain additional network overhead in the transmission pro‐cess of data slices,the granularity of slicing cannot be infi‐nitely small.
▲Figure 1.Illustration of data transmission,where the green block is the additional synchronization delay,and the orange block is the trans⁃mission time of each slice
Taking as many factors as possible into account,we analyze and solve this problem from a theoretical point of view.Con‐sidering the situation under a simple homogeneous network,in a complete data aggregation process under a single server,for a distributed system with a fixed data sizeMin every worker,the network bandwidth isW,and the number of nodes isN,where the slice granularityxthat determines the times of the data is sent separately bym=M∕x(the number of slices).In addition to the inherent transmission delay under the band‐width limit,there are other network delays of data transmis‐sion during each data transmission.Hence,we compensate for the latency factorβ.However,our study finds that the segmen‐tation cost per slice is less than 1 ms,which can be ignored.We also find that.In addition,con‐sidering that the start time of the transmission of each node in practice is difficult to synchronize absolutely,there is an addi‐tional synchronization delay Δtin the total data transmission time.Eq.(1) shows the relationship between granularityxand the total time of data synchronizationT.
WhenMandNare determined,α=1.2 × 105can be ob‐tained through actual testing.Obviously,at this point,the valuexis only related to the data sizeMand the number of nodesN.It illustrates that during the distributed training of machine learning,when the training scale and the number of model parameters are determined,the valuexis determined.
In a heterogeneous network,system performance is limited by the node with the smallest communication bandwidth (bottle‐neck node).If the bottleneck node is related to the server,Wis calculated according to the bandwidth of the parameter server.If the bottleneck node is a worker,Wcan get the maximum value ofTaccording to the bandwidth of the worker.However,in any case,the results are not related toW,so Eq.(4) is still of reference value for heterogeneous networks.
3.2 Network Awareness
In this scheme,the network state information that needs to be measured is only link throughput (available bandwidth).To avoid the large injection of probe traffic in the conven‐tional network measurement technology[19–20]to occupy scarce network bandwidth resources,this scheme directly takes the model parameter data as the probe traffic.The granularity size_probeand the number probe_num of probe packets should be the minimum values that help the scheme to achieve an accurate measurement (the training iteration time remains stable in a stable heterogeneous network within a certain period of time),and they need to be determined in specific engineering implementation.Probe packets are seg‐mented by each worker using the probability partition_rate to select the probe granularity size_probe to segment local data.In Eq.(5),where the coefficientγis fixed at 0.6 in the ex‐periment,the value of the probability partition_rate is neces‐sary to ensure that the number of probe packets sent by the worker to each server is not less than probe_num,so as to re‐alize the complete measurement of links between the worker and all servers.
From the perspective of measurement implementation,the measurement of link throughput only needs to know the data size of the probe packet and the completion time of the probe packet transmission.Since the probe packet receiving node (receiver) has received the probe packet,the data size of the probe packet is known,but its transmission completion time is not easy to know.To calculate the transmission completion time,the start time and end time of transmission have to be fig‐ured out.When the probe packet is submitted to the upper layer,the receiver only knows the time at which the applica‐tion layer received it,which is the end time of the probe packet transmission.But the receiver does not know the start time of the probe packet transmission.To obtain the start time of the probe packet transmission,the receiver can consider starting from the lower transport layer protocol and analyze the start time of the probe packet transmission in more detail,such as analyzing the Acknowledge Character (ACK) when the transmission is based on the Transmission Control Protocol (TCP).But in complex heterogeneous networks where different nodes may be deployed on different types of devices and use different network protocols,the scheme of obtaining the trans‐mission start time of the probe packet based on the analysis of the underlying communication protocol is obviously not suffi‐ciently pervasive.
In fact,without considering the underlying protocol analy‐sis,it is also possible to obtain the start time of probe packet transmission.Although the application layer of the receiver does not directly know the start time of the probe packet trans‐mission,the sending node (sender) knows.Therefore,it is only necessary to tell the receiver the start time of the probe packet transmission through the sender.
Specifically,before the probe packet needs to be sent,the senderisends the forecast message to the receiverj.After receiving the forecast message,the receiver can assume that the end time of the forecast message transmission is the start timetstartof the following probe (packet) message transmis‐sion.Until the following probe message arrives,and the re‐ceiver obtains the end time of probe message transmissiontendand the data size of probe messages size_probe.Finally,according to Eq.(6),the average ratesijof the probe mes‐sage transmission from nodeito nodejcan be calculated.The process of link throughput measurement is shown in Fig.2.We usesijas an estimate of the throughput of the link through which the probe message is transmitted,and then use the estimated throughput as a reference for the evalua‐tion of the communication capabilities of the node associ‐ated with the link.In this process,although additional traf‐fic (the forecast message) is also injected into the network,it is not probe traffic.It is just the signaling message which is responsible for state forecast,and the data size is very small.Thus,the overhead of transmission over the network is al‐most negligible.
From the overall perspective of the network awareness mechanism,the specific measurement of network awareness is distributed at each node.If the links are required for transmission,they all need to be measured.To further en‐hance the reliability and stability of the measurement,we not only use special probe messages but also take data mes‐sages as probe messages to measure networks.Although it leads to some overhead,considering that the final value of throughput between nodes is the average value of the throughput record,the design can further improve the mea‐surement effect.These measurements are obtained by the re‐ceiver,and then summarized to the central scheduling node (scheduler) which is responsible for the evaluation of the communication capacity of nodes and the distribution of communication load.When each node reports the link throughput information,the scheduler will update its re‐corded throughput value,evaluate capacity,and make deci‐sions under the new network state timely,so that the system has a strong adaptive ability.
▲Figure 2.Link throughput measurement
3.3 Load Distribution
Load distribution is decided by a scheduler,which mainly involves the distribution of communication load on each server and the assignment of data slices.For the distribution of communication load,system deployment needs to be consid‐ered first.As bandwidth resources are scarce in heterogeneous networks,more physical nodes are needed in networks and the utilization of link bandwidth between nodes will be lowered if servers and workers are placed separately.To avoid these problems,we attach a server to each worker to get higher net‐work resource utilization.In such a deployment,each node not only receives and distributes aggregated data as a server but also sends and receives aggregated data as a worker.It is im‐portant to note that in such a deployment,the node acting as a worker does not need to actually send the communication load to itself acting as a server.As all nodes as servers need to bear the corresponding proportion of the communication load,and the part of the load undertaken by themselves does not need to be actually sent,it is equivalent to reducing the data transmis‐sion of a worker.
Specifically,when the number of nodes isN,the local data size of each node isM,and the communication load of serveri(i∈V) is assumed to bemi,the communication loadLiof nodeiis:
Based on this,we can calculate the transmission timetifor nodeito complete communication loadLiunder throughputSiby Eq.(9):
In the model aggregation stage,the data trans‐mission of each node is carried out simultane‐ously,so the total transmission completion time in the training iteration is the maximum of the transmission completion time of each node maxi∈Vti.The purpose of reasonable communica‐tion load distribution is to minimize maxi∈Vti.In other words,the current problem model can be determined as:
In Eq.(10),M,NandSiare constants,and onlymiis vari‐able.The objective function requires to minimize the maxi‐mum value ofti.Under the strong constraint that the sum of allmiis fixed,considering that adjusting the load of one node will inevitably affect the load of other nodes,it is intuitively difficult to determine the optimal value ofmi.However,we can write Eq.(10) as:
Eq.(11) is the linear function oftionmi.For the training system withV={1,2,3},we draw the function curve oftionmiof each node as shown in Fig.3.
The problem of Eq.(10) can be approximately transformed to determine a point (mi,ti) on each lineliin Fig.3,and to minimize the maximum value intion the premise that the sum of the abscissa of these pointsmiis a constant valueM.If the position of (mi,ti) is initialized randomly for each line and then moved gradually to minimize maxi∈Vti,the mini‐mum value of maxi∈Vtican be achieved if and only if all points are on the same horizontal linelh.Otherwise,there must be a linelh′,above and below which there are at least one point respectively.Thus,we can still get all the points closer to each other by moving the point abovelh′down its line and moving the point belowlh′up its line,until they are on the same horizontal line.
We distribute the communication load of each node accord‐ing to the principle of equalitarianism in advance.Positions of (mi,ti) are initialized at the intersections of lineand each lineli.Then each point (mi,ti) is moved by means of it‐erative forced equalization of maxi∈Vtiand mini∈Vti.Specifi‐cally,in a moving iteration,it is assumed thati=max,whentmax= maxi∈Vti,andi=min,whentmin= mini∈Vti.Whentmax=tmin,thex-coordinatesandof the moved points (mmax,tmax) and (mmin,tmin) have the relationship as shown in Eqs.(12) and (13).
▲Figure 3.Geometrization of the load distribution problem
Now,(mmax,tmax) and (mmin,tmin) move to the same ordinate position and the next iteration can be started until maxi∈Vyi=mini∈Vyi.Algorithm 1 shows the detailed steps of the process.
Algorithm 1:Load distribution
The first line of Algorithm 1 distributes the communication load of each node according to the principle of fairness in ad‐vance.Lines 2–9 determine the maximum and minimum transmission time of the nodes in the current communication load distribution,as well as the corresponding node.At Line 10,we judge whether the moving iteration needs to be stopped.In order to reduce the number of iterations,we define the difference between the maximum and minimum values of node transmission time as approximately equal if the differ‐ence is no more than similarity_threshold (the experience value is 1 s in our experiment).Lines 11–13 adjust the com‐munication load of the nodes with the maximum and minimum transmission time.Lines 14–22 determine the maximum and minimum values of the transmission time of nodes after adjust‐ing the communication load distribution,which is used for judgment in Line 10.Based on the above process,Algorithm 1 has atime complexity whentihas a uniform initial distribution on the timeline and similarity_threshold isn’t too small.
In the specific process of slice assignment,data are trans‐mitted as the slice,just like the basic granularity,thus the fi‐nal work of load distribution is the assignment of data slices.Algorithm 2 shows the data slice assignment.
Algorithm 2:Slice assignment
At Lines 2–10 in Algorithm 2,the number of probe slices that the servers are distributed with is defined as num_probe,which is generated by segmentation probability partition_rate during data segmentation,mainly to maintain the awareness of the network state of idle nodes that are not distributed any slices.Lines 11–20 are used to achieve the assignment of the remaining slices.Specifically,for each slice,we traverse all current aggregation nodes and select the node with the largest remaining load as the receiving node of this slice.In this way,the receiving node with the best network state can be arranged for each slice as much as possible,and the excess load that the node needs to bear when the slice granularity is larger than the remaining load of nodes can be reduced as much as possible.Based on the above process,Algorithm 2 has aΟ(min(N×num_probe,num_slice) +N× num_slice) time complexity,which shows the execution time of the algorithm is mainly re‐lated to the number of nodes and data slices.
The scheme provides a standard execution process in order to make the system adaptive.In each iteration,specifically,at the beginning of the communication process,each node first reports to the scheduler the link throughput information mea‐sured in the communication process of the previous iteration,then waits for the scheduler to make the latest distribution strategy according to the link throughput information,and sends it to each node.After receiving the latest strategy infor‐mation,each node updates its local strategy,transmits data ac‐cording to the new strategy,and records the link throughput in‐formation measured during transmission.Based on such an in‐teractive process,the training system can realize adaptability almost in real time.
4.1 Environment and Deployment
We simulate a 12-node cluster with Intel(R) Xeon(R) E5-2678 v3 CPUs and NVIDIA 2080TI GPUs and use MXNet as a DML training platform.We have implemented our scheme by modifying the source code of MXNet and deployed the server and the worker in a 1:1 ratio,which means placing one server and one worker on each physical node in the cluster.The bandwidth limit between nodes is below the typical Wide Area Network (WAN) bandwidth of 220 M∕bits with a TC‐Tool[21].The specific value of bandwidth is randomly deter‐mined and randomly adjusted periodically (300 s) to simulate the dynamic heterogeneous network environment.In addition,the hyperparameter configuration of the training system is shown in Table 1.
4.2 Experiment Design
We set up two related schemes to compare with our scheme (Aware).One scheme is Average[17],which is based on the equal distribution principle and network agnosticism,and the other is the elastic parameter distribution scheme named Elastic[18].Since the network awareness mechanism of Elastic is unknown,we directly test Elastic based on our net‐work awareness mechanism in experiments.For these three schemes,we test their performance on AleNet (228 MB),ResNet50 (93 MB),and MobileNet (21 MB) models respectively.
▼Table 1.Key hyperparameters
4.3 Performance Metrics
In our experiments,we use the training speed,namely the number of images per minute trained by the system,as the main performance evaluation metric.The higher the speed,the better the performance of the scheme.Eq.(16) shows the definition of speed,where num_iters is the number of itera‐tions int_iters time.
In addition,single-round iteration time (SRIT) and average single-round iteration time (ASRIT) are used in the verifica‐tions of network awareness validity,verifications of segmenta‐tion granularity rationality,and cost analysis.SRIT is the time to complete a model training iteration,which is directly mea‐sured in tests.The shorter SRIT is,the better performance the scheme has.
5.1 Training Speed
▲Figure 4.Training speed of different schemes on different models
Fig.4 shows the training speed of the compared schemes in different models.As we can see that network-aware Elastic and Aware schemes significantly improves performance: 1.14 times and 2.68 times for MobileNet,1.56 times and 1.76 times for ResNet50,and 1.23 times and 1.32 times for AlexNet,compared with the Average scheme which is agnostic to net‐work states.This shows that the PS adaptive load balancing is feasible and effective based on the network awareness.Com‐pared with Elastic,Aware has achieved better performance im‐provement,2.34 times for MobileNet,1.13x for ResNet50,and 1.08 times for Alexnet,especially on the MobileNet model,which achieved over 2 times acceleration.This suggests that the load distribution strategy of Aware is indeed better than that of Elastic.
In addition,by comparing the speed gain on different mod‐els,it can be found that the gain achieved by Aware is more obvious on the smaller model (MobileNet).This is because the network load of the small model is small,the iteration time of model training is short,and the optimization effect of Aware is more significant in the same experimental network,which is fi‐nally shown as a significant increase in the training speed.On the larger model (AlexNet),Aware has almost no gain com‐pared with Elastic.The reason is that there is no obvious room for optimization of the data aggregation process in the experi‐ment network with limited bandwidth under the excessively large communication load.
5.2 Effectiveness Verification of Network Awareness
Fig.5 shows the changes of SRIT of Aware and Average schemes with iteration rounds in the same dynamic network.The system parameters num_probe and size_probe are set to the best values of 2 and 10 000,respectively,which are deter‐mined by actual tests in the experiment.Due to space limita‐tion of the paper,the details are omitted.In the figure,the curve of Average which is agnostic about the network is above the curve of Aware,which indicates that the optimization ef‐fect of Aware scheme is significant and lasting.Additionally,the curve of Aware exhibits periodic shock wave characteris‐tics,which can be attributed to its poor performance in re‐sponse to abrupt changes in network states at the crest and the end of the strategy.However,with the release of a new round of strategy based on the latest network state,the performance of Aware improves rapidly.That also verifies the effectiveness and reliability of the network awareness mechanism of our scheme.
▲Figure 5.SRIT comparison of Aware and Average schemes in dy⁃namic networks
5.3 Reasonableness Verification of Segmentation Granu⁃larity
In order to verify the rationality of the theoretical analysis conclusion of slice granularity,we take the Resnet50 model as an example to test the change of ASRIT with the slice granu‐larity under multiple network states.As shown in Fig.6,in dif‐ferent network states,ASRIT remains almost unchanged within the logarithmic range of 5–5.5 (quantity of 105–3.16 × 105parameters) of the slice granularity,while our theo‐retical value of 5.38 is exactly within this range.This indi‐cates that our theoretical value of slice granularity can indeed achieve almost the lowest ASRIT in different network states.
5.4 Overhead Analysis
The overhead of the Aware scheme is likely to be concen‐trated in frequent forecast messages and synchronization of strategy requests with each round.As for the former,there should be no significant overhead because the preview mes‐sage only contains extremely short header fields with a fixed length.As for the latter,because the experiments are based on the synchronous training mode and the synchronization of each round has already existed,there should be no obvious overhead.In order to verify this analysis,in a stable (static and isomorphic) network environment,we have tested ASRIT of the Average scheme under four conditions: requiring probe and strategy request synchronization (Probe+Request),only requiring probe (Probe),only requiring strategy request syn‐chronization (Request) and neither requiring probe nor strat‐egy request synchronization (Original).The ASRIT over doz‐ens of iterations is shown in Fig.7.Adding probe or strategy request synchronization does incur some overhead,but even with Probe+Request having the largest overhead,only 0.44 s (2.12%) overhead is added to Original,which is negligible compared with the huge gain shown in Fig.5.
▲Figure 6.ASRIT of Aware scheme in different network states
▲Figure 7.ASRIT of Average scheme in different conditions
In this paper,we study the problem of PS load distribution in DML in heterogeneous networks.The state-of-the-art schemes cannot match the communication load with the com‐munication capacity of PSs to achieve load balancing due to the lack of network awareness.The existing schemes with net‐work awareness have not given specific network measurement methods,which makes them difficult to be realized in prac‐tice.This paper proposes a well-designed network awareness mechanism,which can realize low cost and high precision net‐work measurement.In addition,the slice granularity determi‐nation and slice assignment of fine-grained transmission is studied.We have implemented the scheme in MXNet,and completed the function verification and performance measure‐ment based on the experiment cluster.The results show that the proposed scheme can significantly accelerate DML.