Distributed,Multi-Cell,Multi-User,MISO,Downlink,Beamforming,via,DeepReinforcement,Learning

时间：2023-12-14 13:45:02 来源：网友投稿

JIA Haonan,HE Zhenqing,TAN Wanlong,RUI Hua,LIN Wei

(1.National Key Laboratory of Science and Technology on Communications,University of Electronic Science and Technology of China,Chengdu 611731,China；
2.ZTE Corporation,Shenzhen 518057,China；
3.State Key Laboratory of Mobile Network and Mobile Multimedia Technology,Shenzhen 518055,China)

Abstract: The sum rate maximization beamforming problem for a multi-cell multi-user multiple-input single-output interference channel(MISO-IC) system is considered.Conventionally,the centralized and distributed beamforming solutions to the MISO-IC system have high computational complexity and bear a heavy burden of channel state information exchange between base stations (BSs),which becomes even much worse in a large-scale antenna system.To address this,we propose a distributed deep reinforcement learning (DRL) based approach with limited information exchange.Specifically,the original beamforming problem is decomposed of the problems of beam direction design and power allocation and the costs of information exchange between BSs are significantly reduced.In particular,each BS is provided with an independent deep deterministic policy gradient network that can learn to choose the beam direction scheme and simultaneously allocate power to users.Simulation results illustrate that the proposed DRL-based approach has comparable sum rate performance with much less information exchange over the conventional distributed beamforming solutions.

Keywords: deep reinforcement learning;downlink beamforming;multiple-input single-output interference channel

To meet the increasing wireless communication traffic demand,the frequency reuse factor of cellular systems is expected to be slackened as one,which indicates that all the cells operate in the same frequency band.However,the small frequency reuse factor brings an inter-cell interference problem,heavily degrading the achievable sum rate performance of the wireless system.Therefore,inter-cell interference should be managed carefully.A multi-cell multiple-input single-output (MISO) downlink beamforming technique with cooperation among the base stations (BSs) is introduced as a promising solution.The typical zero-forcing beamforming[1]works in a highly coordinated scenario where each piece of user equipment (UE) is served by all the BSs,which needs all the transmit data and channel state information (CSI) to be shared among the BSs.Nevertheless,it is impractical due to the heavy data-sharing burden[2].A centralized solution[3]collects the global CSI and jointly design beamforming vectors based on fractional programming (FP).Although it can achieve nearoptimal performance,it has high computational complexity and leads to unavoidable delays when collecting CSI and sending beamforming vectors,thereby making it impossible to be applied in a dynamic channel environment.

Many distributed schemes are proposed to reduce the computational cost of centralized solutions.In particular,an achievable rate region of the multi-cell MISO Gaussian interference channel (MISO-IC) system is analyzed in Ref.[4],which proves that the well-designed distributed schemes can reach the Pareto boundary.To reduce information sharing among BSs,a data-sharing-based distributed scheme is proposed in Ref.[5].A fully distributed scheme with no CSI or data sharing is discussed in Ref.[6],which works well at a high signal-to-interference-plus-noise-ratio (SINR).However,these works assume that the BSs are capable of obtaining the instantaneous downlink CSI of the UE in other cells without CSI sharing, which is also infeasible in a practical system.

Deep reinforcement learning (DRL) has shown great potential in decision-making problems. By converting the multi-cell downlink beamforming problem into a decision-making problem, several distributed approaches based on DRL are developed[7–8]. Particularly, a multi-agent deep Q-learning based approach is introduced in Ref. [7], in which each BS learns to make the decisions of beam direction and power for each UE based on the local CSI and the exchanged information among the BSs. However, because of the curse of dimensionality[9],the Q-learning based approach in Ref. [7] is almost impossible to be applied in the cases where there are multiple user devices in the same cell or the BSs are equipped with large-scale antennas, since the discrete action space expands exponentially with the number of user devices and antennas.

In this paper, we develop a distributed-training distributedexecution (DTDE) multi-agent deep deterministic policy gradient (MADDPG) based algorithm to maximize the instantaneous sum rate of the multicell multi-user MISO-IC system under the power constraint for each BS. Thanks to the features of DDPG, the policy network gives continuous value directly,which significantly reduces the dimension of actions. Our main contributions are summarized as follows:

1) A new distributed MADDPG-based scheme is proposed, capable of solving the instantaneous sum rate maximization problem when cells have multiple user devices and BSs are equipped with large-scale antennas. By decomposing the original beamforming problem into the beam direction design and power allocation problems, each BS as an agent can learn to choose beam direction and power allocation based on the wireless environment.

2) A new limited information exchange protocol is proposed for the distributed BSs to share information for beamforming design. Instead of sharing CSI directly, we choose the equivalent channel gains of UE, the received interference of UE, and the sum rate of UE in one cell as the shared information. Different from other DRL-based algorithms which only consider equivalent channel gains and the sum rate of UE, we consider the received interference (also known as the interference temperature) as the crucial information.

3) Extensive experiments are conducted to evaluate the efficiency and scalability of the proposed MADDPG approach by comparing the conventional distributed and centralized solutions. The simulation results show that the proposed MADDPG can reach the state-of-the-art sum rate performance with a much smaller amount of information sharing among BSs.

As far as we know, this is the first attempt to tackle the multi-cell MISO beamforming via MADDPG-based DRL. In contrast to the related work[7], this paper aims to solve the multi-cell sum rate maximization problem in the continuous action space by using the MADDPG method which is more flexible for different wireless environments and is easy for agents (e.g., BSs) to learn since the dimension of action space is much smaller than that of codebook space in Ref. [7].

We consider a wireless cellular downlink system ofNcells,in each of which there is a multi-antenna transmitter (e.g., a BS) withMantennas to serveKsingle-antenna receivers (e.g.,UE). We useN={1,…,N} to denote the set of all BSs. We assume that all UE in this system shares the same frequency band, thereby leading to both intra-cell and inter-cell interference with each UE. As a result, the received signal of thek-th UE in then-th cell at timetcan be expressed as:

We assume that the BS in each cell is equipped with the uniform rectangular array (URA) structure withM=MxMyan-tenna elements,1.We assume the URA model here for simplicity.Nevertheless,the proposed scheme can be applicable to arbitrary array geometry.whereMxandMydenote the horizontal and vertical scales of the URA,respectively.According to the raybased channel modeling[10],the dynamic URA channel response ofhn,j,k(t) withL-paths for then-th BS to thek-th UE in the celljcan be expressed as:

whereκn,j,kis the large-scale fading factor related to the path loss and shadowing andgn,j,k,l(t) is the dynamic small-scale Rayleigh fading factor.The steering vectorsaˉ andbˉ of URA in Eq.(3) are given by:

whereθn,j,kandφn,j,kare the nominal AoD between the cellnand the UEkin cellj,respectively,and Δ is the associated angular perturbation.

To characterize the small-scale fading dynamics of the time varying channel,we utilize the following first-order Gauss-Markov process[11].

whereρdenotes the fading correlation coefficient between any two consecutive time slots andδn,j,k,l(t)～CN (0,1).

Considering the channel variation,we aim to solve the following instantaneous achievable sum-rate maximization problem at the time slott:

where Pmaxdenotes the maximum transmit power for each BS.Unfortunately, Problem (7) is generally NP-hard even with global CSI, and finding its globally optimal solution requires exponential running time[3]. Conventionally, several centralized methods are proposed to find a sub-optimal solution. All centralized algorithms assume that there is a central node to collect global CSI from all BSs, and then the central node computes and returns beamformers of all BSs. However, it is hard to obtain the global CSI for all BSs. Moreover, due to the dynamics of channels, the beamformers are already outdated when the BSs obtain the returns. Therefore, it is more reasonable to apply a distributed approach. However, information sharing design between the BSs is also a problem for the distributed methods.Generally, the BSs communicate with other BSs through the backhaul links. Conventional beamforming methods need the BSs to exchange global or cross-talk CSI, which is an unacceptable burden for the rate-limited backhaul links. Therefore, the amount of shared information for beamforming should be limited, and we try to seek a sub-optimal distributed solution with limited information exchange between the BSs in different cells.

From the perspective of the BS n, the beamformer wn,kcan be expressed as:

The typical beam direction solutions include the virtual SINR[12]and the weighted minimum-mean-square-error(WMMSE)[13]. However, these solutions are closely coupled with power allocation strategies, which cannot be easily adopted to guide the beam direction design. According to Ref.[14], given global CSI in the multi-cell scenario, the optimal beamformer can be expressed as a linear combination of the conventional zero-forcing (ZF) and maximum ratio transmission (MRT). However, it is difficult to obtain the instantaneous global CSI. This inspires us to apply the available ZF and MRT to give heuristic solutions to our proposed approach based on only local CSI[5]. Specifically, the ZF and the MRT solutions are given by:

whereHn∈CK×Mdenotes the downlink channel of allKusers in then-th cell.Note that the MRT works well at low SNR regions[15],especially in the case that most UE is at the edge of the cell,since it only focuses on the maximization of the received signal power.In contrast,ZF tries to minimize the received interference for the UE,which makes it outperform the MRT at high SNR regions where it is dominated by intra-cell interference.Therefore,we introduce the DRLbased method to choose an appropriate approach according to the dynamic wireless communication environment,which can be viewed as a typical decision-making problem.On the other hand,the DRLbased approaches,e.g.,deep Q-learning[17],have been introduced to solve the power allocation task.However,the conventional deep Q-learning approach can only output discrete power levels,which may make training intractable with the increase of the action dimension.This motivates us to apply the deep deterministic policy gradient (DDPG) approach,which will be introduced in the following section,to tackle the challenging beam direction and power allocation tasks for each BS.

In principle,all the BSs share information through the backhaul links between BSs.However,it is an unaffordable burden for the backhaul links to transmit the global CSI among all BSs,especially when the BSs are equipped with large-scale antennas.Therefore,we develop a limited information exchange protocol,in which BSs only need to share a small amount of equivalent channel gain and interference information rather than the global CSI.

Assuming a flat and block-fading downlink channel,we propose a framework for the downlink data transmission process as shown in Fig.1.In this framework,the channels are invariant during one time slot.Each time slot is divided into two phases.The first phase is a preparation phase for the BSs to collect local information,information exchange and beamforming design.The second phase is the downlink data transmission phase.Conventionally,the BSs only estimate downlink channels in the local information collection phase.To be specific,the BSs send reference symbols to UE first,then the UE estimates the downlink channel according to the reference symbols,and finally give the local CSI back to the corresponding BSs.

▲Figure 1.Framework of the designed downlink data transmission process

In this section,we introduce a DTDE MADDPG-based scheme for the MISO-IC system,as illustrated in Fig.3,where each BS acts as a trainable agent.In the following,we take BSnas an example to elaborate on the online decision and offline training processes in detail.

▲Figure 2.Illustration of information exchange process for the BS n

5.1 Online Decision Process

In the online decision process,BSnobserves the states from the wireless communication environment and takes actions based on the online policy network.

▲Figure 3.Illustration of the MADDPG-based scheme for multi-cell multi-user multiple-input singleoutput interference channel (MISO-IC) system

5.2 Offline Training Process

In the offline training process,the sampler first randomly samples a batch of transition data {sn(i),an(i),r(i),sn(i+1)} from the memory replay buffer for training.By inputting the training transitioniinto the two target networks,the output of the target Q-networkyican be expressed as:

where the Q-value of the state-action pair{sn(i),an(i)} is composed of an instantaneous rewardr(i)and the Q-value of the next state-action pair{sn(i+1),an(i+1)}.Note that the output of the target Q-networkyiis actually the estimated Q-value of the state-action pair {sn(i),an(i)}.

According to the deterministic policy gradient theorem[19],the gradients of the online Q-network and policy network are:

whereNbis the batch size of the sampled training data.The parameters in the online networks are updated by the optimizer.For the target networks,the parameters are softly updated as

whereτ∈(0,1) is the soft update factor.

The basic structures of the Q and policy networks,as shown in Fig.4,are similar,which both include the fully connected input layer,hidden layers and the output layer.To reduce the computational complexity,we design two hidden layers for the Q and policy networks.The number of neurons in the input layer is the same as the dimension of the input vector.Hence,the scale of the input layer in the policy network is the same as the length of the input vectorsn(t).On one hand,the input vector for the Q-network is the concatenating ofsn(t) andμn(t).We apply ReLU as the activation function due to its simplicity.Note that the output layer of the policy network consists of two sub-layers that apply the softmax and sigmoid activation function forpn(t) and [Pn,sum(t),Dn(t)],respectively.On the other hand,the output of the Q-network is a real value denoting the Q-value of the corresponding state-action pair.

▲Figure 4.Structures of the action and Q-networks

For clarification,we summarize the overall procedure for distributed multi-cell multi-user beamforming in Algorithm 1,referred to as the MADDPG algorithm.The proposed MADDPG algorithm requires the perfect CSI resulting from the proposed information exchange protocol.However,imperfect CSI may lead to the shifting of the objective function in Eq.(7),resulting in significant performance degradation of the current approach.A possible solution to the imperfect CSI is redesigning the reward function based on appropriate historical information,which will be addressed in our future work.

5.3 Information Exchange Analysis

We list the required information and the information exchange of different schemes in Table 1.Note that the fractional programming[3](FP) and the zero-gradient[6](ZG) need to exchange much more instantaneous CSI than that of MADDPG while the MADDPG only needs to exchange real values of previous information of the wireless environment.Moreover,thanks to the local CSI beam direction design,our proposed MADDPG based scheme does not rely on the number of antennasMand requires much less information exchange than those of FP and ZG,and is therefore suitable for the case of a large number of antennas.

5.4 Computational Complexity Analysis

5.5 Convergence Discussion

The convergence behavior of the DRL algorithm,including the proposed MADDPG algorithm,depends on many factors such as the dynamics of the environment,the power of action noiseand the size of the memory replay bufferNb.Up to now,the theoretical convergence analysis of the the DRL based algorithm is still an open problem and is generally based on empirical attempt.For example,when the action noiseis small,the MADDPG algorithm can converge faster.However,the performance of the MADDPG algorithm will degrade since the agents cannot explore the whole action space.Therefore,we need a large number of simulations to choose appropriate network hyper-parameters for achieving fast convergence and good performance.

To test the convergence behavior of the proposed MADDPG approach,we give an experimental result in Fig.5,which illustrates the achievable sum rate versus the time slot under different initializations of network weights.The simulation settings are the same as that of Fig.6 in Section 6.There are 7 simulation curves with different initial network weights in the same environment and all weights are randomly initialized following the standard Gaussian distribution.The simulation result shows that the different network initialization will basically converge to a similar performance around 4 000 time slots.This indicates that the proposed MADDPG method is insensitive to different network initialization.

▲Figure 5.Convergence behavior of the proposed multi-agent deep deterministic policy gradient (MADDPG) algorithm under different initialization of network weights

▼Table 1.Comparison of the information exchange

▲Figure 6.Average achievable rate of various schemes versus the number of time slots,where each point is a moving average over the previous 300-time slots with the UE distribution factor ϵ=0.3

The MADDPG scheme is deployed with the PyTorch framework and the hyper-parameters are set as follows.Both policy and Q-network parameters are designed asLH=2 fully-connected hidden layers withNH=200 neurons.The discount factorηis set to 0.6 and the soft update factorτis equal to 0.01.The size of the memory replay buffer is set to 2 000 and the size of the sampled batchNbis set to 64.Furthermore,we choose the Adam optimizer to update parameters with the learning rate being 10-3.

In Fig.7,we evaluate the average achievable rate of different schemes vs the UE distribution factor.As the UE distribution factor increases,the average received power of UE can be reduced and the inter-cell interference problem becomes worse.The ZG algorithm,which is derived under high SINR assumption,has the worst performance under the 19 cells scenarios.The FP algorithm with global instantaneous CSI has undoubtedly the best performance in all scenarios.While as the users are getting closer to the cell edge,the performance gap between the FP and our proposed MADDPG is shrinking.

▲Figure 7.Average achievable rate of various schemes versus the UE distribution factor ϵ

In this paper,we reflect on the instantaneous sum rate maximization problem in the multi-cell MISO interference channel scenario.We propose a MADDPG scheme,in which each BS learns to choose an appropriate beam direction solution and allocate power based on the local CSI and limited exchange information among the BSs.The simulation results show that the proposed MADDPG scheme can achieve a relatively high sum rate with much less information exchange than the conventional centralized and distributed solutions.