1 QECO: A QoE-Oriented Computation Offloading Algorithm based on Deep Reinforcement Learning for Mobile Edge Computing arXiv:2311.02525v2 [cs.NI] 14 Aug 2024 Iman Rahmati ID , Hamed Shah-Mansouri ID , and Ali Movaghar ID Abstract—In the realm of mobile edge computing (MEC), efficient computation task offloading plays a pivotal role in ensuring a seamless quality of experience (QoE) for users. Maintaining a high QoE is paramount in today’s interconnected world, where users demand reliable services. This challenge stands as one of the most primary key factors contributing to handling dynamic and uncertain mobile environment. In this study, we delve into computation offloading in MEC systems, where strict task processing deadlines and energy constraints can adversely affect the system performance. We formulate the computation task offloading problem as a Markov decision process (MDP) to maximize the long-term QoE of each user individually. We propose a distributed QoE-oriented computation offloading (QECO) algorithm based on deep reinforcement learning (DRL) that empowers mobile devices to make their offloading decisions without requiring knowledge of decisions made by other devices. Through numerical studies, we evaluate the performance of QECO. Simulation results validate that QECO efficiently exploits the computational resources of edge nodes. Consequently, it can complete 14% more tasks and reduce task delay and energy consumption by 9% and 6%, respectively. These together contribute to a significant improvement of at least 37% in average QoE compared to an existing algorithm. Index Terms—Mobile edge computing, computation task offloading, quality of experience, deep reinforcement learning. I. I NTRODUCTION OBILE edge computing (MEC) [1] has emerged as a promising technological solution to overcome the challenges faced by mobile devices (MDs) when performing high computational tasks, such as real-time data processing and artificial intelligence applications [2] [3]. In spite of the MDs’ technological advancements, their limited computing power and battery may lead to task drops, processing delays, and an overall poor user experience. By offloading intensive tasks to nearby edge nodes (ENs), MEC effectively empowers computation capability and reduces the delay and energy consumption. This improvement enhances the users’ QoE, especially for timesensitive computation tasks [4] [5]. Efficient task offloading in MEC is a complex optimization challenge due to the dynamic nature of the network and the variety of MDs and servers involved [6] [7]. In particular, determining the optimal offloading strategy, scheduling the tasks, and selecting the most suitable EN for task offloading M I. Rahmati and A. Movaghar are with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran (email:{iman.rahmati, movaghar}@sharif.edu). H. Shah-Mansouri is with the Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran (email: hamedsh@sharif.edu). are the main challenges that demand careful consideration. Furthermore, the uncertain requirements and sensitive latency properties of computation tasks pose nontrivial challenges that can significantly impact the computation offloading performance in MEC systems with limited resources. A. Related Work To cope with the dynamic nature of the network, recent research has proposed several task offloading algorithms using machine learning methods. In particular, deep reinforcement learning (DRL) hold promises to determine optimal decisionmaking policies by capturing the dynamics of environments and learning strategies for accomplishing long-term objectives [8]. DRL can effectively tackle the challenges of MEC arising from the ever-changing nature of networks, MDs, and servers’ heterogeneity. This ultimately improves the MD users’ QoE. In [9], Huang et al. focused on a wireless-powered MEC. They proposed a DRL-based approach, capable of attaining nearoptimal decisions. This is achieved by selectively considering a compact subset of candidate actions in each iteration. In [10], the authors proposed an offloading algorithm using deep Qlearning for wireless-powered Internet of Things (IoT) devices in MEC systems. This algorithm aims to minimize the task drop rate while the devices solely rely on harvested energy for operation. In [11], Zhao et al. proposed a computation offloading algorithm based on DRL, which addresses the competition for wireless channels to optimize long-term downlink utility. In this approach, each MD requires qualityof-service information from other MDs. Tang et al. in [12] investigated the task offloading problem for indivisible and deadline-constrained computational tasks in MEC systems. The authors proposed a distributed DRL-based offloading algorithm designed to handle uncertain workload dynamics at the ENs. Sun et al. in [13] explored both computation offloading and service caching problems in MEC. They formulated an optimization problem that aims to minimize the long-term average service delay. They then proposed a hierarchical DRL framework, which effectively handles both problems under heterogeneous resources. Dai et al. in [14] introduced the integration of action refinement into DRL and designed an algorithm to concurrently optimize resource allocation and computation offloading. In [15], Huang et al. proposed a DRLbased method based on a partially observable MDP, which guarantees the deadlines of real-time tasks while minimizing the total energy consumption of MDs. Liu et al. in [16] investigated 2 a two-timescale computing offloading and resource allocation problem and proposed a resource coordination algorithm based on multi-agent DRL, which can generate interactive information along with resource decisions. Zhou et al. in [17] used an MDP to study MEC and modeled the interactions of the environment. They proposed a Q-learning approach to achieve optimal resource allocation strategies and computation offloading. In [18], Gao et al. introduced an attention-based multi-agent algorithm designed for decentralized computation offloading. This algorithm effectively tackles the challenges of dynamic resource allocation in large-scale heterogeneous networks. Gong et al. in [19] proposed a DRL-based network structure in the industrial IoT systems to jointly optimize task offloading and resource allocation in order to achieve lower energy consumption and decreased task delay. Liao et al. in [20] introduced a double reinforcement learning algorithm for performing online computation offloading in MEC. This algorithm optimizes transmission power and scheduling of CPU frequency when minimizing both task computation delay and energy consumption. CPU Computation queue queues (with active) ... CPU Scheduler Wireless Link Transmission queue Queue for device Fig. 1. An illustration of MD i ∈ I and EN j ∈ J in the MEC system. • DRL-based Offloading Algorithm: To address the problem of long-term QoE maximization, we focus on task completion, task delay, and energy consumption to quantify the MDs’ QoE. We propose QECO algorithm based on DRL that empowers each MD to make offloading decisions independently, without prior knowledge of the other MDs’ tasks and offloading models. With a focus on the MD’s battery level, our approach leverages deep Q-network (DQN) [21] and long short-term memory (LSTM) [22] to prioritize and strike an appropriate balance between QoE factors. We also analyze the training convergence and complexity of the proposed algorithm. B. Motivation and Contributions • Performance Evaluation: We conduct comprehensive Although DRL-based methods have demonstrated their experiments to evaluate the QECO’s performance as well effectiveness in handling network dynamics, task offloading as its training convergence under different computation still encounters several challenges that require further attention. workloads. The results demonstrate that our algorithm QoE is a time-varying performance measure that reflects user quickly converges and effectively utilizes the processing satisfaction and is not affected only by delay, as assumed capabilities of MDs and ENs, resulting in substantial in [9]–[13], but also by energy consumption. Albeit some improvement of at least 37% in average QoE. This advanexisting works such as [14]–[20], have investigated the tradetage is achieved through a 14% increase in the number off between delay and energy consumption, they fail to properly of completed tasks, along with 9% and 6% reductions in address the user demands and fulfill QoE requirements. A more task delay and energy consumption, respectively, when comprehensive approach is required to address the dynamic compared to the potential game-based offloading algorithm requirements of individual users in real-time scenarios with (PGOA) [23] and several benchmark methods. multiple MDs and ENs. In contrast to the aforementioned The structure of this paper is as follows. Section II presents works [9]–[20], we propose a DRL-based distributed algorithm the system model, followed by the problem formulation in that provides users with an appropriate balance among QoE Section III. In Section IV, we present the algorithm, while factors based on their demands. We also explore a more realistic Section V provides an evaluation of its performance. Finally, MEC scenario involving delay-sensitive tasks with processing we conclude in Section VI. deadlines, posing a more intricate challenge. In this study, we delve into the computation task offloadII. S YSTEM M ODEL ing problem in MEC systems, where strict task processing deadlines and energy constraints can adversely affect the We investigate a MEC system consisting of a set of MDs system performance. We propose a distributed QoE-oriented denoted by I = {1, 2, ..., I}, along with a set of ENs denoted computation offloading (QECO) algorithm that leverages DRL by J = {1, 2, ..., J}, where I and J represent the number to efficiently handle task offloading in uncertain loads at ENs. of MDs and ENs, respectively. We regard time as a specific This algorithm empowers MDs to make offloading decisions episode containing a series of T time slots denoted by T = utilizing only locally observed information, such as task size, {1, 2, . . . , T }, each representing a duration of τ seconds. As queue details, battery status, and historical workloads at the shown in Fig. 1, we consider two separate queues for each ENs. By adopting the appropriate policy based on each MD’s MD to organize tasks for local processing or dispatching to specific requirements at any given time, the QECO algorithm ENs, operating in a first-in-first-out (FIFO) manner. The MD’s significantly improves the QoE for individual users. scheduler is responsible for assigning newly arrived tasks to Our main contributions are summarized as follows: each of the queues at the beginning of the time slot. On the • Task Offloading Problem in the MEC System: We formu- other hand, we assume that each EN j ∈ J consists of I FIFO late the task offloading problem as an MDP for time- queues, where each queue corresponds to an MD i ∈ I. When sensitive tasks. This approach takes into account the each task arrives at an EN, it is enqueued in the corresponding dynamic nature of workloads at the ENs and concentrates MD’s queue. on providing high performance in the MEC system while We define zi (t) as the index assigned to the computation maximizing the long-term QoE. task arriving at MD i ∈ I in time slot t ∈ T . Let λi (t) 3 denote the size of this task in bits. The size of task zi (t) is selected from a discrete set Λ = {λ1 , λ2 , . . . , λθ }, where θ represents the number of these values. Hence, λi (t) ∈ Λ ∪ {0} to consider the case that no task has arrived. We also denote the task’s processing density as ρi (t) that indicates the number of CPU cycles required to complete the execution of a unit of the task. Furthermore, we denote the deadline of this task by ∆i (t) which is the number of time slots that the task must be completed to avoid being dropped. We define two binary variables, xi (t) and yi,j (t) for i ∈ I and j ∈ J to determine the offloading decision and offloading target, respectively. Specifically, xi (t) indicates whether task zi (t) is assigned to the computation queue (xi (t) = 0) or to the transmission queue (xi (t) = 1), and yi,j (t) indicates whether task zi (t) is offloaded to EN j ∈ J . If the task is dispatched to EN j, we set yi,j (t) = 1; otherwise, yi,j (t) = 0. 1) Local Execution: We model the local execution by a queuing system consisting the computation queue and the MD processor. Let fi denote the MD i’s processing power (in cycle per second). When task zi (t) is assigned to the computation queue at the beginning of time slot t ∈ T , we define liC (t) ∈ T as the time slot during which task zi (t) will either be processed or dropped. If the computation queue is empty, liC (t) = 0. Let δiC (t) denote the number of remaining time slots before processing task zi (t) in the computation queue. We have:  + C C ′ δi (t) = ′ max li (t ) − t + 1 . (5) where [·]+ = max(0, ·) and liT (0) = 0 for the simplicity of presentation. Note that the value of δiT (t) only depends on liT (t) for t′ < t. If MD i ∈ I schedules task zi (t) for dispatching in time slot t ∈ T , then it will either be dispatched or dropped in time slot liT (t), which is n o liT (t) = min t + δiT (t) + ⌈DiT (t)⌉ − 1, t + ∆i (t) − 1 , (2) EiL (t) = DiC (t)pCi τ, t ∈{0,1,...,t−1} In the equation above, the term maxt′ ∈{0,1,...,t−1} liC (t′ ) denotes the time slot at which each existing task in the computation queue, which arrived before time slot t, is either processed or dropped. Consequently, δiC (t) denotes the number of time slots that task zi (t) should wait before being processed. We denote the time slot in which task zi (t) will be completely A. Communication Model C We consider that the tasks in the transmission queue are processed by li (t) if it is assigned to the computation queue in time slot t. We have dispatched to the appropriate ENs via the MD wireless interface. for local processing n o We denote the transmission rate of MD i’s interface when liC (t) = min t + δiC (t) + ⌈DiC (t)⌉ − 1, t + ∆i (t) − 1 . (6) communicating with EN j ∈ J in time t as ri,j (t). In time slot t ∈ T , if task zi (t) is assigned to the transmission queue The task zi (t) will be immediately dropped if its processing for computation offloading, we define liT (t) ∈ T to represent is not completed by the end of the time slot t + ∆i (t) − 1. the time slot when the task is either dispatched to the EN or In addition, we introduce DiC (t) as the number of time slots dropped. We also define δiT (t) as the number of time slots required to complete the processing of task zi (t) on MD i ∈ I. that task zi (t) should wait in the queue before transmission. It It is given by: should be noted that MD i computes the value of δiT (t) before λi (t) DiC (t) = . (7) making a decision. The value of δiT (t) is computed as follows: fi τ /ρi (t)  + ′ To compute the MD’s energy consumption in the time slot δiT (t) = i i ′ max liT i(t ) − t + 1i i, (1) t i∈i{0,1,...,t−1} t ∈ T , we define EiL (t) as: (8) where pCi = 10−27 (fi )3 represents the energy consumption of MD i’s CPU frequency [24]. 2) Edge Execution: We model the edge execution by the queues associated with MDs deployed at ENs. If computation E task zi (t′ ) is dispatched to EN j in time t′ < t, we let zi,j (t) E and λi,j (t) (in bits) denote the unique index of the task and E where DiT (t) refers to the number of time slots required for the size of the task in the ith queue at EN j. We define ηi,j (t) the transmission of task zi (t) from MD i ∈ I to EN j ∈ J . (in bits) as the length of this queue at the end of time slot We have t ∈ T . We refer to a queue as an active queue in a certain X time slot if it is not empty. That being said, if at least one task λ (t) i DiT (t) = yi,j (t) . (3) is already in the queue from previous time slots or there is ri,j (t)τ J a task arriving at the queue, that queue is active. We define Let EiT (t) denote the energy consumption of the transmission Bj (t) to denote the set of active queues at EN j in time slot t. n o from MD i ∈ I to EN j ∈ J . We have E Bj (t) = i ii i ∈ I, λEi,j (t) > 0 iori ηi,j (t − 1) > 0i .i (9) EiT (t) = DiT (t)pTi (t)τ, (4) We introduce Bj (t) ≜ |Bj (t)| that represents the number of where pTi (t) represents the power consumption of the commuactive queues in EN j ∈ J in time slot t ∈ T . In each time slot nication link of MD i ∈ I in time slot t. t ∈ T , the EN’s processing power is divided among its active queues using a generalized processor sharing method [25]. Let B. Computation Model variable fjE (in cycles per second) represent the computational The computation tasks can be executed either locally on the capacity of EN j. Therefore, EN j can allocate computational MD or on the EN. In this subsection, we provide a detailed capacity of fjE /(ρi (t)Bj (t)) to each MD i ∈ Bj (t) during time explanation of these two cases. slot t. To calculate the length of the computation queue for 4 MD i ∈ I in EN j ∈ J , we define ωi,j (t) (in bits) to represent the number of bits from dropped tasks in that queue at the end of time slot t ∈ T . The backlog of the queue, referred to as E ηi,j (t) is given by: " #+ fjE τ E E E ηi,j (t)= ηi,j (t − 1)+λi,j (t)− − ωi,j (t) . (10) ρi (t)Bj (t) E We also define li,j (t) ∈ T as the time slot during which the E offloaded task zi,j (t) is either processed or dropped by EN j. Given the uncertain workload ahead at EN j, neither MD i E nor EN j has information about li,j (t) until the corresponding E E task zi,j (t) is either processed or dropped. Let ˆli,j (t) represent E the time slot at which the execution of task zi,j (t) starts. In mathematical terms, for i ∈ I, j ∈ J , and t ∈ T , we have: ˆlE (t) = max{t, i,j ′ max t ∈{0,1,...,t−1} ′ E li,j (t ) + 1}, (11) E where li,j (0) = 0. Indeed, the initial processing time slot of E task zi,j (t) at EN should not precede the time slot when the task was enqueued or when the previously arrived tasks were E processed or dropped. Therefore, li,j (t) is the time slot that satisfies the following constraints. E li,j (t) X E (t) t′ =l̂i,j E li,j (t)−1 X E (t) t′ =l̂i,j fjE τ ′ E ′ 1(i ∈ Bj (t )) ≥ λi,j (t), ρi (t)Bj (t ) (12) fjE τ ′ 1(i ∈ Bj (t )) < λEi,j (t), ρi (t)Bj (t′ ) (13) λEi,j (t)ρi (t) . fjE τ /Bj (t) (14) E We also define Ei,j (t) as the energy consumption of processing at EN j in time slot t by MD i. This can be calculated as: E Ei,j (t) = E Di,j (t)pEj τ , Bj (t) (15) where pEj is a constant value which denotes the energy consumption of the EN j’s processor when operating at full capacity. In addition to the energy consumed by EN j for task processing, we also take into account the energy consumed by the MD i’s user interface in the standby state while waiting for I task completion at the EN j. We define Ei,j (t) as the energy consumption associated with the user interface of MD i ∈ I, which is given by E EiI (t) = Di,j (t)pIi τ, J III. TASK O FFLOADING PROBLEM F ORMULATION Based on the introduced system model, we present the computation task offloading problem in this section. Our primary goal is to enhance each MD’s QoE individually by taking the dynamic demands of MDs into account. To achieve this, we approach the optimization problem as an MDP, aiming to maximize the MD’s QoE by striking a balance among key QoE factors, including task completion, task delay, and energy consumption. To prioritize QoE factors, we utilize the MD’s battery level, which plays a crucial role in decision-making. Specifically, when an MD observes its state (e.g. task size, queue details, and battery level) and encounters a newly arrived task, it selects an appropriate action for that task. The selected action, based on the observed state, will result in enhanced QoE. Each MD strives to maximize its long-term QoE by optimizing the policy mapping from states to actions. In what follows, we first present the state space, action space, and QoE function, respectively. We then formulate the QoE maximization problem for each MD. A. State Space where 1(z ∈ Z) is the indicator function. In particular, the total processing capacity that EN j allocates to MD i from E E the time slot ˆli,j (t) to the time slot li,j (t) should exceed the E size of task zi,j (t). Conversely, the total allocated processing E E capacity from the time slot li,j (t) to the time slot li,j (t) − 1 should be less than the task’s size. E Additionally, we define Di,j (t) to represent the quantity of E processing time slots allocated to task zi,j (t) when executed at EN j. This value is given by: E Di,j (t) = where pIi is the standby energy consumption of MD i ∈ I. X E EiO (t) = EiT (t) + Ei,j (t) + EiI (t). (17) (16) A state in our MDP represents a conceptual space that comprehensively describes the state of an MD facing the environment. We represent the MD i’s state in time slot t as vector si (t) that includes the newly arrived task size, the queues information, the MD’s battery level, and the workload history at the ENs. The MD observers this vector at the beginning of each time slot. The vector si (t) is defined as follows:   si (t) = λi (t), δiC (t), δiT (t), η Ei (t − 1), ϕi (t), H(t) , (18) E where vector η Ei (t − 1) = (ηi,j (t − 1))j∈J represents the queues length of MD i in ENs at the previous time slot and is computed by the MD according to (10). Let ϕi (t) denote the battery level of MD i in time slot t. Considering the power modes of a real mobile device, ϕi (t) is derived from the discrete set Φ = {ϕ1 , ϕ2 , ϕ3 }, corresponding to ultra powersaving, power-saving, and performance modes, respectively. In addition, to predict future EN workloads, we define the matrix H(t) as historical data, indicating the number of active queues for all ENs. This data is recorded over T s time slots, ranging from t−T s to t−1, in T s×J matrix. For EN j workload history at ith time slot from T s − t, we define hi,j (t) as: hi,j (t) = Bj (t − T s + i − 1). (19) EN j ∈ J broadcasts Bj (t) at the end of each time slot. We define vector S as the discrete and finite state space for each MD. The size of the set S is given by Λ × T 2 × U × 3 × s I T ×J , where U is the set of available queue length values at an EN over T time slots. 5 Input Layer B. Action Space The action space represents the agent’s behavior and the decisions. In this context, we define ai (t) to denote the action taken by MD i ∈ I in time slot t ∈ T . These actions involve two decisions, (a) Offloading decision to determine whether or not to offload the task, and (b) Offloading target to determine the EN to send the offloaded tasks. Thus, the action of MD i in time slot t can be concisely expressed as the following action tuple: ai (t) = (xi (t), y i (t)), FC Layers A&V Layers A Output Layer v LSTM Unit LSTM Unit (20) LSTM Unit where vector y i (t) = (yi,j (t))j∈J represents the selected EN for offloading this task. In Section IV-B, we will discuss about the size of this action space. C. QoE Function The QoE function evaluates the influence of agent’s actions by taking several key performance factors into account. Given the selected action ai (t) in the observed state si (t), we represent Di (si (t), ai (t)) as the delay of task zi (t), which indicates the number of time slots from time slot t to the time slot in which task zi (t) is processed. It is calculated by:   C Di (si (t), ai (t)) = (1 − xi (t)) li (t) − t + 1 + XX  T E ′ E ′ xi (t) 1 zi,j (t ) = zi (t) li,j (t ) − t + 1 , (21) J t′ =t LSTM Fig. 2. The neural network of MD i ∈ I, which characterize the Q-value of each action a ∈ A under state si (t) ∈ S. where R > 0 represents a constant reward for task completion. If zi (t) = 0, then q i (si (t), ai (t)) = 0. Throughout the rest of this paper, we adopt the shortened notation q i (t) to represent q i (si (t), ai (t)). D. Problem Formulation We define the task offloading policy for MD i ∈ I as a mapping from its state to its corresponding action, denoted by i.e., πi : S → A. Especially, MD i determines an action ai (t) ∈ A, according to policy πi given the observed environment state si (t) ∈ S. The MD aims to find its optimal policy πi∗ which maximizes the long-term QoE, " # X πi∗ = arg max E γ t−1 q i (t) πi , (25) where Di (si (t), ai (t)) = 0 when task zi (t) is dropped. Correspondingly, we denote the energy consumption of task πi t∈T zi (t) when taking action ai (t) in the observed state si (t) as where γ ∈ (0, 1] is a discount factor and determines the Ei (si (t), ai (t)), which is: balance between instant QoE and long-term QoE. As γ L Ei (si (t), ai (t)) = (1 − xi (t))Ei (t)+ approaches 0, the MD prioritizes QoE within the current XX  T time slot exclusively. Conversely, as γ approaches 1, the MD  E xi (t) 1 zi,j (t′ ) = zi (t) EiO (t) . (22) increasingly factors in the cumulative long-term QoE. The J t′ =t expectation E[·] is taken into consideration of the time-varying system environments. Solving the optimization problem in Given the delay and energy consumtion of task zi (t), we also (25) is particularly challenging due to the dynamic nature of define Ci (si (t), ai (t)) that denotes the assosiate cost of task the network. To address this challenge, we introduce a DRLzi (t) given the action ai (t) in the state si (t). based offloading algorithm to learn the mapping between each Ci (si (t), ai (t)) = state-action pair and their long-term QoE. ϕi (t) Di (si (t), ai (t)) + (1 − ϕi (t)) Ei (si (t), ai (t)), (23) IV. DRL-BASED O FFLOADING A LGORITHM where ϕi (t) represents the MD i’s battery level. When the MD is operating in performance mode, the primary focus is on minimizing task delays, thus the delay contributes more to the cost. On the other hand, when the MD switches to ultra power-saving mode, the main attention is directed toward reducing power consumption. Finally, we define q i (si (t), ai (t)) as the QoE associated with task zi (t) given the selected action ai (t) and the observed state si (t). The QoE function is defined as follows: We now present QECO algorithm so as to address the distributed offloading decision-making of MDs. The aim is to empower MDs to identify the most efficient action that maximizes their long-term QoE. In the following, we introduce a neural network that characterizes the MD’s state-action Qvalues mapping, followed by a description of the information exchange between the MDs and ENs. q i (si (t), ai (t)) = ( R − Ci (si (t), ai (t)) if task zi (t) is processed, (24) − Ei (si (t), ai (t)) if task zi (t) is dropped, We utilize the DQN technique to find the mapping between each state-action pair to Q-values in the formulated MDP. As shown in Fig. 2, each MD i ∈ I is equipped with a neural network comprising six layers. These layers include an input A. DQN-based Approach 6 layer, an LSTM layer, two dense layers, an advantage-value (A&V) layer, and an output layer. The parameter vector θi of MD i’s neural network is defined to maintain the connection weights and neuron biases across all layers. For MD i ∈ I, we utilize the state information as the input of neural network. The state information λi (t), δiC (t), δiT (t), ϕi (t), and η Ei (t − 1) are directly passed to the dense layer, while the state information H(t) is first supplied to the LSTM layer and then the resulting output is sent to the dense layer. The role and responsibilities of each layer are detailed as follows. 1) Predicting Workloads at ENs: In order to capture the dynamic behavior of workloads at the ENs, we employ the LSTM network [22]. This network maintains a memory state H(t) that evolves over time, enabling the neural network to predict future workloads at the ENs based on historical data. By taking the matrix H(t) as an input, the LSTM network learns the patterns of workload dynamics. The architecture of the LSTM consists of T s units, each equipped with a set of hidden neurons, and it processes individual rows of the matrix H(t) sequentially. Through this interconnected design, MD tracks the variations in sequences from h1 (t) to hT s (t), where vector hi (t) = (hi,j (t))j∈J , thereby revealing workload fluctuations at the ENs across different time slots. The final LSTM unit produces an output that encapsulates the anticipated workload dynamics, and is then connected to the subsequent layer neurons for further learning. 2) State-Action Q-Value Mapping: The pair of dual dense layers plays a crucial role in learning the mapping of Q-values from the current state and the learned load dynamics to the corresponding actions. The dense layers consist of a cluster of neurons that employ rectified linear units (ReLUs) as their activation functions. In the initial dense layer, connections are established from the neurons in the input layer and the LSTM layer to each neuron in the dense layer. The resulting output of a neuron in the dense layer is connected to each neuron in the subsequent dense layer. In the second layer, the outputs from each neuron establish connections with all neurons in the A&V layers. 3) Dueling-DQN Approach for Q-Value Estimation: In the neural network architecture, the A&V layer and the output layer incorporate the principles of the dueling-DQN [26] to compute action Q-values. The fundamental concept of dueling-DQN involves two separate learning components: one for actionadvantage values and another for state-value. This approach enhances Q-value estimation by separately evaluating the longterm QoE attributed to states and actions. The A&V layer consists of two distinct dense networks referred to as network A and network V. Network A’s role is to learn the action-advantage value for each action, while network V focuses on learning the state-value. For an MD i ∈ I, we define Vi (si (t); θi ) and Ai (si (t), a; θi ) to denote the statevalue and the action-advantage value of action a ∈ A under state si (t) ∈ S, respectively. The parameter θi is responsible for determining these values, and it can be adjusted when training the QECO algorithm. For an MD i ∈ I, the A&V layer and the output layer collectively determine Qi (si (t), a; θi ), representing the resulting Q-value under action a ∈ A and state si (t) ∈ S, as follows: Algorithm 1 QECO Algorithm (Offloading Decision) Input: state space S, action space A Output: MD i ∈ I experience 1: for episode 1 to N ep do 2: Initialize si (1) 3: for time slot t ∈ T do 4: if MD i receives a new task zi (t) then 5: Send an UpdateRequest to EN ji ; 6: Receive network parameter vector θiE ; 7: Select action ai (t) based on (27); 8: end if 9: Observe a set of QoEs {q i (t′ ), t′ ∈ Fit }; 10: Observeithe nextistate si (t + 1);i 11: for each task zi (t′ ) where t′ ∈ Fit do Send (si (t′ ), ai (t′ ), q i (t′ ), si (t′+ 1)) to EN ji ; 12: 13: end for 14: end for 15: end for Qi (si (t), a; θi ) = Vi (si (t); θi )+ ! 1 X (Ai (si (t), a′ ; θi ) , (26) Ai (si (t), a; θi ) − |A| ′ a ∈A where θi establishes a functional relationship that maps Qvalues to pairs of state-action. B. QoE-Oriented DRL-Based Algorithm The QECO algorithm is meticulously designed to optimize the allocation of computational tasks between MDs and ENs. Since the training of neural networks imposes an extensive computational workload on MDs, we enable MDs to utilize ENs for training their neural networks, effectively reducing their computational workload. For each MD i ∈ I, there is an associated EN, denoted as EN ji ∈ J , which assists in the training process. This EN possesses the highest transmission capacity among all ENs. We define Ij ⊂ I as the set of MDs for which training is executed by EN j ∈ J , i.e. Ij = {i ∈ I|ji = j}. This approach is feasible due to the minimal information exchange and processing requirements for training compared to MD’s tasks. The algorithms to be executed at MD i ∈ I and EN j ∈ J are given in Algorithms 1 and 2, respectively. The core concept involves training neural networks with MD experiences (i.e., state, action, QoE, next state) to map Q-values to each state-action pair. This mapping allows MD to identify the action in the observed state with the highest Q-value and maximize its long-term QoE. In detail, EN j ∈ J maintains a replay buffer denotes as Mi with two neural networks for MD i: NetEi , denoting the evaluation network, and NetTi , denoting the target network, which have the same neural network architecture. However, they possess distinct parameter vectors θiE and θiT , respectively. Their Q-values are represented by QEi (si (t), a; θiE ) and QTi (si (t), a; θiT ) for MD i ∈ Ij , respectively, associating the action a ∈ A under the state si (t) ∈ S. The replay buffer records the observed experience (si (t), ai (t), q i (t), si (t + 1)) of MD i. Moreover, NetEi is responsible for action selection, 7 while NetTi characterizes the target Q-values, which represent the estimated long-term QoE resulting from an action in the observed state. The target Q-value serves as the reference for updating the network parameter vector θiE . This update occurs through the minimization of disparities between the Q-values under NetEi and NetTi . In the following, we introduce the offloading decision algorithm of MD i ∈ I and the training process algorithm running in EN j ∈ J . 1) Offloading Decision Algorithm at MD i ∈ I: We analyze a series of episodes, where N ep denotes the number of them. At the beginning of each episode, if MD i ∈ I receives a new task zi (t), it initializes the state Si (1) and sends an UpdateRequest to EN ji . After receiving the requested vector θiE of NetEi from EN ji , MD i chooses the following action for task zi (t). ( arg maxa∈A QEi (si (t), a; θiE ), w.p. 1 − ϵ, ai (t) = (27) pick a random action from A, w.p. ϵ, Algorithm 2 QECO Algorithm (Training Process) 1: Initialize replay buffer Mi for each MD i ∈ Ij ; T E 2: Initialize NetE i and Neti with random parameters θi and T θi respectively, for each MD i ∈ Ij ; 3: Set Count := 0 4: while True do ▷ infinite loop 5: if receive an UpdateRequest from MD i ∈ Ij then 6: Send θiE to MD i ∈ I; 7: end if 8: if an experience (si (t), ai (t), q i (t), si (t+1)) is received 9: from MD i ∈ Ij then 10: Store (si (t′ ), ai (t′ ), q i (t′ ), si (t′+ 1)) in Mi ; 11: Get a collection of experiences I from Mi ; 12: for each experience i ∈ I do Get experience (si (n), ai (n), q i (n), si (n+ 1)); 13: 14: Generate Q̂Ti,n according to (28); 15: end for 16: Set vector Q̂Ti := (Q̂Ti,n )n∈N ; 17: Update θiE to minimize L(θiE , Q̂Ti ) in (30); 18: Count := Count + 1; 19: if mod(Count, ReplaceThreshold) = 0 then 20: θiT := θiE ; 21: end if 22: end if 23: end while where w.p. stands for with probability, and ϵ represents the random exploration probability. The value of QEi (si (t), a; θiE ) indicates the Q-value under the parameter θiE of the neural network NetEi . Specifically, the MD with a probability of 1 − ϵ selects the action associated with the highest Q-value under NetEi in the observed state si (t). In the next time slot t + 1, MD i observes the state Si (t + 1). However, due to the potential for tasks to extend across multiple time slots, QoE q i (t) associated with task zi (t) may not be the action anticipated to be taken in the subsequent state of observable in time slot t + 1. On the other hand, MD i may experience n, according to the network NetTi , given by observe a group of QoEs associated with some tasks zi (t′ ) in Q̂Ti,n = q i (n) + γQTi (si (n + 1)), ãn ; θiT ), (28) time slots t′ ≤ t. For each MD i, we define the set Fit ⊂ T ′ to denote the time slots during which each arriving task zi (t ) where ãn denotes the optimal action for the state si (n + 1) is either processed or dropped in time slot t, as given by: based on its highest Q-value under NetEi , as given by:  Fit = t′ t′ ≤ t, λi (t′ ) > 0, (1 − xi (t′ )) liC (t′ ) + ãn = arg max QEi (si (n + 1), a; θiE ). (29) a∈A xi (t′ ) t XX  E E 1(zi,j (n) = zi (t′ )) li,j (n) = t . J n=t′ Therefore, MD i observes a set of QoEs {q i (t′ ) | t′ ∈ Fit } at the beginning of time slot t + 1, where the set Fit for some i ∈ I can be empty. Subsequently, MD i sends its experience (si (t), ai (t), q i (t), si (t + 1)) to EN ji for each task zi (t′ ) in t′ ∈ Fit . 2) Training Process Algorithm at EN j ∈ J : Upon initializing the replay buffer Mi with the neural networks NetEi and NetTi for each MD i ∈ Ij , EN j ∈ J waits for messages from the MDs in the set Ij . When EN j receives an UpdateRequest signal from an MD i ∈ Ij , it responds by transmitting the updated parameter vector θiE , obtained from NetEi , back to MD i. On the other side, if EN j receives an experience (si (t), ai (t), q i (t), si (t + 1)) from MD i ∈ Ij , the EN stores this experience in the replay buffer Mi associated with that MD. The EN randomly selects a sample collection of experiences from the replay buffer, denoted as N . For each experience n ∈ N , it calculates the value of Q̂Ti,n . This value represents the QoE in experience n and includes a discounted Q-value of In particular, regarding experience n, the target-Q value Q̂Ti,n represents the long-term QoE for action ai (n) under state si (n). This value corresponds to the QoE observed in experience n, as well as the approximate expected upcoming QoE. Based on the set N , the EN trains the MD’s neural network using previous sample experiences. Simultaneously, it updates θiE in NetEi and computes vector Q̂Ti = (Q̂Ti,n )n∈N . The key idea of updating NetEi is to minimize the disparity in Q-values between NetEi and NetTi , as indicated by the following loss function:  2 1 X L(θiE , Q̂Ti ) = QEi (si (n), ai (n); θiE ) − Q̂Ti,n . (30) |N | n∈N In every ReplaceThreshold iterations, the update of NetTi will involve duplicating the parameters from NetEi (θiT = θiE ). The objective is to consistently update the network parameter θiT in NetTi , which enhances the approximation of the long-term QoE when computing the target Q-values in (28). 3) Computational Complexity: The computational complexity of the QECO algorithm is determined by the number of experiences required to discover the optimal offloading policy. Each experience involves backpropagation for training, which 8 has a computational complexity of O(C), where C represents the number of multiplication operations in the neural network. During each training round triggered by the arrival of a new task, a sample collection of experiences of size |N | is utilized from the replay buffer. Since the training process encompasses N ep episodes and there are K expected tasks in each episode, the computational complexity of the proposed algorithm is O(N ep K|N |C), which is polynomial. Given the integration of neural networks for function approximation, the convergence guarantee of the DRL algorithm remains an open problem. In this work, we will empirically evaluate the convergence of the proposed algorithm in Section V-B. V. P ERFORMANCE E VALUATION In this section, we first present the simulation setup and training configuration. We then illustrate the convergence of the proposed DRL-based QECO algorithm and evaluate its performance in comparison to three baseline schemes in addition to the existing work [23]. TABLE I S IMULATION PARAMETERS Parameter Value Computation capacity of MD fi Computation capacity of EN fjE Transmission capacity of MD ri,j (t) Task arrival rate Size of task λi (t) Required CPU cycles of task ρi (t) Deadline of task ∆i Battery level percentage of MD ϕi (t) Computation power of EN pEj Transmission power of MD pTi Standby power of MD pIi 2.6 GHz 42.8 GHz 14 Mbps 150 Task/sec {1.0, 1.1, . . . , 7.0} Mbits {0.197,0.297,0.397} ×103 10 time slots (1 Sec) {25, 50, 75} 5W 2.3 W 0.1 W 50), most of the methods demonstrate similar proficiency in completing tasks. However, as the task arrival rate increases, the efficiency of QECO becomes more evident. Specifically, when the task arrival rate increases to 250, our algorithm can A. Simulation Setup increase the number of completed tasks by 73% and 47% We consider a MEC environment with 50 MDs and 5 ENs, compared to RD and PGOA, respectively. Similarly, in Fig. 3 similar to [12]. We also follow the model presented in [17] (b), as the number of MDs increases, QECO shows significant to determine the energy consumption. All the parameters are improvements in the number of completed tasks compared given in Table I. To train the MDs’ neural networks, we adopt to other methods, especially when faced with a large number a scenario comprising 1000 episodes. Each episode contains of MDs. When there are 110 MDs, our proposed algorithm 100 time slots, each of length 0.1 second. The QECO algorithm can effectively increase the number of completed tasks by at incorporates real-time experience into its training process to least 34% comparing with other methods. This achievement is continuously enhance the offloading strategy. Specifically, we attributed to the QECO’s ability to effectively handle unknown employ a batch size of 16, maintain a fixed learning rate of workloads and prevent congestion at the ENs. Figs. 4 (a) and 4 (b) illustrate the overall energy consumption 0.001, and set the discount factor γ to 0.9. The probability of random exploration gradually decreases from an initial value for different values of task arrival rate and the number of 1, progressively approaching 0.01, all of which is facilitated MDs, respectively. At the lower task arrival rate, the total energy consumption of all methods is close to each other. The by an RMSProp optimizer. total energy consumption increases when we have a higher We use the following methods as benchmarks. 1) Local Computing (LC): The MDs execute all of their task arrival rate. As can be observed from Fig. 4 (a), at task arrival rate 450, QECO effectively reduces overall energy computation tasks using their own computing capacity. 2) Full Offloading (FO): Each MD dispatches all of its consumption by 18% and 15% compared to RD and PGOA, computation tasks while choosing the offloading target respectively, as it takes into account the battery level of the MD in its decision-making process. However, it consumes more randomly. 3) Random Decision (RD): In this approach, when an MD energy compared to LC and FO because they do not utilize receives a new task, it randomly makes the offloading all computing resources. In particular, LC only uses the MD’s decisions and selects the offloading target if it decides to computational resources, while FO utilizes the allocated EN computing resources. dispatch the task. In Fig. 4 (b), an increasing trend in overall energy consump4) PGOA [23]: This existing method is a distributed option is observed as the number of MDs increases since the timization algorithm designed for delay-sensitive tasks number of resources available in the system increases, which in an environment where MDs interact strategically with leads to higher energy consumption. The QECO algorithm multiple ENs. We select PGOA as a benchmark method consistently outperforms RD and PGOA methods in overall due to its similarity to our work. energy consumption, especially when there are a large number of MDs. Specifically, QECO demonstrates a 27% and 16% B. Performance Comparison and Convergence reduction in overall energy consumption compared to RD and We first evaluate the number of completed tasks when PGOA, respectively, when the number of MDs increases to comparing our proposed QECO algorithm with the other four 110. schemes. As illustrated in Fig. 3 (a), the QECO algorithm As shown in Fig. 5 (a), the QECO algorithm maintains a consistently outperforms the benchmark methods when we lower average delay compared to other methods as the task vary the task arrival rate. At a lower task arrival rate (i.e., arrival rate increases from 50 to 350. Specifically, when the 9 15 10 5 0 50 150 250 350 20 15 10 30 50 Task Arrival Rate (Task/Sec) i (a) 70 90 Number of MDs i LC FO RD PGOA [12] QECO 1 0 50 150 250 i (a) 350 Task Arrival Rate (Task/Sec) 450 LC FO RD PGOA [12] QECO 9 6 30 50 70 Number of MDs i 90 110 (b) Fig. 4. The overall energy consumption under different computation workloads: (a) task arrival rate; (b) the number of MDs. 150 250 i (a) 350 Task Arrival Rate (Task/Sec) 0.6 0.2 450 LC FO RD PGOA [12] QECO 0.4 30 50 70 Number of MDs i 1 1 0.5 0.5 0 LC FO RD PGOA [12] QECO -0.5 3 0 50 0.8 90 110 (b) Fig. 5. The average delay under different computation workloads: (a) task arrival rate; (b) the number of MDs. Average QoE Energy Consumption (Jx100) Energy Consumption (Jx100) 2 LC FO RD PGOA [12] QECO 0.4 (b) 12 3 0.6 0.2 110 Fig. 3. The number of completed tasks under different computation workloads: (a) task arrival rate; (b) the number of MDs. 4 0.8 5 0 450 1 Average QoE 20 25 Average Delay (Sec) 25 1 LC FO RD PGOA [12] QECO Average Delay (Sec) 30 LC FO RD PGOA [12] QECO Number of Completed Tasks (x100) Number of Completed Tasks (x100) 30 -1 50 150 LC FO RD PGOA [12] QECO -0.5 250 350 Task Arrival Rate (Task/Sec) i 0 (a) 450 -1 30 50 70 Number of MDs i 90 110 (b) Fig. 6. The average QoE under different computation workloads: (a) task arrival rate; (b) the number of MDs. task arrival rate is 200, it reduces the average delay by at least MDs increases to 90, QECO achieves at least a 29% higher 12% compared to other methods. However, for task arrival rates QoE comparing with the other methods. It is worth noting exceeding 350, QECO may experience a higher average delay that although improvements in each of the QoE factors can compared to some of the other methods. This can be attributed contribute to enhancing system performance, it is essential to to the fact that the other algorithms drop more tasks while our consider the user’s demands in each time slot. Therefore, the proposed algorithm is capable of completing a higher number key difference between QECO and other methods is that it of tasks, potentially leading to an increase in average delay. prioritizes users’ demands, enabling it to strike an appropriate In Fig. 5 (b), as the number of MDs increases, we observe balance among them, ultimately leading to a higher QoE for a rising trend in the average delay. It can be inferred that MDs. an increase in computational load in the system can lead to We finally delve into the investigation of the convergence higher queuing delays and computations at ENs. Considering performance of the QECO algorithm, which is shown through the QECO’s ability to schedule workloads, when the number the average QoE across episodes in Figs. 7 (a) and 7 (b). of MDs increases from 30 to 110, it consistently maintains a We explore the impact of two main hyper-parameters on the lower average delay which is at least 8% less than the other convergence speed and the converged result of the proposed methods. algorithm. Fig. 7 (a) illustrates the convergence of the proposed We further investigate the overall improvement achieved by algorithm under different learning rates, where the learning the QECO algorithm in comparison to other methods in terms rate regulates the step size per iteration towards minimizing of the average QoE. This metric signifies the advantages MDs the loss function. The QECO algorithm achieves an average obtain by utilizing different algorithms. Fig. 6 (a) shows the QoE of 0.77 after around 400 episodes when the learning average QoE for different values of task arrival rate. This figure rate is 0.001, indicating relatively rapid convergence. However, indicates the superiority of the QECO algorithm in providing with smaller learning rates (e.g., 0.0001) or larger values (e.g., MDs with an enhanced experience. Specifically, when the task 0.01), a slower convergence is observed. Fig. 7 (b) shows the arrival rate is moderate (i.e., 250), QECO improves the average convergence of the proposed algorithm under different batch QoE by 57% and 33% compared to RD and PGOA, respectively. sizes, which refer to the number of sampled experiences in each Fig. 6 (b) illustrates the average QoE when we increase the training round. An improvement in convergence performance number of MDs. The EN’s workload grows when there are a is observed as the batch size increases from 4 to 16. However, larger number of MDs, leading to a reduction in the average further increasing the batch size from 16 to 32 does not notably QoE of all methods except LC. However, QECO effectively enhance the converged QoE or convergence speed. Hence, a manages the uncertain load at the ENs. When the number of batch size of 16 may be more appropriate for training processes. 0.8 0.8 0.7 0.7 Average QoE Average QoE 10 0.6 Learning rate = 0.01 0.5 0.4 200 400 600 Episode i 800 Batch size = 4 0.5 Learning rate = 0.001 Learning rate = 0.0001 0 0.6 1000 0.4 Batch size = 16 Batch size = 32 0 200 (a) 400 600 Episode i 800 1000 (b) Fig. 7. The convergence of the average QoE across episodes under different hyper-parameters: (a) Learning rate; (b) Batch size. VI. C ONCLUSION In this paper, we focused on addressing the challenge of offloading in MEC systems, where strict task processing deadlines and energy constraints adversely impact system performance. We formulated an optimization problem that aims to maximize the QoE of each MD individually, while QoE reflects the energy consumption and task completion delay. To address the dynamic and uncertain mobile environment, we proposed a QoE-oriented DRL-based computation offloading algorithm called QECO. Our proposed algorithm empowers MDs to make offloading decisions without relying on knowledge about task models or other MDs’ offloading decisions. The QECO algorithm not only adapts to the uncertain dynamics of load levels at ENs, but also effectively manages the ever-changing system environment. Through extensive simulations, we showed that QECO outperforms several established benchmark techniques, while demonstrating a rapid training convergence. Specifically, QECO increases the average user’s QoE by 37% compared to an existing work. This advantage can lead to improvements in key performance metrics, including task completion rate, task delay, and energy consumption, under different system conditions and varying user demands. There are multiple directions for future work. A complementary approach involves extending the task model by considering interdependencies among tasks. This can be achieved by incorporating a task call graph representation. Furthermore, in order to accelerate the learning of optimal offloading policies, it will be beneficial to take advantages of federated learning techniques in the training process. This will allow MDs to collectively contribute to improving the offloading model and enable continuous learning when new MDs join the network. R EFERENCES [1] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surv. Tutor., vol. 19, no. 4, pp. 2322–2358, Aug 2017. [2] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proc IEEE, vol. 107, no. 8, pp. 1738–1762, Aug 2019. [3] A. Yousefpour, C. Fung, T. Nguyen, K. Kadiyala, F. Jalali, A. Niakanlahiji, J. Kong, and J. P. Jue, “All one needs to know about fog computing and related edge computing paradigms: A complete survey,” J. Syst. Archit., vol. 98, pp. 289–330, Sep 2019. [4] A. Kaur and A. Godara, “Machine learning empowered green task offloading for mobile edge computing in 5G networks,” IEEE Trans. Netw. Sci. Eng., vol. 21, no. 1, pp. 810–820, Feb 2024. [5] H. Shah-Mansouri and V. W. Wong, “Hierarchical fog-cloud computing for IoT systems: A computation offloading game,” IEEE Internet Things J., vol. 5, no. 4, pp. 3246–3257, May 2018. [6] C. Jiang, X. Cheng, H. Gao, X. Zhou, and J. Wan, “Toward computation offloading in edge computing: A survey,” IEEE Access, vol. 7, pp. 131 543–131 558, Aug 2019. [7] L. Wu, P. Sun, H. Chen, Y. Zuo, Y. Zhou, and Y. Yang, “NOMA-enabled multiuser offloading in multicell edge computing networks: A coalition game based approach,” IEEE Trans. Netw. Sci. Eng., vol. 11, no. 2, pp. 2170–2181, Mar 2024. [8] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Process Mag, vol. 34, no. 6, pp. 26–38, Nov 2017. [9] L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Trans. Mob. Comput., vol. 19, no. 11, pp. 2581–2593, Jul 2019. [10] M. Bolourian and H. Shah-Mansouri, “Deep Q-learning for minimum task drop in SWIPT-enabled mobile-edge computing,” IEEE Wireless Commun. Letters, vol. 13, no. 3, pp. 894–898, Mar 2024. [11] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks,” IEEE Trans. Wireless Commun., vol. 18, no. 11, pp. 5141–5152, Aug 2019. [12] M. Tang and V. W. Wong, “Deep reinforcement learning for task offloading in mobile edge computing systems,” IEEE Trans. Mob. Comput., vol. 21, no. 6, pp. 1985–1997, Nov 2020. [13] C. Sun, X. Li, C. Wang, Q. He, X. Wang, and V. C. Leung, “Hierarchical deep reinforcement learning for joint service caching and computation offloading in mobile edge-cloud computing,” accepted for publication in IEEE Trans. Services Computing, 2024. [14] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Edge intelligence for energy-efficient computation offloading and resource allocation in 5G beyond,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12 175–12 186, Aug 2020. [15] H. Huang, Q. Ye, and Y. Zhou, “Deadline-aware task offloading with partially-observable deep reinforcement learning for multi-access edge computing,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 6, pp. 3870–3885, Sep 2021. [16] Z. Liu, Y. Zhao, J. Song, C. Qiu, X. Chen, and X. Wang, “Learn to coordinate for computation offloading and resource allocation in edge computing: A rational-based distributed approach,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 5, pp. 3136–3151, Dec 2021. [17] H. Zhou, K. Jiang, X. Liu, X. Li, and V. C. Leung, “Deep reinforcement learning for energy-efficient computation offloading in mobile-edge computing,” IEEE Internet Things J., vol. 9, no. 2, pp. 1517–1530, Jun 2021. [18] Z. Gao, L. Yang, and Y. Dai, “Large-scale computation offloading using a multi-agent reinforcement learning in heterogeneous multi-access edge computing,” IEEE Trans. Mob. Comput., vol. 22, no. 6, pp. 3425–3443, Jan 2023. [19] Y. Gong, H. Yao, J. Wang, M. Li, and S. Guo, “Edge intelligence-driven joint offloading and resource allocation for future 6G Industrial Internet of Things,” accepted for publication in IEEE Trans. Netw. Sci. Eng., 2024. [20] L. Liao, Y. Lai, F. Yang, and W. Zeng, “Online computation offloading with double reinforcement learning algorithm in mobile edge computing,” J Parallel Distrib Comput, vol. 171, pp. 28–39, Jan 2023. [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb 2015. [22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–1780, Sep 1997. [23] L. Yang, H. Zhang, X. Li, H. Ji, and V. C. Leung, “A distributed computation offloading strategy in small-cell networks integrated with mobile edge computing,” IEEE ACM Trans. Netw., vol. 26, no. 6, pp. 2762–2773, Dec 2018. [24] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation offloading for mobile-edge computing with energy harvesting devices,” IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3590–3605, Sep 2016. [25] A. Parekh and R. G. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: The single-node case,” IEEE/ACM Trans. Netw, vol. 1, no. 3, pp. 344–357, Jun 1993. [26] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in Proc. of International Conference on Machine Learning. New York, NY, Jun 2016.