A Brief Summary of Deep Q-Network

Deep Q-Network (DQN) is a deep reinforcement learning method, first proposed by Google DeepMind team in 2013 [Paper]. This algorithm introduces a neural network to replace the cumulative expected return storage form in the Q-learning algorithm, and can effectively solve problems with high-dimensional state spaces. The DeepMind team creatively combines convolutional neural networks in deep learning and Q-learning in reinforcement learning to train agents to learn how to control characters in Atari games to achieve good game performance.


Neural Network

The DQN algorithm takes the state of the agent as the input of the neural network, and then uses the neural network to calculate the Q values corresponding to all actions.

Online Network and Target Network

Two neural networks are used in DQN: Online Network and Target Network. The online network continuously updates parameters θ\theta (the weight ω\omega and bias bb of each neuron) during training, which are used to calculate Q estimates. The parameters of the target network are temporarily frozen, updated every once in a while, used to calculate the Q target value.

The structure of the target network Q^(s,a;θi^)\hat{Q}(s,a;\hat{\theta_i}) and the online network Q(s,a;θi)Q(s,a;\theta_i) are the same. The weights θ\theta of the online network QQ are continuously updated during the training process. Every fixed time step, the target network Q^\hat{Q} is allowed to synchronize the parameters of the online network, that is, let θi^=θi\hat{\theta_i}=\theta_i. Keeping the Q target value unchanged for a period of time can reduce the correlation between the Q estimate value and the Q target value. This can improve the stability of the algorithm.

In the choice of neural network, either a fully connected neural network or a convolutional neural network can be used, depending on the characteristics of the problem. If you need to analyze pictures or videos, convolutional neural network is a better choice.

Loss Function

The loss function is used to measure the prediction performance of the model. Mean Squared Error 𝐿_i(𝜃_i) is defined as follows:

𝐿_i(𝜃_i)=E_{s,a,r,s'}[(y_i-Q(s,a;\theta_i))^2]\\=E_{s,a,r,s}[(r+\gamma max_{a'}\hat{Q}(s',a';\hat{\theta_i})-Q(s,a;\theta_i))^2]

Mini-batch Gradient Descent can be used to update the weight of the online network. Each time b samples are randomly selected from the training set for learning. The gradient formula of the loss function 𝐿_i(𝜃_i) is as follows:

θiL(θi)=Es,a,r,s[(r+γmaxaQ^(s,a;θi^)Q(s,a;θi))θiQ(s,a;θi)]\nabla_{\theta_i}L(\theta_i)=E_{s,a,r,s'}[(r+\gamma max_{a'}\hat{Q}(s',a';\hat{\theta_i})-Q(s,a;\theta_i))\nabla_{\theta_i}Q(s,a;\theta_i)]


Experience Replay

In deep learning, a large amount of manually labeled training data is usually prepared in advance as input, allowing the algorithm to learn the corresponding patterns from them. In reinforcement learning, the agent learns autonomously by continuously taking actions and interacting with the environment.

In addition, deep learning generally requires that the training data be independently and identically distributed. In reinforcement learning, the states of agents at adjacent time steps are often highly correlated, and the data distribution will continue to change with the learning process of the agent.

Experience replay is precisely to solve the problems mentioned above. The experience et=(st,at,rt,st+1)e_t = (s_t, a_t, r_t, s_{t+1}) gained from each step of the interaction between the agent and the environment is stored in the replay buffer 𝐷=[e1,e2,,en]𝐷=[e_1, e_2, ⋯, e_n]Some experiences are randomly selected for learning during each training, thus effectively reducing the correlation between data.


Epsilon Greedy Strategy

Like Q-learning, DQN also requires a balance between exploration and exploitation. If there is too much emphasis on exploration, the algorithm will tend to choose actions completely randomly each time. If too much emphasis is placed on utilization, the algorithm will directly select the action with the largest Q value each time it selects an action, which may cause the agent to fall into a local optimum and fail to reach the target state.

Commonly used strategies to balance exploration and exploitation include Epsilon Greedy and Boltzmann Exploration.

In the Epsilon Greedy strategy, a random number between 0 and 1 is generated every time an action is selected. If the random number is smaller than ϵ\epsilon, the action is randomly selected. If the random number is larger than ϵ\epsilon, the action corresponding to the largest Q value is selected. During each iteration, ϵ\epsilon will continue to decay according to the growth of the time step. The update process of ϵ\epsilon is as follows:

ϵ=ϵmin+(ϵmaxϵmin)eλϵE\epsilon=\epsilon_{min}+(\epsilon_{max}-\epsilon_{min})*e^{-\lambda_{\epsilon} E}

EE is the current cumulative number of training times. As the number of training times continues to increase, ϵ\epsilon will gradually decrease from ϵmax\epsilon_{max} to ϵmin\epsilon_{min} in an exponential decay manner. In the initial stage, let ϵ\epsilon take a larger value, which means that the agent is encouraged to conduct random exploration. As the degree of neural network training increases, ϵ\epsilon will gradually decrease, and the agent will pay more attention to the prediction results of the neural network every time it chooses an action.


Double Deep Q-network

Although the DQN algorithm (2015) also uses two neural networks, its target Q value is usually obtained by directly selecting the largest Q value predicted by the target network, so there is usually a problem of overestimation of the Q value.

The Double Deep Q-Network (DDQN) separates the action selection and calculation of the target Q value, which can improve the overestimation problem and improve the stability of the algorithm. The DDQN and DQN are basically the same. The only difference lies in the process of calculating the Q value.

In the DQN algorithm, the maximum Q value calculated directly from the target network:

Q(st,at)=rt+γmaxaQ^(st+1,a;θ^)Q(s_t,a_t)=r_t+\gamma max_{a'}\hat{Q}(s_{t+1},a';\hat{\theta})

In the DDQN algorithm, the action with the largest Q value is first selected based on the online network Q(s,a;θi)Q(s,a;\theta_i), and then the Q value corresponding to this action is calculated using the target network Q^(s,a;θi^)\hat{Q}(s,a;\hat{\theta_i}). The formula is as follows:

Q(st,at)=rt+γQ^(st+1,argmaxQ(st+1,a;θ);θ^)Q(s_t,a_t)=r_t+\gamma \hat{Q}(s_{t+1},argmaxQ(s_{t+1},a;\theta);\hat{\theta})