深度强化学习(Deep Reinforcement Learning, DRL)是深度学习和强化学习相结合的产物,近年来在自动化控制、游戏等领域取得了显著的成果。深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是一种结合了价值函数方法和策略梯度方法的模型,专门用于处理连续动作空间的问题。本文将对DDPG模型进行解析,并提供相应的PyTorch代码示例。
DDPG模型解析
DDPG是一种Actor-Critic结构的算法,其中分为两个主要部分:
- Actor:负责输出当前状态下的动作,策略模型。
- Critic:负责评估Actor输出动作的质量,价值模型。
DDPG模型的主要创新点在于它结合了经验回放和目标网络的思想,从而提高了训练的稳定性与效率。
关键技术
- 经验回放(Experience Replay):
- 训练过程中,智能体与环境交互,获得的状态、动作、奖励和下一状态会被存储在一个经验池中。
-
每次训练时,从池中随机抽取样本进行训练,这种随机抽取有助于打破样本之间的相关性,提高学习效率。
-
目标网络(Target Network):
- Actor和Critic各有一个目标网络,用于生成时间差分(Critic)更新的目标值和策略改善的目标动作。
-
采用软更新(Soft Update)的方法来更新目标网络的权重,避免训练中产生剧烈波动。
-
推断过程:
- 在每个时间步,Actor根据当前状态输出一个动作,并将其传递给环境,环境返回下一个状态和奖励。
- 训练时,Critic首先通过状态和动作评估当前状态的价值,然后Actor根据Critic的反馈更新自己的策略。
PyTorch实现
以下是实现DDPG的完整PyTorch代码示例:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque
# 定义超参数
class Config:
STATE_DIM = 3
ACTION_DIM = 1
MAX_EPISODES = 1000
MAX_STEP = 200
BUFFER_SIZE = 100000
BATCH_SIZE = 64
GAMMA = 0.99
TAU = 0.005
LR_ACTOR = 0.001
LR_CRITIC = 0.001
# Actor网络
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super(Actor, self).__init__()
self.fc1 = nn.Linear(state_dim, 400)
self.fc2 = nn.Linear(400, 300)
self.fc3 = nn.Linear(300, action_dim)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
return torch.tanh(self.fc3(x)) # 使用tanh激活函数将输出限制在(-1, 1)
# Critic网络
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.fc1 = nn.Linear(state_dim + action_dim, 400)
self.fc2 = nn.Linear(400, 300)
self.fc3 = nn.Linear(300, 1)
def forward(self, state, action):
x = torch.relu(self.fc1(torch.cat([state, action], dim=1)))
x = torch.relu(self.fc2(x))
return self.fc3(x)
# DDPG类
class DDPG:
def __init__(self):
self.actor = Actor(Config.STATE_DIM, Config.ACTION_DIM)
self.actor_target = Actor(Config.STATE_DIM, Config.ACTION_DIM)
self.critic = Critic(Config.STATE_DIM, Config.ACTION_DIM)
self.critic_target = Critic(Config.STATE_DIM, Config.ACTION_DIM)
self.actor_target.load_state_dict(self.actor.state_dict())
self.critic_target.load_state_dict(self.critic.state_dict())
self.memory = deque(maxlen=Config.BUFFER_SIZE)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=Config.LR_ACTOR)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=Config.LR_CRITIC)
def update(self):
if len(self.memory) < Config.BATCH_SIZE:
return
# 从经验回放池中随机抽取一批样本
batch = random.sample(self.memory, Config.BATCH_SIZE)
state, action, reward, next_state = zip(*batch)
state = torch.FloatTensor(state)
action = torch.FloatTensor(action)
reward = torch.FloatTensor(reward).unsqueeze(1)
next_state = torch.FloatTensor(next_state)
# 更新Critic
next_action = self.actor_target(next_state)
target_q = self.critic_target(next_state, next_action)
target_q = reward + Config.GAMMA * target_q.detach()
current_q = self.critic(state, action)
critic_loss = nn.MSELoss()(current_q, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 更新Actor
actor_loss = -self.critic(state, self.actor(state)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 软更新目标网络
for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
target_param.data.copy_(Config.TAU * param.data + (1.0 - Config.TAU) * target_param.data)
for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
target_param.data.copy_(Config.TAU * param.data + (1.0 - Config.TAU) * target_param.data)
# 训练过程
def train():
agent = DDPG()
for episode in range(Config.MAX_EPISODES):
state = np.random.rand(Config.STATE_DIM) # 初始状态
total_reward = 0
for step in range(Config.MAX_STEP):
action = agent.actor(torch.FloatTensor(state)).detach().numpy()
next_state = state + action # 模拟环境的状态转移
reward = -np.sum(np.square(action)) # 假设的奖励
# 存储经验
agent.memory.append((state, action, reward, next_state))
agent.update()
state = next_state
total_reward += reward
if done:
break
print(f'Episode {episode+1}, Total Reward: {total_reward}')
if __name__ == "__main__":
train()
结论
DDPG是一种强大的算法,适用于处理连续动作空间问题。通过使用经验回放和目标网络等技术,DDPG在许多实际应用中表现良好。上面的代码提供了一个基本的DDPG实现,可以根据具体的应用场景进行调整和优化。希望本文能够帮助读者更好地理解DDPG模型及其在深度强化学习中的应用。