深度强化学习(Deep Reinforcement Learning, DRL)是深度学习和强化学习相结合的产物,近年来在自动化控制、游戏等领域取得了显著的成果。深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是一种结合了价值函数方法和策略梯度方法的模型,专门用于处理连续动作空间的问题。本文将对DDPG模型进行解析,并提供相应的PyTorch代码示例。

DDPG模型解析

DDPG是一种Actor-Critic结构的算法,其中分为两个主要部分:

  1. Actor:负责输出当前状态下的动作,策略模型。
  2. Critic:负责评估Actor输出动作的质量,价值模型。

DDPG模型的主要创新点在于它结合了经验回放和目标网络的思想,从而提高了训练的稳定性与效率。

关键技术

  1. 经验回放(Experience Replay)
  2. 训练过程中,智能体与环境交互,获得的状态、动作、奖励和下一状态会被存储在一个经验池中。
  3. 每次训练时,从池中随机抽取样本进行训练,这种随机抽取有助于打破样本之间的相关性,提高学习效率。

  4. 目标网络(Target Network)

  5. Actor和Critic各有一个目标网络,用于生成时间差分(Critic)更新的目标值和策略改善的目标动作。
  6. 采用软更新(Soft Update)的方法来更新目标网络的权重,避免训练中产生剧烈波动。

  7. 推断过程

  8. 在每个时间步,Actor根据当前状态输出一个动作,并将其传递给环境,环境返回下一个状态和奖励。
  9. 训练时,Critic首先通过状态和动作评估当前状态的价值,然后Actor根据Critic的反馈更新自己的策略。

PyTorch实现

以下是实现DDPG的完整PyTorch代码示例:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque

# 定义超参数
class Config:
    STATE_DIM = 3
    ACTION_DIM = 1
    MAX_EPISODES = 1000
    MAX_STEP = 200
    BUFFER_SIZE = 100000
    BATCH_SIZE = 64
    GAMMA = 0.99
    TAU = 0.005
    LR_ACTOR = 0.001
    LR_CRITIC = 0.001

# Actor网络
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_dim, 400)
        self.fc2 = nn.Linear(400, 300)
        self.fc3 = nn.Linear(300, action_dim)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return torch.tanh(self.fc3(x))  # 使用tanh激活函数将输出限制在(-1, 1)

# Critic网络
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_dim + action_dim, 400)
        self.fc2 = nn.Linear(400, 300)
        self.fc3 = nn.Linear(300, 1)

    def forward(self, state, action):
        x = torch.relu(self.fc1(torch.cat([state, action], dim=1)))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# DDPG类
class DDPG:
    def __init__(self):
        self.actor = Actor(Config.STATE_DIM, Config.ACTION_DIM)
        self.actor_target = Actor(Config.STATE_DIM, Config.ACTION_DIM)
        self.critic = Critic(Config.STATE_DIM, Config.ACTION_DIM)
        self.critic_target = Critic(Config.STATE_DIM, Config.ACTION_DIM)

        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic_target.load_state_dict(self.critic.state_dict())

        self.memory = deque(maxlen=Config.BUFFER_SIZE)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=Config.LR_ACTOR)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=Config.LR_CRITIC)

    def update(self):
        if len(self.memory) < Config.BATCH_SIZE:
            return

        # 从经验回放池中随机抽取一批样本
        batch = random.sample(self.memory, Config.BATCH_SIZE)
        state, action, reward, next_state = zip(*batch)

        state = torch.FloatTensor(state)
        action = torch.FloatTensor(action)
        reward = torch.FloatTensor(reward).unsqueeze(1)
        next_state = torch.FloatTensor(next_state)

        # 更新Critic
        next_action = self.actor_target(next_state)
        target_q = self.critic_target(next_state, next_action)
        target_q = reward + Config.GAMMA * target_q.detach()

        current_q = self.critic(state, action)
        critic_loss = nn.MSELoss()(current_q, target_q)
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # 更新Actor
        actor_loss = -self.critic(state, self.actor(state)).mean()
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # 软更新目标网络
        for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
            target_param.data.copy_(Config.TAU * param.data + (1.0 - Config.TAU) * target_param.data)

        for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
            target_param.data.copy_(Config.TAU * param.data + (1.0 - Config.TAU) * target_param.data)

# 训练过程
def train():
    agent = DDPG()

    for episode in range(Config.MAX_EPISODES):
        state = np.random.rand(Config.STATE_DIM)  # 初始状态
        total_reward = 0

        for step in range(Config.MAX_STEP):
            action = agent.actor(torch.FloatTensor(state)).detach().numpy()
            next_state = state + action  # 模拟环境的状态转移
            reward = -np.sum(np.square(action))  # 假设的奖励

            # 存储经验
            agent.memory.append((state, action, reward, next_state))
            agent.update()

            state = next_state
            total_reward += reward

            if done:
                break

        print(f'Episode {episode+1}, Total Reward: {total_reward}')

if __name__ == "__main__":
    train()

结论

DDPG是一种强大的算法,适用于处理连续动作空间问题。通过使用经验回放和目标网络等技术,DDPG在许多实际应用中表现良好。上面的代码提供了一个基本的DDPG实现,可以根据具体的应用场景进行调整和优化。希望本文能够帮助读者更好地理解DDPG模型及其在深度强化学习中的应用。

点赞(0) 打赏

微信小程序

微信扫一扫体验

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部