CodeSphere - Expand Your Coding Knowledge

Advanced Actor-Critic Methods

Overview

In this lesson, we will delve into advanced Actor-Critic methods, focusing on Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). These algorithms address some of the limitations of basic Actor-Critic methods and are widely used in practice for more complex environments.

1. Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is an advanced Actor-Critic algorithm designed to improve training stability and efficiency. It aims to optimize the policy while keeping changes within a "trust region" to prevent large, destabilizing updates.

Key Concepts:

Clipped Objective Function: PPO uses a clipped surrogate objective function to limit policy updates.
Advantage Function: PPO typically uses the Generalized Advantage Estimation (GAE) for calculating advantages.
Surrogate Objective: PPO optimizes a surrogate objective function to approximate the policy gradient.

Objective Function: The PPO objective function is: [ L^{\text{PPO}}(\theta) = \mathbb{E}t \left[ \min \left( \frac{\pi\theta(at | s_t)}{\pi{\text{old}}(at | s_t)} \hat{A}_t, \text{clip} \left( \frac{\pi\theta(at | s_t)}{\pi{\text{old}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right] ] Where:

( \pi_\theta ) is the current policy.
( \pi_{\text{old}} ) is the previous policy.
( \hat{A}_t ) is the advantage estimate.
( \epsilon ) is the clipping parameter.

Algorithm Steps:

Initialize policy and value networks.
For each iteration:
1. Collect data using the current policy.
2. Compute advantages and returns.
3. Update the policy using the clipped surrogate objective.
4. Update the value function.

Python Implementation: Here is a simplified implementation of PPO using TensorFlow.

Step 1: Install Required Libraries

pip install gym numpy tensorflow

Step 2: Implement PPO

import numpy as np
import gym
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Initialize parameters
gamma = 0.99  # Discount factor
lambda_ = 0.95  # GAE lambda
epsilon = 0.2  # Clipping parameter
learning_rate = 0.001
num_episodes = 1000
batch_size = 64

# Create environment
env = gym.make('CartPole-v1')
n_actions = env.action_space.n
n_states = env.observation_space.shape[0]

# Define the actor network
actor = Sequential([
    Dense(24, activation='relu', input_shape=(n_states,)),
    Dense(24, activation='relu'),
    Dense(n_actions, activation='softmax')
])
actor_optimizer = Adam(learning_rate=learning_rate)

# Define the critic network
critic = Sequential([
    Dense(24, activation='relu', input_shape=(n_states,)),
    Dense(24, activation='relu'),
    Dense(1)  # Output a single value
])
critic_optimizer = Adam(learning_rate=learning_rate)

def policy_action(state):
    state = np.expand_dims(state, axis=0)
    probs = actor.predict(state)[0]
    return np.random.choice(n_actions, p=probs)

def compute_advantages(rewards, values, gamma, lambda_):
    advantages = np.zeros_like(rewards)
    deltas = rewards + gamma * np.append(values[1:], 0) - values
    running_sum = 0
    for t in reversed(range(len(rewards))):
        running_sum = deltas[t] + gamma * lambda_ * running_sum
        advantages[t] = running_sum
    return advantages

def compute_returns(rewards, gamma):
    returns = np.zeros_like(rewards)
    running_sum = 0
    for t in reversed(range(len(rewards))):
        running_sum = rewards[t] + gamma * running_sum
        returns[t] = running_sum
    return returns

# Training loop
for episode in range(num_episodes):
    state = env.reset()
    states, actions, rewards, values = [], [], [], []

    done = False
    while not done:
        states.append(state)
        action = policy_action(state)
        next_state, reward, done, _ = env.step(action)

        states.append(next_state)
        actions.append(action)
        rewards.append(reward)
        values.append(critic.predict(np.expand_dims(state, axis=0))[0][0])

        state = next_state

    # Compute values for the last state
    values.append(critic.predict(np.expand_dims(state, axis=0))[0][0])

    # Compute advantages and returns
    advantages = compute_advantages(rewards, values, gamma, lambda_)
    returns = compute_returns(rewards, gamma)

    # Update policy
    with tf.GradientTape() as tape:
        logits = actor(tf.convert_to_tensor(np.array(states), dtype=tf.float32))
        actions_one_hot = tf.one_hot(actions, n_actions)
        old_probs = tf.reduce_sum(actions_one_hot * logits, axis=1)
        ratio = old_probs / tf.reduce_sum(actions_one_hot * logits, axis=1)
        clipped_ratio = tf.clip_by_value(ratio, 1 - epsilon, 1 + epsilon)
        loss = -tf.reduce_sum(tf.minimum(ratio * tf.convert_to_tensor(advantages, dtype=tf.float32),
                                         clipped_ratio * tf.convert_to_tensor(advantages, dtype=tf.float32)))

    grads = tape.gradient(loss, actor.trainable_variables)
    actor_optimizer.apply_gradients(zip(grads, actor.trainable_variables))

    # Update critic
    with tf.GradientTape() as tape:
        values_pred = critic(tf.convert_to_tensor(np.array(states), dtype=tf.float32))
        loss = tf.reduce_mean(tf.square(tf.convert_to_tensor(returns, dtype=tf.float32) - values_pred))

    grads = tape.gradient(loss, critic.trainable_variables)
    critic_optimizer.apply_gradients(zip(grads, critic.trainable_variables))

env.close()

2. Deep Deterministic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient (DDPG) is an advanced Actor-Critic algorithm designed for continuous action spaces. It uses both an actor and critic network, with target networks and experience replay to improve stability and efficiency.