Reinforcement Learning Fundamentals
Policy-Based Methods in Reinforcement Learning
Overview
In this lesson, we will focus on Policy-Based Methods in Reinforcement Learning (RL). Unlike Value-Based Methods, which learn the value of actions or states, Policy-Based Methods directly learn the policy that maps states to actions. We will explore the REINFORCE algorithm and understand how it optimizes policies.
1. Introduction to Policy-Based Methods
Policy-Based Methods focus on learning a policy function ( \pi(a|s) ) that represents the probability of taking action ( a ) in state ( s ). The policy can be deterministic or stochastic.
Key Concepts:
- Policy Function ( \pi(a|s) ): A function that gives the probability of taking action ( a ) given state ( s ).
- Objective: Maximize the expected cumulative reward by optimizing the policy function.
- Gradient Ascent: Policy-Based Methods use gradient ascent to optimize the policy directly.
Advantages:
- Can handle high-dimensional action spaces.
- Can represent stochastic policies.
Disadvantages:
- Typically requires more samples to converge.
- May be less stable compared to Value-Based Methods.
2. REINFORCE Algorithm
REINFORCE is a Monte Carlo-based algorithm for optimizing the policy. It uses the return (cumulative reward) to update the policy parameters.
Policy Gradient Theorem: The gradient of the expected return with respect to the policy parameters ( \theta ) is: [ \nabla*{\theta} J(\theta) = \mathbb{E}*{\pi} \left[ \nabla_{\theta} \log \pi(a|s; \theta) \cdot R \right] ] Where:
- ( J(\theta) ) is the expected return.
- ( \pi(a|s; \theta) ) is the policy function with parameters ( \theta ).
- ( R ) is the return or cumulative reward.
Algorithm Steps:
- Initialize policy parameters ( \theta ).
- For each episode:
- Generate an episode using the current policy.
- For each step in the episode, compute the return ( R ).
- Update policy parameters ( \theta ) using the policy gradient.
Python Implementation: For this implementation, we'll use a simple policy network and optimize it using the REINFORCE algorithm.
Step 1: Install Required Libraries
pip install gym numpy tensorflow
Step 2: Implement the REINFORCE Algorithm
import numpy as np
import gym
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Initialize parameters
gamma = 0.99 # Discount factor
num_episodes = 1000
learning_rate = 0.01
# Create environment
env = gym.make('CartPole-v1')
n_actions = env.action_space.n
n_states = env.observation_space.shape[0]
# Define the policy network
model = Sequential([
Dense(24, activation='relu', input_shape=(n_states,)),
Dense(24, activation='relu'),
Dense(n_actions, activation='softmax')
])
optimizer = Adam(learning_rate=learning_rate)
def policy_action(state):
state = np.expand_dims(state, axis=0)
probs = model.predict(state)[0]
return np.random.choice(n_actions, p=probs)
def compute_returns(rewards, gamma):
returns = np.zeros_like(rewards)
running_sum = 0
for t in reversed(range(len(rewards))):
running_sum = rewards[t] + gamma * running_sum
returns[t] = running_sum
return returns
# Training loop
for episode in range(num_episodes):
state = env.reset()
states, actions, rewards = [], [], []
done = False
while not done:
states.append(state)
action = policy_action(state)
next_state, reward, done, _ = env.step(action)
actions.append(action)
rewards.append(reward)
state = next_state
# Compute returns
returns = compute_returns(rewards, gamma)
with tf.GradientTape() as tape:
logits = model(tf.convert_to_tensor(np.array(states), dtype=tf.float32))
log_probs = tf.math.log(tf.reduce_sum(tf.one_hot(actions, n_actions) * logits, axis=1))
loss = -tf.reduce_sum(log_probs * tf.convert_to_tensor(returns, dtype=tf.float32))
# Update policy
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
env.close()
3. Summary and Next Steps
In this lesson, we covered Policy-Based Methods in Reinforcement Learning, focusing on the REINFORCE algorithm. We discussed its principles, advantages, and implemented it using TensorFlow. In the next lesson, we will explore Actor-Critic Methods, which combine both value-based and policy-based approaches for reinforcement learning.