Reinforcement Learning Fundamentals

Value-Based Methods in Reinforcement Learning


Overview

In this lesson, we will focus on Value-Based Methods in Reinforcement Learning (RL). These methods involve estimating the value of states or actions to guide the agent's decision-making. We will cover key algorithms like Q-Learning and SARSA, delve into their mechanics, and implement them in code.


1. Understanding Value-Based Methods

Value-Based Methods aim to find the optimal policy by estimating the value functions associated with states or state-action pairs. These methods are grounded in the concept that an agent should choose actions that lead to states with higher expected rewards.

Key Concepts:

  • State Value Function (V(s)): Estimates the expected return (reward) starting from state s and following a specific policy.
  • Action Value Function (Q(s, a)): Estimates the expected return of taking action a in state s and then following a specific policy.

Objective: The goal is to find a policy that maximizes the expected cumulative reward. This is achieved by learning the optimal value functions.


2. Q-Learning

Q-Learning is an off-policy algorithm that learns the value of actions in states. It updates the Q-values (action-value function) based on the Bellman equation.

Bellman Equation for Q-Learning: [ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)] ] Where:

  • ( \alpha ) is the learning rate.
  • ( r ) is the reward received after taking action ( a ) in state ( s ).
  • ( \gamma ) is the discount factor.
  • ( \max_{a'} Q(s', a') ) is the maximum Q-value for the next state ( s' ).

Algorithm Steps:

  1. Initialize Q-values arbitrarily.
  2. For each episode:
    1. Initialize state ( s ).
    2. For each step of the episode:
      1. Choose an action ( a ) using an exploration strategy (e.g., ε-greedy).
      2. Take action ( a ), observe reward ( r ) and next state ( s' ).
      3. Update Q-value using the Bellman equation.
      4. Set state ( s ) to ( s' ).

Python Implementation:

import numpy as np
import gym

# Initialize parameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Exploration rate
num_episodes = 1000

# Initialize environment
env = gym.make('CartPole-v1')
n_actions = env.action_space.n
n_states = np.prod(env.observation_space.shape)

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

def get_discrete_state(state):
    # Discretize state space for simplicity
    return np.digitize(state, bins=np.linspace(-1.0, 1.0, num=20))

for episode in range(num_episodes):
    state = get_discrete_state(env.reset())
    done = False

    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state, :])  # Exploit

        next_state, reward, done, _ = env.step(action)
        next_state = get_discrete_state(next_state)

        # Q-learning update
        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state, :]) - Q[state, action])
        state = next_state

env.close()

3. SARSA (State-Action-Reward-State-Action)

SARSA is an on-policy algorithm that updates the Q-values based on the action taken by the agent.

Bellman Equation for SARSA: [ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)] ] Where:

  • ( s' ) is the next state.
  • ( a' ) is the next action taken in state ( s' ).

Algorithm Steps:

  1. Initialize Q-values arbitrarily.
  2. For each episode:
    1. Initialize state ( s ) and choose action ( a ) using an exploration strategy.
    2. For each step of the episode:
      1. Take action ( a ), observe reward ( r ) and next state ( s' ).
      2. Choose next action ( a' ) using the same strategy.
      3. Update Q-value using the Bellman equation for SARSA.
      4. Set state ( s ) to ( s' ) and action ( a ) to ( a' ).

Python Implementation:

import numpy as np
import gym

# Initialize parameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Exploration rate
num_episodes = 1000

# Initialize environment
env = gym.make('CartPole-v1')
n_actions = env.action_space.n
n_states = np.prod(env.observation_space.shape)

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

def get_discrete_state(state):
    # Discretize state space for simplicity
    return np.digitize(state, bins=np.linspace(-1.0, 1.0, num=20))

for episode in range(num_episodes):
    state = get_discrete_state(env.reset())
    action = env.action_space.sample() if np.random.rand() < epsilon else np.argmax(Q[state, :])
    done = False

    while not done:
        next_state, reward, done, _ = env.step(action)
        next_state = get_discrete_state(next_state)
        next_action = env.action_space.sample() if np.random.rand() < epsilon else np.argmax(Q[next_state, :])

        # SARSA update
        Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])
        state = next_state
        action = next_action

env.close()

4. Summary and Next Steps

In this lesson, we explored Value-Based Methods in Reinforcement Learning, focusing on Q-Learning and SARSA. We discussed their principles, differences, and implementations in Python. In the next lesson, we will delve into Policy-Based Methods, starting with REINFORCE, and explore how they differ from Value-Based Methods.