CodeSphere - Expand Your Coding Knowledge

Hierarchical Reinforcement Learning (HRL)

Overview

In this lesson, we will explore Hierarchical Reinforcement Learning (HRL). HRL addresses the challenge of learning complex tasks by breaking them down into simpler, more manageable sub-tasks. This hierarchical approach can improve learning efficiency and policy generalization. We will discuss key concepts, algorithms, and implement a basic HRL example.

1. Introduction to Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL) involves structuring the learning process into multiple levels, where higher-level policies define goals or sub-tasks, and lower-level policies handle the execution of these sub-tasks.

Key Concepts:

Hierarchy of Policies: HRL introduces a hierarchy where high-level policies select sub-tasks or goals, and low-level policies execute these tasks.
Options Framework: A common HRL framework where an option consists of a policy, a termination condition, and a value function.
Macro-Actions: Higher-level actions that are sequences of lower-level actions.

Advantages:

Scalability: Simplifies the learning problem by breaking it into smaller sub-tasks.
Reusability: Allows the reuse of learned policies for different tasks.
Faster Learning: Focuses learning on sub-tasks, potentially speeding up the training process.

Challenges:

Complexity: Designing and managing hierarchical structures can be complex.
Coordination: Ensuring proper coordination between different levels of the hierarchy.

2. Key Algorithms in HRL

a. Options Framework: The Options Framework introduces the concept of options, which are temporal abstractions that consist of:

Policy: The action-selection strategy.
Termination Condition: Determines when the option should terminate.
Value Function: Evaluates the option's performance.

Algorithm Steps:

Initialize options and their policies.
For each episode:
1. Choose an option using the high-level policy.
2. Execute the option's policy until the termination condition is met.
3. Update the high-level and low-level policies.

b. Hierarchical DQN (h-DQN): An extension of Deep Q-Networks (DQN) that uses a hierarchy of Q-networks for handling different levels of abstraction.

Algorithm Steps:

Initialize high-level and low-level Q-networks.
For each episode:
1. Choose a high-level action (sub-task) using the high-level Q-network.
2. Execute the sub-task using the low-level Q-network.
3. Update both high-level and low-level Q-networks based on the received rewards.

3. Python Implementation: Basic HRL with Options Framework

Step 1: Define the Environment

We will use a simple grid world environment where the agent can perform macro-actions (e.g., "move to a specific goal").

import numpy as np
import gym
from gym import spaces

class GridWorldEnv(gym.Env):
    def __init__(self):
        super(GridWorldEnv, self).__init__()
        self.action_space = spaces.Discrete(4)  # Actions: 0: Left, 1: Down, 2: Right, 3: Up
        self.observation_space = spaces.MultiDiscrete([4, 4])  # 4x4 grid
        self.state = [0, 0]  # Initial state

    def reset(self):
        self.state = [0, 0]
        return np.array(self.state)

    def step(self, action):
        if action == 0:  # Left
            self.state[0] = max(0, self.state[0] - 1)
        elif action == 1:  # Down
            self.state[1] = max(0, self.state[1] - 1)
        elif action == 2:  # Right
            self.state[0] = min(3, self.state[0] + 1)
        elif action == 3:  # Up
            self.state[1] = min(3, self.state[1] + 1)

        reward = 1 if self.state == [3, 3] else 0
        done = reward == 1
        return np.array(self.state), reward, done, {}

    def render(self, mode='human'):
        grid = np.zeros((4, 4))
        grid[tuple(self.state)] = 1
        print(grid)

env = GridWorldEnv()

Step 2: Implement Basic HRL with Options

We'll implement a simple HRL setup with a high-level policy selecting goals and a low-level policy achieving those goals.

import numpy as np
import random

class HighLevelPolicy:
    def __init__(self, n_goals, n_actions):
        self.q_table = np.zeros((n_goals, n_actions))
        self.alpha = 0.1
        self.gamma = 0.99
        self.epsilon = 0.1
        self.n_goals = n_goals
        self.n_actions = n_actions

    def choose_goal(self, state):
        return np.argmax(self.q_table[state])

    def update(self, state, goal, reward, next_goal):
        best_next_goal = np.argmax(self.q_table[next_goal])
        td_target = reward + self.gamma * self.q_table[next_goal, best_next_goal]
        td_error = td_target - self.q_table[state, goal]
        self.q_table[state, goal] += self.alpha * td_error

class LowLevelPolicy:
    def __init__(self, n_actions, n_states, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(range(env.action_space.n))
        else:
            return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + self.gamma * self.q_table[next_state, best_next_action]
        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

# Initialize policies
high_level_policy = HighLevelPolicy(n_goals=4, n_actions=4)
low_level_policy = LowLevelPolicy(n_actions=4, n_states=16)

num_episodes = 1000
for episode in range(num_episodes):
    state = env.reset()
    goal = high_level_policy.choose_goal(state[0] * 4 + state[1])
    done = False

    while not done:
        action = low_level_policy.choose_action(state[0] * 4 + state[1])
        next_state, reward, done, _ = env.step(action)

        low_level_policy.update(state[0] * 4 + state[1], action, reward, next_state[0] * 4 + next_state[1])

        if done:
            high_level_policy.update(state[0] * 4 + state[1], goal, reward, next_state[0] * 4 + next_state[1])

        state = next_state

env.close()

4. Summary and Next Steps