Saturday, August 7, 2010

PyBrain: Reinforcement Learning, a Tutorial


2 – Q Learning and its variations:

Q Learning is the most common learning algorithm for reinforcement learning scenarios. Essentially, reinforcement learning scenarios are described by states, actions and rewards. The learning environment places the learning agent in a certain state, and the agent reacts with an action. The environment (more specifically, the "task", but we'll get into the details later) then gives feedback about this action, and the agent adapts (learns). The process can be repeated ad nauseum.

Q Learning is implemented by internally holding a representation of the relative value of each action in the context of each state in a bi-dimensional table (or matrix), such as follows:




Action 1 Action 2
State 1 Q-value for action 1, state 1 Q-value for action 2, state 1
State 2 Q-value for action 1, state 2 Q-value for action 2, state 2
State 3 Q-value for action 1, state 3 Q-value for action 2, state 3
State 4 Q-value for action 1, state 4 Q-value for action 2, state 4
State 5 Q-value for action 1, state 5 Q-value for action 2, state 5
State 6 Q-value for action 1, state 6 Q-value for action 2, state 6
State 7 Q-value for action 1, state 7 Q-value for action 2, state 7


A scenario with 2 possible actions for 7 possible states.


What these Q-values are initialized to doesn't matter terribly, although it might affect the speed at which the learning will "converge".

And now, without any prior explanation, I'm going to throw at you the magic formula that updates these Q-values:

Qt - 1 := Qt - 1 + alpha * [r + (gamma * max(Qt)) – Qt - 1]

In this formula:

The ":=" means "we update the old value with...". In other words, it is an assignment operator.

t refers to the current interaction, t – 1 refers to the previous one.

Qt-1: the Q-value associated with the previous interaction that occured, NOT the current one!

alpha means: the learning rate, and is between 0.0 (exclusively) and 1.0 (inclusively). The closer to 1.0, the faster the Q-values will adjust to the current feedback (more aggressive learning), the closer to 0.0, the slower they will (more conservative learning).

r means: the reward associated with the previous interaction. What is most confusing about this is that although we reward the previous interaction, the value is entered as a response to the current interaction... The interactions go like this:

Interaction 1:
    • observe state 1
    • perform action 1
    • receive reward for interaction 0 (reward value is useless)
No learning occurs, this is the first interaction.

Interaction 2:
    • observe state 2
    • perform action 2
    • receive reward for interaction 1
The first update of the Q-value table occurs here!

gamma means: the "discount factor", between 0.0 (inclusively) and 1.0 (exclusively!). The closer to 1.0, the more the agent will try to optimize its learning for the long-term, and the closer to 0.0, the more the agent will optimize for short-term (greedy learning). The nature of the reinforcement learning problem (specifically, whether or not the agent has an impact on the environment such that its actions will determine its future states or not) has an important impact on the value one should set for this parameter. I will get back to it later.

max(Qt) means: the maximum Q-value in the row corresponding to the given state at time t (i.e. the current state). In other words, the Q-value associated with the optimal action for the given state at time t.

First, let's take a look at what this formula means when we set the discount factor (gamma) to 0.

We get:

Qt - 1 := Qt - 1 + alpha * (r – Qt - 1)

In other words, this formula simply means: add to the Q-value the difference between the reward and the Q-value itself. Note that there is only one time index used here: t – 1. In fact, we don't even need any time index. We could just update the newest value (Qt) as the interaction occurs. This formula could very well read:

Qt := Qt + alpha * (r – Qt)

When alpha is 1, we get Q + r – Q, hence Q = r directly. This is a bit too simplistic...

At the other extreme, if alpha were to be 0, we would have Q = Q, hence no update.

When alpha equals 0.5, for example, what happens is that the old Q value and the new Q value (i.e. the reward given) meet half-way.

Overall, the gamma = 0 case is fairly easy to understand, and does not even necessitate a time dimension. It can happen instantly, for each new interaction. Let's see what happens when we add a discount factor...

The formula with gamma = 0 represents a very near-sighted, "impulsive" agent. It does not look even one step further into time, to see what lies ahead as a consequence of its action. The discount factor will not look at time t (or t+1, depending on your perspective), and take the optimal Q-value of that interaction into account when determining how to update Qt-1.

Of course, this is only valuable for scenarios where the state of the experiment isn't reset (re-initialized or randomized) for each interaction... because in such an experiment, the agent has no influence over its next state.

Hence, going back to the original (complete) formula:

Qt - 1 := Qt - 1 + alpha * [r + (gamma * max(Qt)) – Qt - 1]

Suppose we have a gamma of 1, and an alpha of 1, we have the following formula:
Qt - 1 := Qt - 1 + r + max(Qt) – Qt – 1

Qt - 1 := r + max(Qt)

What this means is that we update the previous Q value with its reward + the maximum possible Q-value given the new state. Therefore, if the previous interaction led us to a new state where even the best action isn't that great in terms of Q value, that previous interaction will be penalized accordingly, in comparison to a previous interaction that led us to a state where an action can be taken that is particulary valuable for the agent.

(coming up next: SARSA learning and other variations)

1 comment: