Saturday, August 7, 2010

PyBrain: Reinforcement Learning, a Tutorial

2 – Q Learning and its variations (continued):

SARSA versus Q Learning, and exploratory policies:


In SARSA learning, instead of using the maximum Q-value at time t, we use the Q-value of the action that was chosen by the agent at time t. So the update formula will be instead:


Qt - 1 := Qt - 1 + alpha * [r + (gamma * Qt) – Qt – 1]

The fundamental difference between Q learning and SARSA learning, therefore, is that SARSA is "policy-dependent", while Q is "policy-independent". "Policy" is the decision process used to select an action, given a certain state.

So far we've discussed how the agent learns, or updates it's value table, but we haven't discussed how it selects the action it is going to perform. This digression is necessary to understand the difference between the two learning algorithms presented so far.

One might initially think that the agent would simply choose the action having the highest Q value given the input state. However, the problem with that policy is that it is vulnerable to what we call "local maxima". Local maxima are local optimizations in the search space of all solutions, such that all gradual movement towards one direction or another would lead to a decrease in Q value and yet, farther away in the search space, there exists a more optimal solution (which we will never reach if we refuse to temporarily decrease our current solution value ever so slightly).

To solve this problem, reinforcement learning agents usually contain in their decision policy a factor generally referred to as epsilon, which determines to which extent we will randomize the actions. Occasionally randomizing an action (i.e. choosing an action that is actually NOT optimal!) can result in potentially bringing the agent to states that could offer higher Q-values even though the temporary step to that state was less beneficial.

Coming back to SARSA versus Q, then, we see that SARSA's learning function depends on the action taken – therefore on the agent's decision policy (essentially: on the exploration factor). Q, on the other hand, does not care about what is the next action that was taken, but rather about what is the optimal action that could have been taken. This is the fundamental (and only) difference.

It should be noted that it can be proven that given "sufficient" training, the Q learning converges to a close approximation of the function being learned. SARSA, on the other hand, does not necessarily converge.



Q-lambda learning:

Here I introduce the last variation of Q-learning. Q-lambda learning is a bit more elaborate, as it goes back in time to update the most recent interactions' associated Q-values. The formula goes like this:

  • For all T time steps t of this learning episode (0 being the first step, T - 1 being the most recent):

Qt - 1 := Qt - 1 + alpha * etr * [r + (gamma * max(Qt)) – Qt – 1]

  • Where:
etr = lambda ^ (T - t)

In plain English, Q-lambda learning updates, for all interactions (indexed by t from 0 to T - 1) in the current episode, all state-action pairs updated in the most recent interactions, but attenuating each update with the factor etr which diminishes exponentially as you go farther back into time (and is equal to 1 for the current interaction).

(coming next: the actual PyBrain API!)

1 comment: