Sunday, August 15, 2010

PyBrain: Reinforcement Learning, a Tutorial

3 – The PyBrain API:

To download/install PyBrain, see:

Pybrain's Reinforcement Learning API can be visually represented like this:


This class essentially coordinates the interaction between Environment/Task and the Agent. It decides what an Interaction, or Episode (in the case of the Episodic Experiment), entails. Usually, that means obtaining the observations from the environment, feeding it into the agent, obtaining the agent's corresponding action, feeding it back into the environment, and receiving the reward from the task which is then fed to the agent for the learning process.

The classes that can be used to implement an Experiment are:

  • Experiment: the basic experiment class that implement the concept of an interaction via the doInteractions method.
  • EpisodicExperiment: an extension to the Experiment class that implements the concept of episodic tasks (an episode being a series of interactions whose termination may consist of a conditional based on the values from the environment, etc.) via the doEpisodes() method.
  • ContinuousExperiment: an extension to the Experiment class that handles continuous tasks, i.e. tasks where no "reset" is involved, and learning occurs after each interaction.


This class implements the nature of the task, and focuses on the reward aspect. It decides what is a successful action, and what is a less successful one. You will have to implement your own derivation of this class for your application.


This class manages the inputs (observations) and outputs (actions performed) that go to and come from the agent. You will have to implement your own derivation of this class for your application.


The agent is the "intelligent" component, the module that does the actual interacting and learning.

The classes that can be used to implement an agent are:

  • Agent: the basic agent class.
  • LoggingAgent: this extension stores a history of interactions, and makes sure integrateObservation, getAction and getReward are called in that order.
  • LearningAgent: this extension includes a learning module and does actual learning. This is generally the class you want to instantiate (not sure in what scenarios you would limit yourself to the 2 classes mentioned above – unless you're going to implement your own Agent class for whatever reason).

The learning algorithm in the diagram depends on which parameter you pass when you instantiate your Agent, and can be:

  • Q: Q learning. Described in section 2(a)
  • NFQ: Neuro-fitted Q learning. Q Learning that approximates the behaviour to be learned using a neural networks rather than the Q-value table used in traditional Q learning. This method is much slower (the learning occurs offline, on a set of samples of transitions between state t and state t + 1), however it can approximate non-linear functions, which is something traditional Q learning cannot do.
  • QLambda: Q-lambda learning. Described in section 2(b).
  • SARSA: SARSA (State-Action-Reward-State-Action) learning. Described in section 2(b)
The corresponding learning data structure can be:
  • ActionValueTable: for discrete actions.
  • ActionValueNetwork: for continous actions.

I suggest referring to the documentation that comes with PyBrain for technical details on the API. This section is more of an introductory "How To" than a technical reference.

In order to implement your own specific reinforcement learning scenario, the steps to follow generally are:

1 – implement your own derived class of Task

This only entails implementing your own getReward(self) function.

2 – implement your own derived class of Environment

Methods to define:


This method is what determines and returns the observations (i.e. State) that the agent will obtain from the environment during an interaction. This is usually an array of doubles.

performAction(self, action):

This method performs on the environment the action that comes from the agent through the parameter "action". This is also, by default, an array of doubles.


This method can be implemented is you want to be able to re-initialize the state of the environment after a certain amount of interactions.

3 – Code your main function that brings it all together... see examples in the next section.

1 comment:

  1. Very clear article exactly what I was looking for. Strangely I found sample like the balancing task that claim NFQ to be much faster than Q, but that's probably task-dependent. On another note I would have like some more clarifications on "policy" in regard with the environment and neural network.