Saturday, August 21, 2010

PyBrain: Reinforcement Learning, a Tutorial

4(a) – A Black Jack playing agent:

First, we will start with a very basic, minimalist scenario where a hand is dealt, and the agent is asked whether it should get another card, or stop. There is no such thing as splitting, there is no betting, etc. From the business perspective, then, the scenario can be represented as 21 states calling for 2 possible actions. The 21 states are, of course, the total hand value that the agent was dealt, and the 2 possible actions are Hit or Stand.

To simplify the code, the interaction will occur on the command line, and the user will feed in manually the hand dealt and the results of the agent's action. The steps of the interaction are:

1 – Ask user to input the hand value that was dealt.
2 – The agent performs the action, outputs it to the user.
3 – Ask user to input the reward, i.e. whether the hand was winning (1.0), a draw (0.0), or losing (-1.0).
4 – The agent learns.

We will start with gamma = 0.0 and see what our results are.

For the actual implementation, we will first write an Environment specialization such as the following pybrain/rl/environments/ (comments are displayed in bold, to help readability...):
from pybrain.rl.environments.environment import Environment
from scipy import zeros

class BlackjackEnv(Environment):
    """ A (terribly simplified) Blackjack game implementation of an environment. """      

    # the number of action values the environment accepts
    indim = 2
    # the number of sensor values the environment produces
    outdim = 21
    def getSensors(self):
        """ the currently visible state of the world (the    observation may be stochastic - repeated calls returning different values)
            :rtype: by default, this is assumed to be a numpy array of doubles

        hand_value = int(raw_input("Enter hand value: ")) - 1
        return [float(hand_value),]
    def performAction(self, action):
        """ perform an action on the world that changes it's internal state (maybe stochastically).
            :key action: an action that should be executed in the Environment.
            :type action: by default, this is assumed to be a numpy array of doubles

        print "Action performed: ", action

    def reset(self):
        """ Most environments will implement this optional method that allows for reinitialization.
Then, we have the specialization of the Task, in pybrain/rl/environments/
from scipy import clip, asarray

from pybrain.rl.environments.task import Task
from numpy import *

class BlackjackTask(Task):
    """ A task is associating a purpose with an environment. It decides how to evaluate the observations, potentially returning reinforcement rewards or fitness values.
    Furthermore it is a filter for what should be visible to the agent.
    Also, it can potentially act as a filter on how actions are transmitted to the environment. """

    def __init__(self, environment):
        """ All tasks are coupled to an environment. """
        self.env = environment
        # we will store the last reward given, remember that "r" in the Q learning formula is the one from the last interaction, not the one given for the current interaction!
        self.lastreward = 0

    def performAction(self, action):
        """ A filtered mapping towards performAction of the underlying environment. """               
    def getObservation(self):
        """ A filtered mapping to getSample of the underlying environment. """
        sensors = self.env.getSensors()
        return sensors
    def getReward(self):
        """ Compute and return the current reward (i.e. corresponding to the last action performed) """
        reward = raw_input("Enter reward: ")
        # retrieve last reward, and save current given reward
        cur_reward = self.lastreward
        self.lastreward = reward
        return cur_reward

    def indim(self):
        return self.env.indim
    def outdim(self):
        return self.env.outdim

Finally, we have the main code that brings everything together:

from pybrain.rl.environments.blackjacktask import BlackjackTask
from pybrain.rl.environments.blackjackenv import BlackjackEnv
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q
from pybrain.rl.experiments import Experiment
from pybrain.rl.explorers import EpsilonGreedyExplorer

# define action-value table
# number of states is:
#    current value: 1-21
# number of actions:
#    Stand=0, Hit=1
av_table = ActionValueTable(21, 2)

# define Q-learning agent
learner = Q(0.5, 0.0)
agent = LearningAgent(av_table, learner)

# define the environment
env = BlackjackEnv()

# define the task
task = BlackjackTask(env)

# finally, define experiment
experiment = Experiment(task, agent)

# ready to go, start the process
while True:

To summarize the training parameters:

alpha = 0.5
gamma = 0.0
epsilon (the exploratory parameter) = 0.0
the learning module is classical Q learning (Q)
the agent is a LearningAgent
the experiment is a basic Experiment class instance

I trained this agent by playing against it for 300 interactions. My playing policy was to hit until 15, then stand. Also, aces are treated as hard 11, not a soft "1 or 11". I would then reward with 1.0 if the agent wins the hand, 0.0 if it's a draw, and -1.0 is the agent loses the hand.

After the 300th interaction, the Q-table looked like this:

StateQ-value of "Stand"Q-value of "Hit"Relative value of Hitting over Standing
Hand value 1:N/AN/AN/A
Hand value 2:N/AN/AN/A
Hand value 3:N/AN/AN/A
Hand value 4:-0.6250.984375+1.609375
Hand value 5:-0.6250.984375+1.609375
Hand value 6:-0.250.984375+1.234375
Hand value 7:-0.50.9921875+1.4921875
Hand value 8:-0.50.984375+1.484375
Hand value 9:-0.50.984375+1.484375
Hand value 10:-0.50.998046875+1.498046875
Hand value 11:-0.50.994140625+1.494140625
Hand value 12:-0.85302734375-0.636352539063+0.216674804687
Hand value 13:-0.8750.40990447998+1.28490447998
Hand value 14:-0.671875-0.64518815279+0.02668684721
Hand value 15:-0.8750.1123046875+0.9873046875
Hand value 16:-0.825504302979-0.875-0.049495697021
Hand value 17:-0.4501953125-0.880859375-0.4306640625
Hand value 18:-0.577491760254-0.95849609375-0.381004333496
Hand value 19:0.707023620605-0.5-1.207023620605
Hand value 20:0.499875545385-0.5-0.999875545385
Hand value 21:0.9374995231630.0-0.937499523163

Keep in mind that this is the "optimal" strategy for playing against me specifically (or, rather, the playing policy I adopted for the training). Against a completely different adversary, or multiple adversaries, the strategy could end up being different. Also, we can see that the Q-values are somewhat vulnerable to noise and unstable (Value of hitting versus standing on 14 is 0.027, yet the same value for a hand of 15 is 0.99?)...


  1. Amazing introduction to Q-learning and Pybrain .The only part that I didn't understand is how the hand values of less than 10 get Q-values since rewards are issued only after either winning or losing.Thank you

  2. This comment has been removed by the author.

  3. How did you display the q table? Like where are these values if I want to print after every hand?

  4. I would also like to know how to print the q-table.

  5. For anyone like me wondering how to print Q table, I think I have a solution

    Firstly in last script, we have to increase interactions from 1 to higher value like 50 or 100, much high value would be inconvenient if you are entering rewards manually.
    # ready to go, start the process
    while True:
    experiment.doInteractions(1) #here to make change
    =>experiment.doInteractions(50)#change done
    and the next step is to print our action value table, we initialised as av_table as following
    # ready to go, start the process
    while True:
    print av_table.params.reshape(21,2)

  6. For some reason, every time I print out av_table.params.reshape(21,2) it's just a table of 0s. I have not changed anything from your code either...