Simon's technical blog: PyBrain: Reinforcement Learning, a Tutorial

Saturday, August 21, 2010

PyBrain: Reinforcement Learning, a Tutorial

4(a) – A Black Jack playing agent:

First, we will start with a very basic, minimalist scenario where a hand is dealt, and the agent is asked whether it should get another card, or stop. There is no such thing as splitting, there is no betting, etc. From the business perspective, then, the scenario can be represented as 21 states calling for 2 possible actions. The 21 states are, of course, the total hand value that the agent was dealt, and the 2 possible actions are Hit or Stand.

To simplify the code, the interaction will occur on the command line, and the user will feed in manually the hand dealt and the results of the agent's action. The steps of the interaction are:

1 – Ask user to input the hand value that was dealt.

2 – The agent performs the action, outputs it to the user.

3 – Ask user to input the reward, i.e. whether the hand was winning (1.0), a draw (0.0), or losing (-1.0).

4 – The agent learns.

We will start with gamma = 0.0 and see what our results are.

For the actual implementation, we will first write an Environment specialization such as the following pybrain/rl/environments/blackjackenv.py (comments are displayed in bold, to help readability...):
-----------------------------------------------------------------------------------------------------------

from  pybrain.rl.environments.environment import Environment

from scipy import zeros

class  BlackjackEnv(Environment):

    """ A (terribly simplified)  Blackjack game implementation of an environment. """       

    # the number of  action values the environment accepts

    indim = 2

    # the number of sensor values the  environment produces

    outdim = 21

    def  getSensors(self):

        """ the currently visible state  of the world (the    observation may be stochastic - repeated  calls returning different values) 

            :rtype: by default, this is  assumed to be a numpy array of doubles

        """

        hand_value =  int(raw_input("Enter hand value: ")) - 1

        return [float(hand_value),]

    def  performAction(self, action):

        """ perform an action on the  world that changes it's internal state (maybe stochastically).

            :key  action: an action that should be executed in the Environment. 

            :type  action: by default, this is assumed to be a numpy array of doubles

        """

        print "Action  performed: ", action

    def reset(self):

        """ Most  environments will implement this optional method that allows for reinitialization. 

        """

-----------------------------------------------------------------------------------------------------------

Then, we have the specialization of the Task, in pybrain/rl/environments/blackjacktask.py:

----------------------------------------------------------------------------------------------------------- 

from scipy import clip, asarray

from  pybrain.rl.environments.task import Task

from numpy import *

class  BlackjackTask(Task):

    """ A task is associating a purpose  with an environment. It decides how to evaluate the observations,  potentially returning reinforcement rewards or fitness values. 

    Furthermore it is a  filter for what should be visible to the agent.

    Also, it can  potentially act as a filter on how actions are transmitted to the  environment. """

    def __init__(self, environment):

        """ All tasks  are coupled to an environment. """

        self.env = environment

        # we will  store the last reward given, remember that "r" in the Q learning formula  is the one from the last interaction, not the one given for the current  interaction!

        self.lastreward = 0

    def  performAction(self, action):

        """ A filtered mapping towards  performAction of the underlying environment. """                

         self.env.performAction(action)

    def getObservation(self):

        """ A  filtered mapping to getSample of the underlying environment. """

        sensors =  self.env.getSensors()

        return sensors

    def  getReward(self):

        """ Compute and return the  current reward (i.e. corresponding to the last action performed) """

        reward =  raw_input("Enter reward: ")

        # retrieve last reward, and save current given reward

        cur_reward = self.lastreward

        self.lastreward = reward

        return cur_reward

    @property
    def indim(self):
        return self.env.indim

    @property
    def outdim(self):
        return self.env.outdim
-----------------------------------------------------------------------------------------------------------

Finally, we have the main code that brings everything together:
-----------------------------------------------------------------------------------------------------------

from pybrain.rl.environments.blackjacktask import BlackjackTask
from pybrain.rl.environments.blackjackenv import BlackjackEnv
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q
from pybrain.rl.experiments import Experiment
from pybrain.rl.explorers import EpsilonGreedyExplorer

# define action-value table
# number of states is:
#
#    current value: 1-21
#
# number of actions:
#
#    Stand=0, Hit=1av_table = ActionValueTable(21, 2)
av_table.initialize(0.)

# define Q-learning agent
learner = Q(0.5, 0.0)
learner._setExplorer(EpsilonGreedyExplorer(0.0))
agent = LearningAgent(av_table, learner)

# define the environment
env = BlackjackEnv()

# define the task
task = BlackjackTask(env)

# finally, define experiment
experiment = Experiment(task, agent)

# ready to go, start the process
while True:
    experiment.doInteractions(1)
    agent.learn()
    agent.reset()
-----------------------------------------------------------------------------------------------------------

To summarize the training parameters:

alpha = 0.5
gamma = 0.0
epsilon (the exploratory parameter) = 0.0
the learning module is classical Q learning (Q)
the agent is a LearningAgent
the experiment is a basic Experiment class instance

I trained this agent by playing against it for 300 interactions. My playing policy was to hit until 15, then stand. Also, aces are treated as hard 11, not a soft "1 or 11". I would then reward with 1.0 if the agent wins the hand, 0.0 if it's a draw, and -1.0 is the agent loses the hand.

After the 300th interaction, the Q-table looked like this:

State	Q-value of "Stand"	Q-value of "Hit"	Relative value of Hitting over Standing
Hand value 1:	N/A	N/A	N/A
Hand value 2:	N/A	N/A	N/A
Hand value 3:	N/A	N/A	N/A
Hand value 4:	-0.625	0.984375	+1.609375
Hand value 5:	-0.625	0.984375	+1.609375
Hand value 6:	-0.25	0.984375	+1.234375
Hand value 7:	-0.5	0.9921875	+1.4921875
Hand value 8:	-0.5	0.984375	+1.484375
Hand value 9:	-0.5	0.984375	+1.484375
Hand value 10:	-0.5	0.998046875	+1.498046875
Hand value 11:	-0.5	0.994140625	+1.494140625
Hand value 12:	-0.85302734375	-0.636352539063	+0.216674804687
Hand value 13:	-0.875	0.40990447998	+1.28490447998
Hand value 14:	-0.671875	-0.64518815279	+0.02668684721
Hand value 15:	-0.875	0.1123046875	+0.9873046875
Hand value 16:	-0.825504302979	-0.875	-0.049495697021
Hand value 17:	-0.4501953125	-0.880859375	-0.4306640625
Hand value 18:	-0.577491760254	-0.95849609375	-0.381004333496
Hand value 19:	0.707023620605	-0.5	-1.207023620605
Hand value 20:	0.499875545385	-0.5	-0.999875545385
Hand value 21:	0.937499523163	0.0	-0.937499523163

Keep in mind that this is the "optimal" strategy for playing against me specifically (or, rather, the playing policy I adopted for the training). Against a completely different adversary, or multiple adversaries, the strategy could end up being different. Also, we can see that the Q-values are somewhat vulnerable to noise and unstable (Value of hitting versus standing on 14 is 0.027, yet the same value for a hand of 15 is 0.99?)...

22 comments:

UnknownFebruary 8, 2014 at 1:01 PM
Amazing introduction to Q-learning and Pybrain .The only part that I didn't understand is how the hand values of less than 10 get Q-values since rewards are issued only after either winning or losing.Thank you
ReplyDelete
Replies
UnknownApril 18, 2016 at 12:00 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMay 26, 2016 at 8:59 AM
How did you display the q table? Like where are these values if I want to print after every hand?
ReplyDelete
Replies
UnknownMay 31, 2016 at 11:37 AM
I would also like to know how to print the q-table.
ReplyDelete
Replies
UnknownSeptember 17, 2016 at 5:33 AM
For anyone like me wondering how to print Q table, I think I have a solution

Firstly in last script, we have to increase interactions from 1 to higher value like 50 or 100, much high value would be inconvenient if you are entering rewards manually.
--------------------------------------------------------------------------
# ready to go, start the process
while True:
====================================================
experiment.doInteractions(1) #here to make change
^^^^^^
=>experiment.doInteractions(50)#change done
--------------------------------------------------------------------------
and the next step is to print our action value table, we initialised as av_table as following
--------------------------------------------------------------------------
# ready to go, start the process
while True:
experiment.doInteractions(1)
agent.learn()
print av_table.params.reshape(21,2)
agent.reset()
ReplyDelete
Replies
UnknownDecember 26, 2016 at 10:42 PM
For some reason, every time I print out av_table.params.reshape(21,2) it's just a table of 0s. I have not changed anything from your code either...
ReplyDelete
Replies
vaiyboraNovember 15, 2017 at 11:58 AM
Wow, absolutely fantastic blog. I am very glad to have such useful information.

ทางบ้าน
ReplyDelete
Replies
น้องส้มส้มOctober 26, 2018 at 1:10 AM

Learning to strengthen I know many things. About these things. For anyone who wants to know this, hurry to read it.

บาคาร่าออนไลน์
ReplyDelete
Replies
vaishaliMay 31, 2019 at 4:04 PM
Thanking you will not be enough. This is really good. You are just exceptional.
Keep doing the good work ??
Thanks for such very great information. This is the best sites for proving such kinds of good information
LIC Information
ReplyDelete
Replies
Shirley B. HoughtonAugust 22, 2019 at 12:14 PM
Enjoyed reading the article above, really explains everything in detail,the article is very interesting and effective. Thank you and good luck for the upcoming articles.
and visit this profile for link Gift card code generator
ReplyDelete
Replies
peter thakurOctober 30, 2019 at 5:28 AM
Thank you because you have been willing to share information with us. we will always appreciate all you have done here because I know you are very concerned with our. Xbox
ReplyDelete
Replies
JeffNovember 24, 2019 at 1:28 PM
Yea, this is an old blog post, but I appreciate it. Thank you.
ReplyDelete
Replies
peter thakurJanuary 15, 2020 at 3:46 PM
Here is very general and the huge knowledgeable platform has been known by this blog.I reality appreciate this blog to have such kind of educational knowledge. Earn free PayPal gift card
ReplyDelete
Replies
kemika7384May 18, 2020 at 11:27 PM
Important vocabulary in playing slots games Pussy888
Let's take a look at how to play the Pussy888 edition, easy to understand before you play slots. The first thing that everyone must do. Is to learn how to play Pussy888 slots first, including the rules of play Learning beforehand will not cause mistakes. In which the rules or methods of playing are not complicated or complicated Unlike some online casino games Or maybe even easier
การเล่นสล็อต Pussy888 มีพื้นฐานการเล่นที่เหมือน ๆ กัน ดังนั้นเมื่อคุณเข้าใจการเล่นสล็อต Pussy888 แค่เพียงเกมเดียว คุณก็จะเล่นสล็อตได้ทุกรูปแบบ การเล่นสล็อตจะเริ่มต้นด้วยการใช้เงินเดิมพันที่เราเติมเข้าไปเป็นเครดิตนั่นเอง จากนั้นการเล่นเริ่มต้นด้วยการหมุนหรือ Spin เกมให้ได้ภาพตามที่กำหนด เช่น ตั้งแต่ 2-3 ภาพขึ้นไป ก็จะได้เงินรางวัลนั่นเอง อาจสรุปได้ง่าย ๆ ดังนี้ เติมเครดิตเข้าสู่ระบบ, กดจำนวนเงินที่ต้องการวางเดิมพัน, ทำการหมุน Spin วงล้อ, เมื่อวงล้อหยุด ระบบจะประเมินผลสัญลักษณ์หรือรูปภาพที่เส้นจ่ายเงิน, ถ้าหากวงล้อหยุดที่รูปภาพที่เป็นสัญลักษณ์เหมือนกัน เราก็จะได้เงินรางวัลตามกติกาที่ได้กำหนดไว้ ซึ่งจะแตกต่างกันออกไปบ้าง
การเล่นสล็อต Pussy888 จะมีการกำหนดคำศัพท์ไว้เป็นการเฉพาะ คำศัพท์ที่สำคัญในเกมสล็อต มีดังนี้นอกจากวิธีเล่นสล็อตที่นักพนันต้องรู้แล้ว คำศัพท์ก็เป็นเรื่องที่ไม่สามารถมองข้ามเช่นกัน การเล่นสล็อตส่วนใหญ่จะใช้ศัพท์ที่เป็นภาษาอังกฤษ ถ้าหากคุณไม่รู้บ้างหรือไม่รู้เลย ก็อาจจะทำให้คุณทำอะไรลำบากกดผลิกดถูกก็ได้ คำศัพท์สำคัญที่ต้องรู้มีดังต่อไปนี้
Line Bet คือ การแสดงยอดเครดิตที่ใช้ในการเดิมพันต่อ 1 ไลน์ นักพนันสามารถปรับเปลี่ยนได้เอง เช่น 1.5, 2.5 เป็นต้น
Lines คือ การแสดงจำนวนไลน์หรือแถวที่ต้องการจะวางเดิมพัน ขั้นต่ำอยู่ที่ 1 ไลน์หรือ 1 แถว และสูงสุดไม่เกิน 30 ไลน์ หรือ 30 แถว
Total Bet คือ ช่องแสดงจำนวนเงินเดิมพันทั้งหมดของแต่ละตา
Spin คือ ปุ่มที่ใช้สำหรับหมุนวงล้อ
Win คือ ยอดเงินที่ได้จากการหมุนวงล้อในแต่ละครั้ง
Balance คือ การแสดงจำนวนเครดิตที่เหลือทั้งหมดของท่าน
Auto Spin คือ ปุ่มสำหรับการตั้งค่าเพื่อการหมุนแบบอัตโนมัติ นักพนันสามารถเลือกจำนวนครั้งได้
Paytable คือ ปุ่มสำหรับดูรายละเอียดของแต่ละเกม เช่น อัตราการจ่ายเงินรางวัล, สัญลักษณ์ของเกม
Max Bet คือ จำนวนการเดิมสูงสุด
ผู้เล่นมือใหม่ที่เพิ่งจะเข้ามาเล่นสล็อต Pussy888 เชื่อว่าทุกคนล้วนต้องการจะได้วิธีเล่นสล็อตให้ได้กำไรเยอะ ๆ แต่มีปัญหา คือ การไม่รู้ว่าจะเล่นแบบไหนดี หากคุณมีปัญหานี้ ลองนำวิธีการที่กล่าวข้างต้นไปใช้ เชื่อว่าจะช่วยทำให้คุณเล่นสล็อต Pussy888 ได้อย่างมีประสิทธิภาพมากกว่าเดิม
Pussy888
918kiss
ReplyDelete
Replies
UnknownAugust 6, 2021 at 3:35 PM
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
ReplyDelete
Replies
AnonymousMay 18, 2022 at 2:29 AM
fon perde modelleri
sms onay
mobil ödeme bozdurma
Nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
Dedektor
web sitesi kurma
Ask romanlari
ReplyDelete
Replies
AnonymousMay 26, 2022 at 12:23 PM
smm panel
smm panel
iş ilanları
instagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
Servis
tiktok jeton hilesi
ReplyDelete
Replies
OlivaMay 27, 2022 at 4:52 AM
I would like to appreciate all your efforts, I have used many other services several times and I am very happy with it too. Dolphin emulator
ReplyDelete
Replies
AnonymousJune 3, 2022 at 5:47 AM
tuzla bosch klima servisi
çekmeköy samsung klima servisi
ataşehir samsung klima servisi
çekmeköy mitsubishi klima servisi
ataşehir mitsubishi klima servisi
maltepe vestel klima servisi
kadıköy vestel klima servisi
maltepe bosch klima servisi
kadıköy bosch klima servisi
ReplyDelete
Replies
HalleboseOctober 1, 2024 at 10:06 AM
Useful
ReplyDelete
Replies

Add comment