4(a) – A Black Jack playing agent:
First, we will start with a very basic, minimalist scenario where a hand is dealt, and the agent is asked whether it should get another card, or stop. There is no such thing as splitting, there is no betting, etc. From the business perspective, then, the scenario can be represented as 21 states calling for 2 possible actions. The 21 states are, of course, the total hand value that the agent was dealt, and the 2 possible actions are Hit or Stand.
To simplify the code, the interaction will occur on the command line, and the user will feed in manually the hand dealt and the results of the agent's action. The steps of the interaction are:
1 – Ask user to input the hand value that was dealt.
2 – The agent performs the action, outputs it to the user.
3 – Ask user to input the reward, i.e. whether the hand was winning (1.0), a draw (0.0), or losing (-1.0).
4 – The agent learns.
We will start with gamma = 0.0 and see what our results are.
For the actual implementation, we will first write an Environment specialization such as the following pybrain/rl/environments/blackjackenv.py (comments are displayed in bold, to help readability...):
-----------------------------------------------------------------------------------------------------------
Then, we have the specialization of the Task, in pybrain/rl/environments/blackjacktask.py:
-----------------------------------------------------------------------------------------------------------
from pybrain.rl.environments.environment import Environment
from scipy import zeros
class BlackjackEnv(Environment):
""" A (terribly simplified) Blackjack game implementation of an environment. """
# the number of action values the environment accepts
indim = 2
# the number of sensor values the environment produces
outdim = 21
def getSensors(self):
""" the currently visible state of the world (the observation may be stochastic - repeated calls returning different values)
:rtype: by default, this is assumed to be a numpy array of doubles
"""
hand_value = int(raw_input("Enter hand value: ")) - 1
return [float(hand_value),]
def performAction(self, action):
""" perform an action on the world that changes it's internal state (maybe stochastically).
:key action: an action that should be executed in the Environment.
:type action: by default, this is assumed to be a numpy array of doubles
"""
print "Action performed: ", action
def reset(self):
""" Most environments will implement this optional method that allows for reinitialization.
"""
-----------------------------------------------------------------------------------------------------------from scipy import zeros
class BlackjackEnv(Environment):
""" A (terribly simplified) Blackjack game implementation of an environment. """
# the number of action values the environment accepts
indim = 2
# the number of sensor values the environment produces
outdim = 21
def getSensors(self):
""" the currently visible state of the world (the observation may be stochastic - repeated calls returning different values)
:rtype: by default, this is assumed to be a numpy array of doubles
"""
hand_value = int(raw_input("Enter hand value: ")) - 1
return [float(hand_value),]
def performAction(self, action):
""" perform an action on the world that changes it's internal state (maybe stochastically).
:key action: an action that should be executed in the Environment.
:type action: by default, this is assumed to be a numpy array of doubles
"""
print "Action performed: ", action
def reset(self):
""" Most environments will implement this optional method that allows for reinitialization.
"""
Then, we have the specialization of the Task, in pybrain/rl/environments/blackjacktask.py:
-----------------------------------------------------------------------------------------------------------
from scipy import clip, asarray
from pybrain.rl.environments.task import Task
from numpy import *
class BlackjackTask(Task):
""" A task is associating a purpose with an environment. It decides how to evaluate the observations, potentially returning reinforcement rewards or fitness values.
Furthermore it is a filter for what should be visible to the agent.
Also, it can potentially act as a filter on how actions are transmitted to the environment. """
def __init__(self, environment):
""" All tasks are coupled to an environment. """
self.env = environment
from pybrain.rl.environments.task import Task
from numpy import *
class BlackjackTask(Task):
""" A task is associating a purpose with an environment. It decides how to evaluate the observations, potentially returning reinforcement rewards or fitness values.
Furthermore it is a filter for what should be visible to the agent.
Also, it can potentially act as a filter on how actions are transmitted to the environment. """
def __init__(self, environment):
""" All tasks are coupled to an environment. """
self.env = environment
# we will store the last reward given, remember that "r" in the Q learning formula is the one from the last interaction, not the one given for the current interaction!
self.lastreward = 0
def performAction(self, action):
""" A filtered mapping towards performAction of the underlying environment. """
self.env.performAction(action)
def getObservation(self):
""" A filtered mapping to getSample of the underlying environment. """
sensors = self.env.getSensors()
return sensors
def getReward(self):
""" Compute and return the current reward (i.e. corresponding to the last action performed) """
reward = raw_input("Enter reward: ")
# retrieve last reward, and save current given reward
cur_reward = self.lastreward
self.lastreward = reward
self.lastreward = 0
def performAction(self, action):
""" A filtered mapping towards performAction of the underlying environment. """
self.env.performAction(action)
def getObservation(self):
""" A filtered mapping to getSample of the underlying environment. """
sensors = self.env.getSensors()
return sensors
def getReward(self):
""" Compute and return the current reward (i.e. corresponding to the last action performed) """
reward = raw_input("Enter reward: ")
# retrieve last reward, and save current given reward
cur_reward = self.lastreward
self.lastreward = reward
return cur_reward
@property
def indim(self):
return self.env.indim
@property
def outdim(self):
return self.env.outdim
-----------------------------------------------------------------------------------------------------------
Finally, we have the main code that brings everything together:
-----------------------------------------------------------------------------------------------------------
from pybrain.rl.environments.blackjacktask import BlackjackTask
from pybrain.rl.environments.blackjackenv import BlackjackEnv
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q
from pybrain.rl.experiments import Experiment
from pybrain.rl.explorers import EpsilonGreedyExplorer
# define action-value table
# number of states is:
#
# current value: 1-21
#
# number of actions:
#
# Stand=0, Hit=1av_table = ActionValueTable(21, 2)
av_table.initialize(0.)
# define Q-learning agent
learner = Q(0.5, 0.0)
learner._setExplorer(EpsilonGreedyExplorer(0.0))
agent = LearningAgent(av_table, learner)
# define the environment
env = BlackjackEnv()
# define the task
task = BlackjackTask(env)
# finally, define experiment
experiment = Experiment(task, agent)
# ready to go, start the process
while True:
experiment.doInteractions(1)
agent.learn()
agent.reset()
-----------------------------------------------------------------------------------------------------------
To summarize the training parameters:
alpha = 0.5
gamma = 0.0
epsilon (the exploratory parameter) = 0.0
the learning module is classical Q learning (Q)
the agent is a LearningAgent
the experiment is a basic Experiment class instance
I trained this agent by playing against it for 300 interactions. My playing policy was to hit until 15, then stand. Also, aces are treated as hard 11, not a soft "1 or 11". I would then reward with 1.0 if the agent wins the hand, 0.0 if it's a draw, and -1.0 is the agent loses the hand.
After the 300th interaction, the Q-table looked like this:
@property
def indim(self):
return self.env.indim
@property
def outdim(self):
return self.env.outdim
-----------------------------------------------------------------------------------------------------------
Finally, we have the main code that brings everything together:
-----------------------------------------------------------------------------------------------------------
from pybrain.rl.environments.blackjacktask import BlackjackTask
from pybrain.rl.environments.blackjackenv import BlackjackEnv
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q
from pybrain.rl.experiments import Experiment
from pybrain.rl.explorers import EpsilonGreedyExplorer
# define action-value table
# number of states is:
#
# current value: 1-21
#
# number of actions:
#
# Stand=0, Hit=1av_table = ActionValueTable(21, 2)
av_table.initialize(0.)
# define Q-learning agent
learner = Q(0.5, 0.0)
learner._setExplorer(EpsilonGreedyExplorer(0.0))
agent = LearningAgent(av_table, learner)
# define the environment
env = BlackjackEnv()
# define the task
task = BlackjackTask(env)
# finally, define experiment
experiment = Experiment(task, agent)
# ready to go, start the process
while True:
experiment.doInteractions(1)
agent.learn()
agent.reset()
-----------------------------------------------------------------------------------------------------------
To summarize the training parameters:
alpha = 0.5
gamma = 0.0
epsilon (the exploratory parameter) = 0.0
the learning module is classical Q learning (Q)
the agent is a LearningAgent
the experiment is a basic Experiment class instance
I trained this agent by playing against it for 300 interactions. My playing policy was to hit until 15, then stand. Also, aces are treated as hard 11, not a soft "1 or 11". I would then reward with 1.0 if the agent wins the hand, 0.0 if it's a draw, and -1.0 is the agent loses the hand.
After the 300th interaction, the Q-table looked like this:
State | Q-value of "Stand" | Q-value of "Hit" | Relative value of Hitting over Standing |
Hand value 1: | N/A | N/A | N/A |
Hand value 2: | N/A | N/A | N/A |
Hand value 3: | N/A | N/A | N/A |
Hand value 4: | -0.625 | 0.984375 | +1.609375 |
Hand value 5: | -0.625 | 0.984375 | +1.609375 |
Hand value 6: | -0.25 | 0.984375 | +1.234375 |
Hand value 7: | -0.5 | 0.9921875 | +1.4921875 |
Hand value 8: | -0.5 | 0.984375 | +1.484375 |
Hand value 9: | -0.5 | 0.984375 | +1.484375 |
Hand value 10: | -0.5 | 0.998046875 | +1.498046875 |
Hand value 11: | -0.5 | 0.994140625 | +1.494140625 |
Hand value 12: | -0.85302734375 | -0.636352539063 | +0.216674804687 |
Hand value 13: | -0.875 | 0.40990447998 | +1.28490447998 |
Hand value 14: | -0.671875 | -0.64518815279 | +0.02668684721 |
Hand value 15: | -0.875 | 0.1123046875 | +0.9873046875 |
Hand value 16: | -0.825504302979 | -0.875 | -0.049495697021 |
Hand value 17: | -0.4501953125 | -0.880859375 | -0.4306640625 |
Hand value 18: | -0.577491760254 | -0.95849609375 | -0.381004333496 |
Hand value 19: | 0.707023620605 | -0.5 | -1.207023620605 |
Hand value 20: | 0.499875545385 | -0.5 | -0.999875545385 |
Hand value 21: | 0.937499523163 | 0.0 | -0.937499523163 |
Keep in mind that this is the "optimal" strategy for playing against me specifically (or, rather, the playing policy I adopted for the training). Against a completely different adversary, or multiple adversaries, the strategy could end up being different. Also, we can see that the Q-values are somewhat vulnerable to noise and unstable (Value of hitting versus standing on 14 is 0.027, yet the same value for a hand of 15 is 0.99?)...
Amazing introduction to Q-learning and Pybrain .The only part that I didn't understand is how the hand values of less than 10 get Q-values since rewards are issued only after either winning or losing.Thank you
ReplyDeleteI know this is an old post, but I found the blog useful so will answer the questions. The action is "draw or stand". The reward is whether the correct action was presented. There is no "winning" in the blackjack sense.
DeleteThis comment has been removed by the author.
ReplyDeleteHow did you display the q table? Like where are these values if I want to print after every hand?
ReplyDeleteI would also like to know how to print the q-table.
ReplyDeleteFor anyone like me wondering how to print Q table, I think I have a solution
ReplyDeleteFirstly in last script, we have to increase interactions from 1 to higher value like 50 or 100, much high value would be inconvenient if you are entering rewards manually.
--------------------------------------------------------------------------
# ready to go, start the process
while True:
====================================================
experiment.doInteractions(1) #here to make change
^^^^^^
=>experiment.doInteractions(50)#change done
--------------------------------------------------------------------------
and the next step is to print our action value table, we initialised as av_table as following
--------------------------------------------------------------------------
# ready to go, start the process
while True:
experiment.doInteractions(1)
agent.learn()
print av_table.params.reshape(21,2)
agent.reset()
For some reason, every time I print out av_table.params.reshape(21,2) it's just a table of 0s. I have not changed anything from your code either...
ReplyDeleteThe agent.reset() command is resetting the Q table. Either print before that, or simply comment it out.
DeleteWow, absolutely fantastic blog. I am very glad to have such useful information.
ReplyDeleteทางบ้าน
ReplyDeleteLearning to strengthen I know many things. About these things. For anyone who wants to know this, hurry to read it.
บาคาร่าออนไลน์
Thanking you will not be enough. This is really good. You are just exceptional.
ReplyDeleteKeep doing the good work ??
Thanks for such very great information. This is the best sites for proving such kinds of good information
LIC Information
Enjoyed reading the article above, really explains everything in detail,the article is very interesting and effective. Thank you and good luck for the upcoming articles.
ReplyDeleteand visit this profile for link Gift card code generator
Thank you because you have been willing to share information with us. we will always appreciate all you have done here because I know you are very concerned with our. Xbox
ReplyDeleteYea, this is an old blog post, but I appreciate it. Thank you.
ReplyDeleteHere is very general and the huge knowledgeable platform has been known by this blog.I reality appreciate this blog to have such kind of educational knowledge. Earn free PayPal gift card
ReplyDeleteImportant vocabulary in playing slots games Pussy888
ReplyDeleteLet's take a look at how to play the Pussy888 edition, easy to understand before you play slots. The first thing that everyone must do. Is to learn how to play Pussy888 slots first, including the rules of play Learning beforehand will not cause mistakes. In which the rules or methods of playing are not complicated or complicated Unlike some online casino games Or maybe even easier
การเล่นสล็อต Pussy888 มีพื้นฐานการเล่นที่เหมือน ๆ กัน ดังนั้นเมื่อคุณเข้าใจการเล่นสล็อต Pussy888 แค่เพียงเกมเดียว คุณก็จะเล่นสล็อตได้ทุกรูปแบบ การเล่นสล็อตจะเริ่มต้นด้วยการใช้เงินเดิมพันที่เราเติมเข้าไปเป็นเครดิตนั่นเอง จากนั้นการเล่นเริ่มต้นด้วยการหมุนหรือ Spin เกมให้ได้ภาพตามที่กำหนด เช่น ตั้งแต่ 2-3 ภาพขึ้นไป ก็จะได้เงินรางวัลนั่นเอง อาจสรุปได้ง่าย ๆ ดังนี้ เติมเครดิตเข้าสู่ระบบ, กดจำนวนเงินที่ต้องการวางเดิมพัน, ทำการหมุน Spin วงล้อ, เมื่อวงล้อหยุด ระบบจะประเมินผลสัญลักษณ์หรือรูปภาพที่เส้นจ่ายเงิน, ถ้าหากวงล้อหยุดที่รูปภาพที่เป็นสัญลักษณ์เหมือนกัน เราก็จะได้เงินรางวัลตามกติกาที่ได้กำหนดไว้ ซึ่งจะแตกต่างกันออกไปบ้าง
การเล่นสล็อต Pussy888 จะมีการกำหนดคำศัพท์ไว้เป็นการเฉพาะ คำศัพท์ที่สำคัญในเกมสล็อต มีดังนี้นอกจากวิธีเล่นสล็อตที่นักพนันต้องรู้แล้ว คำศัพท์ก็เป็นเรื่องที่ไม่สามารถมองข้ามเช่นกัน การเล่นสล็อตส่วนใหญ่จะใช้ศัพท์ที่เป็นภาษาอังกฤษ ถ้าหากคุณไม่รู้บ้างหรือไม่รู้เลย ก็อาจจะทำให้คุณทำอะไรลำบากกดผลิกดถูกก็ได้ คำศัพท์สำคัญที่ต้องรู้มีดังต่อไปนี้
Line Bet คือ การแสดงยอดเครดิตที่ใช้ในการเดิมพันต่อ 1 ไลน์ นักพนันสามารถปรับเปลี่ยนได้เอง เช่น 1.5, 2.5 เป็นต้น
Lines คือ การแสดงจำนวนไลน์หรือแถวที่ต้องการจะวางเดิมพัน ขั้นต่ำอยู่ที่ 1 ไลน์หรือ 1 แถว และสูงสุดไม่เกิน 30 ไลน์ หรือ 30 แถว
Total Bet คือ ช่องแสดงจำนวนเงินเดิมพันทั้งหมดของแต่ละตา
Spin คือ ปุ่มที่ใช้สำหรับหมุนวงล้อ
Win คือ ยอดเงินที่ได้จากการหมุนวงล้อในแต่ละครั้ง
Balance คือ การแสดงจำนวนเครดิตที่เหลือทั้งหมดของท่าน
Auto Spin คือ ปุ่มสำหรับการตั้งค่าเพื่อการหมุนแบบอัตโนมัติ นักพนันสามารถเลือกจำนวนครั้งได้
Paytable คือ ปุ่มสำหรับดูรายละเอียดของแต่ละเกม เช่น อัตราการจ่ายเงินรางวัล, สัญลักษณ์ของเกม
Max Bet คือ จำนวนการเดิมสูงสุด
ผู้เล่นมือใหม่ที่เพิ่งจะเข้ามาเล่นสล็อต Pussy888 เชื่อว่าทุกคนล้วนต้องการจะได้วิธีเล่นสล็อตให้ได้กำไรเยอะ ๆ แต่มีปัญหา คือ การไม่รู้ว่าจะเล่นแบบไหนดี หากคุณมีปัญหานี้ ลองนำวิธีการที่กล่าวข้างต้นไปใช้ เชื่อว่าจะช่วยทำให้คุณเล่นสล็อต Pussy888 ได้อย่างมีประสิทธิภาพมากกว่าเดิม
Pussy888
918kiss
instagram takipçi satın al
ReplyDeleteinstagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
instagram takipçi satın al
fon perde modelleri
ReplyDeletesms onay
mobil ödeme bozdurma
Nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
Dedektor
web sitesi kurma
Ask romanlari
smm panel
ReplyDeletesmm panel
iş ilanları
instagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
Servis
tiktok jeton hilesi
I would like to appreciate all your efforts, I have used many other services several times and I am very happy with it too. Dolphin emulator
ReplyDeletetuzla bosch klima servisi
ReplyDeleteçekmeköy samsung klima servisi
ataşehir samsung klima servisi
çekmeköy mitsubishi klima servisi
ataşehir mitsubishi klima servisi
maltepe vestel klima servisi
kadıköy vestel klima servisi
maltepe bosch klima servisi
kadıköy bosch klima servisi
Useful
ReplyDelete