Tuesday, August 2, 2011

Time-Series calculations using SQL Server 2008: the CROSS APPLY method

For these examples I will use RSI calculations, which require a chronologically ordered set of data where:

RSI = 100 - ( 100 / (1 + (MA(U) / MA(D)) ) )

MA: simple or exponential moving average, depending on the kind of RSI we are interested in.
U: set of current close price - previous close price, if current close price > previous close price (0 otherwise), for each period.
D: set of previous close price - current close price, if current close price < previous close price (0 otherwise), for each period.

This method uses CROSS APPLY, to avoid iterative or cursor-based processes.

First, I create a user-defined table type called OrderedPriceSet:


CREATE TYPE dbo.OrderedPriceSet AS TABLE
(
  ValueRank INT PRIMARY KEY,
  ValuePrice SMALLMONEY
);

And then I wrote a table-valued user-defined function that calculates the RSI values for each row in a given table of close prices:


CREATE FUNCTION Utility.udfCutlerRSI

  @priceSet OrderedPriceSet READONLY,
  @numValues INT 
)
RETURNS @RSI_table TABLE
(
  ValueRank INT PRIMARY KEY,
  ValuePrice SMALLMONEY
)
AS 
BEGIN
  -- using CROSS APPLY

  -- first, generate the ordered set of differentials (close(now) - close(yesterday))
  DECLARE @differentials As OrderedPriceSet;


  INSERT @differentials (ValuePrice, ValueRank)
  SELECT apply_ps.ValuePrice, ps.ValueRank FROM @priceSet As ps
  CROSS APPLY ( SELECT (ps.ValuePrice - inner_ps.ValuePrice) As ValuePrice
FROM @priceSet As inner_ps WHERE inner_ps.ValueRank = ps.ValueRank - 1 ) As apply_ps;


  -- now we need to generate the U's and D's of each @numValues period, and calculate the RSI (can be done in one shot I think)
  INSERT @RSI_table (ValueRank, ValuePrice)
  SELECT diffs.ValueRank,
 CASE sub_calcs.Dsum
   WHEN 0 THEN 100.0
   ELSE (100.0 - (100.0 / (1.0 + ((sub_calcs.Usum / CONVERT(FLOAT, @numValues)) / (sub_calcs.Dsum / CONVERT(FLOAT, @numValues) ))) )) 
  END AS ValuePrice
  FROM @differentials As diffs
  CROSS APPLY ( SELECT inner_diffs.ValueRank,
SUM(CASE
WHEN last_vals.ValuePrice > 0 THEN last_vals.ValuePrice
ELSE 0
END) As Usum,
SUM(CASE
WHEN last_vals.ValuePrice < 0 THEN ABS(last_vals.ValuePrice)
ELSE 0
END) As Dsum
  FROM @differentials As inner_diffs
  CROSS APPLY ( SELECT TOP(@numValues) ValuePrice FROM @differentials
                WHERE ValueRank <= inner_diffs.ValueRank
ORDER BY ValueRank DESC ) last_vals
  WHERE inner_diffs.ValueRank = diffs.ValueRank
  GROUP BY ValueRank) sub_calcs;


  RETURN;
END;

Finally, I use this UDF by passing it an OrderedPriceSet of chronologically ranked Close Price values, and a number of periods for my RSI, and INSERT the output into a table variable which now contains all RSIs for each ranked row.

Saturday, July 30, 2011

Time-series calculations using SQL Server 2008

I am currently in the process of generating technical analysis result sets for my stock market historical data, in SQL Server 2008. In doing so I am realizing more and more how terrible relational databases / SQL are at this kind of thing.

One possible solution is to use CLR functions and do these things in good old procedural algorithms using C#. However, CLR functions don't support (yet?) being passed table-valued parameters, and as a result I would have to marshal my input time series into an XML or CSV parameter. This is fine when you are operating on a small set of values -- however I have moral objections to doing this when you have hundreds of thousands, or even millions of rows, to operate on.

You can also have the CLR function fetch the data set itself, but extensive data retrieval in CLR functions is generally a No No from the standpoint of performance.

Additional NOTE: this can't be done in CLR custom aggregates, because aggregates don't support ordered sets.

Here I will outline the various strategies I have tried, and conclude with which approach I think is the best for this kind of problem.

Saturday, August 21, 2010

PyBrain: Reinforcement Learning, a Tutorial

4(a) – A Black Jack playing agent:

First, we will start with a very basic, minimalist scenario where a hand is dealt, and the agent is asked whether it should get another card, or stop. There is no such thing as splitting, there is no betting, etc. From the business perspective, then, the scenario can be represented as 21 states calling for 2 possible actions. The 21 states are, of course, the total hand value that the agent was dealt, and the 2 possible actions are Hit or Stand.

To simplify the code, the interaction will occur on the command line, and the user will feed in manually the hand dealt and the results of the agent's action. The steps of the interaction are:

1 – Ask user to input the hand value that was dealt.
2 – The agent performs the action, outputs it to the user.
3 – Ask user to input the reward, i.e. whether the hand was winning (1.0), a draw (0.0), or losing (-1.0).
4 – The agent learns.

We will start with gamma = 0.0 and see what our results are.

For the actual implementation, we will first write an Environment specialization such as the following pybrain/rl/environments/blackjackenv.py (comments are displayed in bold, to help readability...):
-----------------------------------------------------------------------------------------------------------
from pybrain.rl.environments.environment import Environment
from scipy import zeros

class BlackjackEnv(Environment):
    """ A (terribly simplified) Blackjack game implementation of an environment. """      

    # the number of action values the environment accepts
    indim = 2
   
    # the number of sensor values the environment produces
    outdim = 21
   
    def getSensors(self):
        """ the currently visible state of the world (the    observation may be stochastic - repeated calls returning different values)
            :rtype: by default, this is assumed to be a numpy array of doubles
        """

        hand_value = int(raw_input("Enter hand value: ")) - 1
        return [float(hand_value),]
                   
    def performAction(self, action):
        """ perform an action on the world that changes it's internal state (maybe stochastically).
            :key action: an action that should be executed in the Environment.
            :type action: by default, this is assumed to be a numpy array of doubles
        """

        print "Action performed: ", action

    def reset(self):
        """ Most environments will implement this optional method that allows for reinitialization.
        """
-----------------------------------------------------------------------------------------------------------
        
Then, we have the specialization of the Task, in pybrain/rl/environments/blackjacktask.py:
-----------------------------------------------------------------------------------------------------------
from scipy import clip, asarray

from pybrain.rl.environments.task import Task
from numpy import *

class BlackjackTask(Task):
    """ A task is associating a purpose with an environment. It decides how to evaluate the observations, potentially returning reinforcement rewards or fitness values.
    Furthermore it is a filter for what should be visible to the agent.
    Also, it can potentially act as a filter on how actions are transmitted to the environment. """


    def __init__(self, environment):
        """ All tasks are coupled to an environment. """
        self.env = environment
        # we will store the last reward given, remember that "r" in the Q learning formula is the one from the last interaction, not the one given for the current interaction!
        self.lastreward = 0

    def performAction(self, action):
        """ A filtered mapping towards performAction of the underlying environment. """               
        self.env.performAction(action)
       
    def getObservation(self):
        """ A filtered mapping to getSample of the underlying environment. """
        sensors = self.env.getSensors()
        return sensors
   
    def getReward(self):
        """ Compute and return the current reward (i.e. corresponding to the last action performed) """
        reward = raw_input("Enter reward: ")
       
        # retrieve last reward, and save current given reward
        cur_reward = self.lastreward
        self.lastreward = reward
    
        return cur_reward

    @property
    def indim(self):
        return self.env.indim
   
    @property
    def outdim(self):
        return self.env.outdim
-----------------------------------------------------------------------------------------------------------


Finally, we have the main code that brings everything together:
-----------------------------------------------------------------------------------------------------------

from pybrain.rl.environments.blackjacktask import BlackjackTask
from pybrain.rl.environments.blackjackenv import BlackjackEnv
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q
from pybrain.rl.experiments import Experiment
from pybrain.rl.explorers import EpsilonGreedyExplorer

# define action-value table
# number of states is:
#
#    current value: 1-21
#
# number of actions:
#
#    Stand=0, Hit=1
av_table = ActionValueTable(21, 2)
av_table.initialize(0.)

# define Q-learning agent
learner = Q(0.5, 0.0)
learner._setExplorer(EpsilonGreedyExplorer(0.0))
agent = LearningAgent(av_table, learner)

# define the environment
env = BlackjackEnv()

# define the task
task = BlackjackTask(env)

# finally, define experiment
experiment = Experiment(task, agent)

# ready to go, start the process
while True:
    experiment.doInteractions(1)
    agent.learn()
    agent.reset()
-----------------------------------------------------------------------------------------------------------



To summarize the training parameters:

alpha = 0.5
gamma = 0.0
epsilon (the exploratory parameter) = 0.0
the learning module is classical Q learning (Q)
the agent is a LearningAgent
the experiment is a basic Experiment class instance


I trained this agent by playing against it for 300 interactions. My playing policy was to hit until 15, then stand. Also, aces are treated as hard 11, not a soft "1 or 11". I would then reward with 1.0 if the agent wins the hand, 0.0 if it's a draw, and -1.0 is the agent loses the hand.


After the 300th interaction, the Q-table looked like this:

StateQ-value of "Stand"Q-value of "Hit"Relative value of Hitting over Standing
Hand value 1:N/AN/AN/A
Hand value 2:N/AN/AN/A
Hand value 3:N/AN/AN/A
Hand value 4:-0.6250.984375+1.609375
Hand value 5:-0.6250.984375+1.609375
Hand value 6:-0.250.984375+1.234375
Hand value 7:-0.50.9921875+1.4921875
Hand value 8:-0.50.984375+1.484375
Hand value 9:-0.50.984375+1.484375
Hand value 10:-0.50.998046875+1.498046875
Hand value 11:-0.50.994140625+1.494140625
Hand value 12:-0.85302734375-0.636352539063+0.216674804687
Hand value 13:-0.8750.40990447998+1.28490447998
Hand value 14:-0.671875-0.64518815279+0.02668684721
Hand value 15:-0.8750.1123046875+0.9873046875
Hand value 16:-0.825504302979-0.875-0.049495697021
Hand value 17:-0.4501953125-0.880859375-0.4306640625
Hand value 18:-0.577491760254-0.95849609375-0.381004333496
Hand value 19:0.707023620605-0.5-1.207023620605
Hand value 20:0.499875545385-0.5-0.999875545385
Hand value 21:0.9374995231630.0-0.937499523163


Keep in mind that this is the "optimal" strategy for playing against me specifically (or, rather, the playing policy I adopted for the training). Against a completely different adversary, or multiple adversaries, the strategy could end up being different. Also, we can see that the Q-values are somewhat vulnerable to noise and unstable (Value of hitting versus standing on 14 is 0.027, yet the same value for a hand of 15 is 0.99?)...

Sunday, August 15, 2010

PyBrain: Reinforcement Learning, a Tutorial

3 – The PyBrain API:

To download/install PyBrain, see: http://www.pybrain.org/

Pybrain's Reinforcement Learning API can be visually represented like this:




Experiment:

This class essentially coordinates the interaction between Environment/Task and the Agent. It decides what an Interaction, or Episode (in the case of the Episodic Experiment), entails. Usually, that means obtaining the observations from the environment, feeding it into the agent, obtaining the agent's corresponding action, feeding it back into the environment, and receiving the reward from the task which is then fed to the agent for the learning process.

The classes that can be used to implement an Experiment are:

  • Experiment: the basic experiment class that implement the concept of an interaction via the doInteractions method.
  • EpisodicExperiment: an extension to the Experiment class that implements the concept of episodic tasks (an episode being a series of interactions whose termination may consist of a conditional based on the values from the environment, etc.) via the doEpisodes() method.
  • ContinuousExperiment: an extension to the Experiment class that handles continuous tasks, i.e. tasks where no "reset" is involved, and learning occurs after each interaction.

Task:

This class implements the nature of the task, and focuses on the reward aspect. It decides what is a successful action, and what is a less successful one. You will have to implement your own derivation of this class for your application.

Environment:

This class manages the inputs (observations) and outputs (actions performed) that go to and come from the agent. You will have to implement your own derivation of this class for your application.

Agent:

The agent is the "intelligent" component, the module that does the actual interacting and learning.

The classes that can be used to implement an agent are:

  • Agent: the basic agent class.
  • LoggingAgent: this extension stores a history of interactions, and makes sure integrateObservation, getAction and getReward are called in that order.
  • LearningAgent: this extension includes a learning module and does actual learning. This is generally the class you want to instantiate (not sure in what scenarios you would limit yourself to the 2 classes mentioned above – unless you're going to implement your own Agent class for whatever reason).

The learning algorithm in the diagram depends on which parameter you pass when you instantiate your Agent, and can be:

  • Q: Q learning. Described in section 2(a)
  • NFQ: Neuro-fitted Q learning. Q Learning that approximates the behaviour to be learned using a neural networks rather than the Q-value table used in traditional Q learning. This method is much slower (the learning occurs offline, on a set of samples of transitions between state t and state t + 1), however it can approximate non-linear functions, which is something traditional Q learning cannot do.
  • QLambda: Q-lambda learning. Described in section 2(b).
  • SARSA: SARSA (State-Action-Reward-State-Action) learning. Described in section 2(b)
The corresponding learning data structure can be:
  • ActionValueTable: for discrete actions.
  • ActionValueNetwork: for continous actions.

I suggest referring to the documentation that comes with PyBrain for technical details on the API. This section is more of an introductory "How To" than a technical reference.

In order to implement your own specific reinforcement learning scenario, the steps to follow generally are:

1 – implement your own derived class of Task

This only entails implementing your own getReward(self) function.

2 – implement your own derived class of Environment

Methods to define:

getSensors(self):

This method is what determines and returns the observations (i.e. State) that the agent will obtain from the environment during an interaction. This is usually an array of doubles.

performAction(self, action):

This method performs on the environment the action that comes from the agent through the parameter "action". This is also, by default, an array of doubles.

reset(self):

This method can be implemented is you want to be able to re-initialize the state of the environment after a certain amount of interactions.


3 – Code your main function that brings it all together... see examples in the next section.


Saturday, August 7, 2010

PyBrain: Reinforcement Learning, a Tutorial

2 – Q Learning and its variations (continued):

SARSA versus Q Learning, and exploratory policies:


In SARSA learning, instead of using the maximum Q-value at time t, we use the Q-value of the action that was chosen by the agent at time t. So the update formula will be instead:


Qt - 1 := Qt - 1 + alpha * [r + (gamma * Qt) – Qt – 1]

The fundamental difference between Q learning and SARSA learning, therefore, is that SARSA is "policy-dependent", while Q is "policy-independent". "Policy" is the decision process used to select an action, given a certain state.

So far we've discussed how the agent learns, or updates it's value table, but we haven't discussed how it selects the action it is going to perform. This digression is necessary to understand the difference between the two learning algorithms presented so far.

One might initially think that the agent would simply choose the action having the highest Q value given the input state. However, the problem with that policy is that it is vulnerable to what we call "local maxima". Local maxima are local optimizations in the search space of all solutions, such that all gradual movement towards one direction or another would lead to a decrease in Q value and yet, farther away in the search space, there exists a more optimal solution (which we will never reach if we refuse to temporarily decrease our current solution value ever so slightly).

To solve this problem, reinforcement learning agents usually contain in their decision policy a factor generally referred to as epsilon, which determines to which extent we will randomize the actions. Occasionally randomizing an action (i.e. choosing an action that is actually NOT optimal!) can result in potentially bringing the agent to states that could offer higher Q-values even though the temporary step to that state was less beneficial.

Coming back to SARSA versus Q, then, we see that SARSA's learning function depends on the action taken – therefore on the agent's decision policy (essentially: on the exploration factor). Q, on the other hand, does not care about what is the next action that was taken, but rather about what is the optimal action that could have been taken. This is the fundamental (and only) difference.

It should be noted that it can be proven that given "sufficient" training, the Q learning converges to a close approximation of the function being learned. SARSA, on the other hand, does not necessarily converge.



Q-lambda learning:

Here I introduce the last variation of Q-learning. Q-lambda learning is a bit more elaborate, as it goes back in time to update the most recent interactions' associated Q-values. The formula goes like this:

  • For all T time steps t of this learning episode (0 being the first step, T - 1 being the most recent):

Qt - 1 := Qt - 1 + alpha * etr * [r + (gamma * max(Qt)) – Qt – 1]

  • Where:
etr = lambda ^ (T - t)

In plain English, Q-lambda learning updates, for all interactions (indexed by t from 0 to T - 1) in the current episode, all state-action pairs updated in the most recent interactions, but attenuating each update with the factor etr which diminishes exponentially as you go farther back into time (and is equal to 1 for the current interaction).

(coming next: the actual PyBrain API!)