⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Deep neural networks are popular because they are in many cases
the most accurate machine learning models for a given task.

Control tasks

Problems, where decisions must be made or some behavior
must be enacted. The algorithm needs to learn how to act
not merely to classify or predict.

Added complexity: ELEMENT OF TIME

The data set upon which the algorithm is training not
necessarily FIXED but changes based on decisions the algorithm makes.

Reinforcment Learning

a generic framework for representing and solving
control tasks (free to choose any algorithm within this framework)

In RL we don’t know exactly what the right thing to do at every step => maximize reward

Overall objective, constraints
Environment
State (a snapshot of the environment at t)
objective function, reward, decisions, actions

Spectrum: Dynamic Programming vs Monte Carlo

DP: 1957 aka GOAL DECOMPOSITION =>
decompose into smaller problems

Difficult to realize this decomposition in the real world

Monte Carlo = Random trial & error

throwing stones in each direction

Often mixed strategy is used

divide into subproblem and then solve it throwin stones in each direction

Richard Bellman

core set of terms
and concecpts that
every RL problem can be phrased in

Forces us to formulate our problem
in a way that is amenable to dynamic
programming-like problem decomposition,
such that we can iteratively optimize over
local sub-problems and make progress towards
the global high-level objective

RL is a field of its own, separate from the concerns of any particular learning algorithm

Generalization: One algorithm, multiple games (DeepMind’s deep Q-network (DQN))

without feature engineering, explicitly programming game rules

Simulation first => Real world second

RL algorithm has the advantage of being able to adapt to a changing conditions in real time (continuously learning)

Tradeoff between exploration vs explotation

Quantopian

compresses game states into
a finite set of parameters

The secret sauce of RL: Deep Learning

The Agent is the focus of any RL problem: processes input & determines action

When the agent is implemented as a deep learning network,
each iteration evaluates a loss function based on reward signal
and backpropagates to improve the performance of the agent.

String diagrams

Matrix x vector -> nonlinear activation function

DNN = layers that do:

Adapted from Category theory where they tend to use

a lot of diagrams to supplement or replace traditional symbolic math notation

Slot Machine = One-Armed bandit: it has one arm (lever) and steals your money

n-armed bandit problem, where n = number of slot machines

The proper balance of explotation and exploration will be important to maximizing rewards

The method of simply choosing the best lever that we know of so far is called a greedy (or exploitation) method.
no exploration

Epsilon Greedy Strategy

at P(E) we will choose random action,
at 1 - E we will choose greedy action

Stationary Problem = underlying reward probability distribution for the arms does not change
over time

Softmax selection policy

Instead of just choosing an action at random during
exploration, softmax gives us a probability distribution
across our options.

AVOIDS CHOOSING THE WORST OPTION

converts actions into probabilities using softmax equation
randomly selects from these probabilities meaning
best action will get chosen more often because it will have
the highest softmax probability, but other actions will be chosen
at lower frequencies
tau is a temperature that scales probability distribution of actions.
- A high tau will cause probs to be very similar, a low prob. will exaggerate
  differences
converges faster than random epsilon greedy for n-armed bandits, but you
need to select tau value

Contextual bandits

Add State (state space) and we start to get a combinatorial explosion of possible
state-action-reward tuples

No State spaces, only action spaces: only need to learn to assoc action with reward

for most problems, the state space is intractably large

Deep Learning

If you’re comfortable with the numpy multidimensional array,
you can replace almost everything you do with numpy with PyTorch

One-Hot encoder: a one-hot encoded vector is a vector where all but 1 element is set to 0. The only
nonzero element is set to 1 and indicates a particular state in the state space.

The Markov Property

A game / control task that exhibits the Markov Property is said to be a MARKOV DECISION PROCESS (MDP)
With MDP, the current state alone contains enough information to choose optimal actions to maximize future rewards
Modelling a control task as an MDP is a key concept in RL => it simplifies an RL problem DRAMATICALLY
as we don’t need to take into account all previous states or actions - WE DON”T NEED TO HAVE MEMORY,
we just need to analyze the present situation.
Hence. we ALWAYS ATTEMPT TO MODEL A PROBLEM AS (at least approximately) a Markov decision process.
e.g. Blackjack is MDP, Choosing the shortest route (don’t need to know what happened yesterday),
Stock market is no MDP (you need to know past performance)

Many problems may not naturally have the Markov property, but often we can induce it
by jamming more information into the state

DeepMind’s implementation feeds in the last 4 frames of gameplay,
effectively chaning a-non-MDP into an MDP (current state now has all the info it needs)