⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’
Excalidraw Data
Text Elements
Deep neural networks are popular because they are in many cases
the most accurate machine learning models for a given task.
Control tasks
Problems, where decisions must be made or some behavior
must be enacted. The algorithm needs to learn how to act
not merely to classify or predict.
Added complexity: ELEMENT OF TIME
The data set upon which the algorithm is training not
necessarily FIXED but changes based on decisions the algorithm makes.
Reinforcment Learning
a generic framework for representing and solving
control tasks (free to choose any algorithm within this framework)
In RL we don’t know exactly what the right thing to do at every step => maximize reward
- Overall objective, constraints
- Environment
- State (a snapshot of the environment at t)
- objective function, reward, decisions, actions
Spectrum: Dynamic Programming vs Monte Carlo
DP: 1957 aka GOAL DECOMPOSITION =>
decompose into smaller problems
Difficult to realize this decomposition in the real world
Monte Carlo = Random trial & error
throwing stones in each direction
Often mixed strategy is used
- divide into subproblem and then solve it throwin stones in each direction
Richard Bellman
core set of terms
and concecpts that
every RL problem can be phrased in
Forces us to formulate our problem
in a way that is amenable to dynamic
programming-like problem decomposition,
such that we can iteratively optimize over
local sub-problems and make progress towards
the global high-level objective
RL is a field of its own, separate from the concerns of any particular learning algorithm
Generalization: One algorithm, multiple games (DeepMind’s deep Q-network (DQN))
without feature engineering, explicitly programming game rules
Simulation first => Real world second
RL algorithm has the advantage of being able to adapt to a changing conditions in real time (continuously learning)
Tradeoff between exploration vs explotation
Quantopian
compresses game states into
a finite set of parameters
The secret sauce of RL: Deep Learning
The Agent is the focus of any RL problem: processes input & determines action
When the agent is implemented as a deep learning network,
each iteration evaluates a loss function based on reward signal
and backpropagates to improve the performance of the agent.
String diagrams
Matrix x vector -> nonlinear activation function
DNN = layers that do:
Adapted from Category theory where they tend to use
a lot of diagrams to supplement or replace traditional symbolic math notation
Slot Machine = One-Armed bandit: it has one arm (lever) and steals your money
n-armed bandit problem, where n = number of slot machines
The proper balance of explotation and exploration will be important to maximizing rewards
The method of simply choosing the best lever that we know of so far is called a greedy (or exploitation) method.
no exploration
Epsilon Greedy Strategy
at P(E) we will choose random action,
at 1 - E we will choose greedy action
Stationary Problem = underlying reward probability distribution for the arms does not change
over time
Softmax selection policy
Instead of just choosing an action at random during
exploration, softmax gives us a probability distribution
across our options.
AVOIDS CHOOSING THE WORST OPTION
- converts actions into probabilities using softmax equation
- randomly selects from these probabilities meaning
best action will get chosen more often because it will have
the highest softmax probability, but other actions will be chosen
at lower frequencies - tau is a temperature that scales probability distribution of actions.
- A high tau will cause probs to be very similar, a low prob. will exaggerate
differences
- A high tau will cause probs to be very similar, a low prob. will exaggerate
- converges faster than random epsilon greedy for n-armed bandits, but you
need to select tau value
Contextual bandits
- Add State (state space) and we start to get a combinatorial explosion of possible
state-action-reward tuples
No State spaces, only action spaces: only need to learn to assoc action with reward
for most problems, the state space is intractably large
Deep Learning
If you’re comfortable with the numpy multidimensional array,
you can replace almost everything you do with numpy with PyTorch
One-Hot encoder: a one-hot encoded vector is a vector where all but 1 element is set to 0. The only
nonzero element is set to 1 and indicates a particular state in the state space.
The Markov Property
- A game / control task that exhibits the Markov Property is said to be a MARKOV DECISION PROCESS (MDP)
With MDP, the current state alone contains enough information to choose optimal actions to maximize future rewards - Modelling a control task as an MDP is a key concept in RL => it simplifies an RL problem DRAMATICALLY
as we don’t need to take into account all previous states or actions - WE DON”T NEED TO HAVE MEMORY,
we just need to analyze the present situation. - Hence. we ALWAYS ATTEMPT TO MODEL A PROBLEM AS (at least approximately) a Markov decision process.
e.g. Blackjack is MDP, Choosing the shortest route (don’t need to know what happened yesterday),
Stock market is no MDP (you need to know past performance)
Many problems may not naturally have the Markov property, but often we can induce it
by jamming more information into the state
DeepMind’s implementation feeds in the last 4 frames of gameplay,
effectively chaning a-non-MDP into an MDP (current state now has all the info it needs)
Element Links
xNahfLZR: https://github.com/Farama-Foundation/Gymnasium
ZvFmwX5D: https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction
K6zDyzCo: https://www.youtube.com/watch?v=2iF9PRriA7w
4A60ySgT: https://www.youtube.com/watch?v=BEFY7IHs0HM&t=486s
myKcn11l: https://www.youtube.com/watch?v=ytbYRIN0N4g&t=28s
BrVf6LpY: https://en.wikipedia.org/wiki/String_diagram
Embedded Files
0dafb5a1c19994c37e0ed4a2529c618057a5fa93:
fced636ae96ce1798b6f0ccaaee2dd79579fde26: Pasted Image 20260509170443_485.png