RL Course by David Silver Notes
Lecture 1: Introduction to Reinforcement Learning
-
Planning: rules of the game are given, perfect model inside agent’s head, plan ahead to find optimal policy(look ahead search or tree search).
-
In RL environment is unknown, in planning environment is known.
-
Types of RL agents:
-
Policy based
-
Value function based
-
Actor critic(combines policy and value function)
-
-
Agent’s model is a representation of the environment in the agent’s head.
-
Agent is our brain, is the algorithm we come up with.
-
To get the maximum expected reward, risk is already included.
-
Value function: goodness/badness of states, so can be used to select between actions.
-
Fully observability: agent sees the environment state.
-
Data are not iid.
-
Reward is delayed.
-
There is only a reward signal, no supervision.
Reinforcement learning is the science of decision making.
Lecture 2: Markov Decision Process
-
Partial ordering over policies:
\[\pi' \geq \pi \Longleftarrow v_\pi' (s) \geq v_\pi (s), \forall s\] -
Action-Value function Q is the same as value function, but also takes action as input:
\[Q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a] = \mathbb{E}_\pi [R_{t+1} + \gamma * Q_\pi (S_{t+1}, A_{t+1}) | S_t = s, A_t = a]\] -
Policy is a distribution over actions given states: \(\pi (a | s) = P [ A_t = a | S_t = s]\)
Policies are stationary: \(A_t \approx \pi (\cdot | S_t), \forall t > 0\)
-
The returns are random samples, the value function is an expectation over there random samples.
-
Value function: \(v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s] = \mathbb{E}_\pi [R_{t+1} + \gamma * v_\pi (S_{t+1}) | S_t = s]\)
It also can be expressed using matrices: \(v_\pi = R_\pi + \gamma * \rho_\pi * v\)
-
Total discounted reward from state \(t\): \(G_t = R_{t+1} + \gamma * R_{t+2} + \gamma^2 * R_{t+3} ... + \gamma^{T-1} * R_T\)
-
Why having discount factor?
-
Uncertainty in the future.
-
Mathematically convenient.
-
Avoids cycles.
-
Immediate rewards worth more that delayed rewards.
-
-
Reward function: \(R = \mathbb{E} [R_{t+1} | S_t = s, A_t = a]\)
-
Sample episodes from a MDP: different iterations over different states in each episode.
-
State transition probability: \(\rho_{ss'} = P [S_{t+1} = s' | S_t = s, A_t = a]\)
-
Showing an ad on a website is actually a bandit problem.
-
Almost all RL problems can be modeled using MDPs.
Lecture 3: Planning by Dynamic Programming
-
Value Iteration:
-
Goal: find optimal policy \(\pi\)
-
Start with an arbitrary value function: \(v_1\) and apply Bellman backup to get to \(v_2\) and finally \(v^*\).
-
Unlike policy iteration, there is no explicit policy.
-
Intermediate value functions may not correspond to any policy.
-
Will converge.
-
Intuition: start with final reward and work backwards.
-
-
Policy Iteration:
-
Given a policy \(\pi\) (any policy works):
-
Evaluate the policy:
\[V_\pi (s) = \mathbb{E} [ R_{t+1} + \gamma * R_{t+2} + ... | S_t = s ]\] -
Improve the policy: \(\pi' = greedy ( V_\pi )\)
-
Will converge.
-
-
Policy Evaluation:
-
Start with an arbitrary value function: \(v_1\) and apply Bellman backup to get to \(v_2\) and finally \(v_\pi\).
-
Use both synchronous and asynchronous backups.
-
Will converge.
- \[V^{k+1} = R^\pi + \gamma * P^\pi * V^k\]
-
-
Input: \(MDP(S, A, P, R, \gamma)\) - Output: optimal value function \(V^*\) and optimal policy \(\pi^*\)
-
In planning full knowledge of MDP is given. We want to solve the MPD, i.e. finding a policy.
-
MDPs satisfy both DP properties.
-
Dynamic programming applies to problems that have:
-
Optimal substructure.
-
Overlapping subproblems.
-
-
Dynamic Programming: method for solving complex problems. Breaking them down into subproblems. Solve subproblems and combine solutions.
-
Dynamic: sequential component to the problem.
-
Programming: a mapping, e.g. a policy.
-
Lecture 4: Model-Free Prediction
-
Bootstrapping: update involves estimated rewards.
-
Monte-Carlo Learning:
-
In a nutshell, goes all the way through the trajectory and estimates the value by looking at the sample returns.
-
Useful only for episodic(terminating) settings, and applies once an episode is complete. No bootstrapping.
-
High variance, zero bias.
-
Not exploit Markov property. In partially observed environments, MC is a better choice.
-
-
Temporal-Difference Learning:
-
Looks one step ahead, and then estimates the return.
-
Uses bootstrapping to learn, so learns from incomplete episodes. Does online learning.
-
Low variance, some bias.
-
Exploits Markov property. Implicitly building MDP structure and solving for that MDP structure(refer to the AB example). More efficient in Markov environments.
-
\(TD(\lambda)\): looks into \(\lambda\) steps of the future to update. Also \(\lambda\) can be defined as averaging over all n-step returns.
-
Lecture 5: Model Free Control
-
On-policy Learning:
-
Learn about policy \(\pi\) from experience sampled from \(\pi\).
-
\(\epsilon\)-greedy Exploration:
-
With probability = \(1 - \epsilon\) choose the greedy action. With probability = \(\epsilon\) choose a random action.
-
Guarantees to have improvements.
-
-
Greedy in the Limit of Infinite Exploration(GLIE):
- All state-action pairs are explored infinitely many times.
-
Policy converges to the greedy policy.
-
SARSA:
- \[Q(S, A) \leftarrow Q(S, A) + \alpha * (R + \gamma * Q(S', A') - Q(S, A))\]
- Converges to the optimal action-value function, under some conditions.
-
-
Off-policy Learning:
-
Learn about policy \(\pi\) from experience sampled from \(\mu\).
-
Evaluate target policy \(\pi (a | s)\) to compute \(v_\pi (s)\) or \(q_\pi (s, a)\) while following the behavior policy \(\mu (a | s)\).
-
Can learn from observing human behaviors or other agents.
-
Learn optimal policy while following exploratory policy.
-
Monte-carlo learning doesn’t work in off-policy learning, because of really high variance. Over many steps, the target policy and the behavior policy never match enough to be useful.
-
Using TD leaning, it works best. Because policies only need to be similar over one step.
-
Q-Learning:
-
\(S'\) is gathered using the behavior policy.
- \[Q(S, A) \leftarrow Q(S, A) + \alpha * (R + \gamma * Q(S, A') - Q(S, A))\]
- Converges to the optimal action-value function.
-
-
Lecture 6: Value Function Approximation
-
Large scale RL problems because of the huge state spaces become non-tabular.
-
So far, we had \(V(s)\) or \(Q(s, a)\). But with large scale problems, calculating them takes too much memory and time. Solution: estimate the value function with function approximation:
\(v' (s, w) ≈ v_\pi (s)\) or \(q' (s, a, w) ≈ q_\pi (s, a)\)
Generalize from seen states to unseen ones. Update parameter \(w\) using MC or TD learning.
-
Incremental Methods:
-
Table lookup is a special case of linear value function approximation.
-
Target value function for MC is the return \(G_t\) and for TD is the \lambda-return \(G^\lambda _t\)
-
Gradient descent is simple and appealing.
-
-
Batch Methods:
-
GD is not sample efficient.
-
Least squares algorithms find parameter vector \(w\) minimizing sum squared error.
-
Example: experience replay in DQN.
-
Lecture 7: Policy Gradient Methods
-
Policy gradient methods optimize the policy directly, instead of the value function.
-
Parametrize the policy: \(\pi_\theta (s, a) = P [a | s, \theta]\)
-
Point of using policy gradient methods is to being able to scale.
-
Nash’s equilibrium is the game theoretic notion of optimality.
-
Finite Difference Policy Gradient:
-
For each dimension in \(\theta\) parameters, perturbing \(\theta\) by small amount \(\epsilon\). (look at the formula)
-
Uses \(n\) evaluations to compute policy gradient in \(n\) dimension.
-
Simple, noisy, inefficient.
-
Works for arbitrary policies.
-
-
Monte-Carlo Policy Gradient:
-
Score function is \(∇_\theta log \pi_\theta (s, a)\)
-
In continuous action spaces, Gaussian policy should be used.
-
REINFORCE: update parameters by stochastic gradient ascent.
-
Has high variance.
-
-
Actor-Critic Policy Gradient:
-
Use a critic to estimate the action-value function.
-
Critic: updates action-value function parameters \(w\).
-
Actor: updates policy parameters \(\theta\) in direction suggested by critic.
-
Approximating the policy gradient introduces bias.
-
Using baseline function \(B\) to reduce variance. Value function \(V\) could be a good baseline.
-
Using advantage function to reduce the variance: \(Q_\pi (s, a) - V_\pi (s)\)
-
Critic estimates the advantage function.
-
Lecture 8: Integrating Learning and Planning
-
So far, the course covered model-free RL.
-
Last lecture: learn policy directly from experience; Previous lectures: learn value function directly from experience; This lecture: learn model directly from experience.
-
Use planning to construct a value function or policy.
- Model-Based RL:
- Plan value function (and/or policy) from model.
- Advantages:
- Can learn model by supervised learning methods.
- Car reason about model uncertainty.
- Disadvantages:
- First learn a model, then construct a value function: two sources of approximation error.
- Model is a parametrized representation of the MDP, i.e. representing state transitions and rewards.
- Learning \(s, a \rightarrow r\) is a regression problem.
- Learning \(s, a \rightarrow s'\) is a density estimation problem.
- Planning the MDP = Solving the MDP, i.e. figure out what’s the best thing to do.
- Planning algorithms: value iteration, policy iteration, tree search, …
- Integrated Architectures:
- Put together the best parts of model-free and model-based RL.
- Two sources of experience: real: sampled from environment (true MDP); simulated: sampled from model (approximate MDP).
- Dyna:
- Learn a model from real experience.
- Learn and plan value function from real and simulated experience.
- Simulation-Based Search:
- Forward search: select best action by lookahead. Doesn’t explore the entire state space. Builds a search tree with current s as the root. Uses a model of MDP to look ahead. Doesn’t solve the whole MDP, just sub-MDP starting from now.
- Simulation-based search is a forward search paradigm using sample-based planning.
- Simulates episodes of experience from now with the model.
- Applies model-free RL to simulated episodes.
Lecture 9: Exploration and Exploitation
-
Three methods of exploration and exploitation:
-
Random exploration: e.g. \(\epsilon\)-greedy
-
Optimism in the face of uncertainty: estimate uncertainty on value, prefer to explore states/actions with highest uncertainty.
-
Information state space: consider agent’s information as part of its state, lookahead to see how information helps reward.
-
-
Types of exploration:
-
State-action exploration: e.g. pick different action \(A\) each time in state \(S\).
-
Parameter exploration: parameterize policy \(\pi (A | S , u)\), e.g. pick different parameters and try for a while.
-
-
Multi-Armed Bandit:
-
One-step decision making problem.
-
No state space, no transition function.
-
\(R (r) = P [ R = r | A = a ]\) is an unknown probability distribution over rewards.
-
Regret is a function of gaps and the counts.
-
\epsilon-greedy has linear total regret. To resolve this, pick a decay schedule for \(\epsilon_1, \epsilon_2\), … . However, it’s not possible to use because \(V^*\) is needed to calculate the gaps.
-
One can transform multi-armed bandit problem into a sequential decision making problem.
-
Define an MDP over information states.
-
At each step, information state \(S'\) summarizes all information accumulated so far.
-
MDP can then be solved by RL.
-
-
Contextual Bandits:
-
\(S = P [ S ]\) is an unknown distribution over states(or “contexts”).
-
\(R (r) = P [ R = r | S = s, A = a ]\) is an unknown probability distribution over rewards.
-
-
MDPs:
-
For unknown or poorly estimated states, replace reward function with \(r_max\). Means to be very optimistic about uncertain states.
-
Augmented MDP: includes information state so that \(S' = (S , I)\).
-
Lecture 10: Classic Games
-
Nash equilibrium is a joint policy for all players, so a way for others to pick actions such that every single player is playing the best response to all other players, i.e. no player would choose from Nash.
-
Single-Agent and Self-Play Reinforcement Learning: Nash equilibrium is fixed-point of self-play RL. Experience is generated by playing games between agents. Each agent learns best response to other players. One player’s policy determines another player’s environment.
-
Two-Player Zero-Sum Games: A two-player game has two (alternating) players: \(R_1 + R_2 = 0\).