RL Course by David Silver Notes

Lecture 1: Introduction to Reinforcement Learning

Planning: rules of the game are given, perfect model inside agent’s head, plan ahead to find optimal policy(look ahead search or tree search).
In RL environment is unknown, in planning environment is known.
Types of RL agents:
- Policy based
- Value function based
- Actor critic(combines policy and value function)
Agent’s model is a representation of the environment in the agent’s head.
Agent is our brain, is the algorithm we come up with.
To get the maximum expected reward, risk is already included.
Value function: goodness/badness of states, so can be used to select between actions.
Fully observability: agent sees the environment state.
Data are not iid.
Reward is delayed.
There is only a reward signal, no supervision.

Reinforcement learning is the science of decision making.

Lecture 2: Markov Decision Process

Partial ordering over policies:
\[\pi' \geq \pi \Longleftarrow v_\pi' (s) \geq v_\pi (s), \forall s\]
Action-Value function Q is the same as value function, but also takes action as input:
\[Q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a] = \mathbb{E}_\pi [R_{t+1} + \gamma * Q_\pi (S_{t+1}, A_{t+1}) | S_t = s, A_t = a]\]
Policy is a distribution over actions given states: \(\pi (a | s) = P [ A_t = a | S_t = s]\)

Policies are stationary: \(A_t \approx \pi (\cdot | S_t), \forall t > 0\)
The returns are random samples, the value function is an expectation over there random samples.
Value function: \(v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s] = \mathbb{E}_\pi [R_{t+1} + \gamma * v_\pi (S_{t+1}) | S_t = s]\)

It also can be expressed using matrices: \(v_\pi = R_\pi + \gamma * \rho_\pi * v\)
Total discounted reward from state \(t\): \(G_t = R_{t+1} + \gamma * R_{t+2} + \gamma^2 * R_{t+3} ... + \gamma^{T-1} * R_T\)
Why having discount factor?
- Uncertainty in the future.
- Mathematically convenient.
- Avoids cycles.
- Immediate rewards worth more that delayed rewards.
Reward function: \(R = \mathbb{E} [R_{t+1} | S_t = s, A_t = a]\)
Sample episodes from a MDP: different iterations over different states in each episode.
State transition probability: \(\rho_{ss'} = P [S_{t+1} = s' | S_t = s, A_t = a]\)
Showing an ad on a website is actually a bandit problem.
Almost all RL problems can be modeled using MDPs.

Lecture 3: Planning by Dynamic Programming

Value Iteration:
- Goal: find optimal policy \(\pi\)
- Start with an arbitrary value function: \(v_1\) and apply Bellman backup to get to \(v_2\) and finally \(v^*\).
- Unlike policy iteration, there is no explicit policy.
- Intermediate value functions may not correspond to any policy.
- Will converge.
- Intuition: start with final reward and work backwards.
Policy Iteration:
- Given a policy \(\pi\) (any policy works):
- Evaluate the policy:
  \[V_\pi (s) = \mathbb{E} [ R_{t+1} + \gamma * R_{t+2} + ... | S_t = s ]\]
- Improve the policy: \(\pi' = greedy ( V_\pi )\)
- Will converge.
Policy Evaluation:
- Start with an arbitrary value function: \(v_1\) and apply Bellman backup to get to \(v_2\) and finally \(v_\pi\).
- Use both synchronous and asynchronous backups.
- Will converge.
- \[V^{k+1} = R^\pi + \gamma * P^\pi * V^k\]
Input: \(MDP(S, A, P, R, \gamma)\) - Output: optimal value function \(V^*\) and optimal policy \(\pi^*\)
In planning full knowledge of MDP is given. We want to solve the MPD, i.e. finding a policy.
MDPs satisfy both DP properties.
Dynamic programming applies to problems that have:
- Optimal substructure.
- Overlapping subproblems.
Dynamic Programming: method for solving complex problems. Breaking them down into subproblems. Solve subproblems and combine solutions.
- Dynamic: sequential component to the problem.
- Programming: a mapping, e.g. a policy.

Lecture 4: Model-Free Prediction

Bootstrapping: update involves estimated rewards.
Monte-Carlo Learning:
- In a nutshell, goes all the way through the trajectory and estimates the value by looking at the sample returns.
- Useful only for episodic(terminating) settings, and applies once an episode is complete. No bootstrapping.
- High variance, zero bias.
- Not exploit Markov property. In partially observed environments, MC is a better choice.
Temporal-Difference Learning:
- Looks one step ahead, and then estimates the return.
- Uses bootstrapping to learn, so learns from incomplete episodes. Does online learning.
- Low variance, some bias.
- Exploits Markov property. Implicitly building MDP structure and solving for that MDP structure(refer to the AB example). More efficient in Markov environments.
- \(TD(\lambda)\): looks into \(\lambda\) steps of the future to update. Also \(\lambda\) can be defined as averaging over all n-step returns.

Lecture 5: Model Free Control

On-policy Learning:
- Learn about policy \(\pi\) from experience sampled from \(\pi\).
- \(\epsilon\)-greedy Exploration:
  - With probability = \(1 - \epsilon\) choose the greedy action. With probability = \(\epsilon\) choose a random action.
  - Guarantees to have improvements.
- Greedy in the Limit of Infinite Exploration(GLIE):
  - All state-action pairs are explored infinitely many times.
- Policy converges to the greedy policy.
- SARSA:
  - \[Q(S, A) \leftarrow Q(S, A) + \alpha * (R + \gamma * Q(S', A') - Q(S, A))\]
  - Converges to the optimal action-value function, under some conditions.
Off-policy Learning:
- Learn about policy \(\pi\) from experience sampled from \(\mu\).
- Evaluate target policy \(\pi (a | s)\) to compute \(v_\pi (s)\) or \(q_\pi (s, a)\) while following the behavior policy \(\mu (a | s)\).
- Can learn from observing human behaviors or other agents.
- Learn optimal policy while following exploratory policy.
- Monte-carlo learning doesn’t work in off-policy learning, because of really high variance. Over many steps, the target policy and the behavior policy never match enough to be useful.
- Using TD leaning, it works best. Because policies only need to be similar over one step.
- Q-Learning:
  - \(S'\) is gathered using the behavior policy.
  - \[Q(S, A) \leftarrow Q(S, A) + \alpha * (R + \gamma * Q(S, A') - Q(S, A))\]
  - Converges to the optimal action-value function.

Lecture 6: Value Function Approximation

Large scale RL problems because of the huge state spaces become non-tabular.
So far, we had \(V(s)\) or \(Q(s, a)\). But with large scale problems, calculating them takes too much memory and time. Solution: estimate the value function with function approximation:

\(v' (s, w) ≈ v_\pi (s)\) or \(q' (s, a, w) ≈ q_\pi (s, a)\)

Generalize from seen states to unseen ones. Update parameter \(w\) using MC or TD learning.
Incremental Methods:
- Table lookup is a special case of linear value function approximation.
- Target value function for MC is the return \(G_t\) and for TD is the \lambda-return \(G^\lambda _t\)
- Gradient descent is simple and appealing.
Batch Methods:
- GD is not sample efficient.
- Least squares algorithms find parameter vector \(w\) minimizing sum squared error.
- Example: experience replay in DQN.

Lecture 7: Policy Gradient Methods

Policy gradient methods optimize the policy directly, instead of the value function.
Parametrize the policy: \(\pi_\theta (s, a) = P [a | s, \theta]\)
Point of using policy gradient methods is to being able to scale.
Nash’s equilibrium is the game theoretic notion of optimality.
Finite Difference Policy Gradient:
- For each dimension in \(\theta\) parameters, perturbing \(\theta\) by small amount \(\epsilon\). (look at the formula)
- Uses \(n\) evaluations to compute policy gradient in \(n\) dimension.
- Simple, noisy, inefficient.
- Works for arbitrary policies.
Monte-Carlo Policy Gradient:
- Score function is \(∇_\theta log \pi_\theta (s, a)\)
- In continuous action spaces, Gaussian policy should be used.
- REINFORCE: update parameters by stochastic gradient ascent.
- Has high variance.
Actor-Critic Policy Gradient:
- Use a critic to estimate the action-value function.
- Critic: updates action-value function parameters \(w\).
- Actor: updates policy parameters \(\theta\) in direction suggested by critic.
- Approximating the policy gradient introduces bias.
- Using baseline function \(B\) to reduce variance. Value function \(V\) could be a good baseline.
- Using advantage function to reduce the variance: \(Q_\pi (s, a) - V_\pi (s)\)
- Critic estimates the advantage function.

Lecture 8: Integrating Learning and Planning

So far, the course covered model-free RL.
Last lecture: learn policy directly from experience; Previous lectures: learn value function directly from experience; This lecture: learn model directly from experience.
Use planning to construct a value function or policy.
Model-Based RL:
- Plan value function (and/or policy) from model.
- Advantages:
  - Can learn model by supervised learning methods.
  - Car reason about model uncertainty.
- Disadvantages:
  - First learn a model, then construct a value function: two sources of approximation error.
- Model is a parametrized representation of the MDP, i.e. representing state transitions and rewards.
- Learning \(s, a \rightarrow r\) is a regression problem.
- Learning \(s, a \rightarrow s'\) is a density estimation problem.
- Planning the MDP = Solving the MDP, i.e. figure out what’s the best thing to do.
- Planning algorithms: value iteration, policy iteration, tree search, …
Integrated Architectures:
- Put together the best parts of model-free and model-based RL.
- Two sources of experience: real: sampled from environment (true MDP); simulated: sampled from model (approximate MDP).
- Dyna:
  - Learn a model from real experience.
  - Learn and plan value function from real and simulated experience.
Simulation-Based Search:
- Forward search: select best action by lookahead. Doesn’t explore the entire state space. Builds a search tree with current s as the root. Uses a model of MDP to look ahead. Doesn’t solve the whole MDP, just sub-MDP starting from now.
- Simulation-based search is a forward search paradigm using sample-based planning.
- Simulates episodes of experience from now with the model.
- Applies model-free RL to simulated episodes.

Lecture 9: Exploration and Exploitation

Three methods of exploration and exploitation:
- Random exploration: e.g. \(\epsilon\)-greedy
- Optimism in the face of uncertainty: estimate uncertainty on value, prefer to explore states/actions with highest uncertainty.
- Information state space: consider agent’s information as part of its state, lookahead to see how information helps reward.
Types of exploration:
- State-action exploration: e.g. pick different action \(A\) each time in state \(S\).
- Parameter exploration: parameterize policy \(\pi (A | S , u)\), e.g. pick different parameters and try for a while.
Multi-Armed Bandit:
- One-step decision making problem.
- No state space, no transition function.
- \(R (r) = P [ R = r | A = a ]\) is an unknown probability distribution over rewards.
- Regret is a function of gaps and the counts.
- \epsilon-greedy has linear total regret. To resolve this, pick a decay schedule for \(\epsilon_1, \epsilon_2\), … . However, it’s not possible to use because \(V^*\) is needed to calculate the gaps.
- One can transform multi-armed bandit problem into a sequential decision making problem.
- Define an MDP over information states.
- At each step, information state \(S'\) summarizes all information accumulated so far.
- MDP can then be solved by RL.
Contextual Bandits:
- \(S = P [ S ]\) is an unknown distribution over states(or “contexts”).
- \(R (r) = P [ R = r | S = s, A = a ]\) is an unknown probability distribution over rewards.
MDPs:
- For unknown or poorly estimated states, replace reward function with \(r_max\). Means to be very optimistic about uncertain states.
- Augmented MDP: includes information state so that \(S' = (S , I)\).

Lecture 10: Classic Games

Nash equilibrium is a joint policy for all players, so a way for others to pick actions such that every single player is playing the best response to all other players, i.e. no player would choose from Nash.
Single-Agent and Self-Play Reinforcement Learning: Nash equilibrium is fixed-point of self-play RL. Experience is generated by playing games between agents. Each agent learns best response to other players. One player’s policy determines another player’s environment.
Two-Player Zero-Sum Games: A two-player game has two (alternating) players: \(R_1 + R_2 = 0\).