The discounting factor � ∈ [ 0 , 1 ] penalize the rewards in the future
Both policy and value functions are what we try to learn in reinforcement learning.
The model defines the reward function and transition probabilities.
It is a mapping from state s to action a
action-value and state-value is the action advantage function (“A-value”):
All states in MDP has “Markov” property, referring to the fact that the future only depends on the current state, not the history:
When the model is fully known, following Bellman equations, we can use Dynamic Programming (DP) to iteratively evaluate value functions and improve policy.
functions (i.e. a machine learning model) to approximate Q values and this is called function approximation.
off-policy, nonlinear function approximation, and bootstrapping are combined in one RL algorithm
experience replay and occasionally frozen target network.
Experience replay improves data efficiency, removes correlations in the observation sequences, and smooths over changes in the data distribution.
it overcomes the short-term oscillations.
one idea … central and novel to reinforcement learning
to update the value function � ( � � ) towards an estimated return � � + 1 + � � ( � � + 1 ) (known as “TD target”).
action-value (“Q-value”; Q as “Quality” I believe?) of a state-action pair
to learn the state/action value function and then to select actions accordingly
Policy Gradient methods instead learn the policy directly with a parameterized function
The loss function
It is natural to expect policy-based methods are more useful in continuous space, because there is an infinite number of actions and/or states to estimate the values for in continuous space and hence value-based approaches are computationally much more expensive.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.