Policy Gradient Estimation in Deep RL

Disclaimer: This is an early version of this post.

When we do deep Reinforcement Learning (RL) with a policy gradient method¹, we have a neural network policy \( \pi_\theta \), and we're trying to optimize its parameters \(\theta\) in order to maximize the expected return \( J(\pi_\theta) = \mathbb{E} _ { \tau \sim \pi_\theta } \left[ R(\tau) \right] \), where \( R(\tau) \) is the return of a trajectory \(\tau\). In the case of finite-horizon undiscounted return over \( T+1 \) timesteps, we can write \( R(\tau) = \sum_{t=0}^{T} r_t \); for infinite-horizon \(\gamma\)-discounted return², we can write \( R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t \), with \(\gamma \in (0, 1) \).³

There are many policy-gradient deep RL algorithms, with various approaches to the explore-exploit or bias-variance tradeoffs, stability tricks, data efficiency due to reusing stale rollouts, hyperparameter choices, etc. However, in order to use gradient-based optimization⁴, they invariably estimate the gradient of the expected return with respect to the policy parameters, also known as the policy gradient \( \nabla_\theta J \).

I find it extremely useful to categorize deep RL algorithms according to how they estimate \( \nabla_\theta J \). Everything else is best understood in the context of its influence on this estimation.

If we do this, three main categories emerge:

REINFORCE-based \( \nabla_\theta J \) estimation: relies on the Policy Gradient Theorem (PGT), using the score function) / log-derivative trick to compute an _unbiased_ (but high-variance) estimate of \( \nabla_\theta J \).
- VPG/A2C/A3C, TRPO, and the widely-used/modern PPO and GRPO⁵ all rely on REINFORCE.
- Key autograd-derived quantity: \( \nabla_\theta \log \pi_\theta \).
  - "How would changes to the policy parameters affect the log-likelihood of a given action?"
- Requires a stochastic policy.
- Remains unbiased even for categorical actions.
- A value estimate is not structurally or theoretically required. However, in practice, baseline-free REINFORCE is incredibly slow to converge even on toy problems. This means that one usually trains a critic network to estimate the value function, in order to reduce variance.
- While some sort of baseline is a no-brainer, don't be so quick to assume you need a critic network: in DeepSeekMath's GRPO, LLMs undergo RL training at scale without a critic, using simple averaging across \(k=64\) rollouts from the same prompt to produce a good-enough baseline.
Critic-based \( \nabla_\theta J \) estimation, AKA "backprop through a learned critic", yielding low-variance but biased (due to the critic's approximation being imperfect) gradient estimates.
- This category includes DDPG, TD3, and SAC. SHAC, AHAC and SAPO (discussed in the next section) also partially rely on critic-based policy gradient estimation.
- Key autograd-derived quantities: \( \nabla_a Q \) and \( \nabla_\theta a \)
  - \( \nabla_a Q \): "How would changes to the action affect my estimate of the state-action value function?"
  - \( \nabla_\theta a \): "How would changes to the policy parameters affect the (sampled or deterministic) action?"
  - By combining the above through the chain rule, we estimate \( \nabla_\theta Q \): "How would changes to the policy parameters affect my estimate of the state-action value function?"
- Here, the learned critic network is not just a variance-reduction trick; it is structurally indispensable for training the actor!
- This approach is compatible with deterministic actions, as seen in DPG/DDPG.
- When using stochastic actions (as in SAC), one normally relies on reparameterized sampling of actions in order to let autograd compute \( \nabla_\theta a \).⁶
Simply backpropagating through the environment
- Dead simple at its core: if the instantaneous reward and the dynamics are differentiable, we can just let autograd do the work, get an unbiased gradient estimate⁷, and maximize the empirical reward without doing anything special.
- While this sounds great in theory (after all, why jump through hoops if autograd gives you unbiased gradients?), the resulting gradients can be very high-variance/norm, particularly at long horizons.
- APG, SHAC, AHAC and SAPO all rely on a differentiable environment, which enables each of them to use backpropagation through time.
- APG relies solely on BPTT; SHAC, AHAC and SAPO all combine BPTT with critic-based estimation at longer temporal horizons.
- If you're interested, you should read Do Differentiable Simulators Give Better Policy Gradients? (2022).

I will not discuss Deep Q-Learning (DQN) and other non-policy-gradient methods here, as they are out of scope for this post about deep policy gradient methods.

For more details, see e.g. OpenAI's Spinning Up in Deep RL - Part 1: Key Concepts in RL.

Consider that choosing \( \gamma < 1 \) still has a purpose in finite-horizon cases: it turns the Bellman expectation backup operator into a contraction mapping, enabling bootstrapping to work even with meaningless value estimates, as are produced early in training, before the critic has learned much.

⁴

Via SGD, Adam(W), RMSProp, Muon, etc. The undisputably-successful and still somewhat-recent DreamerV3 (originally released in 2023) uses the LaProp optimizer (described as "RMSProp with momentum"), which I have never encountered elsewhere. Muon is now the default optimizer in pufferlib 3.0 (relevant line of code; tweet from the author, Joseph Suarez).

⁵

See also GRPO's precursor RLOO and this interesting discussion on RL for LLM post-training.

⁶

Note that there is no unbiased reparameterization of categorical sampling, i.e. sampling from a discrete distribution. The straight-through, Gumbel-Softmax/concrete and ReinMax estimators are all biased, though each of these improves upon the previous one.

⁷

Assuming a differentiable simulator, rather than a learned model of the environment dynamics. In the latter case, we would still get unbiased gradients, but they would be derived from an objective that is itself biased, as is the case when backpropagating through a learned critic.