TOWARDSAI.NET
PPO Explained and Its Constraints: Introducing PDPPO as an Alternative
Author(s): Leonardo Kanashiro Felizardo
Originally published on Towards AI.
What is PPO, and Why is it Popular?
Proximal Policy Optimization (PPO) has rapidly emerged as a leading model-free reinforcement learning (RL) method due to its simplicity and strong performance across various domains. PPO combines trust-region policy optimization and clipped objective optimization to ensure stable and efficient policy updates.
Explanation of PPO
PPO addresses the limitations of previous RL methods like vanilla policy gradient and TRPO (Trust Region Policy Optimization) by balancing exploration and exploitation through controlled policy updates. PPO specifically aims to stabilize training by preventing overly large policy updates, which could lead to catastrophic forgetting or divergence.
Actor-Critic and the Role of Advantage Estimation
PPO belongs to the family of actor-critic algorithms, where two models work together:
The actor updates the policy π(θ,a|s) by selecting actions based on states.
The critic evaluates the actor’s decisions by estimating the value function V(π,s).
This architecture was first formalized by Konda and Tsitsiklis in their seminal work Actor-Critic Algorithms, as shown in Konda et at. [1], where they demonstrated convergence properties and laid the mathematical foundation for combining policy gradient methods with value function estimation.
The advantage function is a critical concept in this setting, defined as:
This is a minimal and clean example of how to implement an Actor-Critic architecture in PyTorch:
import torchimport torch.nn as nnimport torch.optim as optim
class ActorCritic(nn.Module): def __init__(self, state_dim, action_dim): super().__init__() self.shared = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU()) self.actor = nn.Linear(128, action_dim) self.critic = nn.Linear(128, 1) def forward(self, x): x = self.shared(x) return self.actor(x), self.critic(x)# Example usagestate_dim = 4action_dim = 2model = ActorCritic(state_dim, action_dim)optimizer = optim.Adam(model.parameters(), lr=3e-4)state = torch.rand((1, state_dim))logits, value = model(state)dist = torch.distributions.Categorical(logits=logits)action = dist.sample()log_prob = dist.log_prob(action)# Mock advantage and returnadvantage = torch.tensor([1.0])return_ = torch.tensor([[1.5]])# Actor-Critic lossactor_loss = -log_prob * advantagecritic_loss = (value - return_).pow(2).mean()loss = actor_loss + critic_loss# Backpropagationoptimizer.zero_grad()loss.backward()optimizer.step()
PPO Objective and Mathematics
The core idea behind PPO is the optimization of the policy network through a clipped objective function:
Here:
θ represents the parameters of the policy.
ε is a hyperparameter typically small (e.g., 0.2) controlling how much the policy can change at each step.
A is the advantage function, indicating the relative improvement of taking a specific action compared to the average action.
The probability ratio is defined as:
This ratio quantifies how much the probability of selecting an action has changed from the old policy to the new one.
PyTorch Code Example: PPO Core
import torchimport torch.nn as nnimport torch.optim as optim
# Assume we already have: states, actions, old_log_probs, returns, values# And a model with .actor and .critic modulesclip_epsilon = 0.2gamma = 0.99# Compute advantagesadvantages = returns - valuesdiscounted_advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# Get new log probabilities and state valueslog_probs = model.actor.get_log_probs(states, actions)ratios = torch.exp(log_probs - old_log_probs.detach())# Clipped surrogate objectivesurr1 = ratios * discounted_advantagessurr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * discounted_advantagespolicy_loss = -torch.min(surr1, surr2).mean()# Critic loss (value function)value_estimates = model.critic(states)critic_loss = nn.MSELoss()(value_estimates, returns)# Total losstotal_loss = policy_loss + 0.5 * critic_loss# Backpropagationoptimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)optimizer.step()
PPO’s Advantages and Popularity
PPO’s popularity stems from its:
Simplicity: Easier to implement and tune compared to other sophisticated methods like TRPO.
Efficiency: Faster convergence due to the clipped surrogate objective, reducing the need for careful hyperparameter tuning.
Versatility: Robust performance across a wide range of tasks including robotics, games, and operational management problems.
Flaws and Limitations of PPO
Despite PPO’s successes, it faces several limitations:
High Variance and Instability: PPO’s reliance on sample-based estimates can cause significant variance in policy updates, especially in environments with sparse rewards or long horizons.
Exploration Inefficiency: PPO typically relies on Gaussian noise for exploration, which can lead to insufficient exploration, especially in complex, high-dimensional state spaces.
Sensitivity to Initialization: PPO’s effectiveness can vary greatly depending on initial conditions, causing inconsistent results across training runs.
Enter PDPPO: A Novel Improvement
To overcome these limitations, Post-Decision Proximal Policy Optimization (PDPPO) introduces a novel approach using dual critic networks and post-decision states.
Understanding Post-Decision States
Post-decision states, introduced by Warren B. Powell [2], provide a powerful abstraction in reinforcement learning. A post-decision state represents the environment immediately after an agent has taken an action but before the environment’s stochastic response occurs.
This allows the learning algorithm to decompose the transition dynamics into two parts:
Deterministic step (decision):
This representes the state right after the deterministric effects take place.
Stochastic step (nature’s response):
As soon as we observe the deterministric effects, we also account for the stochastic variables that change the state.
Where:
f represents the deterministic function mapping the current state and action to the post-decision state sˣ.
η is a random variable capturing the environment’s stochasticity.
g defines how this stochastic component affects the next state.
s’ is the next state
Example: Frozen Lake
Imagine the Frozen Lake environment. The agent chooses to move right from a given tile. The action is deterministic — the intention to move right is clear. This gives us the post-decision state sˣ: “attempted to move right.”
However, because the ice is slippery, the agent may not land on the intended tile. It might slide right, down, or stay in place, with a certain probability for each. That final position — determined after the slippage — is the true next state s’.
This decomposition allows value functions to be better estimated:
Pre-decision value function:
Post-decision value function:
This formulation helps decouple the decision from stochastic effects, reducing variance in value estimation and improving sample efficiency.
Post-Decision Advantage Calculation
Given both critics, PDPPO computes the advantage as:
And selects the most informative advantage at each step:
This “maximum advantage” strategy allows the actor to favor the most promising value estimate during learning.
Updating the Critics and Policy
Critic loss functions:
Combined actor-critic loss:
This architecture, with separate value estimators for deterministic and stochastic effects, enables more stable learning in environments with complex uncertainty.
Dual Critic Networks
PDPPO employs two critics:
State Critic: Estimates the value function based on pre-decision states.
Post-Decision Critic: Estimates the value function based on post-decision states.
The dual-critic approach improves value estimation accuracy by capturing both deterministic and stochastic dynamics separately.
PyTorch Code Example: PDPPO Core
import torchimport torch.nn as nnimport torch.optim as optim
# Assume: states, actions, old_log_probs, returns, post_returns, # model with actor, critic, post_decision_criticclip_epsilon = 0.2# --- 1. Compute advantages from both critics ---values = model.critic(states)post_values = model.post_decision_critic(post_states)adv_pre = returns - valuesadv_post = post_returns - post_values# Use the max advantage (PDPPO twist)advantages = torch.max(adv_pre, adv_post)advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# --- 2. Policy loss: same PPO-style clip ---log_probs = model.actor.get_log_probs(states, actions)ratios = torch.exp(log_probs - old_log_probs.detach())surr1 = ratios * advantagessurr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantagespolicy_loss = -torch.min(surr1, surr2).mean()# --- 3. Dual critic loss ---critic_loss = nn.MSELoss()(values, returns)post_critic_loss = nn.MSELoss()(post_values, post_returns)# Total loss with dual critictotal_loss = policy_loss + 0.5 * (critic_loss + post_critic_loss)# --- 4. Backpropagation ---optimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)optimizer.step()
PDPPO vs PPO in Practice
Tests on environments such as Frozen Lake and Stochastic Lot-sizing highlight PDPPO’s significant performance improvements as in Felizardo et al. [3]:
Improved Stability Across Seeds
PDPPO showed lower variance in both cumulative and maximum rewards across different random seeds, particularly in stochastic environments like Frozen Lake. This indicates greater robustness to initialization compared to PPO, which often suffers from unstable learning in such settings.Faster and Smoother Convergence
The learning curves of PDPPO are notably smoother and consistently trend upward, while PPO’s often stagnate or oscillate. This suggests that PDPPO’s dual-critic structure provides more accurate value estimates, enabling more reliable policy updates.Better Scaling with Dimensionality
In the Stochastic Lot-Sizing tasks, PDPPO’s performance gap widened as the problem dimensionality increased (e.g., 25 items and 15 machines). This demonstrates that PDPPO scales better in complex settings, benefiting from its decomposition of dynamics into deterministic and stochastic parts.More Informative Advantage Estimates
By using the maximum of pre- and post-decision advantages, PDPPO effectively captures the most optimistic learning signal at each step — leading to better exploitation of promising strategies without ignoring the stochastic nature of the environment.Better Sample Efficiency
Empirical results showed that PDPPO achieved higher rewards using fewer training episodes, making it more sample-efficient — an essential trait for real-world applications where data collection is expensive.
Empirical comparison (20–30 Runs)
PDPPO significantly outperforms PPO across three environment configurations of the Stochastic Lot-Sizing Problem. The shaded areas represent 95% confidence intervals.
Faster convergence
Higher peak performance, and
Tighter variance bands for PDPPO.
A few other alternatives
A few other alternatives to address the limitations of PPO include:
Intrinsic Exploration Module (IEM)
Proposed by Zhang et al. [8], this approach enhances exploration by incorporating uncertainty estimation into PPO. It addresses PPO’s weak exploration signal by rewarding novelty, especially useful in sparse reward settings.Uncertainty-Aware TRPO (UA-TRPO)
Introduced by Queeney et al. [7], UA-TRPO aims to stabilize policy updates in the presence of finite-sample estimation errors by accounting for uncertainty in the policy gradients — offering a more robust learning process than standard PPO.Dual-Critic Variants
Previous methods, like SAC [4] and TD3 [5], use dual critics mainly for continuous action spaces to reduce overestimation bias. However, they typically do not incorporate post-decision states nor are designed for environments with both deterministic and stochastic dynamics.Post-Decision Architectures in OR
Earlier work in operations research (e.g., Powell [2], Hull [6]) used post-decision states to manage the curse of dimensionality in approximate dynamic programming. PDPPO brings this insight into deep RL by using post-decision value functions directly in the learning process.
Each of these methods has its trade-offs, and PDPPO stands out by directly tackling the challenge of stochastic transitions via decomposition and dual critics — making it particularly effective in noisy, real-world-like settings.
Citation
[1] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-Critic Algorithms. In S.A. Solla, T.K. Leen, & K.-R. Müller (Eds.), Advances in Neural Information Processing Systems, Vol. 12. MIT Press.
[2] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality (2nd ed.). John Wiley & Sons.
[3] Felizardo, L. K., Fadda, E., Nascimento, M. C. V., Brandimarte, P., & Del-Moral-Hernandez, E. (2024). A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks. arXiv preprint arXiv:2504.05150. https://arxiv.org/pdf/2504.05150
[4] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML).
[5] Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML).
[6] Hull, I. (2015). Approximate Dynamic Programming with Post-Decision States as a Solution Method for Dynamic Economic Models. Journal of Economic Dynamics and Control, 55, 57–70.
[7] Queeney, J., Paschalidis, I. C., & Cassandras, C. G. (2021). Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(9), 9377–9385.
[8] Zhang, J., Zhang, Z., Han, S., & Lü, S. (2022). Proximal Policy Optimization via Enhanced Exploration Efficiency. Information Sciences, 609, 750–765.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
0 Comments
0 Shares
46 Views