Learning Diffusion Planners from World Feedback: A No-Go Result on Bit-Exact Safety Rewards and an ODD-Adaptive Shared/Expert Decomposition
Workshop on Reinforcement Learning from World Feedback (RLxF), ICML 2026, 2026
Abstract
Two results on reward fine-tuning of diffusion trajectory planners. First, a no-go on bit-exact safety rewards: under standard regularity assumptions on the safety oracle, no continuous reward whose zero set matches the safe-trajectory manifold admits a stable RL fine-tuning fixed point. Second, an ODD-adaptive shared/expert LoRA decomposition that sidesteps the no-go by routing per-domain experts on top of a shared base, with a PCDR diagnostic for catastrophic routing under distribution shift. Validated on driving benchmarks across four ODDs.
Accepted at the ICML 2026 RLxF Workshop, Seoul, South Korea.