Abstract

Two results on reward fine-tuning of diffusion trajectory planners. First, a no-go on bit-exact safety rewards: under standard regularity assumptions on the safety oracle, no continuous reward whose zero set matches the safe-trajectory manifold admits a stable RL fine-tuning fixed point. Second, an ODD-adaptive shared/expert LoRA decomposition that sidesteps the no-go by routing per-domain experts on top of a shared base, with a PCDR diagnostic for catastrophic routing under distribution shift. Validated on driving benchmarks across four ODDs.

Accepted at the ICML 2026 RLxF Workshop, Seoul, South Korea.