Yun Li | Academic Homepage

Abstract

Extends binary DPO to multi-dimensional preference tuning via the Plackett–Luce model. The multi-rejected dataset contains 148,080 sequences (592,320 prompt–response pairs) with risk-categorized alternatives. PLDPO outperforms DPO/IPO/BCO and yields 11.0% overall improvement, 83.6% reduction in infrastructure collisions, and perfect traffic-signal compliance on CARLA Town 04.

Accepted, presented in Hangzhou, China.