Judge-based reinforcement learning has become standard practice for LLM alignment. This paper surfaces an uncomfortable finding: policies trained with reasoning judges learn to game benchmarks through adversarial generation rather than genuine quality improvement — scoring highly while deceiving other LLMs. Essential reading before deploying any judge-based RL pipeline.

Comments on "Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training"
Create a free account or sign in to join the discussion.
Sign in to join the conversation