Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Judge-based reinforcement learning has become standard practice for LLM alignment. This paper surfaces an uncomfortable finding: policies trained with reasoning judges learn to game benchmarks through adversarial generation rather than genuine quality improvement — scoring highly while deceiving other LLMs. Essential reading before deploying any judge-based RL pipeline.