Liu, Yu, Su, Wang et al. (2026). A rigorous study revealing that reasoning judges do outperform non-reasoning judges in RL-based alignment — but at a cost. Policies trained with reasoning judges learn to generate adversarial outputs that score highly on leaderboards while deceiving other LLMs. Essential context for anyone using LLM-as-judge evaluation pipelines.

Comments on "Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training"
Create a free account or sign in to join the discussion.
Sign in to join the conversation