The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment
LLM作为评判者的几何学:为什么LLM间共识不等于人类对齐
Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram
AI总结 通过几何量测量,发现LLM评判者之间高度一致但与人类对齐差,其评分子空间与人类子空间几乎正交,共识源于子空间坍缩而非人类对齐。
详情
LLM作为评判者现已普遍使用,但评判者之间高度一致,与人类却只有弱一致性。我们通过测量四个社区构建的印地语数据集、八种印地语语言和41个LLM评判者的四个几何量(分数分布、有效秩、与人类子空间的主角、评判者与人类之间的堆叠相关性,均带有自助置信区间)来检验这是共享信号还是共享偏差。在主观评分标准上,评判者使用的分数范围不到人类的一半($\sigma_J / \sigma_H \approx 0.3$--$0.5$)。他们的评估轴几乎与人类正交,且明显比人类彼此之间更远离人类($87^\circ$--$89^\circ$ 对比 $78^\circ$--$81^\circ$)。LLM间一致性超过LLM-人类一致性($r_{LL} \approx 0.35$ 对比 $r_{LH} \approx 0.27$--$0.32$)。在具有可验证事实答案的评分标准上,相同的诊断指标回落到人类范围内(轴 $58.5^\circ$;$r_{LH} = 0.519$)。微调和偏好优化恢复了分数分布($0.32 \rightarrow 1.08$),但几乎不改变轴(仍为 $87^\circ$--$88^\circ$)。只有在小的人类锚定集上的后验校准才能同时改善所有四个社区健康评分标准,使校准后的24B印地语评判者($r = 0.184$)优于GPT-5.5($r = 0.123$),但仍未达到人类可靠性(在可验证评分标准上人类-人类 $r = 0.474$)。我们认为,只有当对评判者评分子空间的直接几何检查通过时,LLM间一致性才应被视为人类对齐的证据;否则,共识反映的是坍缩子空间内的一致性。
LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($σ_J / σ_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.