A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays
LLM评审员与律师协会考官对泰国律师资格考试自由回答论文的两阶段稳定性研究
Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong
AI总结 通过泰国律师资格考试的自由回答论文评估,研究LLM评审员与人类考官在评分一致性上的不对称性,发现LLM评审员倾向于多数人类阅读而无法复制少数人类阅读。
详情
NLP中的自由形式法律论文评估将专家间评分者稳定性视为单一上限数字,并将LLM评审员与该上限的一致性视为评审员稳定性的证据。我们通过相同输入协议在泰国律师资格考试上检验这两个假设:三名律师协会培训的考官(A、B、C)和一个26个LLM评审员小组对来自相同四个输入(问题、官方律师协会评分规定、标准答案、考生答案)的15个交叉评分的答案进行评分。主要发现是不对称的。在评分标准规定两个轴的15个单元格中的10个上,所有29名评分者收敛在一个狭窄的区间内:小组一致性是普遍的。在其余5个单元格中,评分标准未规定如何评分一个正确但省略了决定性法定引用的最终答案,人类小组在两个连贯的解读之间分裂(B/C多数在评分标准上限区间,分数6-8;A少数在较低区间,分数1-2)。LLM评审员群体并不对称分裂:26个LLM中有22个在或接近B/C的有争议区间评分,3个位于规定沉默的中间间隙,只有1个(GPT-5.4 Nano)接近A的区间但未一致地在其内评分。我们26个评审员小组中的零个LLM在有争议的单元格上复制了少数人类阅读。B/C方向的集群跨越了我们测试的每个模型大小、供应商和价格层级。一个仪器化的三个LLM锚定子小组(Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro)携带确定性探针、输入消融和自助法置信区间,并在15个单元格上达到锚定小组α=0.77,而人类小组α=0.36。高LLM小组α反映了系统性地收敛于多数阅读,而不是平衡地复制两种阅读;一个通过最大化与人类参考小组的一致性来选择其LLM评审员的基准将必然继承这种不对称性。
Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score $6$--$8$; A minority at the lower band, score $1$--$2$). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. \emph{Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells.} The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.