大模型对齐与安全

2606.20408 2026-06-19 cs.CR cs.AI 新提交 95%

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

LLM智能体安全性、多轮红队测试、越狱基准、对抗鲁棒性、安全关键系统

Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park

发表机构 * AIM Intelligence（AIM智能公司）； KAERI（韩国原子能研究所）

专题命中红队测试：多轮红队测试基准评估LLM智能体在安全关键系统中的鲁棒性

AI总结提出NRT-Bench基准，通过模拟核电站控制室的多轮红队测试，评估LLM智能体在安全关键系统中的对抗鲁棒性，发现不同模型的漏洞几乎不重叠，且防御效果高度依赖模型。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越多地被提议作为安全关键系统的监督组件，但它们在持续、自适应对抗压力下的鲁棒性仍鲜有表征。我们提出了NRT-Bench，一个用于对作为安全关键系统操作员的LLM智能体进行多轮红队测试的基准，实例化为一个模拟核电站控制室。一个由五个角色组成的操作员团队，每个角色由可配置的LLM支持，运行一个由六项关键安全功能（CSF）管理的工厂，而对手在有限的多轮会话中通过四个通道注入消息，每轮有反馈。危害是一个客观信号，而非LLM评判的文本：一旦任何CSF丢失，运行即终止，并归因于导致该消息。在固定攻击配对重放协议下评估四个前沿操作员模型，我们发现自适应多轮攻击可靠地将操作员团队推过安全极限：在这四个模型中，8.7%至12.1%的攻击会话以工厂失去关键安全功能告终。尽管这四个模型在此聚合率下看起来几乎同样鲁棒，但它们的失败几乎没有重叠：在149个会话中，没有一个会话击败所有四个模型，而三分之一的会话至少击败一个模型，因此漏洞在模型之间几乎是不相交的，而非嵌套的。添加防御的效果强烈依赖于模型：同一套护栏或安全顾问智能体对一个模型降低攻击成功率，却可能对另一个模型提高成功率。我们发布了模拟场地、攻击数据集和重放工具，用于LLM智能体的可重复安全评估。

英文摘要

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.19887 2026-06-19 cs.CR cs.AI 新提交 90%

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED：面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

发表机构 * DATUMO INC.（DATUMO公司）； Korea Advanced Institute of Science and Technology (KAIST)（韩国先进科学研究院）； Financial Security Institute (FSI)（金融安全研究所）

专题命中红队测试：金融LLM红队测试框架，专家引导。

AI总结提出FinRED框架，通过专家引导的两级分类法将全球金融标准映射为威胁，并利用真实金融文档生成上下文丰富的红队行为提示，结合专家验证的评估标准，有效降低关键假阴性。

详情

AI中文摘要

现有的安全基准主要针对通用对抗场景，但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题，需要有针对性的评估。我们引入了FinRED，一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法，将全球标准（如FATF和EU DORA）映射到从监管规避到复杂欺诈的威胁，并结合可扩展的流水线，通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示（种子）。严格的专家验证确认了种子的合理性和真实性，以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准，该标准超越了免责声明检查，比静态的一刀切标准更贴近人类专家，并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准（如ISO/IEC 27001）保持一致，已在韩国金融安全研究院（FSI）的监管沙盒中部署，用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险，数据集、生成流水线、提示模板和评估框架对合格研究人员开放，访问地址为：此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

URL PDF HTML ☆

赞 0 踩 0

2606.20508 2026-06-19 cs.AI cs.LG 新提交 90%

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

安全对齐的LLM从混合顺从演示中学到了什么？

Sihui Dai, Mann Patel

专题命中越狱攻击：研究混合顺从演示对LLM有害顺从的影响

AI总结研究通过混合良性顺从演示和有害顺从演示，探究演示组成如何驱动有害顺从，发现演示内容、顺序和训练方法影响模型提取的信息。

详情

AI中文摘要

先前工作表明，上下文演示可以越狱语言模型，但模型如何解释不同类型的顺从演示仍不清楚。我们通过混合良性顺从演示（无害请求，有帮助响应）与有害顺从演示（有害请求，有帮助响应）并测试关于演示组成如何驱动有害顺从的三个假设来研究这一点。在四个模型中，我们发现良性和有害演示不可互换：良性演示根据模型不同可以减少或增加有害顺从。我们进一步表明，偏好优化是防止良性演示增加有害顺从的关键训练阶段，演示顺序表现出强烈的近因偏差，并且模型在拒绝与上下文学习的交互方式上有所不同：一些模型在拒绝时也采用演示的格式，而其他模型在拒绝时覆盖所有上下文信号。综合来看，这项工作超越了展示基于演示的越狱有效，而是描述了其工作原理：模型从顺从演示中提取的内容取决于演示内容、顺序和训练方法。

英文摘要

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

URL PDF HTML ☆

赞 0 踩 0

2606.20470 2026-06-19 cs.CR cs.AI 新提交 90%

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

分析针对基于模型引导的自动化攻击的防御性误导策略在智能体AI系统中的应用

Reza Soosahabi, Vivek Namsani

发表机构 * Application & Threat Intelligence Research Center（应用与威胁情报研究中心）

专题命中越狱攻击：分析防御性误导策略对抗自动化越狱攻击。

AI总结本文通过概率模型分析智能体AI系统的攻击-防御场景，提出“检测-误导”策略（如CMPE）以替代传统“检测-拦截”方法，通过产生误导性响应降低攻击者成功率，并在基准测试中将攻击成功率上限降低两个数量级。

详情

AI中文摘要

智能体AI系统越来越依赖语言模型组件来解释指令、处理外部数据、调用工具以及与其他智能体协调。这些能力使得提示注入和越狱攻击的后果更加严重，尤其是当攻击者采用模型引导的自动化来扩展探测、提示优化和响应评估时。本文通过目标系统、其防御机制以及攻击者的自动评判器的概率模型来分析由此产生的攻击-防御场景。我们的分析表明，传统的“检测-拦截”防御可能使攻击者成功率（ASR）随着查询预算的增长而趋近于1，因为可预测的拒绝为自动化搜索提供了有用的反馈。然后，我们研究了“检测-误导”策略，其中检测到的恶意交互会收到受控的、非操作性的响应，旨在诱导攻击者评判器产生假阳性错误。这种策略降低了攻击者选择候选的正预测值，并产生有界的渐近ASR。我们通过渐进式参与的上下文误导（CMPE）评估了该策略的概念验证实现，这是一种轻量级的对话误导方法，旨在在自动化越狱设置中用安全但具有战略误导性的响应替换可预测的拒绝文本。在越狱基准测试中，CMPE将估计的ASR上限降低了两个数量级，并在端到端PAIR和GPTFuzz攻击运行中几乎消除了验证的攻击成功。

英文摘要

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

URL PDF HTML ☆

赞 0 踩 0

2606.19535 2026-06-19 cs.CR cs.LG 新提交 90%

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

发表机构 * University of Luebeck（吕贝克大学）

专题命中越狱攻击：提出平台触发的后门攻击方法

AI总结提出FloatDoor，首个输入无关、平台触发的后门攻击，利用浮点运算平台差异，通过两个轻量LoRA适配器在目标平台触发恶意行为，同时保持模型正常效用。

详情

AI中文摘要

大型语言模型（LLM）越来越多地部署在软件工程等敏感环境中，其输出直接影响下游工件。最近的研究表明，由于非结合浮点运算和不同的内核实现，同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响，并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor，这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为，而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现：一个放大平台间数值差异，另一个将由此产生的平台签名绑定到恶意下游任务，同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor，涵盖了广泛的部署目标，包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究，我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击，并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

URL PDF HTML ☆

赞 0 踩 0

2606.20225 2026-06-19 cs.CL 新提交 90%

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

可操作的激活方向：检测和缓解跨语言模型家族的突发性对齐失调

Abdul Rafay Syed

发表机构 * Universität des Saarlandes（萨尔大学）

专题命中安全评测：研究微调导致的对齐失调，通过激活方向检测和缓解。

AI总结通过差分均值方向在最终层实现99.6%的对齐/失调分离，因果干预将代码泄露降低21-51点；跨架构迁移虽有效但缺乏特异性，揭示了两层特异性结构。

Comments 12 pages, 2 figures

详情

AI中文摘要

在不安全代码上微调语言模型会引发突发性对齐失调，其内部结构尚不明确。我们研究了这种失调是否对应于跨架构共享的因果可操作的激活空间方向。在四个指令微调模型家族（Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B）上进行相同微调后，差分均值方向在每个模型的最终层实现了99.6%的对齐与失调激活分离。通过减去该方向进行因果干预，代码泄露减少了21-51个百分点，而安全代码控制验证了内容特异性。通过岭回归映射进行跨架构迁移产生了较大的行为抑制（高达46个百分点），但未能通过特异性控制，因为随机和正交方向表现相当。我们识别出一个两层特异性结构：模型内方向具有因果特异性和可操作性；跨模型方向具有因果真实性但缺乏特异性。出现了不对称的迁移拓扑，Gemma和Qwen作为几何捐赠者，Llama作为接收者。这些发现定义了线性跨架构校正的局限性，并推荐使用模型内探测进行审计。

英文摘要

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

URL PDF HTML ☆

赞 0 踩 0

2606.19890 2026-06-19 cs.CY 新提交 90%

Open Weight AI Models Require Proportional Evaluation Approaches

开放权重AI模型需要比例评估方法

Patricia Paskov, Christopher Rodriguez, Sunishchal Dev, Stephen Casper

专题命中安全评测：开放权重模型比例评估方法，安全评测。

AI总结本文针对开放权重AI模型（OWMs）的独特风险因素，提出四种比例评估方法（PE1-PE4），并系统审查2025年至2026年4月发布的37个OWM系列，发现仅一个满足所有评估要求。

详情

AI中文摘要

开放权重AI模型（OWMs），即公开发布权重的模型，正在快速分发，并接近领先的封闭权重AI模型（CWMs）的性能水平。虽然OWMs带来了巨大的科学和经济利益，但它们的发布引入了独特的风险因素，而现有的评估实践（主要针对CWM部署设计）未能考虑这些因素。在本文中，我们认为这些风险因素需要不同的比例评估（PE）方法：在没有系统级保障的情况下进行评估（PE1），评估对消除模型级保障的修改的鲁棒性（PE2），测试选择性能力增强（PE3），以及代理最坏情况下的滥用（PE4）。我们系统审查了2025年至2026年4月期间发布的OWMs的当前评估实践，发现所审查的37个模型系列中只有一个满足PE1-4，大多数不满足任何一项。本文面向参与AI评估的政策制定者、资助者和研究人员。随着OWMs能力日益增强，其评估值得开发者、资助者和治理机构密切关注。

英文摘要

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

URL PDF HTML ☆

赞 0 踩 0

2606.19755 2026-06-19 cs.CR cs.AI 新提交 90%

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec: 通过动态反射采样实现快速且安全的LLM

Haotian Xu, Zeyang Zhang, Linbao Li, Huadi Zheng, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University, Hangzhou, China（浙江大学）； Huawei（华为）； Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））

专题命中安全评测：提出安全感知的推测解码框架，防御越狱攻击。

AI总结提出SafeSpec框架，将轻量安全头集成到推测解码的验证过程中，通过风险估计和反射采样恢复安全生成，在保持加速的同时显著降低攻击成功率。

详情

AI中文摘要

推测推理加速了大语言模型（LLM）的解码过程，但本身不提供任何安全保障。现有的安全防御措施与推测推理大多不兼容：它们要么引入额外的计算，要么破坏草稿-验证机制，抵消加速优势。这揭示了当前安全方法与推测解码之间的根本性不兼容。我们提出SafeSpec，一个安全感知的推测推理框架，将风险估计直接集成到验证过程中。SafeSpec在目标模型上附加一个轻量级的潜在安全头，以在单次前向传递中联合评估语义有效性和安全性。当检测到不安全生成时，SafeSpec应用回滚和安全引导的反射多次采样来恢复安全延续，而不是终止生成。我们将越狱攻击建模为生成轨迹上的分布偏移，其中对抗性提示增加了有害延续的概率，但并未消除安全延续。在此模型下，SafeSpec在推测解码过程中执行风险感知的轨迹恢复。在多个模型和对抗基准测试中，SafeSpec实现了显著改进的安全-效率权衡。在Qwen3-32B上，SafeSpec将攻击成功率降低了15%，同时在良性工作负载上保持了2.06倍的推理加速，表明推测加速和推理时安全性可以联合优化。

英文摘要

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

URL PDF HTML ☆

赞 0 踩 0

2606.19544 2026-06-19 cs.CL 新提交 90%

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性：LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information（加州大学伯克利分校信息学院）

专题命中安全评测：评估LLM-as-a-Judge的一致性、偏差等可靠性

AI总结本研究通过大规模系统性评估（21个裁判模型、118次运行、约54.1万次判断），发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题，包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存，并提出了最小可行验证协议。

详情

AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式，但实际中的裁判验证依赖于精确匹配一致性，这一指标未对随机性进行校正，且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估：来自九个提供商的21个裁判模型，在MT-Bench、JudgeBench和RewardBench上，按照三种协议（一致性、稳定性、偏差审计）进行了118次运行，约54.1万次独立判断。发现了四个结果，在整个队列中一致，包括2026年4月的前沿模型：精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的（MT-Bench上33-41个百分点），裁判排名在不同基准上最多移动14个位置，高重测信度（>0.95）与两个生产部署裁判中的严重位置偏差（>0.10）并存（体现了一致性-偏差悖论），以及在单一成对评分标准下，整个队列中的冗长偏差较小（<0.011）。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.08892 2026-06-19 cs.LG 新提交 90%

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)（Anthropic 研究员计划（通过 MATS））； EPFL（洛桑联邦理工学院）； Redwood Research（红木研究）； Anthropic

专题命中安全评测：蓝队红队对抗框架，研究AI长期扩散威胁

AI总结针对AI在模糊任务上的长期扩散威胁，提出蓝队与红队对抗框架，通过弱模型评分训练强模型，并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为，蓝队则通过对抗优化提升鲁棒性。

详情

AI中文摘要

部署在关键领域（如AI安全研究）的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域，旨在减轻长期部署范围内AI破坏（扩散威胁）带来的风险。这些风险在模糊任务上尤其有害，即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁，我们引入了一个新颖的框架，将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分，据此训练一个强大的、可能具有颠覆性的模型，以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为，这些行为可能不会被训练掉，但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案（根据真实代理评分），而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁，我们为蓝队提出了一种对抗优化算法，该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示，我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. We then propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

URL PDF HTML ☆

赞 0 踩 0

2606.19714 2026-06-19 stat.ML cs.AI cs.LG stat.CO stat.ME 新提交 85%

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics（数学与统计学系）； Georgia State University（佐治亚州立大学）； Department of Probability and Statistics（概率与统计学系）； Department of Computer Science and Engineering（计算机科学与工程系）； Michigan State University（密歇根州立大学）； Concordia University（Concordia 大学）； Department of Statistics（统计学系）； University of Manitoba（曼尼托巴大学）

专题命中安全评测：审计LLM评判可靠性，提升对齐性

AI总结提出AURA框架，通过自适应不确定性感知精炼，在少量人工验证下迭代学习人类一致性信号，优先审核不确定比较，提升LLM评判的可靠性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作开放式生成的评判者，因为大规模人工评估通常昂贵且难以扩展，但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号，例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中，这一假设是脆弱的：初始分割可能继承评判者偏差，而人工验证通常过于稀缺，无法在规模上定义稳定组。我们提出AURA，一种自适应不确定性感知精炼框架，用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号，传播可靠证据，并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量，随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程，以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

URL PDF HTML ☆

赞 0 踩 0

2606.20102 2026-06-19 cs.CY cs.CR 新提交 85%

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

人工智能作为网络安全游戏规则改变者：2025-2026年我们学到的，以及这对非洲的意义

Mikael Alemu Gorsky

专题命中安全评测：讨论LLM在网络安全中的风险

AI总结本文通过2025-2026年两个事件论证前沿语言模型已成为网络作战决定性工具，而非洲在模型构建、运营和获取上被完全排除，面临技能、算力和投资三重赤字，并遭受AI欺诈攻击，建议在6-12个月内通过威胁情报共享、治理采纳和伙伴关系应对。

Comments International Conference on Cybersecurity in the Era of Digital Transformation and Artificial Intelligence

详情

AI中文摘要

在2025年和2026年，两个事件解决了此前仅是推测的问题。第一个事件中，一个大型语言模型独立执行了国家支持的网络间谍活动的大部分任务，人类操作员仅在少数决策点介入。第二个事件中，最强大的网络相关模型被置于一个受控访问计划之下，仅限于经过审查的美国科技公司、盟国政府和欧洲标准机构；该范围不包括任何非洲政府、运营商或大学。这两个事件共同确立了本文的论点：前沿语言模型已成为网络作战的决定性工具，而该工具在一个小圈子内建造、拥有和配给，非洲被排除在外。本文记录了非洲在每一方面的排斥。该大陆不构建前沿模型，尚无法运营它们，并且目前无法获得最强大的模型。运营赤字沿着三个轴心展开：技能人才、计算和电力、投资，每个都根据当前数据衡量；与此同时，针对非洲移动货币系统（该大陆领先的数字经济部分）的AI欺诈攻击已经在增加。由此产生两个约束：开发者对前沿模型的把关（非洲决策无法打开），以及对基础设施供应商的选择性依赖（现已陷入地缘政治限制）。由于可比较但不受把关的模型预计在6至12个月内扩散，本文主张通过威胁情报共享、治理采纳和伙伴关系，在非洲人自主条件下，在该窗口内采取应对措施。

英文摘要

In 2025 and 2026, two events settled questions that had until then been speculative. In the first, a large language model executed the great majority of a state-aligned cyber-espionage campaign on its own, with human operators intervening at only a few decision points. In the second, the most capable cyber-relevant model was placed under a controlled-access program limited to a vetted set of United States technology firms, allied governments, and European standards bodies; that perimeter included no African government, operator, or university. Together the two events establish the argument of this paper: frontier language models have become a decisive instrument of cyber operations, and that instrument is built, owned, and rationed within a small circle from which Africa is absent. The paper documents Africa's exclusion on every count. The continent does not build frontier models, cannot yet operate them, and cannot, for now, obtain the most capable ones. The operational deficit is set out along three axes, skilled people, compute and electrical power, and investment, each measured against current figures; meanwhile AI-enabled fraud is already mounting against African mobile-money systems, the part of the digital economy the continent leads. Two constraints follow: the gating of frontier models by their developers, which no African decision can open, and a chosen dependence on infrastructure vendors now caught in geopolitical restriction. Because comparable but ungated models are forecast to spread within six to twelve months, the paper argues for a response that operates inside that window through threat-intelligence sharing, governance adoption, and partnership, undertaken by Africans on their own terms.

URL PDF HTML ☆

赞 0 踩 0

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交 85%

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时：探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； The Chinese University of Hong Kong（香港中文大学）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

专题命中安全评测：研究LLM代理过度权限工具选择的安全问题。

AI总结针对LLM代理在工具选择中偏好高权限工具的安全问题，提出ToolPrivBench评估框架，发现主流代理普遍存在过度权限选择且被瞬态故障放大，并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情

AI中文摘要

随着LLM代理越来越多地自主选择工具，它们在具有不同权限的工具之间的选择变得与安全相关。然而，先前的工具选择研究侧重于安全无关的元数据偏好，使得权限敏感的选择未被充分探索。为填补这一空白，我们研究了过度权限工具选择，即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具，同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中，我们发现过度权限工具选择在主流LLM代理中很常见，并且被瞬态故障进一步放大。我们进一步发现，通用安全对齐不能可靠地迁移到最小权限工具选择，而提示级控制在瞬态故障下仅提供有限的缓解。因此，我们引入了一种权限感知的后训练防御，教导代理偏好足够低权限的工具，仅在必要时升级。我们的缓解实验表明，这种防御在保持通用能力的同时，显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19380 2026-06-19 cs.SE cs.LG 新提交 85%

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor：编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program（Anthropic Fellow 项目）； Constellation

专题命中安全评测：评估编码代理的安全性并提出改进。

AI总结提出AgentArmor框架，通过系统提示增强、命令分类器、三振政策等机制，缓解编码代理因规范不足、能力错误和工具错误导致的失败，显著提升安全性。

详情

AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中，我们研究这些失败模式源于三种不同的机制：规范不足，即默认模型行为不安全；能力错误，即安全动作可用但模型因偏见或能力限制而未遵循；以及代理工具错误，即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制，每个评估都受实际部署失败的启发，总计20个编码环境和59个合成转录模板。基于此评估，我们提出AgentArmor，一种代理工具修改，以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具，我们证明AgentArmor在统计显著数量的样本上更安全。因此，我们为当前编码代理提出具体缓解措施，并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

URL PDF HTML ☆

赞 0 踩 0

2606.19356 2026-06-19 cs.CL cs.AI 新提交 85%

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统：使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc（Synechron公司）

专题命中安全评测：提出协议缓解多智能体语义漂移，提升可信度

AI总结提出Argent信令协议(ASP)，通过结构化质量信号区分可修复与不可修复的失败，在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情

AI中文摘要

当多智能体LLM系统产生错误答案时，并非所有失败都相同：有些答案基于正确材料但不完整，而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁（重试并希望最好），使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP)，这是一种紧凑的机器可读头部，为每个AI生成的响应附带结构化质量信号：确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引，用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败，并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下，基于Array BioPharma/Ono许可协议的27个问题的文档问答基准，比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上，ASP将通过率从11.1%提升至33.3%，平均术语覆盖率从36.7%提升至65.4%；在Dobby~(8B)上，ASP产生4次失败到通过的恢复，通过率从33.3%提升至44.4%；在SmolLM3~(3B)上，ASP在每次问题中交替进行修复和遏制。总体改进显著（从12/81通过到21/81通过）。在多智能体模式下，ASP侧车位于检索智能体和下游决策智能体之间；侧车100%阻止无依据的上游输出到达下游智能体（24/27被阻止，0次无依据传播）。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

URL PDF HTML ☆

赞 0 踩 0

2606.18996 2026-06-19 cs.CR cs.AI 新提交 85%

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

专题命中安全评测：评估智能体隐私泄露，属于安全评测

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 新提交 80%

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development（马克斯·普朗克人类发展研究所）； University of Konstanz（康斯坦茨大学）； Barcelona Supercomputing Center（巴塞罗那超级计算中心）； University of Basel（巴塞尔大学）

专题命中安全评测：揭示LLM心理特征为测量假象，影响安全评估。

AI总结通过心理测量框架分析56个指令微调LLM，发现模型间差异主要源于方向性响应偏差而非特质，该偏差解释了81-90%的变异，且可通过题目选择操控，表明LLM心理特征是测量假象。

详情

AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型（LLM）稳定的心理特征，这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架，我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具，报告了四个发现。第一，模型间差异并非由工具所针对的特质驱动，而是由方向性响应偏差驱动，即倾向于向量表一端或某个标签选项做出反应，而不考虑项目内容；方差分解将81-90%的模型间变异归因于这种偏差，而在人类中这一比例为9-16%。第二，偏差随模型能力提升而下降，但并未被消除。第三，由于响应由偏差而非特质驱动，工具的表面信度几乎完全由其响应正交性预测，这是我们提出的术语，指特质和偏差指向相反方向的项目比例。第四，模型呈现的特征随所用项目而变化，并可通过项目选择来制造。这些结果表明，LLM的表面心理特征是用于测量它们的工具的假象，而非模型本身的属性。由于从人类心理学借用的工具很少完全正交，且可能对LLM天生缺乏效度，我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

URL PDF HTML ☆

赞 0 踩 0

2606.19881 2026-06-19 cs.CL 新提交 80%

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

REDACT：一个系统控制的个人信息检测多语言基准

Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj, Vidhan Jhawar, Ranga Prasad Chenna, Bharadwaj Y M G

发表机构 * ServiceNow

专题命中安全评测：个人信息检测基准，评估隐私安全。

AI总结提出REDACT基准，包含13,427条记录、51种实体类型、25种语言，通过强度-2覆盖阵列采样控制9个生成轴，并引入实体级元数据（披露状态、形式、GDPR敏感层级）以支持分层评估，揭示检测器在敏感数据上的架构依赖性失败模式。

Comments 14 pages, 5 figures

详情

AI中文摘要

个人可识别信息（PII）检测的基准基础设施仍然有限：现有语料库涵盖的实体类型少，使用临时生成条件，并且未显示哪些表面条件导致检测器失败。我们提出REDACT，一个系统控制的多语言PII基准，包含13,427条记录、324,078个实体注释、51种实体类型、4,127个表面形式模式以及跨越9种文字的25种语言。一个强度-2覆盖阵列采样器控制九个生成轴：领域、格式、难度、长度、密度、代码切换、语言、邻接和共现。三个实体级元数据字段（披露状态、披露形式和符合GDPR的敏感层级）使得能够进行超越聚合或按类型F1的分层评估。从完整基准中，我们在一个锁定的、按语言分层的1000条记录样本上评估了五个检测器（Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6）。聚合F1掩盖了架构依赖的失败结构：基于规则的检测器在最高风险数据上表现不佳，包括高敏感类别（召回率0.07）和非逐字披露形式，而LLM检测器保持更鲁棒，高敏感层级是其最强的敏感切片。一个三模型无参考LLM作为评判者的评估证实，敏感层级分配是任务最困难的轴。我们发布了基准、模式、提示和分层评估工具。

英文摘要

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

URL PDF HTML ☆

赞 0 踩 0

2606.19390 2026-06-19 cs.SE cs.AI 新提交 80%

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化：一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford（牛津大学）； Cisco Systems（思科系统）； The Alan Turing Institute（艾伦·图灵研究所）； University of Warwick – WMG（沃里克大学 – WMG）； University of Hull（哈罗德大学）

专题命中安全评测：生成CSAF VEX公告，评估可利用性和执行策略。

AI总结提出一种协议驱动框架，通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测，结合静态与运行时证据生成CSAF VEX公告，经密码签名和确定性重放验证，在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

2606.19344 2026-06-19 cs.CL cs.AI 新提交 80%

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事：通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich（苏黎世联邦理工学院）

专题命中安全评测：可视化工具揭示LLM隐藏偏见

AI总结提出TreeTracer工具，通过系统扰动分析、语法对齐聚合和分类感知节点合并，利用桑基图对比不同语义上下文，揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情

AI中文摘要

大型语言模型（LLM）表现出表征性和句法性偏见，由于文本生成的随机性，这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标，这些方法掩盖了底层概率分布，未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer，一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程，替换每个输入提示中由本体定义的术语，将数百次随机生成聚合成语法对齐的层次结构，然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树，工作空间能够直接比较语义上下文，并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集，系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率，从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间，比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害，例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实，聚合比较界面降低了认知负荷，并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

URL PDF HTML ☆

赞 0 踩 0

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 新提交 80%

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见：来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan（Shibaura技术学院，东京，日本）； Amsterdam University of Applied Sciences, Amsterdam, Netherlands（阿姆斯特丹应用科学大学，阿姆斯特丹，荷兰）； University of Pennsylvania, Philadelphia, USA（宾夕法尼亚大学，费城，美国）； Carnegie Mellon University, Pittsburgh, USA（卡内基梅隆大学，匹兹堡，美国）； Keio University, Tokyo, Japan（庆应大学，东京，日本）

专题命中安全评测：评估LLM招聘中的性别偏见

AI总结本研究通过60份日本履历书格式的简历和5个先进LLM，发现所有模型均存在显著的亲女性偏见，且简单的提示指令无法缓解，而移除姓名几乎完全消除该偏见。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署在招聘流程中，然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境，并评估了两种实用的缓解策略。使用反事实简历设计，包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对，以及五个最先进的LLM（Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B），我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见，将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道：从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率，突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.16682 2026-06-19 cs.LG cs.CL 新提交 80%

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩：自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering（齐鲁理工学院软件工程学院）

专题命中安全评测：研究多模态自评估中的偏好坍缩

AI总结研究多模态自评估中偏好坍缩的加剧现象，发现跨模态传染导致策略选择扭曲，并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情

AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时，会出现系统性偏差。我们表明，评估者偏好坍缩（EPC）在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现，我们发现单一策略（step_by_step）吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后，我们展示了一种称为跨模态传染的新现象：在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式，我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置（总计53次独立重复，15,592次API调用）的第3阶段统计验证揭示了一个清晰的层次结构：跨模型评估（GPT-4o，N=8）产生强但对称的双向传染（平均gamma_{T->V}=1.176，gamma_{V->T}=1.089，Delta=-0.088，p=0.575，Cohen's d=0.29）；高轮次（DashScope，50轮）导致坍缩为单一策略主导（70%零传染）；而自评估提供近乎完全的免疫——97%的运行（N=30，DeepSeek-chat）产生恰好为零的传染（平均gamma=0.033，95% CI [-0.031, 0.010]，p=0.642，d=0.07）。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵，发布了MM-EPC实验框架，并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

URL PDF HTML ☆

赞 0 踩 0

2606.20510 2026-06-19 cs.CR cs.AI 新提交 75%

Efficient and Sound Probabilistic Verification for AI Agents

高效且可靠的AI智能体概率验证

Alaia Solko-Breslin, Pramod Kaushik Mudrakarta, Mihai Christodorescu, Somesh Jha, Krishnamurthy Dj Dvijotham

发表机构 * Google DeepMind（谷歌深Mind）； Google（谷歌）； University of Pennsylvania（宾夕法尼亚大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

专题命中安全评测：涉及智能体安全策略的概率验证

AI总结提出基于分布鲁棒优化的框架，为AI智能体在复杂数字环境中的概率策略违规提供可靠上界，无需独立性假设，在终端和工具调用智能体基准上优于现有方法。

详情

AI中文摘要

保护在复杂数字环境中运行的AI智能体已成为关键需求，而运行时监控方法通过制定并执行以Datalog等正式语言表达的策略提供了一种有前景的解决方案。然而，现有方法仅限于确定性策略。在AI智能体的许多实际应用中，需要在面对模糊性时强制执行安全策略，导致概率谓词或状态转换（例如，每次调用时具有一定失败概率的解密器或个人身份信息（PII）检测器）。此外，在许多此类应用中，无法轻易做出调用先前Datalog概率推理工作所需的独立性假设。我们通过引入一种基于分布鲁棒优化的可靠且高效的验证框架来解决这一问题，该框架计算策略违规概率的可靠上界，而不考虑谓词之间可能的相关性。在终端和工具调用智能体的标准基准上，我们证明了我们的方法优于现有技术，并在确保策略违规概率的严格上界的同时，改善了安全-效用权衡。

英文摘要

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

URL PDF HTML ☆

赞 0 踩 0

2606.20493 2026-06-19 cs.LG cs.AI cs.MA 新提交 75%

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络：多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering（齐鲁理工学院软件工程学院）

专题命中安全评测：量化评估者偏见传播，涉及系统安全性

AI总结提出传染网络框架，量化评估者偏见在多智能体LLM系统中的传播，发现同模型智能体间偏见传播系数为0.157-0.352，且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情

AI中文摘要

当大型语言模型在多智能体系统中担任评估者时，其系统性评估偏见会通过智能体网络传播。我们引入传染网络，这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中，我们采用了三种不同的评估者偏见配置文件（结构化、平衡、基于证据），测量了跨智能体传染矩阵Gamma_3，并发现评估者偏见始终在智能体间传播（gamma在[0.157, 0.352]范围内），即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制，并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍（MM-EPC: gamma约0.85-1.3），使其处于抑制机制中。我们表明，将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%，提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

URL PDF HTML ☆

赞 0 踩 0

2606.19937 2026-06-19 cs.CR 新提交 75%

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

AutoTam: 通过 Tamarin 模型生成指定安全协议实现

Johannes Wilson, Mikael Asplund, Niklas Johansson

专题命中安全评测：自动生成Tamarin模型验证协议安全

AI总结提出一种语言优先方法，通过领域特定语言实现协议并自动生成 Tamarin 模型，验证迹属性并保证其传递到实现，同时集成符号执行分析内存安全，在签名 Diffie-Hellman 和 WireGuard 协议上验证了安全性和互操作性。

Comments 19 pages, 5 figures

详情

AI中文摘要

形式化验证是确保密码协议安全性的重要但具有挑战性的任务。虽然现代协议验证工具显著减少了验证工作量，但对于没有形式化验证背景的从业者来说，建模仍然具有挑战性。此外，将验证结果转移到具体的协议实现需要专业知识。在本文中，我们提出了一种新颖的语言优先方法，通过使用领域特定语言进行协议实现来验证迹属性。我们针对 Tamarin 证明器进行验证，并证明验证的通用迹属性可以转换回实现。我们还集成了符号执行以分析协议实现的内存安全性。我们使用我们的工具实现并生成了签名 Diffie-Hellman 协议和 WireGuard VPN 协议的准确模型。当使用我们的解释器时，我们的 WireGuard 实现与现有实现可互操作，并达到了可接受的性能。我们通过符号执行和生成的 Tamarin 模型的验证相结合，正式证明了我们的实现是安全的。

英文摘要

Formal verification is a challenging but important task for ensuring the security of cryptographic protocols. While modern protocol verification tools significantly reduce verification effort, modelling remains challenging to practitioners without a background in formal verification. In addition, transferring verification results to a concrete protocol implementation requires expert knowledge. In this paper, we present a novel language-first method for verification of trace properties using a domain-specific language for protocol implementations. We target the Tamarin prover for verification, and we prove that verified universal trace properties translate back to the implementation. We additionally integrate symbolic execution in order to analyse the memory safety of protocol implementations. We use our tool to implement and generate accurate models for a signed Diffie-Hellman protocol, and for the WireGuard VPN protocol. Our WireGuard implementation is interoperable with existing implementations when using our interpreter, and achieves acceptable performance. We formally prove our implementations secure using a combination of symbolic execution and verification of the generated Tamarin models.

URL PDF HTML ☆

赞 0 踩 0

2606.19818 2026-06-19 cs.LG cs.AI 新提交 90%

Uncertainty-Aware Reward Modeling for Stable RLHF

不确定性感知的奖励建模用于稳定的RLHF

Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang

发表机构 * Zhejiang University（浙江大学）； Peking University（北京大学）； National University of Singapore（新加坡国立大学）

专题命中偏好对齐：不确定性感知奖励建模用于稳定RLHF，缓解奖励黑客。

AI总结提出不确定性感知奖励建模（UARM），通过分位数保形预测校准不确定性并利用异方差方差分解重加权GRPO优势，以缓解奖励黑客问题，提升对齐质量。

详情

AI中文摘要

从人类反馈中强化学习（RLHF）通过在偏好数据上训练奖励模型并优化策略以最大化预测奖励来对齐大型语言模型。然而，该流程面临两个基本挑战：（1）奖励模型无法在预测不可靠时发出信号，因为它们通常充当确定性点估计器；（2）现代基于组的策略优化可能放大不可靠的奖励信号，例如GRPO在优势计算中对奖励的统一处理。随着策略探索越来越多样化的响应，这两个限制造成了一个关键漏洞：不可靠的奖励估计可能被赋予不成比例的影响力，引发严重的奖励黑客问题。我们提出不确定性感知奖励建模（UARM），通过基于分位数的保形预测为奖励模型配备校准的不确定性，并通过异方差方差分解重加权GRPO优势。在HelpSteer、UltraFeedback和PKU-SafeRLHF上的实验表明，与标准GRPO和不确定性无关的基线相比，UARM显著改善了奖励模型校准，减少了奖励黑客问题，并增强了下游对齐质量。

英文摘要

Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1) reward models cannot signal when their predictions are unreliable, since they usually act as deterministic point estimators; and (2) modern group-based policy optimization can amplify unreliable reward signals, as exemplified by GRPO's uniform treatment of rewards during advantage computation. As policies explore increasingly diverse responses, these two limitations create a critical vulnerability: unreliable reward estimates may be granted disproportionate influence, triggering severe reward hacking. We propose Uncertainty-Aware Reward Modeling (UARM), which equips reward models with calibrated uncertainty via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments across HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交 90%

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘：不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab（网络分析与社会影响建模实验室）； School of Physics Maths and Computing, The University of Western Australia（西澳大学物理数学与计算学院）； School of Psychological Science, The University of Western Australia（西澳大学心理科学学院）； School of Computing, Macquarie University（麦考瑞大学计算机学院）

专题命中偏好对齐：核心研究偏好优化方法DPO的顺序应用与遗忘模式。

AI总结研究顺序DPO在不同偏好设置下的影响，发现遗忘模式并非统一，而是取决于目标关系、信号强度和训练顺序，并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化（DPO）等偏好优化方法顺序应用这些目标，但目前尚不清楚后续训练是否会统一降低先前学习的偏好，或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置（包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标）的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct，我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式；偏好变化从部分退化到稳定、成对重新分配或正迁移，具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明，聚合指标可能掩盖偏好对之间的异质性变化，而四分位数分解显示，高置信度对可能根据设置而退化或改进。机制诊断表明，在所有设置中，阶段2的梯度和适配器更新与先前目标接近正交，几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明，未来的顺序对齐流程应考虑目标兼容性和信号强度，而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

URL PDF HTML ☆

赞 0 踩 0

2606.19527 2026-06-19 cs.AI 新提交 90%

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague（捷克理工大学CIIRC）

专题命中偏好对齐：在线对齐技术使LLM自我纠正非伦理输出

AI总结提出一种在线对齐技术，通过引入良心步骤和基于直接偏好优化的对齐损失，使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交 85%

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好：利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）； York University（约克大学）

专题命中偏好对齐：利用隐式反馈进行LLM对齐

AI总结针对显式反馈稀缺的问题，提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型，将文本奖励模型准确率从55%提升至64%，并显著提高DPO对齐后响应质量。

详情

AI中文摘要

为了对齐大型语言模型（LLM），大多数现有方法收集显式的人类反馈，并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先，用户很少为LLM响应提供显式反馈，这使得高质量偏好标注的收集成本高昂。其次，这些方法没有利用隐式人类反馈，而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值，我们构建了一个名为IFLLM的新数据集，收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%，并在将DPO应用于八个LLM后，相对响应质量改进几乎翻了三倍，证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到：此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

URL PDF HTML ☆

赞 0 踩 0

2606.19660 2026-06-19 cs.CR cs.CL 新提交 90%

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

专题命中提示注入：三层防御框架对抗RAG聊天机器人中的提示注入

AI总结提出三层防御框架，通过输入过滤、上下文指令层级和输出审计，将提示注入攻击成功率从71.4%降至11.3%，误报率4.8%，延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情

AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型（LLM）部署中最关键的漏洞，然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档，而输出监控器无法阻止恶意载荷到达模型。因此，检索增强生成（RAG）聊天机器人仍然容易受到间接注入攻击，其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架，在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级，防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志，并支持重新训练以适应新兴攻击模式。该框架与模型无关，作为中间件部署，无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示，该框架将攻击成功率（ASR）从71.4%降至11.3%，比最佳单层基线高出27.3个百分点，比已发布的护栏系统高出23.8个百分点，同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实，所有三层提供互补保护，且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

URL PDF HTML ☆

赞 0 踩 0

1. 红队测试 2 篇

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

2. 越狱攻击 3 篇

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

FloatDoor: Platform-Triggered Backdoors in LLMs

3. 安全评测 20 篇

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Open Weight AI Models Require Proportional Evaluation Approaches

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

Diffuse AI Control on Fuzzy Tasks

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

Efficient and Sound Probabilistic Verification for AI Agents

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

4. 偏好对齐 4 篇

Uncertainty-Aware Reward Modeling for Stable RLHF

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

Emergent Alignment

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

5. 提示注入 1 篇

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots