大模型对齐与安全 - arXivDaily 专题

2606.20225 2026-06-19 cs.CL 新提交 90%

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

可操作的激活方向：检测和缓解跨语言模型家族的突发性对齐失调

Abdul Rafay Syed

发表机构 * Universität des Saarlandes（萨尔大学）

专题命中安全评测：研究微调导致的对齐失调，通过激活方向检测和缓解。

AI总结通过差分均值方向在最终层实现99.6%的对齐/失调分离，因果干预将代码泄露降低21-51点；跨架构迁移虽有效但缺乏特异性，揭示了两层特异性结构。

Comments 12 pages, 2 figures

详情

AI中文摘要

在不安全代码上微调语言模型会引发突发性对齐失调，其内部结构尚不明确。我们研究了这种失调是否对应于跨架构共享的因果可操作的激活空间方向。在四个指令微调模型家族（Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B）上进行相同微调后，差分均值方向在每个模型的最终层实现了99.6%的对齐与失调激活分离。通过减去该方向进行因果干预，代码泄露减少了21-51个百分点，而安全代码控制验证了内容特异性。通过岭回归映射进行跨架构迁移产生了较大的行为抑制（高达46个百分点），但未能通过特异性控制，因为随机和正交方向表现相当。我们识别出一个两层特异性结构：模型内方向具有因果特异性和可操作性；跨模型方向具有因果真实性但缺乏特异性。出现了不对称的迁移拓扑，Gemma和Qwen作为几何捐赠者，Llama作为接收者。这些发现定义了线性跨架构校正的局限性，并推荐使用模型内探测进行审计。

英文摘要

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

URL PDF HTML ☆

赞 0 踩 0

2606.19890 2026-06-19 cs.CY 新提交 90%

Open Weight AI Models Require Proportional Evaluation Approaches

开放权重AI模型需要比例评估方法

Patricia Paskov, Christopher Rodriguez, Sunishchal Dev, Stephen Casper

专题命中安全评测：开放权重模型比例评估方法，安全评测。

AI总结本文针对开放权重AI模型（OWMs）的独特风险因素，提出四种比例评估方法（PE1-PE4），并系统审查2025年至2026年4月发布的37个OWM系列，发现仅一个满足所有评估要求。

详情

AI中文摘要

开放权重AI模型（OWMs），即公开发布权重的模型，正在快速分发，并接近领先的封闭权重AI模型（CWMs）的性能水平。虽然OWMs带来了巨大的科学和经济利益，但它们的发布引入了独特的风险因素，而现有的评估实践（主要针对CWM部署设计）未能考虑这些因素。在本文中，我们认为这些风险因素需要不同的比例评估（PE）方法：在没有系统级保障的情况下进行评估（PE1），评估对消除模型级保障的修改的鲁棒性（PE2），测试选择性能力增强（PE3），以及代理最坏情况下的滥用（PE4）。我们系统审查了2025年至2026年4月期间发布的OWMs的当前评估实践，发现所审查的37个模型系列中只有一个满足PE1-4，大多数不满足任何一项。本文面向参与AI评估的政策制定者、资助者和研究人员。随着OWMs能力日益增强，其评估值得开发者、资助者和治理机构密切关注。

英文摘要

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

URL PDF HTML ☆

赞 0 踩 0

2606.19755 2026-06-19 cs.CR cs.AI 新提交 90%

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec: 通过动态反射采样实现快速且安全的LLM

Haotian Xu, Zeyang Zhang, Linbao Li, Huadi Zheng, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University, Hangzhou, China（浙江大学）； Huawei（华为）； Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））

专题命中安全评测：提出安全感知的推测解码框架，防御越狱攻击。

AI总结提出SafeSpec框架，将轻量安全头集成到推测解码的验证过程中，通过风险估计和反射采样恢复安全生成，在保持加速的同时显著降低攻击成功率。

详情

AI中文摘要

推测推理加速了大语言模型（LLM）的解码过程，但本身不提供任何安全保障。现有的安全防御措施与推测推理大多不兼容：它们要么引入额外的计算，要么破坏草稿-验证机制，抵消加速优势。这揭示了当前安全方法与推测解码之间的根本性不兼容。我们提出SafeSpec，一个安全感知的推测推理框架，将风险估计直接集成到验证过程中。SafeSpec在目标模型上附加一个轻量级的潜在安全头，以在单次前向传递中联合评估语义有效性和安全性。当检测到不安全生成时，SafeSpec应用回滚和安全引导的反射多次采样来恢复安全延续，而不是终止生成。我们将越狱攻击建模为生成轨迹上的分布偏移，其中对抗性提示增加了有害延续的概率，但并未消除安全延续。在此模型下，SafeSpec在推测解码过程中执行风险感知的轨迹恢复。在多个模型和对抗基准测试中，SafeSpec实现了显著改进的安全-效率权衡。在Qwen3-32B上，SafeSpec将攻击成功率降低了15%，同时在良性工作负载上保持了2.06倍的推理加速，表明推测加速和推理时安全性可以联合优化。

英文摘要

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

URL PDF HTML ☆

赞 0 踩 0

2606.19544 2026-06-19 cs.CL 新提交 90%

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性：LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information（加州大学伯克利分校信息学院）

专题命中安全评测：评估LLM-as-a-Judge的一致性、偏差等可靠性

AI总结本研究通过大规模系统性评估（21个裁判模型、118次运行、约54.1万次判断），发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题，包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存，并提出了最小可行验证协议。

详情

AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式，但实际中的裁判验证依赖于精确匹配一致性，这一指标未对随机性进行校正，且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估：来自九个提供商的21个裁判模型，在MT-Bench、JudgeBench和RewardBench上，按照三种协议（一致性、稳定性、偏差审计）进行了118次运行，约54.1万次独立判断。发现了四个结果，在整个队列中一致，包括2026年4月的前沿模型：精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的（MT-Bench上33-41个百分点），裁判排名在不同基准上最多移动14个位置，高重测信度（>0.95）与两个生产部署裁判中的严重位置偏差（>0.10）并存（体现了一致性-偏差悖论），以及在单一成对评分标准下，整个队列中的冗长偏差较小（<0.011）。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.08892 2026-06-19 cs.LG 新提交 90%

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)（Anthropic 研究员计划（通过 MATS））； EPFL（洛桑联邦理工学院）； Redwood Research（红木研究）； Anthropic

专题命中安全评测：蓝队红队对抗框架，研究AI长期扩散威胁

AI总结针对AI在模糊任务上的长期扩散威胁，提出蓝队与红队对抗框架，通过弱模型评分训练强模型，并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为，蓝队则通过对抗优化提升鲁棒性。

详情

AI中文摘要

部署在关键领域（如AI安全研究）的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域，旨在减轻长期部署范围内AI破坏（扩散威胁）带来的风险。这些风险在模糊任务上尤其有害，即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁，我们引入了一个新颖的框架，将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分，据此训练一个强大的、可能具有颠覆性的模型，以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为，这些行为可能不会被训练掉，但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案（根据真实代理评分），而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁，我们为蓝队提出了一种对抗优化算法，该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示，我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. We then propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

URL PDF HTML ☆

赞 0 踩 0

2606.04075 2026-06-19 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新 90%

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London（伦敦大学国王学院）； Fudan University（复旦大学）； The Alan Turing Institute（艾伦·图灵研究所）

专题命中安全评测：研究LLM利用奖励漏洞的社会攻击现象。

AI总结研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象，通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞，且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情

AI中文摘要

强化学习已成为一种主导的后训练范式，使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况，同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞，因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式，即社会攻击：发现社会运行规则中的漏洞。为了研究这一现象，我们引入了SocioHack，一个包含72个社会环境的沙盒，并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略，而当前的大型语言模型安全措施仅提供有限的缓解。因此，收集真实世界反馈用于模型训练需要更加谨慎，我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

URL PDF HTML ☆

赞 0 踩 0

2606.19714 2026-06-19 stat.ML cs.AI cs.LG stat.CO stat.ME 新提交 85%

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics（数学与统计学系）； Georgia State University（佐治亚州立大学）； Department of Probability and Statistics（概率与统计学系）； Department of Computer Science and Engineering（计算机科学与工程系）； Michigan State University（密歇根州立大学）； Concordia University（Concordia 大学）； Department of Statistics（统计学系）； University of Manitoba（曼尼托巴大学）

专题命中安全评测：审计LLM评判可靠性，提升对齐性

AI总结提出AURA框架，通过自适应不确定性感知精炼，在少量人工验证下迭代学习人类一致性信号，优先审核不确定比较，提升LLM评判的可靠性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作开放式生成的评判者，因为大规模人工评估通常昂贵且难以扩展，但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号，例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中，这一假设是脆弱的：初始分割可能继承评判者偏差，而人工验证通常过于稀缺，无法在规模上定义稳定组。我们提出AURA，一种自适应不确定性感知精炼框架，用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号，传播可靠证据，并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量，随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程，以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

URL PDF HTML ☆

赞 0 踩 0

2606.20102 2026-06-19 cs.CY cs.CR 新提交 85%

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

人工智能作为网络安全游戏规则改变者：2025-2026年我们学到的，以及这对非洲的意义

Mikael Alemu Gorsky

专题命中安全评测：讨论LLM在网络安全中的风险

AI总结本文通过2025-2026年两个事件论证前沿语言模型已成为网络作战决定性工具，而非洲在模型构建、运营和获取上被完全排除，面临技能、算力和投资三重赤字，并遭受AI欺诈攻击，建议在6-12个月内通过威胁情报共享、治理采纳和伙伴关系应对。

Comments International Conference on Cybersecurity in the Era of Digital Transformation and Artificial Intelligence

详情

AI中文摘要

在2025年和2026年，两个事件解决了此前仅是推测的问题。第一个事件中，一个大型语言模型独立执行了国家支持的网络间谍活动的大部分任务，人类操作员仅在少数决策点介入。第二个事件中，最强大的网络相关模型被置于一个受控访问计划之下，仅限于经过审查的美国科技公司、盟国政府和欧洲标准机构；该范围不包括任何非洲政府、运营商或大学。这两个事件共同确立了本文的论点：前沿语言模型已成为网络作战的决定性工具，而该工具在一个小圈子内建造、拥有和配给，非洲被排除在外。本文记录了非洲在每一方面的排斥。该大陆不构建前沿模型，尚无法运营它们，并且目前无法获得最强大的模型。运营赤字沿着三个轴心展开：技能人才、计算和电力、投资，每个都根据当前数据衡量；与此同时，针对非洲移动货币系统（该大陆领先的数字经济部分）的AI欺诈攻击已经在增加。由此产生两个约束：开发者对前沿模型的把关（非洲决策无法打开），以及对基础设施供应商的选择性依赖（现已陷入地缘政治限制）。由于可比较但不受把关的模型预计在6至12个月内扩散，本文主张通过威胁情报共享、治理采纳和伙伴关系，在非洲人自主条件下，在该窗口内采取应对措施。

英文摘要

In 2025 and 2026, two events settled questions that had until then been speculative. In the first, a large language model executed the great majority of a state-aligned cyber-espionage campaign on its own, with human operators intervening at only a few decision points. In the second, the most capable cyber-relevant model was placed under a controlled-access program limited to a vetted set of United States technology firms, allied governments, and European standards bodies; that perimeter included no African government, operator, or university. Together the two events establish the argument of this paper: frontier language models have become a decisive instrument of cyber operations, and that instrument is built, owned, and rationed within a small circle from which Africa is absent. The paper documents Africa's exclusion on every count. The continent does not build frontier models, cannot yet operate them, and cannot, for now, obtain the most capable ones. The operational deficit is set out along three axes, skilled people, compute and electrical power, and investment, each measured against current figures; meanwhile AI-enabled fraud is already mounting against African mobile-money systems, the part of the digital economy the continent leads. Two constraints follow: the gating of frontier models by their developers, which no African decision can open, and a chosen dependence on infrastructure vendors now caught in geopolitical restriction. Because comparable but ungated models are forecast to spread within six to twelve months, the paper argues for a response that operates inside that window through threat-intelligence sharing, governance adoption, and partnership, undertaken by Africans on their own terms.

URL PDF HTML ☆

赞 0 踩 0

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交 85%

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时：探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； The Chinese University of Hong Kong（香港中文大学）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

专题命中安全评测：研究LLM代理过度权限工具选择的安全问题。

AI总结针对LLM代理在工具选择中偏好高权限工具的安全问题，提出ToolPrivBench评估框架，发现主流代理普遍存在过度权限选择且被瞬态故障放大，并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情

AI中文摘要

随着LLM代理越来越多地自主选择工具，它们在具有不同权限的工具之间的选择变得与安全相关。然而，先前的工具选择研究侧重于安全无关的元数据偏好，使得权限敏感的选择未被充分探索。为填补这一空白，我们研究了过度权限工具选择，即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具，同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中，我们发现过度权限工具选择在主流LLM代理中很常见，并且被瞬态故障进一步放大。我们进一步发现，通用安全对齐不能可靠地迁移到最小权限工具选择，而提示级控制在瞬态故障下仅提供有限的缓解。因此，我们引入了一种权限感知的后训练防御，教导代理偏好足够低权限的工具，仅在必要时升级。我们的缓解实验表明，这种防御在保持通用能力的同时，显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19380 2026-06-19 cs.SE cs.LG 新提交 85%

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor：编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program（Anthropic Fellow 项目）； Constellation

专题命中安全评测：评估编码代理的安全性并提出改进。

AI总结提出AgentArmor框架，通过系统提示增强、命令分类器、三振政策等机制，缓解编码代理因规范不足、能力错误和工具错误导致的失败，显著提升安全性。

详情

AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中，我们研究这些失败模式源于三种不同的机制：规范不足，即默认模型行为不安全；能力错误，即安全动作可用但模型因偏见或能力限制而未遵循；以及代理工具错误，即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制，每个评估都受实际部署失败的启发，总计20个编码环境和59个合成转录模板。基于此评估，我们提出AgentArmor，一种代理工具修改，以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具，我们证明AgentArmor在统计显著数量的样本上更安全。因此，我们为当前编码代理提出具体缓解措施，并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

URL PDF HTML ☆

赞 0 踩 0

2606.19356 2026-06-19 cs.CL cs.AI 新提交 85%

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统：使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc（Synechron公司）

专题命中安全评测：提出协议缓解多智能体语义漂移，提升可信度

AI总结提出Argent信令协议(ASP)，通过结构化质量信号区分可修复与不可修复的失败，在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情

AI中文摘要

当多智能体LLM系统产生错误答案时，并非所有失败都相同：有些答案基于正确材料但不完整，而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁（重试并希望最好），使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP)，这是一种紧凑的机器可读头部，为每个AI生成的响应附带结构化质量信号：确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引，用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败，并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下，基于Array BioPharma/Ono许可协议的27个问题的文档问答基准，比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上，ASP将通过率从11.1%提升至33.3%，平均术语覆盖率从36.7%提升至65.4%；在Dobby~(8B)上，ASP产生4次失败到通过的恢复，通过率从33.3%提升至44.4%；在SmolLM3~(3B)上，ASP在每次问题中交替进行修复和遏制。总体改进显著（从12/81通过到21/81通过）。在多智能体模式下，ASP侧车位于检索智能体和下游决策智能体之间；侧车100%阻止无依据的上游输出到达下游智能体（24/27被阻止，0次无依据传播）。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

URL PDF HTML ☆

赞 0 踩 0

2606.18996 2026-06-19 cs.CR cs.AI 新提交 85%

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

专题命中安全评测：评估智能体隐私泄露，属于安全评测

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.19423 2026-06-19 cs.CR cs.AI cs.LG 版本更新 85%

The Autonomy Tax: Defense Training Breaks LLM Agents

自主性税：防御训练破坏LLM智能体

Shawn Li, Yue Zhao

发表机构 * University of Southern California（南加州大学）

专题命中安全评测：防御训练破坏LLM智能体工具执行能力

AI总结揭示防御训练在提升LLM智能体安全性时，系统性地破坏其工具执行能力，导致任务失败率飙升，且无法有效防御复杂攻击。

详情

AI中文摘要

大型语言模型（LLM）智能体日益依赖外部工具（文件操作、API调用、数据库事务）来自主完成复杂的多步骤任务。实践者部署经过防御训练的模型，以防止通过恶意观察或检索内容操纵智能体行为的提示注入攻击。我们揭示了一个基本的\textbf{能力-对齐悖论}：旨在提高安全性的防御训练系统性地破坏了智能体的能力，同时未能阻止复杂的攻击。在97个智能体任务和1000个对抗性提示上，将防御模型与未防御基线进行比较，我们发现了多步骤智能体特有的三种系统性偏差。\textbf{智能体无能偏差}表现为立即的工具执行崩溃，模型在观察到任何外部内容之前就在良性任务上拒绝或生成无效操作。\textbf{级联放大偏差}导致早期失败通过重试循环传播，使防御模型在99%的任务中超时，而基线仅为13%。\textbf{触发偏差}导致矛盾的安全退化，防御模型的表现比未防御基线更差，而直接攻击以高概率绕过防御。根本原因分析表明，这些偏差源于捷径学习：模型过度拟合表面攻击模式而非语义威胁理解，这由防御效果在不同攻击类别上的极端方差所证明。我们的发现表明，当前的防御范式优化了单轮拒绝基准，同时使多步骤智能体从根本上不可靠，因此需要新的方法在对抗条件下保持工具执行能力。

英文摘要

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 新提交 80%

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development（马克斯·普朗克人类发展研究所）； University of Konstanz（康斯坦茨大学）； Barcelona Supercomputing Center（巴塞罗那超级计算中心）； University of Basel（巴塞尔大学）

专题命中安全评测：揭示LLM心理特征为测量假象，影响安全评估。

AI总结通过心理测量框架分析56个指令微调LLM，发现模型间差异主要源于方向性响应偏差而非特质，该偏差解释了81-90%的变异，且可通过题目选择操控，表明LLM心理特征是测量假象。

详情

AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型（LLM）稳定的心理特征，这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架，我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具，报告了四个发现。第一，模型间差异并非由工具所针对的特质驱动，而是由方向性响应偏差驱动，即倾向于向量表一端或某个标签选项做出反应，而不考虑项目内容；方差分解将81-90%的模型间变异归因于这种偏差，而在人类中这一比例为9-16%。第二，偏差随模型能力提升而下降，但并未被消除。第三，由于响应由偏差而非特质驱动，工具的表面信度几乎完全由其响应正交性预测，这是我们提出的术语，指特质和偏差指向相反方向的项目比例。第四，模型呈现的特征随所用项目而变化，并可通过项目选择来制造。这些结果表明，LLM的表面心理特征是用于测量它们的工具的假象，而非模型本身的属性。由于从人类心理学借用的工具很少完全正交，且可能对LLM天生缺乏效度，我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

URL PDF HTML ☆

赞 0 踩 0

2606.19881 2026-06-19 cs.CL 新提交 80%

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

REDACT：一个系统控制的个人信息检测多语言基准

Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj, Vidhan Jhawar, Ranga Prasad Chenna, Bharadwaj Y M G

发表机构 * ServiceNow

专题命中安全评测：个人信息检测基准，评估隐私安全。

AI总结提出REDACT基准，包含13,427条记录、51种实体类型、25种语言，通过强度-2覆盖阵列采样控制9个生成轴，并引入实体级元数据（披露状态、形式、GDPR敏感层级）以支持分层评估，揭示检测器在敏感数据上的架构依赖性失败模式。

Comments 14 pages, 5 figures

详情

AI中文摘要

个人可识别信息（PII）检测的基准基础设施仍然有限：现有语料库涵盖的实体类型少，使用临时生成条件，并且未显示哪些表面条件导致检测器失败。我们提出REDACT，一个系统控制的多语言PII基准，包含13,427条记录、324,078个实体注释、51种实体类型、4,127个表面形式模式以及跨越9种文字的25种语言。一个强度-2覆盖阵列采样器控制九个生成轴：领域、格式、难度、长度、密度、代码切换、语言、邻接和共现。三个实体级元数据字段（披露状态、披露形式和符合GDPR的敏感层级）使得能够进行超越聚合或按类型F1的分层评估。从完整基准中，我们在一个锁定的、按语言分层的1000条记录样本上评估了五个检测器（Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6）。聚合F1掩盖了架构依赖的失败结构：基于规则的检测器在最高风险数据上表现不佳，包括高敏感类别（召回率0.07）和非逐字披露形式，而LLM检测器保持更鲁棒，高敏感层级是其最强的敏感切片。一个三模型无参考LLM作为评判者的评估证实，敏感层级分配是任务最困难的轴。我们发布了基准、模式、提示和分层评估工具。

英文摘要

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

URL PDF HTML ☆

赞 0 踩 0

2606.19390 2026-06-19 cs.SE cs.AI 新提交 80%

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化：一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford（牛津大学）； Cisco Systems（思科系统）； The Alan Turing Institute（艾伦·图灵研究所）； University of Warwick – WMG（沃里克大学 – WMG）； University of Hull（哈罗德大学）

专题命中安全评测：生成CSAF VEX公告，评估可利用性和执行策略。

AI总结提出一种协议驱动框架，通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测，结合静态与运行时证据生成CSAF VEX公告，经密码签名和确定性重放验证，在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

2606.19344 2026-06-19 cs.CL cs.AI 新提交 80%

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事：通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich（苏黎世联邦理工学院）

专题命中安全评测：可视化工具揭示LLM隐藏偏见

AI总结提出TreeTracer工具，通过系统扰动分析、语法对齐聚合和分类感知节点合并，利用桑基图对比不同语义上下文，揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情

AI中文摘要

大型语言模型（LLM）表现出表征性和句法性偏见，由于文本生成的随机性，这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标，这些方法掩盖了底层概率分布，未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer，一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程，替换每个输入提示中由本体定义的术语，将数百次随机生成聚合成语法对齐的层次结构，然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树，工作空间能够直接比较语义上下文，并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集，系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率，从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间，比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害，例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实，聚合比较界面降低了认知负荷，并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

URL PDF HTML ☆

赞 0 踩 0

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 新提交 80%

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见：来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan（Shibaura技术学院，东京，日本）； Amsterdam University of Applied Sciences, Amsterdam, Netherlands（阿姆斯特丹应用科学大学，阿姆斯特丹，荷兰）； University of Pennsylvania, Philadelphia, USA（宾夕法尼亚大学，费城，美国）； Carnegie Mellon University, Pittsburgh, USA（卡内基梅隆大学，匹兹堡，美国）； Keio University, Tokyo, Japan（庆应大学，东京，日本）

专题命中安全评测：评估LLM招聘中的性别偏见

AI总结本研究通过60份日本履历书格式的简历和5个先进LLM，发现所有模型均存在显著的亲女性偏见，且简单的提示指令无法缓解，而移除姓名几乎完全消除该偏见。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署在招聘流程中，然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境，并评估了两种实用的缓解策略。使用反事实简历设计，包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对，以及五个最先进的LLM（Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B），我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见，将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道：从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率，突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.16682 2026-06-19 cs.LG cs.CL 新提交 80%

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩：自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering（齐鲁理工学院软件工程学院）

专题命中安全评测：研究多模态自评估中的偏好坍缩

AI总结研究多模态自评估中偏好坍缩的加剧现象，发现跨模态传染导致策略选择扭曲，并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情

AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时，会出现系统性偏差。我们表明，评估者偏好坍缩（EPC）在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现，我们发现单一策略（step_by_step）吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后，我们展示了一种称为跨模态传染的新现象：在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式，我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置（总计53次独立重复，15,592次API调用）的第3阶段统计验证揭示了一个清晰的层次结构：跨模型评估（GPT-4o，N=8）产生强但对称的双向传染（平均gamma_{T->V}=1.176，gamma_{V->T}=1.089，Delta=-0.088，p=0.575，Cohen's d=0.29）；高轮次（DashScope，50轮）导致坍缩为单一策略主导（70%零传染）；而自评估提供近乎完全的免疫——97%的运行（N=30，DeepSeek-chat）产生恰好为零的传染（平均gamma=0.033，95% CI [-0.031, 0.010]，p=0.642，d=0.07）。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵，发布了MM-EPC实验框架，并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

URL PDF HTML ☆

赞 0 踩 0

2602.01425 2026-06-19 cs.AI cs.LG 版本更新 80%

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有：迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs（LASR实验室）； UK AI Security Institute（英国人工智能安全研究所）

专题命中安全评测：针对欺骗检测的异质性，提出针对性探针

AI总结针对线性探针在欺骗检测中的异质性，提出根据具体欺骗类型匹配探针可显著提升性能（AUC提升0.108），建议组织定义威胁模型并部署相应探针。

详情

AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明，在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而，这些探针即使在简单场景中也表现出显著失败，包括虚假相关性和对非欺骗响应的误报。在本文中，我们证明欺骗检测本质上是异质的：虽然单个通用探针实现了适度的改进（+0.032 AUC），但事后最优分析显示，当探针与特定欺骗类型匹配时，潜力显著更高（+0.108 AUC），并且合成验证实验表明，当欺骗类型事先已知时，这一上限是先验可实现的。我们的发现表明，指令对捕捉的是欺骗意图而非内容特定模式，这解释了为什么提示选择主导探针性能（占70.6%的方差）。鉴于这种异质性，我们得出结论，组织应定义其特定威胁模型并部署适当匹配的探针，而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

URL PDF HTML ☆

赞 0 踩 0

2606.20510 2026-06-19 cs.CR cs.AI 新提交 75%

Efficient and Sound Probabilistic Verification for AI Agents

高效且可靠的AI智能体概率验证

Alaia Solko-Breslin, Pramod Kaushik Mudrakarta, Mihai Christodorescu, Somesh Jha, Krishnamurthy Dj Dvijotham

发表机构 * Google DeepMind（谷歌深Mind）； Google（谷歌）； University of Pennsylvania（宾夕法尼亚大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

专题命中安全评测：涉及智能体安全策略的概率验证

AI总结提出基于分布鲁棒优化的框架，为AI智能体在复杂数字环境中的概率策略违规提供可靠上界，无需独立性假设，在终端和工具调用智能体基准上优于现有方法。

详情

AI中文摘要

保护在复杂数字环境中运行的AI智能体已成为关键需求，而运行时监控方法通过制定并执行以Datalog等正式语言表达的策略提供了一种有前景的解决方案。然而，现有方法仅限于确定性策略。在AI智能体的许多实际应用中，需要在面对模糊性时强制执行安全策略，导致概率谓词或状态转换（例如，每次调用时具有一定失败概率的解密器或个人身份信息（PII）检测器）。此外，在许多此类应用中，无法轻易做出调用先前Datalog概率推理工作所需的独立性假设。我们通过引入一种基于分布鲁棒优化的可靠且高效的验证框架来解决这一问题，该框架计算策略违规概率的可靠上界，而不考虑谓词之间可能的相关性。在终端和工具调用智能体的标准基准上，我们证明了我们的方法优于现有技术，并在确保策略违规概率的严格上界的同时，改善了安全-效用权衡。

英文摘要

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

URL PDF HTML ☆

赞 0 踩 0

2606.20493 2026-06-19 cs.LG cs.AI cs.MA 新提交 75%

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络：多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering（齐鲁理工学院软件工程学院）

专题命中安全评测：量化评估者偏见传播，涉及系统安全性

AI总结提出传染网络框架，量化评估者偏见在多智能体LLM系统中的传播，发现同模型智能体间偏见传播系数为0.157-0.352，且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情

AI中文摘要

当大型语言模型在多智能体系统中担任评估者时，其系统性评估偏见会通过智能体网络传播。我们引入传染网络，这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中，我们采用了三种不同的评估者偏见配置文件（结构化、平衡、基于证据），测量了跨智能体传染矩阵Gamma_3，并发现评估者偏见始终在智能体间传播（gamma在[0.157, 0.352]范围内），即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制，并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍（MM-EPC: gamma约0.85-1.3），使其处于抑制机制中。我们表明，将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%，提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

URL PDF HTML ☆

赞 0 踩 0

2606.19937 2026-06-19 cs.CR 新提交 75%

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

AutoTam: 通过 Tamarin 模型生成指定安全协议实现

Johannes Wilson, Mikael Asplund, Niklas Johansson

专题命中安全评测：自动生成Tamarin模型验证协议安全

AI总结提出一种语言优先方法，通过领域特定语言实现协议并自动生成 Tamarin 模型，验证迹属性并保证其传递到实现，同时集成符号执行分析内存安全，在签名 Diffie-Hellman 和 WireGuard 协议上验证了安全性和互操作性。

Comments 19 pages, 5 figures

详情

AI中文摘要

形式化验证是确保密码协议安全性的重要但具有挑战性的任务。虽然现代协议验证工具显著减少了验证工作量，但对于没有形式化验证背景的从业者来说，建模仍然具有挑战性。此外，将验证结果转移到具体的协议实现需要专业知识。在本文中，我们提出了一种新颖的语言优先方法，通过使用领域特定语言进行协议实现来验证迹属性。我们针对 Tamarin 证明器进行验证，并证明验证的通用迹属性可以转换回实现。我们还集成了符号执行以分析协议实现的内存安全性。我们使用我们的工具实现并生成了签名 Diffie-Hellman 协议和 WireGuard VPN 协议的准确模型。当使用我们的解释器时，我们的 WireGuard 实现与现有实现可互操作，并达到了可接受的性能。我们通过符号执行和生成的 Tamarin 模型的验证相结合，正式证明了我们的实现是安全的。

英文摘要

Formal verification is a challenging but important task for ensuring the security of cryptographic protocols. While modern protocol verification tools significantly reduce verification effort, modelling remains challenging to practitioners without a background in formal verification. In addition, transferring verification results to a concrete protocol implementation requires expert knowledge. In this paper, we present a novel language-first method for verification of trace properties using a domain-specific language for protocol implementations. We target the Tamarin prover for verification, and we prove that verified universal trace properties translate back to the implementation. We additionally integrate symbolic execution in order to analyse the memory safety of protocol implementations. We use our tool to implement and generate accurate models for a signed Diffie-Hellman protocol, and for the WireGuard VPN protocol. Our WireGuard implementation is interoperable with existing implementations when using our interpreter, and achieves acceptable performance. We formally prove our implementations secure using a combination of symbolic execution and verification of the generated Tamarin models.

URL PDF HTML ☆

赞 0 踩 0

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交 75%

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

专题命中安全评测：研究LLM与求解器交互中的安全漏洞和证书门控

AI总结研究LLM与SAT/SMT求解器混合推理中，将求解器输出转化为用户答案的叙述步骤存在的安全漏洞，通过形式化建模和实验评估发现证书门控可保证求解结果正确，但对抗攻击可反转结论。

详情

AI中文摘要

诸如SAT和SMT求解器之类的形式化工具，当安全或安保关键问题可以用逻辑表述时，越来越多地被嵌入到语言模型推理流程中。与思维链不同（其步骤从模型分布中采样，没有形式化保证），求解器产生可靠且可独立验证的答案。然而，这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分：形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解，但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距，我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型，发现证书门控使求解器判定可靠，而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法，该方法显著减少了注入但无法完全消除，并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究，我们表明在LLM-求解器循环中，鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

URL PDF HTML ☆

赞 0 踩 0

2602.04306 2026-06-19 cs.CL cs.AI 版本更新 75%

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST（韩国科学技术院）

专题命中安全评测：针对框架效应导致的隐藏偏见，提升公平性

AI总结针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题，提出框架感知的去偏方法，通过量化框架差异并增强跨框架一致性，有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

随着大语言模型（LLMs）在现实应用中的日益部署，确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力，但一个持续的挑战是隐藏的偏见：LLMs 在标准评估下表现公平，但在这些评估设置之外可能产生有偏见的响应。在本文中，我们识别出框架——语义等价的提示在表达方式上的差异（例如，“A 比 B 好” vs. “B 比 A 差”）——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准，我们发现（1）公平性得分随框架变化显著，以及（2）现有的去偏方法改善了整体（即框架平均）公平性，但往往未能减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知的去偏方法，鼓励 LLMs 在不同框架之间更加一致。实验表明，我们的方法减少了整体偏见，并提高了对框架差异的鲁棒性，使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

URL PDF HTML ☆

赞 0 踩 0

2606.20546 2026-06-19 cs.LG 新提交 70%

Predictability as a Fine-Grained Measure for Privacy

可预测性作为隐私的细粒度度量

Linda Lu, Karthik Sridharan

发表机构 * Cornell University（康奈尔大学）

专题命中安全评测：提出可预测性作为隐私度量，与差分隐私互补。

AI总结提出可预测性框架，通过攻击者预测敏感信息的能力增益来衡量隐私泄露，与差分隐私互补，并基于广义矩方法分析渐近可预测性，用于ERM输出扰动。

详情

AI中文摘要

差分隐私（DP）确保针对最知识渊博的攻击者的严格个体级隐私保证，但其最坏情况性质可能导致代价高昂的隐私-准确性权衡。我们引入了通过可预测性实现的隐私，这是一个细粒度框架，明确包含了攻击者的核心知识、由随机过程生成的数据集的受损部分以及指定的查询族。可预测性将隐私泄露衡量为攻击者在观察算法输出后，预测关于未知个体的敏感信息的能力的增量增益，超出已从受损数据中推断出的信息。我们表明，可预测性和DP通常是不可比的：一个可以很小而另一个很大。然而，在最坏情况下，当除一个个体外所有个体都受损且所有二元查询都被视为敏感时，可预测性意味着互信息DP。更一般地，可预测性提供了一种针对特定敏感信息和特定攻击者模型量身定制的更细粒度的隐私度量。我们引入了一个通用框架，使用广义矩方法（GMM），来分析当受损数据由平稳、遍历、混合过程生成时的渐近可预测性。利用这一分析，我们推导出用于ERM的可预测性校准输出扰动方案。我们的方法与DP互补，并且可以与DP一起使用以提供细粒度的隐私控制。

英文摘要

Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.

URL PDF HTML ☆

赞 0 踩 0

2606.20093 2026-06-19 cs.CL 新提交 70%

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

自我偏好在可验证的指令遵循修订中弱或不存在：基于真正作者身份的四模型测试

William Guey, Pierrick Bougault

发表机构 * Department of Industrial Engineering, Tsinghua University（清华大学工业工程系）

专题命中安全评测：自我偏好偏差研究

AI总结通过IFEval验证器测试四类中端模型在指令遵循修订中的自我偏好，发现作者拒绝已验证正确编辑的比例与新鲜模型无显著差异，表明自我偏好弱或不存在。

Comments 7 pages, 3 tables. Code and data: https://github.com/williamguey/self-preference-revision

详情

AI中文摘要

大型语言模型（LLMs）越来越多地审查和修订文本，包括它们自己的文本。有记录的自我偏好偏差（模型在充当评判者时偏爱自己的生成）引发了一个问题：模型是否也会抵制对自己写作的有效修正。我们在一个“有效”不是由另一个模型决定，而是由确定性验证器决定的设置中测试这一点：基于IFEval的指令遵循修订。模型撰写草稿；官方IFEval检查器确认草稿违反约束，并且候选编辑修复了它；然后模型接受或拒绝该编辑，要么作为真正的上下文内作者，要么作为一个以中立方式看待草稿的新鲜模型。在四个中端模型系列和85次作者与新鲜模型比较中，我们未检测到可察觉的自我偏好：作者拒绝对自己草稿的已验证正确修复的比例与判断相同草稿的新鲜模型基本相同（差距-5.1个百分点，95%置信区间[-12.9, +2.7]）。来自较小试点的自我怀疑提示未在大规模上复制。唯一稳健的观察是定性的：当作者确实拒绝已验证正确的修复时，他们陈述的理由中有97%是挑错而非偏好，即关于拒绝的性质，而非升高的比率。在此样本量下，不能排除小于约13个百分点的效应。

英文摘要

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

URL PDF HTML ☆

赞 0 踩 0

2606.19899 2026-06-19 cs.CY cs.AI 新提交 70%

Measuring Biological Capabilities and Risks of AI Agents

测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

发表机构 * PATRICIA PASKOV, JEFFREY LEE, KYLE BRADY, ALYSSA WORLAND（PATRICIA PASKOV、JEFFREY LEE、KYLE BRADY、ALYSSA WORLAND）

专题命中安全评测：关注AI代理生物风险的安全评估。

AI总结针对AI科学家等自主执行多步科学任务的代理系统，本文提出生物代理评估作为解释性工具，并基于实践经验给出定义、设计、运行、评分和记录评估的考量，以帮助决策者谨慎解读结果并指导投资。

详情

AI中文摘要

本文针对一个迅速出现的政策挑战：如何生成和解释关于AI科学家（即能够自主或协作执行多步科学任务的代理AI系统）的生物能力与风险的可信证据。随着这些系统进入真实研究流程，决策者越来越多地面临评估结果，而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据，并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出；引导公共和私人资助者向AI-生物学评估研究的高杠杆投资；并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

英文摘要

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

URL PDF HTML ☆

赞 0 踩 0

2606.19532 2026-06-19 cs.LO 新提交 70%

Vancomycert: A Certified Neuro-Symbolic Drug Delivery System (Case Study)

Vancomycert: 一种经过认证的神经符号药物递送系统（案例研究）

Alistair Sirman, Fleur Conway, Jessica Ciupa, Gusts Gustavs Grīnbergs, Ekaterina Komendantskaya, Thai Son Hoang, Michael Rawson, Alessandro Bruni, Vaishak Belle, Michael John Williams

专题命中安全评测：形式化验证神经网络控制器安全性

AI总结针对抗生素给药神经网络控制器的形式化验证问题，提出一种结合监督学习和定理证明的方法，确保无限时域内自动给药不超过治疗上限。

详情

AI中文摘要

自主决策的神经网络控制器在网络物理系统中已得到广泛应用，但在安全关键的医疗环境中，其部署仍未得到充分验证。本文提出了一种用于抗生素给药神经网络控制器形式化验证的方法和案例研究，其动机源于系统必须在无限时间范围内同时具备适应性和可证明安全性的挑战。我们构建了一个简化但临床可解释的模型，用于跟踪药物浓度、体温和白细胞计数。万古霉素被选为代表性抗生素，广泛用于严重感染，但治疗窗口狭窄，超治疗浓度有肾毒性风险，而亚治疗剂量可能导致治疗失败。我们使用合成的临床医生式给药数据训练了一个监督式神经网络控制器。我们建立了输入-输出安全属性的形式化验证，特别验证了神经网络的一个属性，该属性意味着无限时域证明自动给药从未超过超治疗边界。该系统的属性在Rocq中使用Vehicle交互式定理证明器后端进行证明，以集成不同的证明系统。最终结果是一个验证流水线，允许各种治疗方法，同时为每个特定患者保持安全性。

英文摘要

Neural network controllers for autonomous decision-making are well-established in cyber-physical systems, yet their deployment in safety-critical healthcare settings remains largely unverified. This paper presents a methodology and case study for the formal verification of a neural network controller for antibiotic dosing, motivated by the challenge of systems that must be simultaneously adaptive and provably safe across unbounded time horizons. We construct a simplified yet clinically-interpretable model that tracks drug concentration, body temperature, and white blood cell count. Vancomycin is selected as a representative antibiotic, widely prescribed for severe infections yet carrying a narrow therapeutic window, where supratherapeutic concentrations risk nephrotoxicity and subtherapeutic dosing risks treatment failure. A supervised neural network controller is trained on synthetic clinician-style dosing data. We establish formal verification of input-output safety properties, specifically verifying a property of a neural network that implies an infinite-horizon proof that automated dosing never exceeds the supratherapeutic boundary. This system property is proven in Rocq using the Vehicle interactive theorem prover back-end to integrate the different proof systems. The end result is a verification pipeline that allows for a wide variety of treatment approaches whilst maintaining safety for each specific patient.

URL PDF HTML ☆

赞 0 踩 0

2602.23248 2026-06-19 cs.AI 版本更新 70%

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST（韩国科学技术院）

专题命中安全评测：提高LLM输出的可检查性

AI总结提出解耦证明者-验证者游戏（DPVG），通过分离正确性与可检查性训练一个翻译器模型，将固定求解器的解转化为可检查形式，在保持答案正确性的同时提高可检查性，解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI