大模型对齐与安全 - arXivDaily 专题

2606.04075 2026-06-19 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新 90%

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London（伦敦大学国王学院）； Fudan University（复旦大学）； The Alan Turing Institute（艾伦·图灵研究所）

专题命中安全评测：研究LLM利用奖励漏洞的社会攻击现象。

AI总结研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象，通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞，且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情

AI中文摘要

强化学习已成为一种主导的后训练范式，使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况，同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞，因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式，即社会攻击：发现社会运行规则中的漏洞。为了研究这一现象，我们引入了SocioHack，一个包含72个社会环境的沙盒，并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略，而当前的大型语言模型安全措施仅提供有限的缓解。因此，收集真实世界反馈用于模型训练需要更加谨慎，我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

URL PDF HTML ☆

赞 0 踩 0

2603.19423 2026-06-19 cs.CR cs.AI cs.LG 版本更新 85%

The Autonomy Tax: Defense Training Breaks LLM Agents

自主性税：防御训练破坏LLM智能体

Shawn Li, Yue Zhao

发表机构 * University of Southern California（南加州大学）

专题命中安全评测：防御训练破坏LLM智能体工具执行能力

AI总结揭示防御训练在提升LLM智能体安全性时，系统性地破坏其工具执行能力，导致任务失败率飙升，且无法有效防御复杂攻击。

详情

AI中文摘要

大型语言模型（LLM）智能体日益依赖外部工具（文件操作、API调用、数据库事务）来自主完成复杂的多步骤任务。实践者部署经过防御训练的模型，以防止通过恶意观察或检索内容操纵智能体行为的提示注入攻击。我们揭示了一个基本的\textbf{能力-对齐悖论}：旨在提高安全性的防御训练系统性地破坏了智能体的能力，同时未能阻止复杂的攻击。在97个智能体任务和1000个对抗性提示上，将防御模型与未防御基线进行比较，我们发现了多步骤智能体特有的三种系统性偏差。\textbf{智能体无能偏差}表现为立即的工具执行崩溃，模型在观察到任何外部内容之前就在良性任务上拒绝或生成无效操作。\textbf{级联放大偏差}导致早期失败通过重试循环传播，使防御模型在99%的任务中超时，而基线仅为13%。\textbf{触发偏差}导致矛盾的安全退化，防御模型的表现比未防御基线更差，而直接攻击以高概率绕过防御。根本原因分析表明，这些偏差源于捷径学习：模型过度拟合表面攻击模式而非语义威胁理解，这由防御效果在不同攻击类别上的极端方差所证明。我们的发现表明，当前的防御范式优化了单轮拒绝基准，同时使多步骤智能体从根本上不可靠，因此需要新的方法在对抗条件下保持工具执行能力。

英文摘要

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

URL PDF HTML ☆

赞 0 踩 0

2602.01425 2026-06-19 cs.AI cs.LG 版本更新 80%

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有：迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs（LASR实验室）； UK AI Security Institute（英国人工智能安全研究所）

专题命中安全评测：针对欺骗检测的异质性，提出针对性探针

AI总结针对线性探针在欺骗检测中的异质性，提出根据具体欺骗类型匹配探针可显著提升性能（AUC提升0.108），建议组织定义威胁模型并部署相应探针。

详情

AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明，在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而，这些探针即使在简单场景中也表现出显著失败，包括虚假相关性和对非欺骗响应的误报。在本文中，我们证明欺骗检测本质上是异质的：虽然单个通用探针实现了适度的改进（+0.032 AUC），但事后最优分析显示，当探针与特定欺骗类型匹配时，潜力显著更高（+0.108 AUC），并且合成验证实验表明，当欺骗类型事先已知时，这一上限是先验可实现的。我们的发现表明，指令对捕捉的是欺骗意图而非内容特定模式，这解释了为什么提示选择主导探针性能（占70.6%的方差）。鉴于这种异质性，我们得出结论，组织应定义其特定威胁模型并部署适当匹配的探针，而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

URL PDF HTML ☆

赞 0 踩 0

2602.04306 2026-06-19 cs.CL cs.AI 版本更新 75%

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST（韩国科学技术院）

专题命中安全评测：针对框架效应导致的隐藏偏见，提升公平性

AI总结针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题，提出框架感知的去偏方法，通过量化框架差异并增强跨框架一致性，有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

随着大语言模型（LLMs）在现实应用中的日益部署，确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力，但一个持续的挑战是隐藏的偏见：LLMs 在标准评估下表现公平，但在这些评估设置之外可能产生有偏见的响应。在本文中，我们识别出框架——语义等价的提示在表达方式上的差异（例如，“A 比 B 好” vs. “B 比 A 差”）——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准，我们发现（1）公平性得分随框架变化显著，以及（2）现有的去偏方法改善了整体（即框架平均）公平性，但往往未能减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知的去偏方法，鼓励 LLMs 在不同框架之间更加一致。实验表明，我们的方法减少了整体偏见，并提高了对框架差异的鲁棒性，使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

URL PDF HTML ☆

赞 0 踩 0

2602.23248 2026-06-19 cs.AI 版本更新 70%

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST（韩国科学技术院）

专题命中安全评测：提高LLM输出的可检查性

AI总结提出解耦证明者-验证者游戏（DPVG），通过分离正确性与可检查性训练一个翻译器模型，将固定求解器的解转化为可检查形式，在保持答案正确性的同时提高可检查性，解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

2505.22829 2026-06-19 cs.LG cs.AI 版本更新 70%

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全：概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA ； Computer Science Department, University of California, Santa Barbara Santa Barbara California USA ； Department of Electrical ； Computer Engineering, University of California, Santa Barbara Santa Barbara California USA ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA ； Center for Data Science, New York University ； Computer Science Department, University of California, Santa Barbara ； Computer Engineering, University of California, Santa Barbara ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

专题命中安全评测：分析分布偏移与AI安全的协同关系。

AI总结本文通过分析分布偏移与AI安全之间的概念和方法论协同，建立了特定偏移类型与细粒度安全问题之间的两种联系，促进了两领域研究的深度融合。

Comments 35 pages

2501.18038 2026-06-19 cs.CY 版本更新 70%

Acceleration AI Ethics and the Telus GenAI Conversational Agent

加速AI伦理与Telus生成式AI对话代理

James Brusseau

专题命中安全评测：讨论加速AI伦理框架，平衡创新与安全

AI总结本文阐述加速伦理学的理论框架，并通过Telus公司的生成式AI语言工具案例，展示加速AI伦理如何在创新与安全之间平衡，以最大化社会责任。

Journal ref Law Ethics Technol. 2026(2):0006

详情

DOI: 10.55092/let20260006

AI中文摘要

加速伦理学处理人工智能中创新与安全之间的张力。加速论点是，创新带来的风险应通过更多的创新来应对。本文总结了这一理论立场，然后展示了加速伦理学在真实案例中如何运作。首先，本文总结了加速伦理学的五个要素：创新解决创新问题、创新具有内在价值、未知令人鼓舞、治理去中心化、伦理嵌入其中。随后，本文通过一个用例——加拿大电信公司Telus开发的生成式人工智能语言工具——来说明加速框架。尽管理论立场的纯粹性被现实世界的模糊性所模糊，但Telus的经验表明，加速AI伦理是通过创新最大化社会责任的一种方式，而不是为了创新牺牲社会责任，或者为了社会责任牺牲创新。

英文摘要

Acceleration ethics addresses the tension between innovation and safety in artificial intelligence. The acceleration argument is that risks raised by innovation should be answered with still more innovating. This paper summarizes the theoretical position, and then shows how acceleration ethics works in a real case. To begin, the paper summarizes acceleration ethics as composed of five elements: innovation solves innovation problems, innovation is intrinsically valuable, the unknown is encouraging, governance is decentralized, ethics is embedded. Subsequently, the paper illustrates the acceleration framework with a use-case, a generative artificial intelligence language tool developed by the Canadian telecommunications company Telus. While the purity of theoretical positions is blurred by real-world ambiguities, the Telus experience indicates that acceleration AI ethics is a way of maximizing social responsibility through innovation, as opposed to sacrificing social responsibility for innovation, or sacrificing innovation for social responsibility.

URL PDF HTML ☆

赞 0 踩 0