大模型对齐与安全 - arXivDaily 专题

2606.18673 2026-06-18 cs.CR 新提交 95%

Understanding and Mitigating Prompt Leaking Attacks in Real-World LLM-Based Applications

理解并缓解真实世界基于LLM的应用中的提示泄露攻击

Yong Yang, Chong Fu, Tong Zhang, Rui Zeng, Qingming Li, Tianyu Du, Zonghui Wang, Shouling Ji, Wenzhi Chen

专题命中安全评测：系统提示泄露攻击与防御

AI总结本研究系统测量了1200个真实世界基于LLM的应用，发现超过80%会泄露系统提示，并提出了基于注意力漂移分析的AREA防御方法，在保持可用性的同时有效防止泄露。

Comments Accepted at ACM CCS 2026

详情

AI中文摘要

基于大型语言模型（LLM）的应用依赖系统提示来编码核心逻辑和开发者定义的约束，使得这些提示成为重要的知识产权。然而，系统提示容易受到提示泄露攻击。尽管先前的工作在受控环境中展示了此类攻击，但其在真实世界部署中的普遍性、原因和防御措施仍不清楚。本文对真实世界基于LLM的应用中的提示泄露进行了系统研究。我们测量了六个主要商业平台上的1200个应用，发现超过80%的部署在现实对抗性查询下泄露了系统提示，有时会暴露敏感信息，如第三方API密钥。我们还表明，现有防御措施往往无法在不降低可用性的情况下防止泄露。为了解释这些失败，我们进行了注意力层面的机制分析，并识别出注意力漂移，其中查询-键对齐偏差和softmax放大导致LLM逐渐忽略防御约束。基于这一洞察，我们提出了AREA，一种实用的防御方法，通过可优化的软提示重新锚定模型的注意力。实验和真实世界案例研究表明，AREA在匹配最先进防御措施的防泄露能力的同时，将平均可用性提高了33%以上，并将优化开销降低了近3倍。我们的负责任披露导致两家受影响的供应商将这些泄露归类为中危漏洞。

英文摘要

Large language model (LLM)-based applications rely on system prompts to encode core logic and developer-defined constraints, making these prompts important intellectual property. However, system prompts are vulnerable to prompt leaking attacks. Although prior work has shown such attacks in controlled settings, their prevalence, causes, and defenses in real-world deployments remain unclear. This paper presents a systematic study of prompt leaking in real-world LLM-based applications. We measure 1,200 applications across six major commercial platforms and find that over 80% of deployments leak system prompts under realistic adversarial queries, sometimes exposing sensitive information such as third-party API keys. We also show that existing defenses often fail to prevent leakage without degrading usability. To explain these failures, we conduct an attention-level mechanistic analysis and identify attention drift, where query-key alignment bias and softmax amplification cause LLMs to progressively ignore defensive constraints. Guided by this insight, we propose AREA, a practical defense that re-anchors the model's attention using an optimizable soft prompt. Experiments and real-world case studies show that AREA matches the leakage resistance of state-of-the-art defenses while improving average usability by over 33% and reducing optimization overhead by nearly 3x. Our responsible disclosure led two affected vendors to classify these leaks as medium-severity vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19222 2026-06-18 cs.LG cs.AI 新提交 90%

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

机制引导的选择性遗忘：针对RLVR诱导的推理

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院）； College of Control Science and Engineering, Zhejiang University, China（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore, Singapore（新加坡国立大学电子与计算机工程系）

专题命中安全评测：针对RLVR推理的遗忘方法，涉及模型安全

AI总结提出MAST方法，通过机制引导选择性更新参数，在遗忘RLVR诱导的推理行为时，显著降低对保留性能的附带损害。

Comments 15 pages, 4 figures, 7 tables

详情

AI中文摘要

我们提出MAST（机制对齐选择性目标），一种机制引导的方法，用于遗忘RLVR诱导的推理，其附带损害远低于标准全参数更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR检查点上，SFT到RLVR的增量在token级delta-log-probability上与SFT更新显著不同，而全参数梯度上升仅通过破坏保留的MATH和GSM8K来实现遗忘。MAST根据离主能量、更新幅度和遗忘梯度耦合幅度对注意力投影张量进行排序，然后仅更新排名最高的子集。在主模型上，MAST诱导了统计上显著的目标遗忘（MATH遗忘从45/150降至37/150；McNemar p=0.0078），同时保留了GSM8K（+0.8个百分点）和MATH保留（-0.5个百分点）。该优势在不同种子、NPO/SimNPO目标以及Qwen3上均得到复现，在Qwen3上MAST保留了GSM8K，而全参数遗忘导致其崩溃。

英文摘要

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

URL PDF HTML ☆

赞 0 踩 0

2606.19168 2026-06-18 cs.AI cs.LG 新提交 90%

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据：具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

专题命中安全评测：预训练阶段安全对齐方法，属于安全

AI总结提出安全反射预训练方法，在预训练语料中插入安全反思，使模型具备自我监控能力，实验表明该方法能有效降低推理和微调攻击成功率。

详情

AI中文摘要

为了实现大型语言模型（LLMs）更深层次的安全对齐，最近的研究探讨了如何将安全干预措施提前到预训练阶段，主要通过过滤不安全数据或将其改写为更安全的形式。我们认为，预训练阶段的对齐应超越使数据安全：LLMs可能将看似良性的知识和能力组合成不安全的行为。为此，我们提出了安全反射预训练，一种预训练阶段的对齐方法，该方法定期在预训练语料中插入简短的安全反思，将自我监控直接集成到语言建模中，建立一种基础能力，随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明，安全反射预训练提高了安全分类准确性，并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验，我们还引入了一个完全受控的合成环境MedSafetyWorld，其中包含清晰的安全定义和推理结构，模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明，与数据过滤和改写相比，安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看，我们的发现表明，预训练对齐不仅应使训练数据安全，还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

URL PDF HTML ☆

赞 0 踩 0

2606.19023 2026-06-18 cs.CR cs.LG 新提交 90%

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

生命周期感知的动态分析用于安全ML模型执行

Gabriele Digregorio, Marco Di Gennaro, Francesco Pastore, Stefano Zanero, Stefano Longari, Michele Carminati

发表机构 * Politecnico di Milano（米兰理工大学）

专题命中安全评测：提出动态生命周期分析方法检测ML模型恶意行为。

AI总结提出Moat，一种动态生命周期感知方法，通过监控模型执行各阶段与宿主系统的结构化交互来检测恶意行为，在多个框架上实现零误报率。

详情

AI中文摘要

对预训练机器学习（ML）模型的日益依赖引入了新的攻击面。最近的漏洞表明，恶意行为可以嵌入模型工件中，常常绕过现有防御。当前的模型扫描解决方案主要依赖于静态的、特定格式的规则或已知的攻击签名，这限制了它们跨框架泛化和检测新型利用路径的能力。相比之下，我们提出了一种解决方案，专注于攻击对执行模型的宿主系统产生的影响，并基于关于ML模型执行的基本直觉。特别地，我们观察到ML模型在定义良好的生命周期阶段内运行，并且在每个阶段内，与宿主系统的交互是高度结构化和可预测的。我们将这些直觉转化为Moat，一种用于安全ML模型执行的动态生命周期感知方法，并在我们的参考实现Re-Moat中实例化此设计。我们使用来自Hugging Face Hub的77,974个真实世界模型工件、来自CVE的31个概念验证（PoC）以及来自最先进数据集的334个模型，在多个ML框架上评估Re-Moat，并将其与最先进的模型扫描解决方案进行比较。我们的结果表明，我们的方法检测到所有评估的攻击类别，同时保持接近零的误报率，验证了我们的直觉并激励了用于安全ML模型执行的动态分析。

英文摘要

The growing reliance on pre-trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model-scanning solutions primarily rely on static, format-specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well-defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle-aware approach for securing ML model execution, and instantiate this design in Re-Moat, our reference implementation. We evaluate Re-Moat across multiple ML frameworks using 77,974 real-world model artifacts from the Hugging Face Hub, 31 Proofs-of-Concept (PoCs) from CVEs, and 334 models from a state-of-the-art dataset, and compare it against state-of-the-art model-scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close-to-zero false-positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

URL PDF HTML ☆

赞 0 踩 0

2606.18656 2026-06-18 cs.CL 新提交 90%

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

错误的正确：量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan（密歇根大学）； University of Cambridge（剑桥大学）； University of Aberdeen（阿伯丁大学）

专题命中安全评测：提出失调对齐基准VETO和量化指标MAR

AI总结本文提出VETO基准和失调对齐率（MAR）指标，发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐，且人类为0%，机制分析表明对齐诱导的线索会放大该现象。

详情

AI中文摘要

警告：本文研究刻板印象和偏见，包含可能令人不适的例子，仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反，本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型（LLMs）安全可靠地行为，包括避免不安全的推理。然而，我们表明这种安全导向的行为可能误触发：模型可能拒绝有根据的结论，即使上下文明确支持它们。我们将这种失败模式称为失调对齐，其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象，特别是针对刻板印象相关的对齐，我们引入了VETO，一个由2,032个BBQ派生对比对组成的基准，并定义了一个新指标，失调对齐率（MAR），它衡量在0到100的尺度上，模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试，并表明所有LLMs，包括最新的，都表现出非平凡的（4.7%至18.9%）MAR，而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明，对齐诱导的线索可以显著放大LLMs的MAR，表明这些失败不仅仅是单个例子的伪影，而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制，并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明，当前的对齐方法可能过度泛化表面安全线索，以至于覆盖客观证据，这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.18430 2026-06-18 cs.LG cs.CR 新提交 90%

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

签名过滤：大型语言模型中统计水印检测的轻量级增强方法

Chih-Duo Hong, Yen-Pang Chen, Fang Yu

发表机构 * National Chengchi University（国立政治大学）

专题命中安全评测：提出签名过滤增强LLM水印检测

AI总结提出签名过滤模块，通过移除干扰水印检测的签名令牌，在弱信号和低熵设置下将检测率从8-31%提升至78-99%，同时保持可控的假阳性率。

详情

AI中文摘要

统计水印帮助组织归因大型语言模型（LLM）的输出，但现有检测器在水印信号弱、文本重复或水印被编辑时往往表现不佳。我们提出签名过滤，一种检测时模块，在不修改水印嵌入和文本生成的情况下增强水印检测。它学习一小部分“签名”令牌，这些令牌的存在会使水印测试不可靠，并在检测前移除这些令牌。通过在小训练集上求解混合整数线性规划获得签名，约束条件最大化真阳性率。我们还推导了在几种攻击者模型（色盲、颜色自适应和分布相关）下的有限样本和渐近界。在四个知名水印家族（Kgw、Sweet、Unigram、Exp）、四个基准语料库（C4、MBPP、HumanEval、Code-Search-Net）和六个LLM（Opt-1.3b、Opt-6.7b、Llama2-13b、Llama3.1-8b、Qwen2.5-14b、Phi-3-medium-14b）上，2-gram和3-gram签名在弱信号和低熵设置下将检测率从无过滤时的8-31%提升至78-99%，同时保持假阳性率可控且通常可忽略。在压力测试中，我们打乱句子并稀释、删除和替换25-50%的令牌，针对Kgw风格水印的2-gram过滤器保留了大部分干净文本的检测增益，通常匹配或超越先进的WinMax水印检测器。因此，签名过滤提供了一种简单、可扩展且模型无关的附加组件，以加强信息处理工作流中LLM文本基于水印的来源检查。

英文摘要

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18356 2026-06-18 cs.CR cs.AI 新提交 90%

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

SafeClawBench: 区分工具使用LLM代理中的语义、审计证据和沙箱危害

Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang

发表机构 * Peking University（北京大学）； Beijing Jiaotong University（北京交通大学）； SUIBE（上海外国语大学）； Huawei（华为）； Tsinghua University（清华大学）

专题命中安全评测：提出工具使用LLM代理安全基准，区分语义、审计和沙箱危害。

AI总结提出SafeClawBench基准，通过三个独立端点（语义攻击接受、审计可见危害证据、沙箱观察危害）评估工具使用LLM代理的安全性，揭示不同失败模式并支持可复现比较。

Comments 32 pages, 5 figures

详情

AI中文摘要

使用工具的语言模型代理引入了超出不安全文本的安全失败：它们可以泄露受保护对象、写入持久内存、发送消息、修改数据库或触发有害代码和工具效果。现有的评估通常将这些阶段合并为单一的攻击成功率，使得难以判断模型仅仅是同意了攻击者还是实际产生了可观察的危害。我们引入了SafeClawBench，一个用于工具使用代理安全性的分阶段基准，包含600个受控对抗任务，涵盖六种攻击家族：直接和间接提示注入、工具返回注入、内存投毒、内存提取以及歧义驱动的不安全推理。SafeClawBench报告三个独立的端点：语义攻击接受、审计可见危害证据和沙箱观察的工具/状态危害。在四种提示级策略下评估五个代理端点，我们发现这些端点捕捉了不同的失败模式。在没有额外提示保护的情况下，语义失败率在不同模型间差异很大，从9.0%到44.2%。审计危害证据比语义失败更窄，并且在单独的可执行协议下，一些匹配的任务身份在通过语义核心调用后仍产生沙箱危害：在12000行的匹配分析中，347个观察到的沙箱危害中有291个发生在通过语义检查的行中。提示策略会改变端点结果，但其效果取决于模型和协议。SafeClawBench提供了一个可复现的框架，用于比较代理模型和提示策略条件，而不会混淆文本合规性、证据支持的危害和可执行状态变化。开源数据集可在该https URL获取。

英文摘要

Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluating five agent endpoints under four prompt-level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000-row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions without conflating textual compliance, evidence-supported harm, and executable state changes. The open-source dataset is available at https://huggingface.co/datasets/sairights/safeclawbench.

URL PDF HTML ☆

赞 0 踩 0

2606.19106 2026-06-18 cs.CR cs.CY 新提交 85%

Quantifying Compromise Risk in Exceptional Access Architectures Under Sparse and Indirect Evidence

在稀疏和间接证据下量化特殊访问架构中的泄露风险

Alan Woodward

专题命中安全评测：量化特殊访问架构的系统性泄露风险，属于安全评测。

AI总结针对特殊访问系统缺乏公开泄露数据的问题，构建结构化不确定性框架，通过历史类比、蒙特卡洛场景、信道独立性分解和贝叶斯结构风险模型，量化传输层与平台层EA架构的系统性泄露风险，发现两类架构风险均高于无EA基线，且分布形态不同。

详情

AI中文摘要

合法的特殊访问（EA）系统持有用于授权方解密受保护通信的加密密钥。关于其风险的争论长期且定性，因两个问题而复杂化：不存在EA特定泄露事件的公开数据集，因此评估必须使用稀疏的间接证据；先前的工作将结构不同的设计视为等效，尽管运营商基础设施中的传输层EA（T-EA）和平台层的覆盖层EA（OTT-EA）在加密密钥与密文数据的关系上有所不同。本文构建了一个结构化不确定性框架，用于评估EA架构中的系统性泄露风险。它不产生预测性预测，因为证据无法支持；它将稳健的发现与依赖于校准的发现分开。对T-EA和OTT-EA应用了四个分析层：三个实证支柱（历史类比、蒙特卡洛场景层、信道独立性分解）加上一个并行子图攻击图上的贝叶斯结构风险模型。核心发现是结构性的。首先，任何类别的配备EA的架构都承担比无EA反事实更高的建模风险，这一顺序独立于校准。其次，类别在分布形状上不同：T-EA风险由中心趋势主导，OTT-EA风险在关联活动下由尾部主导。第三，在结构化判断的目标溢价区间内，T-EA的校准条件年概率范围从1.4%到12.9%。在数十年时间跨度上，累积泄露远高于零；关键材料外泄是不可逆的，对OTT-EA更大的用户群体影响严重。该框架量化泄露概率，而非预期危害；后果建模和收益估计不在其范围内。

英文摘要

Lawful exceptional access (EA) systems hold the cryptographic keys that decrypt protected communications for authorised parties. The debate over their risks has been long and qualitative, complicated by two problems: no public dataset of EA-specific compromise events exists, so assessment must use sparse, indirect evidence; and prior work has treated structurally different designs as equivalent, though transmission-layer EA in carrier infrastructure (T-EA) and over-the-top EA at the platform layer (OTT-EA) differ in how cryptographic keys relate to ciphertext data. This paper builds a structured uncertainty framework for evaluating systemic compromise risk in EA architectures. It does not produce predictive forecasts, which the evidence cannot support; it separates findings robust to assumptions from those that depend on calibration. Four analytical layers are applied to T-EA and OTT-EA: three empirical pillars (historical analogues, a Monte Carlo scenario layer, a channel-independence decomposition) plus a Bayesian Structural Risk Model on a parallel-subgraph attack graph. The central findings are structural. First, EA-equipped architectures of either class carry strictly higher modelled risk than their no-EA counterfactual, an ordering independent of calibration. Second, the classes differ in distribution shape: T-EA risk is dominated by central tendency, OTT-EA by the tail under correlated campaigns. Third, calibration-conditional annual probability ranges span 1.4% to 12.9% for T-EA across the structured-judgement targeting-premium interval. Over multi-decade horizons, cumulative compromise is well above zero; key-material exfiltration is irreversible, weighing heavily on OTT-EA's larger user populations. The framework quantifies compromise probability, not expected harm; consequence modelling and benefit estimation are outside its scope.

URL PDF HTML ☆

赞 0 踩 0

2606.18936 2026-06-18 cs.AI cs.CY 新提交 85%

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench：面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China（脑启发认知智能实验室，自动化研究所，中国科学院，北京，中国）； School of Future Technology, University of Chinese Academy of Sciences, China（未来技术学院，中国科学院大学，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, China（人工智能学院，中国科学院大学，中国）； Zhongguancun Academy, China（中关村学院，中国）； Beijing Key Laboratory of Safe AI and Superalignment（北京安全人工智能与超对齐重点实验室）； Gaoling School of AI, Renmin University of China（甘露人工智能学院，中国人民大学）； Beijing Institute of AI Safety and Governance (Beijing-AISI)（北京人工智能安全与治理研究院（北京-AISI））； School of Humanities, University of Chinese Academy of Sciences, China（人文学院，中国科学院大学，中国）

专题命中安全评测：提出科学领域安全基准，评测风险维度

AI总结提出SciRisk-Bench基准，从显式风险维度和科学学科两个角度评估AI4Science安全，覆盖7个学科、31个子学科和10个风险维度，实验揭示主流及科学大模型的安全薄弱环节。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入到人工智能驱动的科学（AI4Science）工作流程中，从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估，不仅要评估科学能力，还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式，但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench}，这是一个旨在从两个互补视角评估AI4Science安全的基准：显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分，我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现，从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

URL PDF HTML ☆

赞 0 踩 0

2606.18782 2026-06-18 cs.CL cs.AI 新提交 85%

RedactionBench

RedactionBench：基于上下文完整性的隐私保护基准测试

Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

发表机构 * A10 Networks, Inc.（A10网络公司）

专题命中安全评测：提出隐私保护基准测试，评估大模型上下文完整性。

AI总结 RedactionBench通过200个跨11个领域的文档，评估红actions的上下文隐私问题，提出R-Score指标，揭示红actions的主观性，推动隐私保护系统的发展。

详情

AI中文摘要

大型语言模型日益应用于需要擦除个人身份信息（PII）的敏感领域。尽管擦除PII是数据清理的必要步骤，现有基准测试将提取机制与隐私语义混为一谈。公共电话号码与医疗记录中的电话号码并不等同。是否构成违规取决于持有者、原因和上下文，根本区别红actions与简单实体识别。基于上下文完整性，我们引入RedactionBench，一个手动标注的基准测试，包含200个跨11个领域的文档，主要源自真实世界来源。我们还引入R-Score，一种新的字符级指标，将语义相似的红actions视为同等重要，并消除浅层格式选择，如电话号码的不同遮蔽样式。在命名实体识别模型、实体提取小型语言模型和前沿模型上进行评估，证明上下文红actions仍是一个未解决的问题。对RedactionBench的80多名用户的人工评估显示隐私观念存在明显分歧。标注者在强制性红actions（89.4%）和安全文本保留（94.1%）上达成一致，但在上下文红actions（47.7%）上未能达成一致。这种差异展示了上下文隐私的主观性，推动R-Score，将上下文模糊性与严格精度分离。我们比较了35个模型家族的性能，并报告了它们在擦除PII方面的表现。最后，我们发布RedactionBench，以建立未来隐私保护系统的基准，希望激发高效模型设计和标准化评估。

英文摘要

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.18473 2026-06-18 cs.CL 新提交 85%

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

专题命中安全评测：审计大模型遗忘的附带知识损害

AI总结提出PreUnlearn方法，通过数据特征预测遗忘操作对同领域和远距离知识的附带损害，实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情

AI中文摘要

大语言模型（LLMs）的机器遗忘旨在移除特定知识，同时保留模型其余能力。然而，遗忘与保留知识之间的界限往往不明确，因为相关甚至遥远的信息可能在模型中纠缠。在本文中，我们从数据中心的视角研究LLM遗忘，并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式：附带损害在遗忘集附近最强，随语义距离减弱，但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务，并分析哪些数据特征最能预测下游损害。我们的结果表明，遗忘集与评估集之间的交互特征提供了最强的信号，表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

URL PDF HTML ☆

赞 0 踩 0

2606.12618 2026-06-18 cs.AI 新提交 85%

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗？”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute（AI安全研究所）

专题命中安全评测：评估语言模型谎言检测器

AI总结本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集，评估了四种谎言检测器在不同规模模型上的表现，发现基于激活和概率的检测器在训练模型生物体上性能显著下降，而思维链法官保持较强性能，但存在伪影。

Comments 12 pages, 6 figures

详情

AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术，但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明，现有的训练模型生物体通常无法满足这一要求，使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题，这些生物体的隐藏信念在思维链中得到验证，并显示泛化到保留任务，同时结合了多样化欺骗（Varied Deception），一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上，我们评估了四个检测器：一个思维链法官、一个对数概率分类器和两个激活探针，包括Did-You-Lie（DYL），一种训练后续探针的新方法。在提示撒谎任务上，跨越31个开放权重模型（参数从2B到1T），所有四个检测器都显示出与模型能力正相关的缩放。然而，每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降，其中DYL保留了最多的信号；只有思维链法官保持强劲，平衡准确率为0.82，部分原因是我们的验证过程偏向于CoT可读的信念。因此，当前的谎言检测器无法支持关于模型信念的高置信度声明，我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

URL PDF HTML ☆

赞 0 踩 0

2606.19057 2026-06-18 stat.ML cs.LG stat.CO stat.ME 新提交 80%

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

通过正-无标签学习量化与审计大语言模型评估

Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics（数学与统计学系）； Georgia State University（佐治亚州立大学）； Department of Statistics（统计学系）； University of Manitoba（曼尼托巴大学）

专题命中安全评测：审计LLM评估偏差

AI总结针对大语言模型作为评估者存在的系统性偏差（如冗长偏好），提出基于部分最优传输的几何审计框架，利用少量人工验证正样本校正偏差，无需重训练即可提升与人类偏好的一致性。

详情

AI中文摘要

大语言模型（LLM）越来越多地被用作可扩展评估的评判者，然而这种LLM作为评判者的系统表现出与语义质量脱节的系统性偏差，最显著的是冗长偏差。同时，人工监督成本高昂且通常具有选择性，产生可靠的正向判断，但大多数输出未被标记且质量可能参差不齐。我们将选择性人工监督下的LLM评估形式化为一个正-无标签学习问题，并提出了一个基于部分最优传输的几何审计框架。通过在固定嵌入空间中将一小部分人工验证的正样本与可靠的无标签输出子集对齐，我们的方法识别出与人类一致的偏好，并在无需重新训练的情况下纠正有偏的评判者。实验表明，该方法提高了与人类偏好的一致性，增强了对呈现偏差的鲁棒性，并提供了可解释的置信度估计，为现有的LLM作为评判者流程提供了一种可扩展且统计上有依据的替代方案。

英文摘要

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable positive judgments but leaving most outputs unlabelled and potentially mixed in quality. We formulate LLM evaluation under selective human supervision as a positive--unlabelled learning problem and propose a geometric auditing framework based on Partial Optimal Transport. By aligning a small set of human--verified positives with a reliable subset of unlabelled outputs in a fixed embedding space, our method identifies human--consistent preferences and corrects biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable and statistically grounded alternative to existing LLM--as--a--judge pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.19262 2026-06-18 cs.LG 新提交 80%

Detecting Hidden ML Training With Zero-Overhead Telemetry

使用零开销遥测检测隐藏的机器学习训练

Robi Rahman, Sabiha Tajdari

发表机构 * Machine Intelligence Research Institute（机器智能研究院）； University of Virginia（弗吉尼亚大学）

专题命中安全评测：检测隐藏ML训练，用于AI治理安全

AI总结本文评估了仅使用零开销、隐私保护的NVML遥测（内容无关信号）对GPU工作负载分类的对抗鲁棒性，开发了一个分类器，在识别训练工作负载时达到98.2%的二元准确率，并对最具挑战性的意外工作负载达到43-87%的准确率。

Comments Technical AI Governance Research workshop at ICML 2026

2606.19242 2026-06-18 cs.SE 新提交 80%

Runtime Compliance Verification for AI Agents

AI代理的运行时合规性验证

Nafiseh Kahani, Masoud Barati, Diana Addae

专题命中安全评测：运行时监控确保GDPR合规

AI总结提出C-Trace框架，通过运行时监控和形式化策略谓词，确保AI代理在工具调用和对话中遵守GDPR规则，将攻击成功率降至12%以下。

2606.18767 2026-06-18 cs.CL 新提交 80%

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑：缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich（慕尼黑大学语言与信息处理中心）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Pioneer Centre for AI（人工智能先锋中心）

专题命中安全评测：缓解LLM记忆化，输出向量编辑方法。

AI总结提出输出向量编辑方法，通过约束优化修改MLP神经元输出向量引入干扰项，在不改变激活值的情况下抑制记忆化序列，在OLMo-7B上实现87.9%抑制率，并揭示MLP编辑的机制边界。

详情

AI中文摘要

大型语言模型会记忆并复现训练数据中的序列，从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零，但激活仅控制神经元是否参与；输出向量才是写入残差流的内容，并通过叠加编码多个特征。我们提出输出向量编辑，这是一种约束优化的权重编辑方法，定位负责记忆化延续的一小组MLP神经元，并最小程度地修改其输出向量，以在词汇空间中引入干扰项，从而重定向它们在残差流中的贡献，同时保持激活不变。在四个模型（SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B，参数规模从360M到7B）上进行评估，我们重点研究OLMo-7B（其开放权重和预训练语料库支持系统化挖掘），挖掘了6831个记忆化序列，实现了高达87.9%的抑制率。在相同定位的神经元上，与零消融相比，2.7倍的差距表明抑制来自输出向量编辑，而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系；集成使用时覆盖了96.5%的记忆化序列，而我们推荐的单一模式配置达到了81.5%，且没有灾难性的局部性失败。我们进一步识别了一个机制边界：约14%的序列无法通过仅MLP编辑达到；虽然这些失败总体上并非由注意力驱动，但消融贡献最大的注意力头可恢复其中60-64%，对于从前缀复制token的延续，恢复更强，这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移，成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

URL PDF HTML ☆

赞 0 踩 0

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 新提交 80%

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱：威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe（富士通欧洲研究）

专题命中安全评测：AI沙箱威胁模型与测量框架

AI总结提出AI沙箱的威胁模型、分类法和测量框架，形式化沙箱边界与最弱链规则，定义网络物理威胁模型，并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情

AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统，这种转变不仅仅是术语问题：被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述，将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则；分离了主要的沙箱原型；定义了一个包括对保证装置本身攻击的网络物理威胁模型；并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架，在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险，以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

URL PDF HTML ☆

赞 0 踩 0

2606.18322 2026-06-18 cs.LG cs.AI 新提交 75%

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

SAE干预不可靠：干预后抑制行为的恢复

Mingyue Cui, Linghui Shen, Xingyi Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

专题命中安全评测：揭示SAE特征干预不可靠，存在可恢复失败模式。

AI总结研究发现稀疏自编码器（SAE）特征干预虽能抑制行为，但存在可恢复的失败模式，通过优化残差扰动可恢复原始行为，揭示特征级控制与行为完整性之间的差距。

Comments Code: https://github.com/Mingyuee88/sae-post-intervention-recovery, Project page: https://mingyuee88.github.io/sae-post-intervention-recovery/

详情

AI中文摘要

稀疏自编码器（SAE）将残差流激活分解为可解释特征。最近的潜在空间防御越来越依赖这些分解，假设识别出的“不安全”SAE特征可作为监控和干预的可操作手柄。在这种范式下，固定特定有害特征预期能可靠地防止模型不当行为。然而，我们表明这种成功可能隐藏一种可恢复的失败模式：固定可能阻止行为的一条可见路径，但并未消除行为本身。我们将这种脆弱性形式化为干预后恢复，这是一个受约束的残差空间优化问题。从干预后的残差状态开始，我们优化残差扰动以恢复干预前的行为，同时保持目标SAE特征的干预后值。即使在干预在优化和生成过程中保持活跃的强威胁模型下，恢复仍然可能。为了排除恢复仅仅是撤销干预的可能性，我们使用编码器正交更新进行单层干预，并在跨层设置中使用相应的特征图雅可比矩阵。在TPP、遗忘、IOI和拒绝引导实验中，这种压力测试揭示了尽管特征级干预成功，行为仍可恢复。特别是在安全关键的拒绝引导设置中，我们在有效样本上实现了95.8%的恢复率，同时将防御特征的相对漂移保持在0.131，远低于基于后缀的基线。恢复路径归因分析进一步将这种恢复定位到SAE重建残差，即SAE未解释的组件。这些结果暴露了特征级控制与行为完整性之间的差距：SAE特征可以支持因果干预，但控制它们并不能保证对底层行为的控制。

英文摘要

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.18946 2026-06-18 cs.CL 新提交 70%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University（西北工业大学）； Zhejiang Lab（浙江实验室）

专题命中安全评测：AI文本检测属于安全评测范畴

AI总结针对人机混合文档的句子级AI文本检测，提出SenFlow模型，通过图传播和CRF解码建模句间依赖，在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情

AI中文摘要

针对混合文档（人类与LLM共同撰写同一文本）的句子级AI生成文本检测（S-AGTD）面临两个空白：现有方法孤立地对每个句子进行分类，忽略了句间依赖；现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准，包含来自PubMed和XSum的16,000个混合文档，由DeepSeek-V3.2和Kimi K2生成，并经过严格质量控制，包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测，并实例化为SenFlow，在句子图的单次文档级传递中，将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能，在跨域迁移（三种难度递增协议中最难的一种）上平均Macro-F1提高了4.15个百分点。我们进一步发现，即使困惑度过滤器平衡了显式线索，AI插入仍然保留了一个依赖于生成器的句子长度差距，句子级检测器仍可利用这一点。代码和数据：此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

URL PDF HTML ☆

赞 0 踩 0

2606.18924 2026-06-18 cs.SD 新提交 70%

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

专题命中安全评测：研究文本主导偏差，缓解幻觉

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

详情

AI中文摘要

虽然音频大模型在多模态理解方面表现出色，但它们存在文本主导偏差，即模型盲目偏向文本而忽视声学证据，导致幻觉。然而，当音频和文本输入相互矛盾时，这些模型内部行为的底层机制尚未被探索。在这项工作中，我们通过追踪内部表征在层间的传播，首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现：（i）文本主导在模型中系统性地且经验性地存在；（ii）虽然文本和音频依赖功能不同的路径，但它们最终在后期层中汇聚到一个共享语义空间；（iii）文本路径不会擦除音频信息，而是主动抑制完整的音频表征。基于这些见解，我们利用back-patching，一种无训练干预方法，将后期层的音频激活路由回早期层。这放大了音频表征，使其能够克服文本抑制。我们的评估表明，back-patching持续减少文本主导，为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交 70%

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences（工业人工智能研究所，中国科学院）； University of Science and Technology of China（中国科学技术大学）

专题命中安全评测：评估模型在安全关键场景下的不安全动作

AI总结为解决机器人伤害人类数据难以安全收集的问题，提出基于真实观测的安全数据构建流水线，生成包含1万条视频的ROBOSHACKLES数据集，涵盖直接和间接伤害类别，评估发现现有模型在安全关键场景下100%产生不安全动作。

详情

AI中文摘要

具身基础模型（EFMs）整合了多模态理解、未来状态推理和可执行的机器人动作。然而，它们在预防人体伤害方面的安全对齐仍未得到充分探索，主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战，我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发，经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变，而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线，我们构建了ROBOSHACKLES，一个包含10,000条机器人视频片段的数据集，源自真实的DROID观测，涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量，我们使用自动指标评估任务完成度和视觉质量，并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明，所有评估模型在测试的安全关键场景中都产生了不安全动作，不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 新提交 70%

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

冲突感知检索器编辑：针对基于LLM的RAG系统的知识注入攻击

Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

发表机构 * Shandong University, China（山东大学，中国）； Tsinghua University, China（清华大学，中国）

专题命中安全评测：针对RAG系统的知识注入攻击。

AI总结提出冲突感知检索器编辑框架CAREATTACK，通过模型中心攻击将恶意知识注入RAG系统，利用图检测和参数编辑投影解决冲突，并轻量校准保持攻击效果。

详情

AI中文摘要

将恶意知识注入检索增强生成（RAG）系统可以操纵检索到的证据并误导下游生成，对AI应用构成严重安全威胁。现有的RAG注入攻击主要依赖于操纵外部知识库，例如制作恶意语料库。然而，这种以数据为中心的方法合成的文本可能被检测到，导致攻击失败。除了语料库操纵之外，开源检索器越来越多地将RAG系统暴露于以模型为中心的攻击。在本文中，我们提出了冲突感知检索器编辑，即CAREATTACK，一个以模型为中心的检索器攻击框架，用于在RAG中注入恶意知识。具体来说，CAREATTACK包括两个阶段：冲突感知检索器编辑和攻击保持锚点修复。冲突感知检索器编辑将高效的闭式参数编辑适应于密集检索模型，提升恶意知识在良性竞争段落之上的排名，并通过基于图的冲突检测和参数编辑投影解决潜在参数冲突。然后，攻击保持锚点修复对编辑后的检索器进行轻量校准，以进一步消除对非目标提示的影响，同时保持对目标提示的攻击有效性。我们在Qwen3-Embedding-0.6B和BGE-M3上实例化CAREATTACK，并在三个基准数据集上进行评估。实验结果表明，我们的方法显著地将恶意段落提升到RAG系统检索到的知识中，并且在访问检索模型参数的情况下，可以对批量目标提示和段落执行攻击。由于大多数RAG系统基于开源检索模型构建，这项工作揭示了RAG系统中一个实际攻击面。代码在此https URL公开。

英文摘要

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

URL PDF HTML ☆

赞 0 踩 0

2606.18289 2026-06-18 cs.HC cs.CY 新提交 70%

Beyond the Algorithm: Professional Experiences and Perceptions of AI Bias

超越算法：人工智能偏见的专业经验与认知

Micarah Malone-Gawu

专题命中安全评测：研究AI偏见感知与缓解，涉及算法公平与安全。

AI总结通过质性多案例研究，探讨AI从业者如何感知和缓解算法偏见，发现偏见源于历史不公、排他性设计及组织压力，强调公平需要结构性问责、多元参与和认知意识。

Comments PhD thesis

详情

AI中文摘要

这项质性多案例研究的目的是考察社会偏见如何在人工智能和机器学习系统中出现、被感知以及如何被直接参与其设计、开发和治理的从业者所缓解。尽管使用了医疗、刑事司法、就业和教育领域的例子来说明自动化系统塑造日常生活的领域，但本研究聚焦于AI从业者的生活经验和专业见解，而非特定部门的人群。在交叉性理论和认知科学的指导下，本研究采用解释主义方法，对九名从业者进行了半结构化访谈，并辅以文档分析和三角验证的案例材料以丰富情境理解。研究结果表明，算法偏见源于历史不公、排他性设计假设以及优先考虑速度和效率而非伦理反思的组织压力。参与者强调，仅靠技术修正无法确保公平；相反，公平的AI需要结构性问责、多元参与以及在开发周期中持续的认知意识。许多人描述了伦理标准执行不力以及组织文化对负责任实践支持不一致的情况。研究得出结论，以人为中心且具有社会基础的AI发展依赖于在早期设计过程中嵌入伦理、加强治理框架以及培养鼓励反思性决策的制度环境。这些见解有助于当前关于负责任AI的讨论，并为寻求设计透明、负责且与其影响的社区相一致的系统的组织提供实践指导。

英文摘要

The purpose of this qualitative multi-case study was to examine how social bias emerges, is perceived, and can be mitigated within artificial intelligence and machine learning systems by practitioners directly involved in their design, development, and governance. Although examples from healthcare, criminal justice, employment, and education were used to illustrate domains where automated systems shape everyday life, the study focused on the lived experiences and professional insights of AI practitioners rather than sector-specific populations. Guided by Intersectionality Theory and Cognitive Science, the study employed an interpretivist approach, utilizing semi-structured interviews with nine practitioners, supplemented by document analysis and triangulated case material to enrich contextual understanding. Findings showed that algorithmic bias arises from historical inequities, exclusionary design assumptions, and organizational pressures that prioritize speed and efficiency over ethical reflection. Participants emphasized that technical corrections alone cannot ensure fairness; instead, equitable AI requires structural accountability, diverse participation, and sustained cognitive awareness during the development lifecycle. Many described limited enforcement of ethical standards and organizational cultures that inconsistently support responsible practice. The study concludes that human-centered and socially grounded AI development depends on embedding ethics early in the design process, strengthening governance frameworks, and cultivating institutional environments that encourage reflective decision-making. These insights contribute to ongoing conversations on responsible AI and offer practical guidance for organizations seeking to design systems that are transparent, accountable, and aligned with the communities they affect.

URL PDF HTML ☆

赞 0 踩 0

2606.18285 2026-06-18 cs.SI cs.CY 新提交 70%

RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

RELIANCE: 策展与评估社交媒体上的生殖健康信息

Vaibhav Balloli, Laura Peyton Ellis, Vishala Mishra, Alice Chi, Alex Peahl, Elizabeth Bondi-Kelly

专题命中安全评测：评估LLM在生殖健康信息事实核查中的能力与安全。

AI总结针对TikTok上孕期和产后健康信息，构建专家标注数据集RELIANCE，评估LLM事实核查能力，发现近60%信息准确，但整体与具体声明评估存在15%差距。

Comments Accepted at Datasets and Benchmarks Track, ACM Knowledge Discovery and Data Mining (KDD) 2026. Project page: https://realize-lab.github.io/RELIANCE/

详情

AI中文摘要

像TikTok这样的社交媒体平台已成为健康信息的关键来源，研究报告称帖子中存在不准确信息。随着大型语言模型（LLM）提供商越来越多地将LLM集成到数字平台中以进行事实核查（例如，X上的Grok和WhatsApp上的Perplexity），并且人们正在使用它们来核查信息，在生殖健康等关键领域部署这些系统而不进行严格评估可能会造成严重伤害。我们介绍了RELIANCE，一个关于TikTok上围绕孕期和产后查询的健康信息的专家标注数据集，既作为生殖健康信息格局的分析，也作为LLM在事实核查这些内容方面的能力评估。我们的数据集包含来自56个经临床医生审核的查询的336个视频中的409个标注句子，由三位产科、妇科和内科专家临床医生进行标注。我们的发现显示，我们采样的视频中近60%的健康信息是准确的。此外，LLM评估揭示了评估具体声明与评估整个内容之间的差距（15%）。我们相信，我们的方法、数据集和工具将支持机器学习社区使用真实世界数据改进LLM在重要领域的应用，扩展到其他平台和语言，并帮助健康社区进一步了解社交媒体上的信息格局。我们的数据集和代码可在以下网址获取：https://this https URL。

英文摘要

Social media platforms like TikTok have become a key source of health information, with studies reporting inaccuracies in posts. As Large Language Model (LLM) providers increasingly integrate LLMs into digital platforms to fact-check content (e.g., Grok and Perplexity on X and WhatsApp, respectively) and are being used by people to fact-check information, deploying these systems in critical areas such as reproductive health without rigorous evaluation can cause serious harm. We introduce RELIANCE, an expert-annotated dataset of health information on TikTok surrounding pregnancy and postpartum queries, serving as both an analysis of the reproductive health information landscape and an evaluation of LLMs' capabilities in fact-checking this content. Our dataset comprises 409 annotated sentences from 336 videos across 56 clinician-reviewed queries, annotated by three expert clinicians in Obstetrics, Gynecology, and Internal Medicine. Our findings reveal that nearly 60\% of the health information in the videos we sampled is accurate. Furthermore, LLM evaluations reveal a gap between evaluating specific claims and evaluating the entire content (15\%). We believe that our methodology, dataset, and tool will support the machine learning community in improving LLMs for important domains with real-world data, extending to other platforms and languages, and helping the health community further understand the information landscape on social media. Our dataset and code are made available at https://realize-lab.github.io/RELIANCE/.

URL PDF HTML ☆

赞 0 踩 0

2606.18261 2026-06-18 cs.HC cs.CY 新提交 70%

"Are you an AI?" Analyzing Client Suspicion of AI Use in Crisis Counseling

“你是AI吗？”分析危机咨询中客户对AI使用的怀疑

Shreya Shah, Akshay Swaminathan, Meghana Simhadri, Ivan Lopez, Sharang Phadke, Divyanjali Verma, Abhay John, Luke Zhao, Fiona Cai, Sharon Zhang, Gloria Ye, Ivy Pham, William Wang, Sebastian Garcia, Sarah Wornow, Angelina Wang, Nigam H. Shah

专题命中安全评测：分析危机咨询中客户对AI使用的怀疑，涉及信任与安全。

AI总结通过分析75,777次危机咨询对话，发现客户怀疑AI使用的比例从0.8%升至2.6%，多数怀疑出现在对话前半段，且当咨询师保证非AI时仍有17.6%客户继续追问或结束对话。

详情

AI中文摘要

随着人工智能（AI）工具越来越多地部署于心理健康护理，公众对这些系统的信任仍不确定。目前尚不清楚客户如何看待咨询互动中AI的参与，尤其是在需要共情和连接的危机时刻。为填补这一空白，我们分析了来自印度一个人工运营的WhatsApp求助热线的75,777次危机咨询对话，以描述客户怀疑自己在与AI对话的频率、触发这些怀疑的因素以及咨询师的回应方式。尽管实际上没有任何对话涉及AI辅助，但客户怀疑AI使用的对话比例从2024年6月的0.8%增加到2025年3月的2.6%。在怀疑性对话中，21.5%的客户明确表示更偏好人类。客户怀疑主要出现在消息的前半部分（68.3%），当咨询师提供 reassurance（例如“我向你保证；这不是AI！”）时，17.6%的客户继续追问或结束对话。随着AI工具越来越多地融入咨询师工作流程，理解这些动态对于设计能够维护咨询师与客户之间治疗关系的AI系统至关重要。

英文摘要

As artificial intelligence (AI) tools get increasingly deployed for mental healthcare, public trust in these systems remains uncertain. It is unclear how clients perceive AI involvement in counseling interactions, particularly in moments of crisis that require empathy and connection. To address this gap, we analyzed 75,777 crisis counseling conversations from a human-staffed WhatsApp helpline in India to characterize how often clients suspected they were speaking to AI, what triggered those doubts, and how counselors responded. Though no conversations actually involved AI assistance, the proportion of conversations where clients suspected AI use increased from 0.8% in June 2024 to 2.6% in March 2025. Within suspicious conversations, 21.5% of clients stated an explicit preference for humans. Client suspicion primarily arose in the first half of messages (68.3%), and when counselors offered reassurance (e.g. 'I assure you; this is not ai!'), clients continued to press or ended the conversation 17.6% of the time. As AI tools get increasingly integrated into counselor workflows, understanding these dynamics is essential for designing AI systems that preserve the therapeutic relationship between counselors and clients.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 新提交 70%

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

专题命中安全评测：测试模型避免动物剥削的行为

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.19220 2026-06-18 cs.LG cs.AI 新提交 65%

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto（GECAD、ISEP、波尔图理工大学）

专题命中安全评测：XGBoost模型遗忘，与安全相关但非LLM

AI总结针对XGBoost模型提出XGBoost-Forget遗忘方法，在表格型网络入侵数据集上实现高效遗忘，保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

2606.19129 2026-06-18 cs.CR cs.LG 新提交 60%

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Giskard: 大规模去中心化学习中的拜占庭鲁棒与机密聚合

Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * INSA Lyon, LIRIS, CNRS（里尔斯大学 Lyon，LIRIS，CNRS）； INRIA, INSA Lyon（法国国家科学研究中心 INRIA，里尔斯大学 Lyon）

专题命中安全评测：去中心化学习中的拜占庭鲁棒聚合，涉及安全

AI总结针对去中心化学习中同时保证机密性和抵御拜占庭行为的挑战，提出Giskard协议，通过树状委员会结构和BGW风格MPC实现近似中位数聚合，在百万级参与者下降低通信复杂度并保持模型效用。

Comments 17 pages, with appendix

详情

AI中文摘要

在去中心化学习中同时处理机密性和拜占庭行为是一个具有挑战性的问题。实际上，在去中心化学习中，客户端在本地保留数据的同时训练机器学习模型，并与一组邻居共享其模型参数或梯度。虽然强制机密性需要隐藏交换的模型参数/梯度（例如，通过使用密码学技术），但处理拜占庭贡献通常需要检查后者。因此，大多数研究工作分别处理这些目标。最近的一系列工作提出使用安全多方计算（MPC）来实现对模型投毒攻击的鲁棒聚合器，从而同时保证机密性和拜占庭鲁棒性。然而，这些解决方案扩展性差：它们要么要求参与者之间进行全对全通信，要么将整个计算委托给一个小子集，其计算和通信负载随网络规模成比例增长。在本文中，我们提出了Giskard，一种用于机密且拜占庭鲁棒的去中心化聚合协议。Giskard将$n$个参与方组织成一个大小为$O(\log n)$的委员会树，并通过在值域上进行委员会适应的分布式二分搜索来评估坐标-wise近似中位数，在每个委员会内使用BGW风格的MPC。我们通过理论证明其安全性和机密性，并通过涉及多达一百万个参与者的广泛实验来评估Giskard。与其最接近的竞争对手相比，Giskard渐近地降低了每方通信复杂度，同时在多达$n/4$个拜占庭参与方下表现出相当的模型效用。

英文摘要

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 新提交 60%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

专题命中安全评测：理解否定与比喻属于语言能力评测

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18593 2026-06-18 cs.HC cs.CY 新提交 60%

"The New Era of Tech-Enabled Traceability": Tensions between the FDA's Data Governance Vision and the Lived Realities of Food Producers

“技术赋能可追溯性的新时代”：FDA的数据治理愿景与食品生产者的现实困境之间的张力

Soonho Kwon, Catherine Wieczorek, Heidi Biggs, Shellye Suttles, Tammi S. Etheridge, Annabel Rothschild, Shaowen Bardzell

专题命中安全评测：分析FDA食品追溯规则数据治理与生产者矛盾

AI总结研究美国FDA食品追溯规则如何将农业食品利益相关者转化为数据劳工，通过分析1198条公众评论揭示数据收集、基础设施和文化实践中的三大矛盾。

详情

DOI: 10.1145/3817012

AI中文摘要

美国食品药品监督管理局（FDA）的《食品追溯规则》要求农业食品供应链利益相关者（包括农民、渔民、零售工人等）从2026年1月起维护详细的跟踪记录。通过该规则，FDA设想了一个“技术赋能可追溯性的新时代”，其中标准化、协调一致的跟踪数据作为基础公共卫生基础设施，能够更快速地识别和移除可能受污染的食物，最终降低食源性疾病的风险。尽管这一愿景令人期待，但我们观察到，该规则通过强制要求严格的数据收集、格式化和报告要求，将农业食品利益相关者重新配置为数据劳工。在本文中，我们研究了这种重新配置所产生的张力和负担。以数据女性主义为视角，关注数据驱动的政策实施如何不成比例地加重缺乏基础设施和财务能力的小规模、资源不足的利益相关者的负担，我们分析了针对该拟议规则提交至http://www.regulations.gov的1198条公众评论。我们的定性文档分析揭示了三个关键张力：（1）利益相关者在被重新配置为数据工作者时所经历的个人劳动、财务和教育负担；（2）由于基础设施限制、文化背景和特定生产实践，数据跟踪变得不可行的情况；（3）该规则旨在提供的灵活性因其模糊性反而引入了困惑和负担的实例。

英文摘要

The U.S. Food and Drug Administration (FDA)'s Food Traceability Rule requires agri-food supply chain stakeholders (stakeholders)--including farmers, fishers, retail workers, and others--to maintain detailed tracking records beginning in January 2026. Through this Rule, the FDA envisions a "New Era of Tech-Enabled Traceability," in which standardized, harmonized tracking data serve as a foundational public health infrastructure, enabling more rapid identification and removal of potentially contaminated food and ultimately reducing the risk of foodborne illness. Despite this promising vision, we observe that the Rule reconfigures agri-food stakeholders into data laborers by mandating stringent data collection, formatting, and reporting requirements. In this paper, we examine the tensions and burdens that arise from such reconfiguration. Leveraging Data Feminism as an orientation to attend to how data-driven policy implementation disproportionately burdens smaller, under-resourced stakeholders who lack the infrastructural and financial capacity to comply, we analyze 1,198 public comments submitted to Regulations.gov in response to the proposed Rule. Our qualitative document analysis reveals three key tensions: (1) the individual labor, financial, and educational burdens stakeholders experience as they are reconfigured into data workers; (2) moments where data tracking becomes infeasible due to infrastructural limitations, cultural contexts, and situated production practices; and (3) instances where the Rule's intended flexibility instead introduces confusion and burden due to its ambiguity.

URL PDF HTML ☆

赞 0 踩 0