大模型对齐与安全

2507.04219 2026-06-18 cs.LG cs.AI 版本更新专题 80

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误，而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich（计算机科学系及慕尼黑数据科学研究所，技术大学慕尼黑）； Mila, Université de Montréal（蒙特利尔大学Mila）

专题命中安全评测：机器遗忘方法，移除私有信息，涉及安全

AI总结提出部分模型崩溃（PMC）方法，通过故意触发模型在目标数据上的分布崩溃实现遗忘，无需在遗忘目标上优化，有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情

AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露，而且从根本上违背了最小化其使用的原则。作为补救，我们提出了一种新颖的遗忘方法——部分模型崩溃（PMC），该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发：在生成模型上训练其自身生成会导致分布崩溃，从而有效移除模型输出中的信息。我们的核心见解是，可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果，即模型遗忘目标移除的数据。我们实验证明，PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制，并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言，我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

URL PDF HTML ☆

赞 0 踩 0

2606.18322 2026-06-18 cs.LG cs.AI 新提交专题 75

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

SAE干预不可靠：干预后抑制行为的恢复

Mingyue Cui, Linghui Shen, Xingyi Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

专题命中安全评测：揭示SAE特征干预不可靠，存在可恢复失败模式。

AI总结研究发现稀疏自编码器（SAE）特征干预虽能抑制行为，但存在可恢复的失败模式，通过优化残差扰动可恢复原始行为，揭示特征级控制与行为完整性之间的差距。

Comments Code: https://github.com/Mingyuee88/sae-post-intervention-recovery, Project page: https://mingyuee88.github.io/sae-post-intervention-recovery/

详情

AI中文摘要

稀疏自编码器（SAE）将残差流激活分解为可解释特征。最近的潜在空间防御越来越依赖这些分解，假设识别出的“不安全”SAE特征可作为监控和干预的可操作手柄。在这种范式下，固定特定有害特征预期能可靠地防止模型不当行为。然而，我们表明这种成功可能隐藏一种可恢复的失败模式：固定可能阻止行为的一条可见路径，但并未消除行为本身。我们将这种脆弱性形式化为干预后恢复，这是一个受约束的残差空间优化问题。从干预后的残差状态开始，我们优化残差扰动以恢复干预前的行为，同时保持目标SAE特征的干预后值。即使在干预在优化和生成过程中保持活跃的强威胁模型下，恢复仍然可能。为了排除恢复仅仅是撤销干预的可能性，我们使用编码器正交更新进行单层干预，并在跨层设置中使用相应的特征图雅可比矩阵。在TPP、遗忘、IOI和拒绝引导实验中，这种压力测试揭示了尽管特征级干预成功，行为仍可恢复。特别是在安全关键的拒绝引导设置中，我们在有效样本上实现了95.8%的恢复率，同时将防御特征的相对漂移保持在0.131，远低于基于后缀的基线。恢复路径归因分析进一步将这种恢复定位到SAE重建残差，即SAE未解释的组件。这些结果暴露了特征级控制与行为完整性之间的差距：SAE特征可以支持因果干预，但控制它们并不能保证对底层行为的控制。

英文摘要

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

URL PDF HTML ☆

赞 0 踩 0

2504.14798 2026-06-18 cs.LG cs.CV 版本更新专题 75

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta（电气与计算机工程大学阿尔伯塔大学）

专题命中安全评测：评估未学习模型残留知识，对抗攻击

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

详情

AI中文摘要

机器未学习（MUL）已成为隐私保护和内容监管的关键机制，然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行，但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中，我们倡导鲁棒未学习原则，要求模型既与重新训练的模型不可区分，又能抵御多样化的对抗威胁。为实例化这一原则，我们提出了一个统一基准RUB（鲁棒未学习基准），系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内，我们引入未学习映射攻击（UMA）作为检测残留信息的通用方法，并展示现有攻击策略如何适应此框架，只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明，最先进的未学习方法在这些评估下仍然脆弱，即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准，我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

URL PDF HTML ☆

赞 0 踩 0

2606.18946 2026-06-18 cs.CL 新提交专题 70

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University（西北工业大学）； Zhejiang Lab（浙江实验室）

专题命中安全评测：AI文本检测属于安全评测范畴

AI总结针对人机混合文档的句子级AI文本检测，提出SenFlow模型，通过图传播和CRF解码建模句间依赖，在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情

AI中文摘要

针对混合文档（人类与LLM共同撰写同一文本）的句子级AI生成文本检测（S-AGTD）面临两个空白：现有方法孤立地对每个句子进行分类，忽略了句间依赖；现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准，包含来自PubMed和XSum的16,000个混合文档，由DeepSeek-V3.2和Kimi K2生成，并经过严格质量控制，包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测，并实例化为SenFlow，在句子图的单次文档级传递中，将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能，在跨域迁移（三种难度递增协议中最难的一种）上平均Macro-F1提高了4.15个百分点。我们进一步发现，即使困惑度过滤器平衡了显式线索，AI插入仍然保留了一个依赖于生成器的句子长度差距，句子级检测器仍可利用这一点。代码和数据：此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

URL PDF HTML ☆

赞 0 踩 0

2606.18924 2026-06-18 cs.SD 新提交专题 70

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

专题命中安全评测：研究文本主导偏差，缓解幻觉

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

详情

AI中文摘要

虽然音频大模型在多模态理解方面表现出色，但它们存在文本主导偏差，即模型盲目偏向文本而忽视声学证据，导致幻觉。然而，当音频和文本输入相互矛盾时，这些模型内部行为的底层机制尚未被探索。在这项工作中，我们通过追踪内部表征在层间的传播，首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现：（i）文本主导在模型中系统性地且经验性地存在；（ii）虽然文本和音频依赖功能不同的路径，但它们最终在后期层中汇聚到一个共享语义空间；（iii）文本路径不会擦除音频信息，而是主动抑制完整的音频表征。基于这些见解，我们利用back-patching，一种无训练干预方法，将后期层的音频激活路由回早期层。这放大了音频表征，使其能够克服文本抑制。我们的评估表明，back-patching持续减少文本主导，为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交专题 70

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences（工业人工智能研究所，中国科学院）； University of Science and Technology of China（中国科学技术大学）

专题命中安全评测：评估模型在安全关键场景下的不安全动作

AI总结为解决机器人伤害人类数据难以安全收集的问题，提出基于真实观测的安全数据构建流水线，生成包含1万条视频的ROBOSHACKLES数据集，涵盖直接和间接伤害类别，评估发现现有模型在安全关键场景下100%产生不安全动作。

详情

AI中文摘要

具身基础模型（EFMs）整合了多模态理解、未来状态推理和可执行的机器人动作。然而，它们在预防人体伤害方面的安全对齐仍未得到充分探索，主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战，我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发，经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变，而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线，我们构建了ROBOSHACKLES，一个包含10,000条机器人视频片段的数据集，源自真实的DROID观测，涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量，我们使用自动指标评估任务完成度和视觉质量，并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明，所有评估模型在测试的安全关键场景中都产生了不安全动作，不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 新提交专题 70

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

冲突感知检索器编辑：针对基于LLM的RAG系统的知识注入攻击

Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

发表机构 * Shandong University, China（山东大学，中国）； Tsinghua University, China（清华大学，中国）

专题命中安全评测：针对RAG系统的知识注入攻击。

AI总结提出冲突感知检索器编辑框架CAREATTACK，通过模型中心攻击将恶意知识注入RAG系统，利用图检测和参数编辑投影解决冲突，并轻量校准保持攻击效果。

详情

AI中文摘要

将恶意知识注入检索增强生成（RAG）系统可以操纵检索到的证据并误导下游生成，对AI应用构成严重安全威胁。现有的RAG注入攻击主要依赖于操纵外部知识库，例如制作恶意语料库。然而，这种以数据为中心的方法合成的文本可能被检测到，导致攻击失败。除了语料库操纵之外，开源检索器越来越多地将RAG系统暴露于以模型为中心的攻击。在本文中，我们提出了冲突感知检索器编辑，即CAREATTACK，一个以模型为中心的检索器攻击框架，用于在RAG中注入恶意知识。具体来说，CAREATTACK包括两个阶段：冲突感知检索器编辑和攻击保持锚点修复。冲突感知检索器编辑将高效的闭式参数编辑适应于密集检索模型，提升恶意知识在良性竞争段落之上的排名，并通过基于图的冲突检测和参数编辑投影解决潜在参数冲突。然后，攻击保持锚点修复对编辑后的检索器进行轻量校准，以进一步消除对非目标提示的影响，同时保持对目标提示的攻击有效性。我们在Qwen3-Embedding-0.6B和BGE-M3上实例化CAREATTACK，并在三个基准数据集上进行评估。实验结果表明，我们的方法显著地将恶意段落提升到RAG系统检索到的知识中，并且在访问检索模型参数的情况下，可以对批量目标提示和段落执行攻击。由于大多数RAG系统基于开源检索模型构建，这项工作揭示了RAG系统中一个实际攻击面。代码在此https URL公开。

英文摘要

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

URL PDF HTML ☆

赞 0 踩 0

2606.18289 2026-06-18 cs.HC cs.CY 新提交专题 70

Beyond the Algorithm: Professional Experiences and Perceptions of AI Bias

超越算法：人工智能偏见的专业经验与认知

Micarah Malone-Gawu

专题命中安全评测：研究AI偏见感知与缓解，涉及算法公平与安全。

AI总结通过质性多案例研究，探讨AI从业者如何感知和缓解算法偏见，发现偏见源于历史不公、排他性设计及组织压力，强调公平需要结构性问责、多元参与和认知意识。

Comments PhD thesis

详情

AI中文摘要

这项质性多案例研究的目的是考察社会偏见如何在人工智能和机器学习系统中出现、被感知以及如何被直接参与其设计、开发和治理的从业者所缓解。尽管使用了医疗、刑事司法、就业和教育领域的例子来说明自动化系统塑造日常生活的领域，但本研究聚焦于AI从业者的生活经验和专业见解，而非特定部门的人群。在交叉性理论和认知科学的指导下，本研究采用解释主义方法，对九名从业者进行了半结构化访谈，并辅以文档分析和三角验证的案例材料以丰富情境理解。研究结果表明，算法偏见源于历史不公、排他性设计假设以及优先考虑速度和效率而非伦理反思的组织压力。参与者强调，仅靠技术修正无法确保公平；相反，公平的AI需要结构性问责、多元参与以及在开发周期中持续的认知意识。许多人描述了伦理标准执行不力以及组织文化对负责任实践支持不一致的情况。研究得出结论，以人为中心且具有社会基础的AI发展依赖于在早期设计过程中嵌入伦理、加强治理框架以及培养鼓励反思性决策的制度环境。这些见解有助于当前关于负责任AI的讨论，并为寻求设计透明、负责且与其影响的社区相一致的系统的组织提供实践指导。

英文摘要

The purpose of this qualitative multi-case study was to examine how social bias emerges, is perceived, and can be mitigated within artificial intelligence and machine learning systems by practitioners directly involved in their design, development, and governance. Although examples from healthcare, criminal justice, employment, and education were used to illustrate domains where automated systems shape everyday life, the study focused on the lived experiences and professional insights of AI practitioners rather than sector-specific populations. Guided by Intersectionality Theory and Cognitive Science, the study employed an interpretivist approach, utilizing semi-structured interviews with nine practitioners, supplemented by document analysis and triangulated case material to enrich contextual understanding. Findings showed that algorithmic bias arises from historical inequities, exclusionary design assumptions, and organizational pressures that prioritize speed and efficiency over ethical reflection. Participants emphasized that technical corrections alone cannot ensure fairness; instead, equitable AI requires structural accountability, diverse participation, and sustained cognitive awareness during the development lifecycle. Many described limited enforcement of ethical standards and organizational cultures that inconsistently support responsible practice. The study concludes that human-centered and socially grounded AI development depends on embedding ethics early in the design process, strengthening governance frameworks, and cultivating institutional environments that encourage reflective decision-making. These insights contribute to ongoing conversations on responsible AI and offer practical guidance for organizations seeking to design systems that are transparent, accountable, and aligned with the communities they affect.

URL PDF HTML ☆

赞 0 踩 0

2606.18285 2026-06-18 cs.SI cs.CY 新提交专题 70

RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

RELIANCE: 策展与评估社交媒体上的生殖健康信息

Vaibhav Balloli, Laura Peyton Ellis, Vishala Mishra, Alice Chi, Alex Peahl, Elizabeth Bondi-Kelly

专题命中安全评测：评估LLM在生殖健康信息事实核查中的能力与安全。

AI总结针对TikTok上孕期和产后健康信息，构建专家标注数据集RELIANCE，评估LLM事实核查能力，发现近60%信息准确，但整体与具体声明评估存在15%差距。

Comments Accepted at Datasets and Benchmarks Track, ACM Knowledge Discovery and Data Mining (KDD) 2026. Project page: https://realize-lab.github.io/RELIANCE/

详情

AI中文摘要

像TikTok这样的社交媒体平台已成为健康信息的关键来源，研究报告称帖子中存在不准确信息。随着大型语言模型（LLM）提供商越来越多地将LLM集成到数字平台中以进行事实核查（例如，X上的Grok和WhatsApp上的Perplexity），并且人们正在使用它们来核查信息，在生殖健康等关键领域部署这些系统而不进行严格评估可能会造成严重伤害。我们介绍了RELIANCE，一个关于TikTok上围绕孕期和产后查询的健康信息的专家标注数据集，既作为生殖健康信息格局的分析，也作为LLM在事实核查这些内容方面的能力评估。我们的数据集包含来自56个经临床医生审核的查询的336个视频中的409个标注句子，由三位产科、妇科和内科专家临床医生进行标注。我们的发现显示，我们采样的视频中近60%的健康信息是准确的。此外，LLM评估揭示了评估具体声明与评估整个内容之间的差距（15%）。我们相信，我们的方法、数据集和工具将支持机器学习社区使用真实世界数据改进LLM在重要领域的应用，扩展到其他平台和语言，并帮助健康社区进一步了解社交媒体上的信息格局。我们的数据集和代码可在以下网址获取：https://this https URL。

英文摘要

Social media platforms like TikTok have become a key source of health information, with studies reporting inaccuracies in posts. As Large Language Model (LLM) providers increasingly integrate LLMs into digital platforms to fact-check content (e.g., Grok and Perplexity on X and WhatsApp, respectively) and are being used by people to fact-check information, deploying these systems in critical areas such as reproductive health without rigorous evaluation can cause serious harm. We introduce RELIANCE, an expert-annotated dataset of health information on TikTok surrounding pregnancy and postpartum queries, serving as both an analysis of the reproductive health information landscape and an evaluation of LLMs' capabilities in fact-checking this content. Our dataset comprises 409 annotated sentences from 336 videos across 56 clinician-reviewed queries, annotated by three expert clinicians in Obstetrics, Gynecology, and Internal Medicine. Our findings reveal that nearly 60\% of the health information in the videos we sampled is accurate. Furthermore, LLM evaluations reveal a gap between evaluating specific claims and evaluating the entire content (15\%). We believe that our methodology, dataset, and tool will support the machine learning community in improving LLMs for important domains with real-world data, extending to other platforms and languages, and helping the health community further understand the information landscape on social media. Our dataset and code are made available at https://realize-lab.github.io/RELIANCE/.

URL PDF HTML ☆

赞 0 踩 0

2606.18261 2026-06-18 cs.HC cs.CY 新提交专题 70

"Are you an AI?" Analyzing Client Suspicion of AI Use in Crisis Counseling

“你是AI吗？”分析危机咨询中客户对AI使用的怀疑

Shreya Shah, Akshay Swaminathan, Meghana Simhadri, Ivan Lopez, Sharang Phadke, Divyanjali Verma, Abhay John, Luke Zhao, Fiona Cai, Sharon Zhang, Gloria Ye, Ivy Pham, William Wang, Sebastian Garcia, Sarah Wornow, Angelina Wang, Nigam H. Shah

专题命中安全评测：分析危机咨询中客户对AI使用的怀疑，涉及信任与安全。

AI总结通过分析75,777次危机咨询对话，发现客户怀疑AI使用的比例从0.8%升至2.6%，多数怀疑出现在对话前半段，且当咨询师保证非AI时仍有17.6%客户继续追问或结束对话。

详情

AI中文摘要

随着人工智能（AI）工具越来越多地部署于心理健康护理，公众对这些系统的信任仍不确定。目前尚不清楚客户如何看待咨询互动中AI的参与，尤其是在需要共情和连接的危机时刻。为填补这一空白，我们分析了来自印度一个人工运营的WhatsApp求助热线的75,777次危机咨询对话，以描述客户怀疑自己在与AI对话的频率、触发这些怀疑的因素以及咨询师的回应方式。尽管实际上没有任何对话涉及AI辅助，但客户怀疑AI使用的对话比例从2024年6月的0.8%增加到2025年3月的2.6%。在怀疑性对话中，21.5%的客户明确表示更偏好人类。客户怀疑主要出现在消息的前半部分（68.3%），当咨询师提供 reassurance（例如“我向你保证；这不是AI！”）时，17.6%的客户继续追问或结束对话。随着AI工具越来越多地融入咨询师工作流程，理解这些动态对于设计能够维护咨询师与客户之间治疗关系的AI系统至关重要。

英文摘要

As artificial intelligence (AI) tools get increasingly deployed for mental healthcare, public trust in these systems remains uncertain. It is unclear how clients perceive AI involvement in counseling interactions, particularly in moments of crisis that require empathy and connection. To address this gap, we analyzed 75,777 crisis counseling conversations from a human-staffed WhatsApp helpline in India to characterize how often clients suspected they were speaking to AI, what triggered those doubts, and how counselors responded. Though no conversations actually involved AI assistance, the proportion of conversations where clients suspected AI use increased from 0.8% in June 2024 to 2.6% in March 2025. Within suspicious conversations, 21.5% of clients stated an explicit preference for humans. Client suspicion primarily arose in the first half of messages (68.3%), and when counselors offered reassurance (e.g. 'I assure you; this is not ai!'), clients continued to press or ended the conversation 17.6% of the time. As AI tools get increasingly integrated into counselor workflows, understanding these dynamics is essential for designing AI systems that preserve the therapeutic relationship between counselors and clients.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 新提交专题 70

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

专题命中安全评测：测试模型避免动物剥削的行为

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2604.13899 2026-06-18 cs.CL cs.AI 版本更新专题 70

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

专题命中安全评测：比较LLM与人类在敌意检测中的标注效果

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

详情

AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习（AL）提出了两个问题：LLM标签能否替代AL回路中的人类标签？当整个语料库可以廉价标注时，AL是否仍然必要？我们在一个新的包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人工标注）的数据集上进行了研究，比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下，大规模LLM标注的性能优于人类监督分类器，成本约为其十分之一（GPT-5.2 Batch API为28美元，Prolific为316美元）。这一优势对于闭源（GPT-5.2）和开源（Qwen3.5-122B-10B）LLM均成立，在软标签评估下具有鲁棒性，并且是通过双问题分解实现的；整体单提示基线仅与人类监督持平。在任一LLM标注器下，主动学习相比随机采样没有可靠优势。然而，错误结构差异显著：只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡，而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

URL PDF HTML ☆

赞 0 踩 0

2606.19220 2026-06-18 cs.LG cs.AI 新提交专题 65

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto（GECAD、ISEP、波尔图理工大学）

专题命中安全评测：XGBoost模型遗忘，与安全相关但非LLM

AI总结针对XGBoost模型提出XGBoost-Forget遗忘方法，在表格型网络入侵数据集上实现高效遗忘，保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

2606.19129 2026-06-18 cs.CR cs.LG 新提交专题 60

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Giskard: 大规模去中心化学习中的拜占庭鲁棒与机密聚合

Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * INSA Lyon, LIRIS, CNRS（里尔斯大学 Lyon，LIRIS，CNRS）； INRIA, INSA Lyon（法国国家科学研究中心 INRIA，里尔斯大学 Lyon）

专题命中安全评测：去中心化学习中的拜占庭鲁棒聚合，涉及安全

AI总结针对去中心化学习中同时保证机密性和抵御拜占庭行为的挑战，提出Giskard协议，通过树状委员会结构和BGW风格MPC实现近似中位数聚合，在百万级参与者下降低通信复杂度并保持模型效用。

Comments 17 pages, with appendix

详情

AI中文摘要

在去中心化学习中同时处理机密性和拜占庭行为是一个具有挑战性的问题。实际上，在去中心化学习中，客户端在本地保留数据的同时训练机器学习模型，并与一组邻居共享其模型参数或梯度。虽然强制机密性需要隐藏交换的模型参数/梯度（例如，通过使用密码学技术），但处理拜占庭贡献通常需要检查后者。因此，大多数研究工作分别处理这些目标。最近的一系列工作提出使用安全多方计算（MPC）来实现对模型投毒攻击的鲁棒聚合器，从而同时保证机密性和拜占庭鲁棒性。然而，这些解决方案扩展性差：它们要么要求参与者之间进行全对全通信，要么将整个计算委托给一个小子集，其计算和通信负载随网络规模成比例增长。在本文中，我们提出了Giskard，一种用于机密且拜占庭鲁棒的去中心化聚合协议。Giskard将$n$个参与方组织成一个大小为$O(\log n)$的委员会树，并通过在值域上进行委员会适应的分布式二分搜索来评估坐标-wise近似中位数，在每个委员会内使用BGW风格的MPC。我们通过理论证明其安全性和机密性，并通过涉及多达一百万个参与者的广泛实验来评估Giskard。与其最接近的竞争对手相比，Giskard渐近地降低了每方通信复杂度，同时在多达$n/4$个拜占庭参与方下表现出相当的模型效用。

英文摘要

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 新提交专题 60

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

专题命中安全评测：理解否定与比喻属于语言能力评测

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18593 2026-06-18 cs.HC cs.CY 新提交专题 60

"The New Era of Tech-Enabled Traceability": Tensions between the FDA's Data Governance Vision and the Lived Realities of Food Producers

“技术赋能可追溯性的新时代”：FDA的数据治理愿景与食品生产者的现实困境之间的张力

Soonho Kwon, Catherine Wieczorek, Heidi Biggs, Shellye Suttles, Tammi S. Etheridge, Annabel Rothschild, Shaowen Bardzell

专题命中安全评测：分析FDA食品追溯规则数据治理与生产者矛盾

AI总结研究美国FDA食品追溯规则如何将农业食品利益相关者转化为数据劳工，通过分析1198条公众评论揭示数据收集、基础设施和文化实践中的三大矛盾。

详情

DOI: 10.1145/3817012

AI中文摘要

美国食品药品监督管理局（FDA）的《食品追溯规则》要求农业食品供应链利益相关者（包括农民、渔民、零售工人等）从2026年1月起维护详细的跟踪记录。通过该规则，FDA设想了一个“技术赋能可追溯性的新时代”，其中标准化、协调一致的跟踪数据作为基础公共卫生基础设施，能够更快速地识别和移除可能受污染的食物，最终降低食源性疾病的风险。尽管这一愿景令人期待，但我们观察到，该规则通过强制要求严格的数据收集、格式化和报告要求，将农业食品利益相关者重新配置为数据劳工。在本文中，我们研究了这种重新配置所产生的张力和负担。以数据女性主义为视角，关注数据驱动的政策实施如何不成比例地加重缺乏基础设施和财务能力的小规模、资源不足的利益相关者的负担，我们分析了针对该拟议规则提交至http://www.regulations.gov的1198条公众评论。我们的定性文档分析揭示了三个关键张力：（1）利益相关者在被重新配置为数据工作者时所经历的个人劳动、财务和教育负担；（2）由于基础设施限制、文化背景和特定生产实践，数据跟踪变得不可行的情况；（3）该规则旨在提供的灵活性因其模糊性反而引入了困惑和负担的实例。

英文摘要

The U.S. Food and Drug Administration (FDA)'s Food Traceability Rule requires agri-food supply chain stakeholders (stakeholders)--including farmers, fishers, retail workers, and others--to maintain detailed tracking records beginning in January 2026. Through this Rule, the FDA envisions a "New Era of Tech-Enabled Traceability," in which standardized, harmonized tracking data serve as a foundational public health infrastructure, enabling more rapid identification and removal of potentially contaminated food and ultimately reducing the risk of foodborne illness. Despite this promising vision, we observe that the Rule reconfigures agri-food stakeholders into data laborers by mandating stringent data collection, formatting, and reporting requirements. In this paper, we examine the tensions and burdens that arise from such reconfiguration. Leveraging Data Feminism as an orientation to attend to how data-driven policy implementation disproportionately burdens smaller, under-resourced stakeholders who lack the infrastructural and financial capacity to comply, we analyze 1,198 public comments submitted to Regulations.gov in response to the proposed Rule. Our qualitative document analysis reveals three key tensions: (1) the individual labor, financial, and educational burdens stakeholders experience as they are reconfigured into data workers; (2) moments where data tracking becomes infeasible due to infrastructural limitations, cultural contexts, and situated production practices; and (3) instances where the Rule's intended flexibility instead introduces confusion and burden due to its ambiguity.

URL PDF HTML ☆

赞 0 踩 0

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交专题 60

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中安全评测：评估模型推理的严谨性

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.19263 2026-06-18 cs.SI cs.CY cs.MA econ.GN q-fin.EC 新提交专题 55

Digital Speech Acts Retain Control of Copyright with People, Not Platforms

数字言语行为：版权控制权归属于人而非平台

James Golike, Ehud Shapiro

专题命中安全评测：数字版权控制，非直接安全但相关

AI总结本文提出“数字言语行为”概念，即个人用自己的私钥在自有设备上对内容进行加密签名，从而确立归属、责任和作者身份，并论证该行为符合美国版权法保护条件，能确保个人对内容的控制权，为数字主权和民主自治奠定基础。

详情

AI中文摘要

法律先例保护计算机代码作为可版权化的表达。它们使集中式数字平台——运营着持有所有用户数据的企业服务器——能够通过版权、合同和技术架构的相互作用构建私人治理体制：创造几乎所有平台价值的人必须通过服务条款协议放弃有效的版权控制，作为参与的条件。相比之下，草根平台由加密身份标识的个人组成，他们独立于任何服务器或全球资源操作自己的联网智能手机；每个人在自己的设备上持有自己的数据，没有第三方占有或中介。在这里，我们定义了“数字言语行为”的概念——个人在自己的设备上用自己的私钥对个人内容进行加密签名的故意意志行为——通过该行为，个人同时确立了签名内容的归属、责任和作者身份。我们认为：(ia) 数字言语行为符合美国现有先例下的版权保护条件：《Burrow-Giles》将作者身份定位于尽管存在机械或算法过程但具有意志的创造性选择，《Feist》提供了最低创造性门槛，而持久设备存储满足了版权法的固定要求；(ib) 草根平台背后的数字社会契约通过设计保留了这一版权——签名内容不能与其签名分离，并且随着内容转发，完整的来源链不断累积——因此所有权和占有权在个人身上统一；(ic) 数字言语行为中的版权是数字主权和民主自治的先决条件。

英文摘要

Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a \textit{digital speech act} -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (\ia) digital speech acts qualify for copyright protection under existing U.S.\ precedent: \textit{Burrow-Giles} locates authorship in volitional creative choices despite mechanical or algorithmic processes, \textit{Feist} supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (\ib) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that ownership and possession coalesce in the person; and (\ic) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance.

URL PDF HTML ☆

赞 0 踩 0

2606.18327 2026-06-18 cs.LG cs.AI 新提交专题 70

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL：基于强化学习的自一致性训练

Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中偏好对齐：通过强化学习优化语言模型自我解释与行为一致性。

AI总结提出Self-CTRL方法，通过强化学习优化语言模型自我解释与行为之间的一致性，在概率推理和宪法AI任务上显著提升一致性和安全性。

Comments 34 pages, 12 figures, includes appendices

详情

AI中文摘要

能够忠实描述自身行为的语言模型（LMs）更容易被用户审计、理解和信任。本文描述了基于强化学习的自一致性训练（Self-CTRL），该方法通过更新解释以更好地预测行为或更新行为以更好地匹配解释，优化LM的自我解释与相关输入行为之间的一致性。我们在两个领域应用该方法。首先，研究一个形式化概率推理任务，其中LM必须学习模仿一组有偏采样器，并评估其报告相关偏差的能力。我们发现，一致性训练将自我报告和行为测量的潜在偏差之间的相关性从$R^2=0.24$提高到$R^2=0.64$（在保留分布上），匹配直接真实标签监督的泛化能力。其次，研究一个宪法AI领域，其中LM必须描述何时拒绝或遵守用户请求。在此，Self-CTRL产生忠实描述模型在保留请求上行为的规则，将第三方审计模型的拒绝预测从$36\%$提高到$92\%$。另一方面，行为更新改善了对齐，将HarmBench失败率从$15.0\%$降低到$0.5\%$，而不会显著增加对无害提示的拒绝。通过对齐解释和行为，我们的工作为训练更安全、更透明、更可控的AI模型提供了通用方法。

英文摘要

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

URL PDF HTML ☆

赞 0 踩 0

2606.19162 2026-06-18 cs.LG cs.CV 新提交专题 60

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中：用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta ； Columbia University ； Mila -- Qu\' e bec AI Institute ； McGill University ； Canada CIFAR AI Chair

专题命中偏好对齐：使用RL进行偏好对齐，但主要针对图像生成

AI总结针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷，提出判别器引导的强化学习（DRL），利用预训练空间中判别器的logit作为奖励，显著提升无引导FID和语义FD，并改善偏好对齐。

Comments 84 pages, including appendices

详情

AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的：与主观偏好对齐，以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差，这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励，强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励，因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习（DRL）。DRL训练一个判别器，在预训练表示空间中区分数据样本和基础模型样本，并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上，而logit估计数据与模型之间的对数似然比，这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上，DRL降低了无引导FID（例如，SiT上从9.38降至2.62）和语义空间FD（例如，SiT上DINOv3从88.2降至19.3），在所有骨干网络上均有一致提升，并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中，DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿，在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

URL PDF HTML ☆

赞 0 踩 0

1. 安全评测 18 篇

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

RUB: Evaluating Residual Knowledge in Unlearned Models

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

Beyond the Algorithm: Professional Experiences and Perceptions of AI Bias

RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

"Are you an AI?" Analyzing Client Suspicion of AI Use in Crisis Counseling

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

"The New Era of Tech-Enabled Traceability": Tensions between the FDA's Data Governance Vision and the Lived Realities of Food Producers

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Digital Speech Acts Retain Control of Copyright with People, Not Platforms

2. 偏好对齐 2 篇

Self-CTRL: Self-Consistency Training with Reinforcement Learning

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL