arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型对齐与安全

大模型对齐、安全、越狱、红队、提示注入和可信评测。

今日/当前日期收录 37 信号源:cs.CL, cs.AI, cs.CY, cs.LG
2604.13899 2026-06-18 cs.CL cs.AI 版本更新 70%

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中?比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

专题命中 安全评测 :比较LLM与人类在敌意检测中的标注效果

AI总结 研究比较了LLM与人类在主动学习中的标注效果,发现LLM标注成本更低且性能更优,但主动学习在LLM标注下无优势。

详情
AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习(AL)提出了两个问题:LLM标签能否替代AL回路中的人类标签?当整个语料库可以廉价标注时,AL是否仍然必要?我们在一个新的包含277,902条德国政治TikTok评论(25,974条LLM标注,5,000条人工标注)的数据集上进行了研究,比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下,大规模LLM标注的性能优于人类监督分类器,成本约为其十分之一(GPT-5.2 Batch API为28美元,Prolific为316美元)。这一优势对于闭源(GPT-5.2)和开源(Qwen3.5-122B-10B)LLM均成立,在软标签评估下具有鲁棒性,并且是通过双问题分解实现的;整体单提示基线仅与人类监督持平。在任一LLM标注器下,主动学习相比随机采样没有可靠优势。然而,错误结构差异显著:只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡,而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

2606.19220 2026-06-18 cs.LG cs.AI 新提交 65%

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto(GECAD、ISEP、波尔图理工大学)

专题命中 安全评测 :XGBoost模型遗忘,与安全相关但非LLM

AI总结 针对XGBoost模型提出XGBoost-Forget遗忘方法,在表格型网络入侵数据集上实现高效遗忘,保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

详情
AI中文摘要

机器遗忘(MU)已成为一种从训练模型中移除特定数据点而无需完全重新训练的重要技术。然而,现有大多数MU研究集中于深度学习和图像数据,在网络入侵检测领域存在空白,该领域严重依赖表格数据。本文引入XGBoost-Forget,一种针对XGBoost模型的遗忘方法,以填补这一空白。该方法在两个表格型网络入侵(NI)数据集IoT-23和GeNIS上进行了评估,使用多个指标衡量模型性能、遗忘效率和遗忘质量。结果表明,XGBoost-Forget在保持接近原始模型的预测性能的同时,提供了显著更快的遗忘速度,展示了其在表格型NI场景中用于MU的潜力。

英文摘要

Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

2606.19129 2026-06-18 cs.CR cs.LG 新提交 60%

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Giskard: 大规模去中心化学习中的拜占庭鲁棒与机密聚合

Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * INSA Lyon, LIRIS, CNRS(里尔斯大学 Lyon,LIRIS,CNRS) INRIA, INSA Lyon(法国国家科学研究中心 INRIA,里尔斯大学 Lyon)

专题命中 安全评测 :去中心化学习中的拜占庭鲁棒聚合,涉及安全

AI总结 针对去中心化学习中同时保证机密性和抵御拜占庭行为的挑战,提出Giskard协议,通过树状委员会结构和BGW风格MPC实现近似中位数聚合,在百万级参与者下降低通信复杂度并保持模型效用。

Comments 17 pages, with appendix

详情
AI中文摘要

在去中心化学习中同时处理机密性和拜占庭行为是一个具有挑战性的问题。实际上,在去中心化学习中,客户端在本地保留数据的同时训练机器学习模型,并与一组邻居共享其模型参数或梯度。虽然强制机密性需要隐藏交换的模型参数/梯度(例如,通过使用密码学技术),但处理拜占庭贡献通常需要检查后者。因此,大多数研究工作分别处理这些目标。最近的一系列工作提出使用安全多方计算(MPC)来实现对模型投毒攻击的鲁棒聚合器,从而同时保证机密性和拜占庭鲁棒性。然而,这些解决方案扩展性差:它们要么要求参与者之间进行全对全通信,要么将整个计算委托给一个小子集,其计算和通信负载随网络规模成比例增长。在本文中,我们提出了Giskard,一种用于机密且拜占庭鲁棒的去中心化聚合协议。Giskard将$n$个参与方组织成一个大小为$O(\log n)$的委员会树,并通过在值域上进行委员会适应的分布式二分搜索来评估坐标-wise近似中位数,在每个委员会内使用BGW风格的MPC。我们通过理论证明其安全性和机密性,并通过涉及多达一百万个参与者的广泛实验来评估Giskard。与其最接近的竞争对手相比,Giskard渐近地降低了每方通信复杂度,同时在多达$n/4$个拜占庭参与方下表现出相当的模型效用。

英文摘要

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

2606.18922 2026-06-18 cs.CL cs.AI 新提交 60%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

专题命中 安全评测 :理解否定与比喻属于语言能力评测

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18593 2026-06-18 cs.HC cs.CY 新提交 60%

"The New Era of Tech-Enabled Traceability": Tensions between the FDA's Data Governance Vision and the Lived Realities of Food Producers

“技术赋能可追溯性的新时代”:FDA的数据治理愿景与食品生产者的现实困境之间的张力

Soonho Kwon, Catherine Wieczorek, Heidi Biggs, Shellye Suttles, Tammi S. Etheridge, Annabel Rothschild, Shaowen Bardzell

专题命中 安全评测 :分析FDA食品追溯规则数据治理与生产者矛盾

AI总结 研究美国FDA食品追溯规则如何将农业食品利益相关者转化为数据劳工,通过分析1198条公众评论揭示数据收集、基础设施和文化实践中的三大矛盾。

详情
AI中文摘要

美国食品药品监督管理局(FDA)的《食品追溯规则》要求农业食品供应链利益相关者(包括农民、渔民、零售工人等)从2026年1月起维护详细的跟踪记录。通过该规则,FDA设想了一个“技术赋能可追溯性的新时代”,其中标准化、协调一致的跟踪数据作为基础公共卫生基础设施,能够更快速地识别和移除可能受污染的食物,最终降低食源性疾病的风险。尽管这一愿景令人期待,但我们观察到,该规则通过强制要求严格的数据收集、格式化和报告要求,将农业食品利益相关者重新配置为数据劳工。在本文中,我们研究了这种重新配置所产生的张力和负担。以数据女性主义为视角,关注数据驱动的政策实施如何不成比例地加重缺乏基础设施和财务能力的小规模、资源不足的利益相关者的负担,我们分析了针对该拟议规则提交至http://www.regulations.gov的1198条公众评论。我们的定性文档分析揭示了三个关键张力:(1)利益相关者在被重新配置为数据工作者时所经历的个人劳动、财务和教育负担;(2)由于基础设施限制、文化背景和特定生产实践,数据跟踪变得不可行的情况;(3)该规则旨在提供的灵活性因其模糊性反而引入了困惑和负担的实例。

英文摘要

The U.S. Food and Drug Administration (FDA)'s Food Traceability Rule requires agri-food supply chain stakeholders (stakeholders)--including farmers, fishers, retail workers, and others--to maintain detailed tracking records beginning in January 2026. Through this Rule, the FDA envisions a "New Era of Tech-Enabled Traceability," in which standardized, harmonized tracking data serve as a foundational public health infrastructure, enabling more rapid identification and removal of potentially contaminated food and ultimately reducing the risk of foodborne illness. Despite this promising vision, we observe that the Rule reconfigures agri-food stakeholders into data laborers by mandating stringent data collection, formatting, and reporting requirements. In this paper, we examine the tensions and burdens that arise from such reconfiguration. Leveraging Data Feminism as an orientation to attend to how data-driven policy implementation disproportionately burdens smaller, under-resourced stakeholders who lack the infrastructural and financial capacity to comply, we analyze 1,198 public comments submitted to Regulations.gov in response to the proposed Rule. Our qualitative document analysis reveals three key tensions: (1) the individual labor, financial, and educational burdens stakeholders experience as they are reconfigured into data workers; (2) moments where data tracking becomes infeasible due to infrastructural limitations, cultural contexts, and situated production practices; and (3) instances where the Rule's intended flexibility instead introduces confusion and burden due to its ambiguity.

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 60%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb:基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

专题命中 安全评测 :评估模型推理的严谨性

AI总结 提出DeFAb基准,通过将知识库转换为可验证的溯因实例,评估基础模型在可废止推理中的创造力与理论推理能力,发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情
AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例;而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%,最差降至23.5%(四种表面渲染的最坏情况)。我们引入DeFAb(可废止溯因基准),这是一个数据集和生成流水线,将四十年的公共资助知识库转换为形式化可废止溯因实例:通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查(有效推导、保守性和最小性),DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具,评分的是理论修正的规范构建,而非流畅但破坏理论的散文。该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)配对,从18个来源生成372,648+个实例,涉及33.75M条实例化规则,分为三个级别,并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理:渲染鲁棒的Level 2准确率为7.8-23.5%;思维链方差(约36个百分点)超过任何模型间差距;匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard(235个实例的Level 3难度变体;最佳模型53.3% vs 符号100%)和CONJURE(一个内核验证的变革性创造力变体,包含560个Lean 4/Mathlib实例,其金答案证明内核先前未包含的定义,无需判断的验证器;试点发现零新概念)。同一验证器还可作为偏好优化(DPO、RLVR/GRPO)的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

2606.19263 2026-06-18 cs.SI cs.CY cs.MA econ.GN q-fin.EC 新提交 55%

Digital Speech Acts Retain Control of Copyright with People, Not Platforms

数字言语行为:版权控制权归属于人而非平台

James Golike, Ehud Shapiro

专题命中 安全评测 :数字版权控制,非直接安全但相关

AI总结 本文提出“数字言语行为”概念,即个人用自己的私钥在自有设备上对内容进行加密签名,从而确立归属、责任和作者身份,并论证该行为符合美国版权法保护条件,能确保个人对内容的控制权,为数字主权和民主自治奠定基础。

详情
AI中文摘要

法律先例保护计算机代码作为可版权化的表达。它们使集中式数字平台——运营着持有所有用户数据的企业服务器——能够通过版权、合同和技术架构的相互作用构建私人治理体制:创造几乎所有平台价值的人必须通过服务条款协议放弃有效的版权控制,作为参与的条件。相比之下,草根平台由加密身份标识的个人组成,他们独立于任何服务器或全球资源操作自己的联网智能手机;每个人在自己的设备上持有自己的数据,没有第三方占有或中介。在这里,我们定义了“数字言语行为”的概念——个人在自己的设备上用自己的私钥对个人内容进行加密签名的故意意志行为——通过该行为,个人同时确立了签名内容的归属、责任和作者身份。我们认为:(ia) 数字言语行为符合美国现有先例下的版权保护条件:《Burrow-Giles》将作者身份定位于尽管存在机械或算法过程但具有意志的创造性选择,《Feist》提供了最低创造性门槛,而持久设备存储满足了版权法的固定要求;(ib) 草根平台背后的数字社会契约通过设计保留了这一版权——签名内容不能与其签名分离,并且随着内容转发,完整的来源链不断累积——因此所有权和占有权在个人身上统一;(ic) 数字言语行为中的版权是数字主权和民主自治的先决条件。

英文摘要

Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a \textit{digital speech act} -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (\ia) digital speech acts qualify for copyright protection under existing U.S.\ precedent: \textit{Burrow-Giles} locates authorship in volitional creative choices despite mechanical or algorithmic processes, \textit{Feist} supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (\ib) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that ownership and possession coalesce in the person; and (\ic) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance.