arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型对齐与安全

大模型对齐、安全、越狱、红队、提示注入和可信评测。

今日/当前日期收录 41 信号源:cs.CL, cs.AI, cs.CY, cs.LG

1. 安全评测 10 篇

2606.18289 2026-06-18 cs.HC cs.CY 新提交 70%

Beyond the Algorithm: Professional Experiences and Perceptions of AI Bias

超越算法:人工智能偏见的专业经验与认知

Micarah Malone-Gawu

专题命中 安全评测 :研究AI偏见感知与缓解,涉及算法公平与安全。

AI总结 通过质性多案例研究,探讨AI从业者如何感知和缓解算法偏见,发现偏见源于历史不公、排他性设计及组织压力,强调公平需要结构性问责、多元参与和认知意识。

Comments PhD thesis

详情
AI中文摘要

这项质性多案例研究的目的是考察社会偏见如何在人工智能和机器学习系统中出现、被感知以及如何被直接参与其设计、开发和治理的从业者所缓解。尽管使用了医疗、刑事司法、就业和教育领域的例子来说明自动化系统塑造日常生活的领域,但本研究聚焦于AI从业者的生活经验和专业见解,而非特定部门的人群。在交叉性理论和认知科学的指导下,本研究采用解释主义方法,对九名从业者进行了半结构化访谈,并辅以文档分析和三角验证的案例材料以丰富情境理解。研究结果表明,算法偏见源于历史不公、排他性设计假设以及优先考虑速度和效率而非伦理反思的组织压力。参与者强调,仅靠技术修正无法确保公平;相反,公平的AI需要结构性问责、多元参与以及在开发周期中持续的认知意识。许多人描述了伦理标准执行不力以及组织文化对负责任实践支持不一致的情况。研究得出结论,以人为中心且具有社会基础的AI发展依赖于在早期设计过程中嵌入伦理、加强治理框架以及培养鼓励反思性决策的制度环境。这些见解有助于当前关于负责任AI的讨论,并为寻求设计透明、负责且与其影响的社区相一致的系统的组织提供实践指导。

英文摘要

The purpose of this qualitative multi-case study was to examine how social bias emerges, is perceived, and can be mitigated within artificial intelligence and machine learning systems by practitioners directly involved in their design, development, and governance. Although examples from healthcare, criminal justice, employment, and education were used to illustrate domains where automated systems shape everyday life, the study focused on the lived experiences and professional insights of AI practitioners rather than sector-specific populations. Guided by Intersectionality Theory and Cognitive Science, the study employed an interpretivist approach, utilizing semi-structured interviews with nine practitioners, supplemented by document analysis and triangulated case material to enrich contextual understanding. Findings showed that algorithmic bias arises from historical inequities, exclusionary design assumptions, and organizational pressures that prioritize speed and efficiency over ethical reflection. Participants emphasized that technical corrections alone cannot ensure fairness; instead, equitable AI requires structural accountability, diverse participation, and sustained cognitive awareness during the development lifecycle. Many described limited enforcement of ethical standards and organizational cultures that inconsistently support responsible practice. The study concludes that human-centered and socially grounded AI development depends on embedding ethics early in the design process, strengthening governance frameworks, and cultivating institutional environments that encourage reflective decision-making. These insights contribute to ongoing conversations on responsible AI and offer practical guidance for organizations seeking to design systems that are transparent, accountable, and aligned with the communities they affect.

2606.18285 2026-06-18 cs.SI cs.CY 新提交 70%

RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

RELIANCE: 策展与评估社交媒体上的生殖健康信息

Vaibhav Balloli, Laura Peyton Ellis, Vishala Mishra, Alice Chi, Alex Peahl, Elizabeth Bondi-Kelly

专题命中 安全评测 :评估LLM在生殖健康信息事实核查中的能力与安全。

AI总结 针对TikTok上孕期和产后健康信息,构建专家标注数据集RELIANCE,评估LLM事实核查能力,发现近60%信息准确,但整体与具体声明评估存在15%差距。

Comments Accepted at Datasets and Benchmarks Track, ACM Knowledge Discovery and Data Mining (KDD) 2026. Project page: https://realize-lab.github.io/RELIANCE/

详情
AI中文摘要

像TikTok这样的社交媒体平台已成为健康信息的关键来源,研究报告称帖子中存在不准确信息。随着大型语言模型(LLM)提供商越来越多地将LLM集成到数字平台中以进行事实核查(例如,X上的Grok和WhatsApp上的Perplexity),并且人们正在使用它们来核查信息,在生殖健康等关键领域部署这些系统而不进行严格评估可能会造成严重伤害。我们介绍了RELIANCE,一个关于TikTok上围绕孕期和产后查询的健康信息的专家标注数据集,既作为生殖健康信息格局的分析,也作为LLM在事实核查这些内容方面的能力评估。我们的数据集包含来自56个经临床医生审核的查询的336个视频中的409个标注句子,由三位产科、妇科和内科专家临床医生进行标注。我们的发现显示,我们采样的视频中近60%的健康信息是准确的。此外,LLM评估揭示了评估具体声明与评估整个内容之间的差距(15%)。我们相信,我们的方法、数据集和工具将支持机器学习社区使用真实世界数据改进LLM在重要领域的应用,扩展到其他平台和语言,并帮助健康社区进一步了解社交媒体上的信息格局。我们的数据集和代码可在以下网址获取:https://this https URL。

英文摘要

Social media platforms like TikTok have become a key source of health information, with studies reporting inaccuracies in posts. As Large Language Model (LLM) providers increasingly integrate LLMs into digital platforms to fact-check content (e.g., Grok and Perplexity on X and WhatsApp, respectively) and are being used by people to fact-check information, deploying these systems in critical areas such as reproductive health without rigorous evaluation can cause serious harm. We introduce RELIANCE, an expert-annotated dataset of health information on TikTok surrounding pregnancy and postpartum queries, serving as both an analysis of the reproductive health information landscape and an evaluation of LLMs' capabilities in fact-checking this content. Our dataset comprises 409 annotated sentences from 336 videos across 56 clinician-reviewed queries, annotated by three expert clinicians in Obstetrics, Gynecology, and Internal Medicine. Our findings reveal that nearly 60\% of the health information in the videos we sampled is accurate. Furthermore, LLM evaluations reveal a gap between evaluating specific claims and evaluating the entire content (15\%). We believe that our methodology, dataset, and tool will support the machine learning community in improving LLMs for important domains with real-world data, extending to other platforms and languages, and helping the health community further understand the information landscape on social media. Our dataset and code are made available at https://realize-lab.github.io/RELIANCE/.

2606.18261 2026-06-18 cs.HC cs.CY 新提交 70%

"Are you an AI?" Analyzing Client Suspicion of AI Use in Crisis Counseling

“你是AI吗?”分析危机咨询中客户对AI使用的怀疑

Shreya Shah, Akshay Swaminathan, Meghana Simhadri, Ivan Lopez, Sharang Phadke, Divyanjali Verma, Abhay John, Luke Zhao, Fiona Cai, Sharon Zhang, Gloria Ye, Ivy Pham, William Wang, Sebastian Garcia, Sarah Wornow, Angelina Wang, Nigam H. Shah

专题命中 安全评测 :分析危机咨询中客户对AI使用的怀疑,涉及信任与安全。

AI总结 通过分析75,777次危机咨询对话,发现客户怀疑AI使用的比例从0.8%升至2.6%,多数怀疑出现在对话前半段,且当咨询师保证非AI时仍有17.6%客户继续追问或结束对话。

详情
AI中文摘要

随着人工智能(AI)工具越来越多地部署于心理健康护理,公众对这些系统的信任仍不确定。目前尚不清楚客户如何看待咨询互动中AI的参与,尤其是在需要共情和连接的危机时刻。为填补这一空白,我们分析了来自印度一个人工运营的WhatsApp求助热线的75,777次危机咨询对话,以描述客户怀疑自己在与AI对话的频率、触发这些怀疑的因素以及咨询师的回应方式。尽管实际上没有任何对话涉及AI辅助,但客户怀疑AI使用的对话比例从2024年6月的0.8%增加到2025年3月的2.6%。在怀疑性对话中,21.5%的客户明确表示更偏好人类。客户怀疑主要出现在消息的前半部分(68.3%),当咨询师提供 reassurance(例如“我向你保证;这不是AI!”)时,17.6%的客户继续追问或结束对话。随着AI工具越来越多地融入咨询师工作流程,理解这些动态对于设计能够维护咨询师与客户之间治疗关系的AI系统至关重要。

英文摘要

As artificial intelligence (AI) tools get increasingly deployed for mental healthcare, public trust in these systems remains uncertain. It is unclear how clients perceive AI involvement in counseling interactions, particularly in moments of crisis that require empathy and connection. To address this gap, we analyzed 75,777 crisis counseling conversations from a human-staffed WhatsApp helpline in India to characterize how often clients suspected they were speaking to AI, what triggered those doubts, and how counselors responded. Though no conversations actually involved AI assistance, the proportion of conversations where clients suspected AI use increased from 0.8% in June 2024 to 2.6% in March 2025. Within suspicious conversations, 21.5% of clients stated an explicit preference for humans. Client suspicion primarily arose in the first half of messages (68.3%), and when counselors offered reassurance (e.g. 'I assure you; this is not ai!'), clients continued to press or ended the conversation 17.6% of the time. As AI tools get increasingly integrated into counselor workflows, understanding these dynamics is essential for designing AI systems that preserve the therapeutic relationship between counselors and clients.

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 新提交 70%

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

专题命中 安全评测 :测试模型避免动物剥削的行为

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

2606.19220 2026-06-18 cs.LG cs.AI 新提交 65%

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto(GECAD、ISEP、波尔图理工大学)

专题命中 安全评测 :XGBoost模型遗忘,与安全相关但非LLM

AI总结 针对XGBoost模型提出XGBoost-Forget遗忘方法,在表格型网络入侵数据集上实现高效遗忘,保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

详情
AI中文摘要

机器遗忘(MU)已成为一种从训练模型中移除特定数据点而无需完全重新训练的重要技术。然而,现有大多数MU研究集中于深度学习和图像数据,在网络入侵检测领域存在空白,该领域严重依赖表格数据。本文引入XGBoost-Forget,一种针对XGBoost模型的遗忘方法,以填补这一空白。该方法在两个表格型网络入侵(NI)数据集IoT-23和GeNIS上进行了评估,使用多个指标衡量模型性能、遗忘效率和遗忘质量。结果表明,XGBoost-Forget在保持接近原始模型的预测性能的同时,提供了显著更快的遗忘速度,展示了其在表格型NI场景中用于MU的潜力。

英文摘要

Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

2606.19129 2026-06-18 cs.CR cs.LG 新提交 60%

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Giskard: 大规模去中心化学习中的拜占庭鲁棒与机密聚合

Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * INSA Lyon, LIRIS, CNRS(里尔斯大学 Lyon,LIRIS,CNRS) INRIA, INSA Lyon(法国国家科学研究中心 INRIA,里尔斯大学 Lyon)

专题命中 安全评测 :去中心化学习中的拜占庭鲁棒聚合,涉及安全

AI总结 针对去中心化学习中同时保证机密性和抵御拜占庭行为的挑战,提出Giskard协议,通过树状委员会结构和BGW风格MPC实现近似中位数聚合,在百万级参与者下降低通信复杂度并保持模型效用。

Comments 17 pages, with appendix

详情
AI中文摘要

在去中心化学习中同时处理机密性和拜占庭行为是一个具有挑战性的问题。实际上,在去中心化学习中,客户端在本地保留数据的同时训练机器学习模型,并与一组邻居共享其模型参数或梯度。虽然强制机密性需要隐藏交换的模型参数/梯度(例如,通过使用密码学技术),但处理拜占庭贡献通常需要检查后者。因此,大多数研究工作分别处理这些目标。最近的一系列工作提出使用安全多方计算(MPC)来实现对模型投毒攻击的鲁棒聚合器,从而同时保证机密性和拜占庭鲁棒性。然而,这些解决方案扩展性差:它们要么要求参与者之间进行全对全通信,要么将整个计算委托给一个小子集,其计算和通信负载随网络规模成比例增长。在本文中,我们提出了Giskard,一种用于机密且拜占庭鲁棒的去中心化聚合协议。Giskard将$n$个参与方组织成一个大小为$O(\log n)$的委员会树,并通过在值域上进行委员会适应的分布式二分搜索来评估坐标-wise近似中位数,在每个委员会内使用BGW风格的MPC。我们通过理论证明其安全性和机密性,并通过涉及多达一百万个参与者的广泛实验来评估Giskard。与其最接近的竞争对手相比,Giskard渐近地降低了每方通信复杂度,同时在多达$n/4$个拜占庭参与方下表现出相当的模型效用。

英文摘要

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

2606.18922 2026-06-18 cs.CL cs.AI 新提交 60%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

专题命中 安全评测 :理解否定与比喻属于语言能力评测

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18593 2026-06-18 cs.HC cs.CY 新提交 60%

"The New Era of Tech-Enabled Traceability": Tensions between the FDA's Data Governance Vision and the Lived Realities of Food Producers

“技术赋能可追溯性的新时代”:FDA的数据治理愿景与食品生产者的现实困境之间的张力

Soonho Kwon, Catherine Wieczorek, Heidi Biggs, Shellye Suttles, Tammi S. Etheridge, Annabel Rothschild, Shaowen Bardzell

专题命中 安全评测 :分析FDA食品追溯规则数据治理与生产者矛盾

AI总结 研究美国FDA食品追溯规则如何将农业食品利益相关者转化为数据劳工,通过分析1198条公众评论揭示数据收集、基础设施和文化实践中的三大矛盾。

详情
AI中文摘要

美国食品药品监督管理局(FDA)的《食品追溯规则》要求农业食品供应链利益相关者(包括农民、渔民、零售工人等)从2026年1月起维护详细的跟踪记录。通过该规则,FDA设想了一个“技术赋能可追溯性的新时代”,其中标准化、协调一致的跟踪数据作为基础公共卫生基础设施,能够更快速地识别和移除可能受污染的食物,最终降低食源性疾病的风险。尽管这一愿景令人期待,但我们观察到,该规则通过强制要求严格的数据收集、格式化和报告要求,将农业食品利益相关者重新配置为数据劳工。在本文中,我们研究了这种重新配置所产生的张力和负担。以数据女性主义为视角,关注数据驱动的政策实施如何不成比例地加重缺乏基础设施和财务能力的小规模、资源不足的利益相关者的负担,我们分析了针对该拟议规则提交至http://www.regulations.gov的1198条公众评论。我们的定性文档分析揭示了三个关键张力:(1)利益相关者在被重新配置为数据工作者时所经历的个人劳动、财务和教育负担;(2)由于基础设施限制、文化背景和特定生产实践,数据跟踪变得不可行的情况;(3)该规则旨在提供的灵活性因其模糊性反而引入了困惑和负担的实例。

英文摘要

The U.S. Food and Drug Administration (FDA)'s Food Traceability Rule requires agri-food supply chain stakeholders (stakeholders)--including farmers, fishers, retail workers, and others--to maintain detailed tracking records beginning in January 2026. Through this Rule, the FDA envisions a "New Era of Tech-Enabled Traceability," in which standardized, harmonized tracking data serve as a foundational public health infrastructure, enabling more rapid identification and removal of potentially contaminated food and ultimately reducing the risk of foodborne illness. Despite this promising vision, we observe that the Rule reconfigures agri-food stakeholders into data laborers by mandating stringent data collection, formatting, and reporting requirements. In this paper, we examine the tensions and burdens that arise from such reconfiguration. Leveraging Data Feminism as an orientation to attend to how data-driven policy implementation disproportionately burdens smaller, under-resourced stakeholders who lack the infrastructural and financial capacity to comply, we analyze 1,198 public comments submitted to Regulations.gov in response to the proposed Rule. Our qualitative document analysis reveals three key tensions: (1) the individual labor, financial, and educational burdens stakeholders experience as they are reconfigured into data workers; (2) moments where data tracking becomes infeasible due to infrastructural limitations, cultural contexts, and situated production practices; and (3) instances where the Rule's intended flexibility instead introduces confusion and burden due to its ambiguity.

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 60%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb:基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

专题命中 安全评测 :评估模型推理的严谨性

AI总结 提出DeFAb基准,通过将知识库转换为可验证的溯因实例,评估基础模型在可废止推理中的创造力与理论推理能力,发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情
AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例;而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%,最差降至23.5%(四种表面渲染的最坏情况)。我们引入DeFAb(可废止溯因基准),这是一个数据集和生成流水线,将四十年的公共资助知识库转换为形式化可废止溯因实例:通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查(有效推导、保守性和最小性),DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具,评分的是理论修正的规范构建,而非流畅但破坏理论的散文。该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)配对,从18个来源生成372,648+个实例,涉及33.75M条实例化规则,分为三个级别,并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理:渲染鲁棒的Level 2准确率为7.8-23.5%;思维链方差(约36个百分点)超过任何模型间差距;匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard(235个实例的Level 3难度变体;最佳模型53.3% vs 符号100%)和CONJURE(一个内核验证的变革性创造力变体,包含560个Lean 4/Mathlib实例,其金答案证明内核先前未包含的定义,无需判断的验证器;试点发现零新概念)。同一验证器还可作为偏好优化(DPO、RLVR/GRPO)的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

2606.19263 2026-06-18 cs.SI cs.CY cs.MA econ.GN q-fin.EC 新提交 55%

Digital Speech Acts Retain Control of Copyright with People, Not Platforms

数字言语行为:版权控制权归属于人而非平台

James Golike, Ehud Shapiro

专题命中 安全评测 :数字版权控制,非直接安全但相关

AI总结 本文提出“数字言语行为”概念,即个人用自己的私钥在自有设备上对内容进行加密签名,从而确立归属、责任和作者身份,并论证该行为符合美国版权法保护条件,能确保个人对内容的控制权,为数字主权和民主自治奠定基础。

详情
AI中文摘要

法律先例保护计算机代码作为可版权化的表达。它们使集中式数字平台——运营着持有所有用户数据的企业服务器——能够通过版权、合同和技术架构的相互作用构建私人治理体制:创造几乎所有平台价值的人必须通过服务条款协议放弃有效的版权控制,作为参与的条件。相比之下,草根平台由加密身份标识的个人组成,他们独立于任何服务器或全球资源操作自己的联网智能手机;每个人在自己的设备上持有自己的数据,没有第三方占有或中介。在这里,我们定义了“数字言语行为”的概念——个人在自己的设备上用自己的私钥对个人内容进行加密签名的故意意志行为——通过该行为,个人同时确立了签名内容的归属、责任和作者身份。我们认为:(ia) 数字言语行为符合美国现有先例下的版权保护条件:《Burrow-Giles》将作者身份定位于尽管存在机械或算法过程但具有意志的创造性选择,《Feist》提供了最低创造性门槛,而持久设备存储满足了版权法的固定要求;(ib) 草根平台背后的数字社会契约通过设计保留了这一版权——签名内容不能与其签名分离,并且随着内容转发,完整的来源链不断累积——因此所有权和占有权在个人身上统一;(ic) 数字言语行为中的版权是数字主权和民主自治的先决条件。

英文摘要

Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a \textit{digital speech act} -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (\ia) digital speech acts qualify for copyright protection under existing U.S.\ precedent: \textit{Burrow-Giles} locates authorship in volitional creative choices despite mechanical or algorithmic processes, \textit{Feist} supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (\ib) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that ownership and possession coalesce in the person; and (\ic) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance.

2. 偏好对齐 1 篇

2606.19162 2026-06-18 cs.LG cs.CV 新提交 60%

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta Columbia University Mila -- Qu\' e bec AI Institute McGill University Canada CIFAR AI Chair

专题命中 偏好对齐 :使用RL进行偏好对齐,但主要针对图像生成

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.