大模型对齐与安全 - arXivDaily 专题

2505.22829 2026-06-19 cs.LG cs.AI 版本更新 70%

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全：概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA ； Computer Science Department, University of California, Santa Barbara Santa Barbara California USA ； Department of Electrical ； Computer Engineering, University of California, Santa Barbara Santa Barbara California USA ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA ； Center for Data Science, New York University ； Computer Science Department, University of California, Santa Barbara ； Computer Engineering, University of California, Santa Barbara ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

专题命中安全评测：分析分布偏移与AI安全的协同关系。

AI总结本文通过分析分布偏移与AI安全之间的概念和方法论协同，建立了特定偏移类型与细粒度安全问题之间的两种联系，促进了两领域研究的深度融合。

Comments 35 pages

2501.18038 2026-06-19 cs.CY 版本更新 70%

Acceleration AI Ethics and the Telus GenAI Conversational Agent

加速AI伦理与Telus生成式AI对话代理

James Brusseau

专题命中安全评测：讨论加速AI伦理框架，平衡创新与安全

AI总结本文阐述加速伦理学的理论框架，并通过Telus公司的生成式AI语言工具案例，展示加速AI伦理如何在创新与安全之间平衡，以最大化社会责任。

Journal ref Law Ethics Technol. 2026(2):0006

详情

DOI: 10.55092/let20260006

AI中文摘要

加速伦理学处理人工智能中创新与安全之间的张力。加速论点是，创新带来的风险应通过更多的创新来应对。本文总结了这一理论立场，然后展示了加速伦理学在真实案例中如何运作。首先，本文总结了加速伦理学的五个要素：创新解决创新问题、创新具有内在价值、未知令人鼓舞、治理去中心化、伦理嵌入其中。随后，本文通过一个用例——加拿大电信公司Telus开发的生成式人工智能语言工具——来说明加速框架。尽管理论立场的纯粹性被现实世界的模糊性所模糊，但Telus的经验表明，加速AI伦理是通过创新最大化社会责任的一种方式，而不是为了创新牺牲社会责任，或者为了社会责任牺牲创新。

英文摘要

Acceleration ethics addresses the tension between innovation and safety in artificial intelligence. The acceleration argument is that risks raised by innovation should be answered with still more innovating. This paper summarizes the theoretical position, and then shows how acceleration ethics works in a real case. To begin, the paper summarizes acceleration ethics as composed of five elements: innovation solves innovation problems, innovation is intrinsically valuable, the unknown is encouraging, governance is decentralized, ethics is embedded. Subsequently, the paper illustrates the acceleration framework with a use-case, a generative artificial intelligence language tool developed by the Canadian telecommunications company Telus. While the purity of theoretical positions is blurred by real-world ambiguities, the Telus experience indicates that acceleration AI ethics is a way of maximizing social responsibility through innovation, as opposed to sacrificing social responsibility for innovation, or sacrificing innovation for social responsibility.

URL PDF HTML ☆

赞 0 踩 0

2606.20527 2026-06-19 cs.CL cs.CV 新提交 65%

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Princeton Center for Information and Technology Policy（普林斯顿信息与技术政策中心）

专题命中安全评测：评估模型社会偏见，涉及安全与公平

AI总结提出StylisticBias基准，通过控制单一视觉属性变化，发现年龄和体型主导身份层面偏见，而时尚风格等约15个属性解释近80%的偏见变化，偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在个人和社会影响重大的场景中，但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的（群体）个体，难以将外貌效应与身份差异分离。我们引入StylisticBias，一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸，每张脸创建约50个单一属性变体，产生约25K张图像。这种设计保持身份不变，每次改变一个视觉属性，使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应，而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现，约15个属性解释了近80%的总变异，表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中，尤其是社会经济和风格相关判断，敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集：此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.20520 2026-06-19 cs.CR cs.AI cs.DC cs.LG 新提交 60%

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

主权执行代理：在智能体控制平面中强制执行证书绑定权限

Jun He, Deying Yu

专题命中安全评测：运行时强制执行权限，涉及安全

AI总结针对自主代理在生产环境中执行变更时缺乏强制权限验证的问题，提出主权执行代理（SEB），通过证书验证、状态检查和范围身份实现运行时强制权限控制，并在AWS和Kubernetes上验证了其安全性和性能。

Comments 19 pages, 6 figures, 10 tables

详情

AI中文摘要

自主代理越来越多地连接到云、部署和数据控制工作流，但生产环境的变更权限不应存在于非确定性推理过程中。现有的访问控制机制授权身份，而保证层认证提议的操作；两者单独都无法在变更时刻提供对认证权限的强制执行点。本文介绍了主权执行代理（SEB），一种用于证书绑定智能体基础设施的运行时强制边界。SEB消耗由主权保证边界（SAB）颁发的证书，验证请求的变更与认证的执行合约匹配，检查有效期窗口、策略时期、撤销时期和实时状态漂移，铸造范围执行身份，调用基础设施API，并记录签名的决策和结果记录。通过分离提议、准入和执行，SEB将认证权限转化为短暂的、可撤销的、可审计的运行时能力，前提是生产变更API拒绝非代理身份。我们展示了SEB执行模型、证书和重放验证谓词、范围身份语义、绕过预防部署模式、失败行为以及一个具体的原型实现。我们在AWS和Kubernetes集群上评估了原型，测量了延迟开销、撤销传播、漂移检测以及故障注入下的安全性。

英文摘要

Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

URL PDF HTML ☆

赞 0 踩 0

2606.19831 2026-06-19 cs.CL cs.LG 新提交 60%

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性：语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

专题命中安全评测：涉及神经元干预对行为控制的影响，与安全相关。

AI总结提出预算归一化控制窗口框架，通过残差范数与写入范数之比定义的相干预算，预测单神经元干预何时产生连贯行为控制，并在15个神经元上验证了预测精度。

详情

AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为，但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标：残差流与写入之间的对齐，该对齐沿着一条通用饱和曲线驱动，以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时，存在连贯控制。同一坐标控制良性模式切换和拒绝；上限由权重和一次通用前向传播得出，而触发点在 rollout 时测量。在15个保留神经元上，预测上限的平均绝对误差为0.14，在批量层中约为0.07，并且承诺的开启或关闭判定在11个神经元上成立，而多数基线为10/15。关闭情况揭示了三种失败模式而非违反：触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制：真正的控制器偏离读出轴写入，并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中，干预成功是类型化的而非标量：连贯旁路和严格可操作可达性分离，因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝，而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个，且仅在较晚的 rollout 时间范围内。因此，单神经元操控是对可控性的预算化、类型化审计，而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

URL PDF HTML ☆

赞 0 踩 0

2606.19794 2026-06-19 econ.GN cs.CY q-fin.EC 新提交 55%

Forecasting AI-Era Productivity: The Intellectually Converged Human Framework and a Missing Cognitive Mediator in Production Function Theory

预测AI时代的生产率：智力融合人类框架与生产函数理论中缺失的认知中介

Kwan Soo Shin, In Seok Kang

专题命中安全评测：AI生产率悖论，认知中介框架

AI总结本文提出智力融合人类（ICH）框架，通过引入四维认知构念“融合能力”（C）作为AI与生产率之间的认知中介，解释了AI投资未能带来相应生产率增长的理论悖论，并基于20个OECD国家的数据分析验证了AI与C的交互作用对全要素生产率变异的解释力。

Comments 78 pages, 3 figures

详情

AI中文摘要

为什么大规模AI投资未能产生相应的生产率增长？我们认为这一悖论在理论上是生成的：主流生产函数框架通过将AI视为可分离的生产要素，而未建模AI产生生产性价值的认知中介，从而遇到了结构性边界。这导致投资倾向于部署，而生产率需要先发展我们称之为融合能力（C）的东西。我们提出了智力融合人类（ICH）框架，这是生产函数理论的第五阶段框架：H-hat = H[1 + phi(A,C)]，其中有效生产能力等于人力资本（H）乘以一个增强因子[1 + phi]，phi由AI利用强度（A）和融合能力（C）共同决定，C是一个四维认知构念，涵盖具身理解、元认知、时间整合和整合思维。生产函数Y = F(K, H-hat)为索洛的TFP残差提供了一个以人为中心的机制：A_Solow = [1 + phi(A,C)]^(1-alpha)。该框架预测了三种具有不同政策含义的增强机制。对20个OECD经济体的描述性跨国分析显示，AIxC交互作用与86%的TFP变异相关，而仅AI为31%，这是小n理论传统中模式一致的发现。韩国是国家级欠增强的例证：高H、大量A、低C导致phi=0。我们将融合能力与相邻构念——吸收能力、动态能力和人力资本——区分开来，并证明C构成了先前框架中隐含的特定认知中介。我们推导出C优先的政策建议，并提出了三个可实证检验的命题及一个可证伪的10年预测。

英文摘要

Why does massive AI investment fail to generate commensurate productivity gains? We argue the paradox is theoretically generated: prevailing production function frameworks encounter a structural boundary by treating AI as a separable factor of production without modeling the cognitive mediation through which AI generates productive value. This directs investment toward deployment when productivity requires prior development of what we term convergence capacity (C). We propose the Intellectually Converged Human (ICH) framework, a fifth-stage framework for production function theory: H-hat = H[1 + phi(A,C)], where effective productive capacity equals human capital (H) scaled by an augmentation factor [1 + phi], with phi jointly determined by AI utilization intensity (A) and convergence capacity (C), a four-dimensional cognitive construct encompassing embodied understanding, metacognition, temporal integration, and integrative thinking. The production function Y = F(K, H-hat) provides a human-centered mechanism for Solow's TFP residual: A_Solow = [1 + phi(A,C)]^(1-alpha). The framework predicts three augmentation regimes with distinct policy implications. Descriptive cross-national analysis of 20 OECD economies shows the AIxC interaction is associated with 86% of TFP variance versus 31% for AI alone, a pattern-consistent finding in the small-n theoretical tradition. South Korea exemplifies national-scale under-augmentation: high H, substantial A, low C produce phi = 0. We distinguish convergence capacity from adjacent constructs, absorptive capacity, dynamic capability, and human capital, and demonstrate that C constitutes the specific cognitive mediator that prior frameworks have left implicit. We derive C-first policy prescriptions and offer three empirically testable propositions with a falsifiable 10-year forecast.

URL PDF HTML ☆

赞 0 踩 0