arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 29 篇

2606.10209 2026-06-10 cs.AI cs.LG cs.SE 新提交

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

更少上下文,更优智能体:面向长周期工具使用LLM智能体的高效上下文工程

Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal

发表机构 * Microsoft(微软)

AI总结 针对企业工具使用场景中上下文过长导致的问题,提出选择性保留最近工具交互并添加紧凑摘要的方法,在费用明细任务上将完成率从71.0%提升至91.6%,同时大幅降低token消耗和运行时间。

Comments 17 pages, 3 figures, 8 tables

详情
AI中文摘要

部署为自主智能体用于企业工作流的大型语言模型面临一个关键挑战:来自企业系统的冗长工具响应可能导致上下文溢出、状态过时错误和高推理成本。我们在Microsoft Dynamics 365 Finance and Operations中使用Model Context Protocol工具研究自动费用明细化问题。我们在一个包含50个任务的酒店费用基准上评估了四种GPT-5配置:无用户模型、完整对话历史、上下文裁剪至最近5个工具调用/响应对、以及裁剪加自动摘要。结果在5次独立运行中取平均,用户模型在上下文工程比较中保持不变。无用户模型基线仅达到8.0%的完全明细化。完整上下文保留将完成率提升至71.0%,但每次基准测试消耗1,480,996个token和14.56小时。裁剪至最近5个工具调用将完成率提升至79.0%,同时将token使用降至535,274个,运行时间降至5.39小时。添加摘要实现了最佳结果:91.6%的完全明细化和99.64%的平均明细金额,使用553,374个token和5.79小时。我们进一步报告了置信区间、效应量分析、裁剪和摘要窗口的敏感性、失败分析、按三个类别分组的五种费用类型的结果,以及使用Claude Sonnet 4.5的跨模型证据。这些结果表明,对于这类企业工具使用工作流,选择性保留最近的工具交互加上紧凑摘要,与保留完整历史相比,可以提高可靠性和效率。

英文摘要

Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

2606.10299 2026-06-10 cs.AI cs.CV cs.MA 新提交

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

空间记忆必须存储什么:遮挡作为语言-智能体记忆的测试

Doeon Kwon, Junho Bang

发表机构 * Space Zero, Inc.(Space Zero公司)

AI总结 本文通过实验证明,在空间查询场景中,几何信息必须主导记忆召回,而可见性判断需要独立于记忆召回,并提出了基于射线-体素DDA的可见性谓词计算方法。

Comments 23 pages, 6 figures

详情
AI中文摘要

语言智能体的“记忆宫殿”系统将每条记忆锚定到世界坐标,其直觉是几何提供了文本无法提供的信息。我们使这一直觉可测试,并报告三个结果。首先,记忆宫殿默认将空间邻近性折叠成与近期性和重要性线性混合的做法没有帮助甚至有害:在一个预注册的召回实验中,现有的混合在其自身冻结测试中失败(平均Delta-Hit@5 -0.0375,Wilcoxon p=0.306),处于位置盲基线水平,而几何主导的加权则取得决定性胜利(+0.3208,p<10^-15):当查询模式是空间时,几何必须主导召回。其次,记忆召回和可见性必须分离:召回在设计上对遮挡不敏感(你能正确记住墙后下一个房间),而可见性是对存储几何的感知谓词,实时系统从未计算过。一行射线与体素的数字微分分析器(DDA),从智能体已经投射的视线射线重新指向,提供了这一点:文本和实时视锥在849个墙后目标上得分均为0.000,而锥体加DDA达到0.982(精确McNemar p<10^-6);坐标召回分别解决了余弦空值无法解决的近重复位置(1.000 vs 0.533,n=150)。第三,可见性谓词在git提交的预注册下得到实时确认(SPMEM-OCC-LIVE-v1:八个脚本化世界,自动oracle评分,96个墙后目标,假可见从1.000降至0.000,合并精确McNemar p=2.5x10^-29),该运行发现并修复了一个真实的中继锚点缺陷。我们承认遮挡需要几何几乎是同义反复;贡献在于测量和隔离,将空间记忆必须存储的内容与其读取方式分开。这些试验为一个冻结的确认性研究(SPMEM-ZERO-REAL-PREREG-v1)提供动力;完整的人类作者多世界研究(含盲评者)仍是未来工作。

英文摘要

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

2606.10457 2026-06-10 cs.AI 新提交

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Trace2Policy:从专家行为轨迹到自我进化的决策代理

Junli Zha, Jinbo Wang, Chao Zhou, Xiang Song

发表机构 * SF Express(顺丰速运)

AI总结 提出Trace2Policy框架,通过错误驱动的迭代技能精炼(EISR)从专家行为中提取可读规则,在合规敏感任务中规则质量是关键性能杠杆,经8轮迭代后编译为确定性Python代码达到79.6%准确率,并在实际部署中优于纯LLM基线。

详情
AI中文摘要

企业专家在审计、合规和合同审查中隐性应用的决策规则可以通过迭代错误分析系统地恢复和改进。我们提出\textbf{Trace2Policy},其核心机制——\textbf{EISR}(\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement)——将人类可读的规则文档作为优化目标:每轮在验证集上执行规则,按根本原因将错误聚类为MISSING、WRONG或CONFLICT类型,应用针对性补丁,并仅提交通过回归门的补丁。\textbf{对于这类合规敏感、基率偏斜的决策任务,我们确定规则质量——而非模型能力——是主导性能杠杆}:在五个LLM上,一次性蒸馏在部署池上停滞在约70%,而八轮EISR将相同规则提升至79.6%(编译为确定性Python,推理时零LLM调用)。\textbf{执行形式放大了收益:在生产中,相同的EISR精炼内容作为编译Python运行比作为LLM提示高出9.8个百分点,这是一个形式与工程捆绑包,经过22天部署共同成熟。}在一家大型物流承运商(3,349个审计案例)部署22天后,编译管道优于其替代的纯LLM基线(72.7%);在这些校准的、基率偏斜的工作负载上,重新启用LLM回退会单调地降低准确率。一种LLM驱动的变体,\textbf{Auto-EISR},以每周期5-10美元(对比约70专家小时)复现了这种精炼,并无需重新工程即可迁移到涵盖法律推理(LegalBench)和流程挖掘决策(BPIC 2012)的四个公开基准上。

英文摘要

Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$70\% on the deployed pool, while eight EISR rounds lift the same rules to 79.6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together.} Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \$5--\$10 per cycle versus $\sim$70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

2606.10489 2026-06-10 cs.AI 新提交

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

PlanGPT的补充研究:使用定义性能指标评估并与规划器比较

Youssef Abdelkader, Humbert Fiorino, Damien Pellier

发表机构 * Univ. Grenoble Alpes - LIG(格勒诺布尔阿尔卑斯大学 - 信息学实验室(LIG))

AI总结 本文对大型语言模型PlanGPT进行补充实验,使用规划成本和生成时间两个指标评估其性能,并与传统规划器比较,发现PlanGPT并不优于贪心搜索策略。

Comments 7 pages

详情
AI中文摘要

自动规划是人工智能(AI)的一个子领域,其主要目标是生成一系列动作(称为规划),帮助我们从初始状态达到目标状态。规划问题由一组对象、初始状态和期望目标状态定义。目标是计算一个从初始状态到目标状态的规划。生成规划的程序称为规划器。在本文中,我们对去年发布的最新LLM——PlanGPT进行了补充研究。我们重新进行了一些实验,以验证使用LLM进行规划是否**恰当**且**有价值**。我们还检查了官方PlanGPT论文中关于规划覆盖的结果是否正确,并对PlanGPT的性能进行了更全面的研究:在我们的论文中,PlanGPT的性能使用两个指标进行评估:规划成本和规划生成时间。将PlanGPT的结果与同一规划和相同指标下传统规划器产生的结果进行比较。我们发现PlanGPT并不优于贪心搜索策略。

英文摘要

Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that'll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT's performance: in our paper PlanGPT's performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.

2606.10507 2026-06-10 cs.AI 新提交

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF: 面向长视界LLM智能体学习的层次化规划与信息折叠

Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai

发表机构 * Meituan(美团) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出层次化规划与信息折叠方法,通过子目标分解和历史折叠减少长上下文干扰,结合层次化反思和子目标过程奖励,提升LLM在多轮长视界任务中的表现。

详情
AI中文摘要

尽管大型语言模型(LLM)在广泛任务中展现出作为自主智能体的强大能力,但其性能在多轮长视界智能体任务中常常下降。现有方法通过细粒度信用分配以缓解长视界稀疏奖励,以及通过层次化强化学习分解任务并减少长期依赖,取得了进展。然而,这些方法仍未直接解决长上下文干扰问题,即持续增长的历史记录削弱了智能体跟踪全局任务状态的能力,并损害了后续推理和决策。受人类通过子目标分解和已完成进度总结处理复杂任务的方式启发,我们提出了面向长视界LLM智能体学习的层次化规划与信息折叠(HIPIF)。HIPIF端到端地训练智能体,使其围绕显式子目标组织长视界执行,同时折叠已完成的子目标历史以减少长上下文干扰。此外,为稳定基于子目标的规划与执行,HIPIF结合了层次化反思和面向子目标的过程奖励,以指导子目标的生成、转换和执行,而无需依赖昂贵的辅助模型或特定任务的专家轨迹。在三个公开可用的智能体基准上的广泛实验证明了我们方法的有效性。

英文摘要

While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

2606.10532 2026-06-10 cs.AI 新提交

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

ActiveMem: 用于长程LLM推理的分布式主动记忆

Yunhan Jiang, Wenbin Duan, Shasha Guo, Liang Pang, Xiaoqian Sun, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ActiveMem框架,将记忆从核心推理中解耦,通过分布式主动记忆系统积累语义要点,在长程推理任务中实现高精度和低开销。

详情
AI中文摘要

记忆对于使大型语言模型(LLM)代理能够处理长程推理任务至关重要。现有的记忆机制大多是集中式的,通常将检索到的信息和交互历史组织在单个模型上下文中。这种设计带来了一个基本的权衡:扩展推理轨迹可能导致上下文过载,而激进的修剪内容可能导致不可逆的信息丢失。为了寻求更好的权衡,我们从人类认知系统中汲取灵感,特别是前额叶皮层(执行控制)和海马体(记忆管理)之间的功能互补性,表明这种权衡并非固有,而可能源于集中式记忆组织。为此,我们提出了ActiveMem,一个异构框架,将代理记忆从核心推理过程中解耦。具体来说,高层规划器利用提炼的语义要点执行推理,而轻量级的分布式记忆系统并行运行,在整个任务中主动积累和整合这些要点。在BrowseComp-Plus和GAIA上的实验表明,ActiveMem以显著降低的开销实现了最先进的准确性,证明了分布式主动记忆在长程推理中的有效性。

英文摘要

Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.

2606.10677 2026-06-10 cs.AI cs.CL 新提交

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory:用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI(InfiniGen AI) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出Infini Memory架构,将智能体记忆组织为主题文档,通过缓冲合并和迭代检索实现可维护的长期记忆,在MemoryAgentBench上达到64.7%的总体得分。

详情
AI中文摘要

长期LLM智能体需要持久记忆,以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段,这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory,一种可维护的基于文本的持久记忆架构,将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元,用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中,然后定期合并为连贯的文本上下文。在推理时,一种智能体检索过程允许LLM通过迭代工具调用读取记忆,而不是单次检索步骤。在MemoryAgentBench上,Infini Memory取得了64.7%的总体得分。消融实验表明,主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

2606.10806 2026-06-10 cs.AI math.FA 新提交

Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Moonshine:一个以猜想生成为中心的自主数学研究智能体

Xiaoyang Chen, Xiang Jiang

AI总结 提出自主智能体Moonshine,通过提取经典问题结构、提炼新概念并生成数学猜想,以Jacobian猜想为例,将其转化为神经Jacobian猜想并证明部分情况。

详情
AI中文摘要

Moonshine是一个自主智能体,其核心目标是生成数学猜想。它的核心能力是从经典问题中提取结构、提炼新概念,并制定具有数学意义的猜想。Moonshine不将解决单个命题作为终点,而是通过猜想生成、桥梁构建和障碍识别来构建可扩展的理论框架。本文以Moonshine对Jacobian猜想的探索为例,展示了局部非退化性是否强制全局单射性的核心逻辑如何转移到单隐层仿射-岭sigmoid网络上。这导致了\emph{神经Jacobian猜想}(NJC)的提出:如果这样的网络在整个空间上具有严格正的Jacobian行列式,则它必须是全局单射的。通过分别调用GPT-5.5-pro和DeepSeek-V4-pro,Moonshine获得了情况\(N=n+1\)的独立完整证明。此外,在ChatGPT通过其网页界面与GPT-5.5-pro交互使用的辅助下,开发了一个几何拓扑证明。这些结果为猜想的合理性提供了初步证据。然而,一般的高宽度情况\(N\ge n+2\)仍未解决,留待进一步研究。这项工作展示了Moonshine自主生成有意义的数学问题并对其取得严格进展的能力。

英文摘要

Moonshine is an autonomous agent whose central objective is to generate mathematical conjectures. Its core capability is to extract structure from classical problems, distill new concepts, and formulate conjectures of mathematical significance. Rather than treating the solution of a single proposition as its endpoint, Moonshine builds an extensible theoretical framework through conjecture generation, bridge building, and obstacle identification. This article uses Moonshine's exploration of the Jacobian conjecture as an example. It shows how the central logic of whether local nondegeneracy can force global injectivity is transferred to one-hidden-layer affine-ridge sigmoid networks. This leads to the formulation of the \emph{Neural Jacobian Conjecture} (NJC): if such a network has strictly positive Jacobian determinant on the whole space, then it must be globally injective. By invoking GPT-5.5-pro and DeepSeek-V4-pro separately, Moonshine obtained independent complete proofs for the case \(N=n+1\). In addition, with the assistance of ChatGPT through interactive use of its web interface with GPT-5.5-pro, a geometric-topological proof was developed. These results provide preliminary evidence for the plausibility of the conjecture. The general higher-width case \(N\ge n+2\), however, remains unresolved and is left for further investigation. This work illustrates Moonshine's ability to autonomously generate meaningful mathematical problems and make rigorous progress on them.

2606.10917 2026-06-10 cs.AI 新提交

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Role-Agent: 通过双角色演化引导LLM智能体

Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang, Yong Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出Role-Agent框架,让单个LLM同时作为智能体和环境,通过世界在智能体(WIA)和智能体在世界(AIW)两个组件实现自举协同演化,在多个基准上平均提升超过4%。

Comments 20 pages, including 12 pages of main text and 8 pages of appendix; work in progress

详情
AI中文摘要

尽管大型语言模型(LLM)智能体在复杂任务上表现出色,但其学习常受限于低效的交互反馈和静态训练环境,阻碍了更广泛的泛化。为解决这些问题,本文引入Role-Agent,一个利用单个LLM同时充当智能体和环境的框架,实现自举协同演化。Role-Agent包含两个协同组件:世界在智能体(WIA)和智能体在世界(AIW)。在WIA中,LLM作为智能体,在每个动作后预测未来状态;预测状态与实际状态的对齐被用作过程奖励,鼓励环境感知推理。在AIW中,LLM分析失败轨迹中的失败模式,并检索具有相似失败模式的任务,从而重塑训练数据分布以进行针对性练习。在多个基准上的实验表明,Role-Agent持续提升性能,相比强基线平均提升超过4%。

英文摘要

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \textcolor{black}{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.

2606.10933 2026-06-10 cs.AI 新提交

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

前沿编码代理使用元编程适应不熟悉的编程语言

Aman Sharma, Sushrut Thorat, Paras Chopra

发表机构 * Lossfunk

AI总结 研究评估LLM编码代理在陌生语言上的表现,发现强代理通过元编程(如用Python生成目标语言代码)适应,禁止此策略性能大幅下降,而弱代理无法从中受益。

Comments 43 pages, 8 figures

详情
AI中文摘要

基于LLM的编码代理通常在熟悉的软件环境中进行评估:主流语言、常见库和公共仓库。这些基准仍然重要,但它们可能隐藏代理在语言本身不熟悉时的行为。我们使用顺序设置(包括文件编辑、本地执行和隐藏测试评分)在四种深奥编程语言上评估了六个当代编码代理。我们的协议揭示了这些代理之间的能力差异,而主流编码和代理基准(如SWE-Bench Verified和Terminal-Bench 2.0)将这些差异压缩到更窄的范围内。我们观察到,最强的代理Claude Opus 4.6和GPT-5.4 xhigh通常避免直接编写目标语言。在Brainfuck和Befunge-98上,它们编写Python程序来生成目标语言代码,并在本地调试这些生成器。禁止这种元编程策略会导致性能大幅下降。从该策略中提取的文本指导并未实质性地改善较弱的代理。相比之下,来自Opus的用于构建生成器的Python辅助代码(没有解决的基准程序或隐藏测试答案)显著提高了Sonnet 4.6和GPT-5.4 mini在相同问题上的表现,而Haiku 4.5仍然较低。更多的解释器调用和输出令牌改善了较强的代理,但使较弱的代理接近其原始性能,表明这些资源放大了有用的策略而非创造了它们。总之,这些结果表明,强大的编码代理通过使用工具、反馈和工作区状态来构建目标语言的工作模型,从而适应不熟悉的语言。元编程是最明显的案例,但更广泛的差距在于构建和调试在目标语言规则下有效的策略。

英文摘要

LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.

2606.11078 2026-06-10 cs.AI cs.CL cs.CV 新提交

A History-Aware Visually Grounded Critic for Computer Use Agents

面向计算机使用代理的历史感知视觉基础批评家

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Capital One University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HiViG框架,通过历史感知的视觉基础多模态批评家,在测试时评估动作并拦截错误,在多个GUI基准上提升成功率。

Comments Code: https://github.com/G-JWLee/HiViG

详情
AI中文摘要

针对计算机使用代理(CUA)的各种测试时干预措施(包括批评模型)已被开发出来,通过在复杂图形用户界面(GUI)环境中执行前动作评估来提高性能。然而,现有的批评家存在两个关键限制:(1)主要关注短视决策循环(例如,遗忘早期动作);(2)缺乏检测有缺陷动作(例如,点击错误的UI元素)所需的视觉基础。为了解决这些问题,我们引入了HiViG,一个历史感知的视觉基础测试时框架,其核心是一个在真实GUI轨迹上训练的多模态批评家,用于将过去的交互抽象为紧凑记录,并基于视觉基础评估动作。在测试时,HiViG将批评家集成到策略决策循环中,以提供宏观动作历史(总结策略已完成成就)和视觉基础批评(根据当前截图验证原始执行坐标,在执行前拦截错误)。在网页、移动和桌面基准测试中,HiViG持续优于现有的标量和口头批评家,在Qwen3-VL-32B上比最强基线平均成功率提高5.8%,在Gemini-3-Flash上提高9.0%,并展示了强大的跨平台泛化能力。消融实验表明,宏观动作历史缓解了短视规划,视觉基础批评减少了执行错误,这两个组件对于长时域GUI任务中的测试时扩展至关重要。

英文摘要

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

2606.09832 2026-06-10 cs.HC cs.AI 交叉投稿

Agentic Social Affordance Framework (ASAF): Agent Identity Design as a Collaboration Interface in Multi-Agent Systems

智能体社会可供性框架 (ASAF):多智能体系统中作为协作接口的智能体身份设计

Meng-Han Lee

发表机构 * Independent Researcher(独立研究者)

AI总结 提出ASAF框架,将社会可供性理论扩展到多智能体AI系统,将智能体身份设计视为协作接口,通过身份信号、行为启动和协作治理三个机制影响人机协作质量。

Comments 24 pages, 2 figures, 1 table. Introduces ASAF with falsifiable hypotheses and proposed experimental designs for testing agent identity design effects in multi-agent Human-in-the-Loop systems, grounded in a real-world 38-agent deployment

详情
AI中文摘要

随着AI系统从单一对话智能体演变为复杂的多智能体架构,一个关键的设计维度被忽视了:个体智能体的社会身份如何塑造人类在协作中的行为。本文介绍了智能体社会可供性框架(ASAF),这是一个将社会可供性理论扩展到多智能体AI系统背景的理论框架。我们提出,智能体身份设计不仅作为用户界面惯例,而且作为协作接口——构建用户如何感知、接近和与每个智能体互动,从而影响人机协作结果的质量。具体来说,社会可供性层构成了一个独立于工程编排的设计维度:两者代表不同的决策空间,不能相互推导。ASAF包含三个机制:身份信号、行为启动和协作治理,并通过四层身份信号保真度谱和个体差异调节变量(拟人化与工具化认知风格)指定其边界条件。我们将ASAF与现有可供性理论和CASA范式相联系,阐明ASAF的多智能体、拓扑级预测在哪些方面超出了二元框架的解释范围。我们讨论了多智能体系统设计的启示,并概述了未来实证验证的方向,包括用于测试设计空间正交性的因子设计。

英文摘要

As AI systems evolve from single conversational agents to complex multi-agent architectures, a critical design dimension has been overlooked: how the social identity of individual agents shapes human behavior within the collaboration. This paper introduces the Agentic Social Affordance Framework (ASAF), a theoretical framework that extends Social Affordance theory into the context of multi-agent AI systems. We propose that agent identity design functions not merely as a user interface convention, but as a collaboration interface -- structuring how users perceive, approach, and engage with each agent, and thereby influencing the quality of Human-Agent collaboration outcomes. Specifically, the social affordance layer constitutes an independent design dimension orthogonal to engineering orchestration: the two represent distinct decision spaces that cannot be derived from each other. ASAF comprises three mechanisms: Identity Signaling, Behavioral Priming, and Collaborative Governance, and specifies their boundary conditions through a four-tier Identity Signal Fidelity Spectrum and an individual-difference moderating variable (anthropomorphizing vs.\ instrumentalizing cognitive style). We situate ASAF in relation to existing affordance theory and the CASA paradigm, delineating where ASAF's multi-agent, topology-level predictions exceed the explanatory scope of dyadic frameworks. We discuss implications for multi-agent system design and outline directions for future empirical validation, including a factorial design for testing design-space orthogonality.

2606.09848 2026-06-10 cs.HC cs.AI cs.CY 交叉投稿

Human-AI Coordination Zones: A Framework for Designing Human-in-the-Loop Experiences with Agentic AI

人机协调区域:设计具有代理性AI的人机协同体验框架

James Pierce, Vaiva Kalnikaitė, Siddharth Gupta, Brian Granger

发表机构 * Amazon Web Services(亚马逊网络服务)

AI总结 提出人机协调框架,通过显著性、参与度和活动三个维度定义协调区域,并提供输入分类、协调曲线和设计模式,用于生成、分析和沟通人机交互体验。

详情
AI中文摘要

随着生成式和代理性AI嵌入日常产品,实践者面临一个持续挑战:如何设计人机协调——即用户与AI系统通过界面进行的持续相互调整,以支持可用性、信任和安全性。现有资源提供高层次原则(“保持透明”、“维持用户控制”)或低层次UI模式,但缺乏连接两者的中层设计知识。通过对60个商业AI应用进行景观和人工制品分析,我们引入了一个框架,将人机协调定义为三个维度的相互作用:显著性(AI呈现的突出程度)、参与度(用户可做什么来参与AI)和活动(AI实际做什么)。我们贡献了中层工具,包括协调区域(为我做、在我之下做、与我一起做、没有我做)、输入分类(提示、激发、推断、分层)、用于映射用户旅程的协调曲线,以及展示框架生成能力的设计模式。该框架可生成性地应用于设计体验,分析性地评估现有体验,以及沟通性地在利益相关者之间阐述想法。

英文摘要

As generative and agentic AI becomes embedded in everyday products, practitioners face a persistent challenge: how to design human-AI coordination -- the ongoing mutual adjustment between users and AI systems as mediate through interfaces-that supports usability, trust, and safety. Existing resources offer high-level principles ("be transparent," "maintain user control") or low-level UI patterns, but there is a lack of mid-level design knowledge bridging the two. Through landscape and artifact analysis of 60 commercial AI applications, we introduce a framework defining human-AI coordination as the interplay of three dimensions: salience (how prominently AI is presented), involvement (what users can do to engage AI), and activity (what AI actually does). We contribute mid-level tools including coordination zones (done-for-me, done-under-me, done-with-me, done-without-me), an input taxonomy (prompted, sparked, inferred, layered), coordination curves for mapping user journeys, and design patterns demonstrating the generative capacity of the framework. The framework can be applied generatively to design experiences, analytically to evaluate existing ones, and communicatively to articulate ideas across stakeholders.

2606.09916 2026-06-10 cs.LG cs.AI 交叉投稿

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

IntentKV: 面向Agent推理的跨轮次意图感知KV缓存剪枝

Junjie Li, Jiong Lou, Jie Li

AI总结 针对多轮LLM Agent中KV缓存成为服务瓶颈的问题,提出IntentKV方法,通过会话级QueryMemory和残差注意力头实现跨轮次意图感知的KV剪枝,在保持精度的同时大幅降低峰值请求token和KV读取量。

详情
AI中文摘要

多轮LLM Agent将短查询扩展为包含工具调用、搜索结果和中间推理的长轨迹。在单条轨迹中,KV内存和KV读取带宽增长数个数量级,使得键值(KV)缓存(而非参数计算)成为长时Agent的主要服务瓶颈。我们提出IntentKV,一种学习型KV剪枝方法,保持基础LLM冻结。IntentKV维护一个会话级的跨轮次意图QueryMemory,通过记忆-注意力规则对实时历史token进行评分,并添加一个零初始化的残差注意力头,对当前查询的K向量进行交叉注意力。为了与前缀缓存保持可组合性,驱逐采用槽位映射重定向:被丢弃的位置路由到一个哨兵死槽,而存活的K/V行、RoPE相位和槽位标识保持不变。在严格的KV预算下,IntentKV与无剪枝的全缓存基线相比几乎没有精度下降:在8k KV预算下,Qwen3-8B的平均峰值请求token下降23.9%,Qwen2.5-14B下降30.7%。在Qwen2.5-14B上所有方法都能完成的100个最长BCP查询中,IntentKV-8k进一步将最坏情况下的峰值请求token从92.3k降至20.5k(减少77.8%),最坏情况下的原始KV读取从4.11亿降至3100万(减少92.6%)。

英文摘要

Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.

2606.09961 2026-06-10 cs.LG cs.AI 交叉投稿

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

3SPO: 面向LLM智能体的状态分数监督策略优化

Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian

AI总结 提出3SPO算法,通过动态状态分数监督实现逐步骤策略优化,解决多轮智能体任务中奖励稀疏和信用分配问题,在ALFWorld和WebShop上分别比GRPO提升22.6%和15.6个百分点。

详情
AI中文摘要

通过强化学习(RL)将大型语言模型(LLM)训练为自主智能体,已使前沿模型在长周期任务中实现超人类性能。然而,现有RL算法在轨迹级别操作,仅在收集完整回合后执行策略优化。这种粗粒度方法在多轮智能体设置中面临根本性挑战,其中奖励稀疏、延迟,且跨单个步骤的信用分配至关重要。在这项工作中,我们提出\textbf{状态分数监督策略优化(3SPO)},一种新颖的RL算法,通过动态状态分数监督执行逐步骤策略优化。在每个步骤,3SPO基于历史成功率计算状态分数,监督逐步骤信用分配、自适应回合和逐步骤策略优化,无需价值函数估计或额外辅助模型。理论上,在每状态臂架抽象下,我们证明所提出的分数监督分配机制实现了对数分配遗憾,并为动作识别、分数可区分性和过滤稳定性提供了样本复杂度保证。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B-Instruct的实验表明,3SPO在ALFWorld上持续优于GRPO $+22.6\%$,在WebShop上优于$+15.6$个百分点,同时使用相当资源实现了$2.4\times$更多的状态探索和$1.8\times$更快的收敛。代码可从此https URL获取。

英文摘要

Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.

2606.10064 2026-06-10 cs.LG cs.AI 交叉投稿

Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

Bittensor 智能体竞技场作为轨迹基元:从 ShoppingBench 子网轨迹中蒸馏购物智能体

Shardul Bansal, Seth Schilbe, Jarrod Barnes

AI总结 针对小模型后训练缺乏多轮轨迹数据的问题,利用 Bittensor 子网 SN15 的竞技机制生成激励对齐的轨迹,通过结构质量过滤提取智能体轨迹,后训练 Qwen3-4B 模型在 ShoppingBench 上达到 42.7% ASR,接近合成数据基线。

Comments 10 pages, 4 figures, Data and Models available at: https://huggingface.co/collections/oro-ai/shoppingbench-sn15-trajectory-primitive

详情
AI中文摘要

小模型智能体后训练的瓶颈更多在于其消耗的轨迹基质而非算法。领先的方案(RLVR、群体相对 RL、拒绝采样重 SFT)都需要携带每轨迹监督的多轮轨迹,而现有的两个来源存在不足:前沿合成数据继承了合成器的偏见并坍缩了长尾,而未经过滤的生产日志未经评判且被捷径行为污染。我们认为可以设计一个激励对齐的智能体竞技场来制造此类轨迹,并在 ORO Subnet 15(SN15)上进行了演示,这是 ShoppingBench 智能体电商基准的 Bittensor 部署。SN15 的竞赛机制、LLM 推理评判器和旋转泄漏簇防护问题集产生了一个具有三个特性的语料库:激励对齐的多样性、每轨迹评判和反记忆的留出评估。我们引入了一个结构质量过滤器,通过保留智能体轨迹(模型自身发出工具调用)并拒绝子任务轨迹(模型仅在确定性搜索循环上进行分类或叙述),将原始数据流转换为可训练的语料库,然后使用与已发布的 ShoppingBench SFT-然后-GRPO 流程匹配的方案对 Qwen3-4B 进行后训练。在泄漏簇防护的留出分区上,以生产严格方式评分,模型从已发布的 Qwen3-4B 基线的 18.0% ASR 提升至 42.7%,与合成数据 SFT 仅基线(43.6%)在单问题噪声范围内,同时仅训练了子网单日输出的一小部分。监督堆栈留下了较大的 pass@8 到 pass@1 差距(53.3% 对比 34.8%);每步教师基础的 Dr. GRPO 奖励将该空间转化为过程改进,我们确定子任务数据流是缩小与 48.7% SFT+GRPO 基线差距的主要杠杆。我们发布了过滤器、语料库分割和竞技场机制。

英文摘要

Small-model agentic post-training is bottlenecked less by the algorithm than by the trajectory substrate it consumes. Leading recipes (RLVR, group-relative RL, rejection-sampled re-SFT) all need multi-turn traces carrying per-trajectory supervision, and the two existing sources fall short: frontier-synthesised data inherits the synthesizer's biases and collapses the long tail, while unfiltered production logs are unjudged and contaminated by shortcut behaviour. We argue that an incentive-aligned agent arena can be engineered to manufacture such trajectories, and demonstrate this on ORO Subnet 15 (SN15), a Bittensor deployment of the ShoppingBench agentic-commerce benchmark. SN15's race mechanism, LLM reasoning judge, and rotating leak-cluster-guarded problem suite yield a corpus with three properties: incentive-aligned diversity, per-trajectory judging, and anti-memorised held-out evaluation. We introduce a structural-quality filter that converts the raw firehose into a trainable corpus by keeping agentic trajectories (the model itself emits the tool calls) and rejecting sub-task trajectories (the model only classifies or narrates over a deterministic search loop), then post-train Qwen3-4B with a recipe matched to the published ShoppingBench SFT-then-GRPO pipeline. On a leak-cluster-guarded held-out partition scored production-strict, the model lifts from the published Qwen3-4B base of 18.0% ASR to 42.7%, within single-problem noise of the synthetic-data SFT-only baseline (43.6%), while training on a fraction of a single day of subnet output. The supervised stack leaves a large pass@8 to pass@1 gap (53.3% vs 34.8%); a per-step teacher-grounded Dr. GRPO reward converts that headroom into process improvement, and we identify the sub-task firehose as the primary lever for closing the gap to the 48.7% SFT+GRPO bar. We release the filter, the corpus splits, and the arena mechanics.

2606.10106 2026-06-10 cs.SE cs.AI 交叉投稿

What makes a harness a harness: necessary and sufficient conditions for an agent harness

什么使一个工具成为工具:智能体工具的必要和充分条件

Sanderson Oliveira de Macedo

发表机构 * Federal Institute of Goiás(戈亚斯联邦理工学院)

AI总结 本文通过概念分析,定义了智能体工具的必要和充分条件,并提供了包含/排除测试,以区分智能体工具与智能体框架、SDK、IDE插件等。

详情
AI中文摘要

术语“智能体工具”现在在软件工程中随着生成式人工智能广泛流传。它指的是包裹语言模型并将其转化为能够在仓库上行动的编码智能体的层。该用法松散且多义。有时该术语指整个产品(Claude Code, Codex CLI);有时指运行智能体执行任务的评估脚手架(SWE-bench工具);有时它与智能体框架、SDK、IDE插件或编排器混为一谈。缺失的是一个作为工具的参考定义,能够一致地包含和排除案例。我们通过概念分析构建该定义,结合了具有持久标识符的作品和主要灰色文献来源,如官方文档、词汇表和工程报告。我们重构了该术语的谱系,从马具到经典测试工具,到机器学习评估工具,最后到智能体工具。然后我们提出一个构成性定义,陈述一个系统成为智能体工具的必要和充分条件,我们将其操作化为包含和排除测试,并绘制该概念与智能体框架、智能体SDK、IDE插件、评估工具和编排器的边界。我们将该定义应用于六个真实工具(Claude Code, Codex CLI, Aider, Cline, OpenHands和SWE-agent)以及故意的边缘案例;测试一致地包含和排除。最后我们以按设计张力轴组织的研究议程结束。贡献是智能体工具的操作性定义,具有共享词汇,能够指导工程实践和智能体系统的科学比较。

英文摘要

The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps a language model and turns it into a coding agent able to act on a repository. The usage is loose and polysemous. Sometimes the term denotes the whole product (Claude Code, Codex CLI); sometimes it denotes the evaluation scaffold that runs an agent against tasks (the SWE-bench harness); sometimes it gets conflated with an agent framework, an SDK, an IDE plugin, or an orchestrator. What is missing is a reference definition that works as an instrument, one that includes and excludes cases consistently. We build that definition through a conceptual analysis that combines works with persistent identifiers and primary grey-literature sources, such as official documentation, glossaries, and engineering reports. We reconstruct the genealogy of the term, from the horse's tack to the classic test harness, to the machine-learning evaluation harness, and finally to the agent harness. We then propose a constitutive definition that states the necessary and sufficient conditions for a system to be an agent harness, we operationalize it as an inclusion and exclusion test, and we draw the boundary of the concept against an agent framework, an agent SDK, an IDE plugin, an eval harness, and an orchestrator. We apply the definition to six real harnesses (Claude Code, Codex CLI, Aider, Cline, OpenHands, and SWE-agent) and to deliberate edge cases; the test includes and excludes consistently. We close with a research agenda organized by design tension axes. The contribution is an operational definition of agent harness, with a shared vocabulary, able to guide engineering practice and the scientific comparison of agentic systems.

2606.10228 2026-06-10 cs.LG cs.AI cs.RO 交叉投稿

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO: 面向安全探索的锐度感知策略优化

Kaustubh Mani, Yann Pequignot, Vincent Mai, Liam Paull

AI总结 提出SHAPO算法,通过锐度感知策略更新隐式重加权梯度,放大罕见不安全动作的影响,抑制安全动作的贡献,从而在欠探索区域实现保守行为,提升安全性与任务性能。

Comments ICLR 2026

详情
AI中文摘要

安全探索是在安全关键领域部署强化学习(RL)智能体的先决条件。在本文中,我们通过认知不确定性的视角来探讨安全探索,其中智能体对参数扰动的敏感性作为高不确定性区域的实际代理。我们提出了锐度感知策略优化(SHAPO),一种锐度感知的策略更新规则,该规则在扰动参数处评估梯度,使得策略更新相对于智能体的认知不确定性变得悲观。分析表明,这种调整隐式地重新加权了策略梯度,放大了罕见不安全动作的影响,同时抑制了已安全动作的贡献,从而在欠探索区域将学习偏向于保守行为。在多个连续控制任务中,我们的方法在安全性和任务性能上均持续优于现有基线,显著扩展了它们的帕累托前沿。

英文摘要

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

2606.10613 2026-06-10 cs.LG cs.AI 交叉投稿

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

基于自举流Q学习的离线强化学习快速且高表达性策略学习

Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出自举流Q学习(BFQ),通过分治位移向量并自举短程分量,实现单步动作生成,无需辅助网络或蒸馏,显著降低计算成本并提升性能。

Comments ICML 2026, 19 pages

详情
Journal ref
ICML 2026
AI中文摘要

基于扩散的Q学习已成为离线强化学习的一种强大范式,但其对多步去噪的依赖使得训练和推理在计算上昂贵且脆弱。最近将扩散Q学习加速到单步动作生成的努力通常引入辅助网络、策略蒸馏或多阶段训练,这常常损害简单性、稳定性或性能。为解决这些限制,我们引入了自举流Q学习(BFQ),一种新颖的框架,能够在训练和推理期间实现精确的单步动作生成,无需辅助网络或蒸馏过程。BFQ采用分治视角处理沿流路径的位移向量:它首先学习可以从流匹配边际速度准确估计的短程位移,然后自举这些分量以直接学习单步噪声到动作的映射。这种公式消除了多步去噪,导致学习过程更快、更简单、更稳健。广泛的D4RL评估表明,与多步扩散基线相比,BFQ在显著降低计算成本的同时提高了性能,证明了单步动作生成足以实现高性能的离线强化学习。

英文摘要

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

2606.10705 2026-06-10 cs.LG cs.AI cs.SY eess.SY 交叉投稿

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

事件驱动强化学习实现半导体制造中的长时域控制

Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta

发表机构 * Politecnico di Milano(米兰理工大学) STMicroelectronics(意法半导体)

AI总结 提出事件驱动深度强化学习框架,将半导体制造控制建模为中心化智能体问题,通过事件驱动时序差分方法优化多目标策略,在高保真仿真中显著提升吞吐量和利用率。

详情
AI中文摘要

强化学习有望优化大规模系统中的序贯决策。半导体制造系统是随机且高度约束的环境,其中异构晶圆在广泛的设备网络中经历数百个加工步骤。这些特性产生了复杂、高维的决策问题,具有延迟反馈和长时域要求,使生产计划和控制复杂化。我们提出了一个用于此规模的多目标策略优化的深度强化学习框架。具体来说,我们将控制表述为一个中心化智能体问题,其中核心策略协调系统范围的决策,而系统演化被表示为由离散事件驱动的互联时间过程。相应地,我们开发了一个定制的事件驱动时序差分公式,该公式保持通用性,并可在相关训练设置下与各种策略优化方法集成。我们研究了纳入该框架的几种核心无模型算法,并使用不同工业现实操作场景的高保真仿真评估其有效性。在广泛的验证实验中,在离线和在线设置下训练的智能体在吞吐量和利用率方面显示出显著且一致的提升。我们进一步评估了训练阶段的表现和泛化能力,阐明了替代强化学习公式和算法的相对优势。总体而言,结果支持所提出框架在控制事件驱动复杂自适应系统方面的可扩展性、通用性和可迁移性。

英文摘要

Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

2606.10709 2026-06-10 cs.IR cs.AI 交叉投稿

Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

通过训练期间回收零方差查询实现智能体搜索的有效强化学习

João Coelho, João Magalhães, Bruno Martins, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所) Instituto Superior Técnico and INESC-ID, University of Lisbon(里斯本大学理工学院和INESC-ID) NOVA LINCS, NOVA School of Science and Technology(NOVA科学与技术学院LINCS)

AI总结 提出查询回收方法,将训练中零方差查询重新投入采样池,使有效训练分布随策略演化,1.7B模型在7个多跳QA基准上平均Pass@1达66.0,匹配或超越7B模型。

详情
AI中文摘要

使用GRPO风格的算法已成为在仅结果奖励下训练LLM搜索代理的标准策略。使用这些算法时,只有当查询的 rollout 组混合了成功和失败时,该查询才对参数更新有贡献;全正确(太容易)和全错误(太难)的组是零方差的,浪费了 rollout 成本。现有方法将零方差视为静态属性,要么丢弃要么预过滤这些组。我们假设并通过实验验证,随着训练过程中策略的演化,查询会在零方差和信号承载状态之间翻转。基于这一直觉,我们提出查询回收,将零方差组返回到可变池中以供将来重新采样,从而使有效训练分布与策略共同演化。使用所提出的技术,在合成数据上训练的1.7B参数模型在七个多跳QA基准上平均达到66.0的Pass@1,匹配或超越使用基准监督训练的、参数高达7B的系统。回收模式分析表明,到训练结束时,回收的查询提供了大约四分之三的有效批次,贡献在策略改进恢复和策略漂移之间分配。

英文摘要

The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-incorrect (too-hard) groups are zero-variance and waste rollout cost. Existing approaches treat zero-variance as a static property and either discard or pre-filter such groups. We hypothesize and empirically validate that queries flip between zero-variance and signal-bearing states as the policy evolves during training. Building on this intuition, we propose query recycling, which returns zero-variance groups to a mutable pool for future resampling, so that the effective training distribution co-evolves with the policy. With the proposed technique, a 1.7B parameter model trained on synthetic data can reach 66.0 average Pass@1 accross seven multi-hop QA benchmarks, matching or surpassing systems with up to 7B parameters trained on benchmark-derived supervision. Analysis of recycling patterns shows that recycled queries supply roughly three quarters of the effective batch by the end of training, with contributions split between recovery from policy improvement and policy drift.

2606.11119 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

TRACE:一种用于高效智能体强化学习的统一展开预算分配框架

Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji

发表机构 * Tsinghua University(清华大学) Tencent(腾讯)

AI总结 针对多轮智能体强化学习中奖励对比度不足的问题,提出TRACE框架,通过将每个ReAct式思考-行动-观察步骤建模为语义节点,在固定采样预算内将预算分配到提示根和中间前缀,增强奖励对比,提升策略更新信号。

Comments 32 pages, 12 figures, 6 tables

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是增强大型语言模型推理和智能体行为的一种有前景的方法。然而,展开密集的策略优化常常受到奖励对比度不足的限制,当过于简单或复杂的提示产生低方差反馈,以及当仅结果奖励对多轮展开中的每个决策赋予相同的终端评估时,就会出现这种情况。过去的努力集中在将可用的展开资源分配给有希望的提示,但它们仅利用提示级别的样本信息性,而忽略了同一展开中不同轮次之间前缀级别信息性的变化。本工作针对多轮智能体强化学习,将每个ReAct风格的思考-行动-观察步骤建模为语义上不同的节点,使得预算分配从提示根扩展到具有进一步延续的轮次级前缀,这自然形成了树状结构的展开。我们引入了树状展开分配用于对比探索(TRACE),这是一个统一的展开分配框架,在固定采样预算内增强奖励对比。在技术上,TRACE将展开预算分配给最可能产生混合终端奖励的提示根和中间前缀。一个共享的通用预测器根据前缀历史估计这些锚点处的条件成功概率,以指导这种分配。由此产生的自适应树状结构丰富了仅结果反馈,并放大了策略更新信号。实验上,TRACE在典型的智能体基准测试中取得了有竞争力的性能和效率提升,例如,在相同采样成本下,Qwen3-14B多跳问答的平均准确率比竞争基线提高了2.8个百分点。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

2606.11182 2026-06-10 cs.LG cs.AI 交叉投稿

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

EEVEE:面向真实世界自改进智能体的测试时提示学习

Weixian Xu, Shilong Liu, Mengdi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Princeton University(普林斯顿大学)

AI总结 提出首个多数据集测试时提示学习框架EEVEE,通过路由器与提示协同进化策略解决跨数据集干扰,在异构数据流下提升鲁棒性。

Comments 19 pages, 6 figures

详情
AI中文摘要

本文提出EEVEE,首个面向LLM智能体的多数据集测试时提示学习框架,能够在真实世界任务流下实现测试时提示学习。现有方法主要针对单数据集设计,而实际应用要求模型处理来自多个数据集、领域和任务分布的异构输入流,限制了其实用性。为减轻跨数据集干扰,EEVEE引入一个路由器,将输入划分为任务簇并分配适当的提示配置。该设计通过路由器-提示协同进化策略进行优化,该策略采用交错的路由器和提示学习阶段来解决它们的相互依赖关系。跨多个数据集的实验表明,该框架在异构数据流下提高了鲁棒性,同时保持了单基准学习能力和效率。具体而言,EEVEE在Qwen3-4B-Instruct和DeepSeek-V3.2上平均多基准分数分别提高了10.38和24.32分,超过SOTA方法GEPA和ACE高达37.2%和48.2%。

英文摘要

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

2510.04195 2026-06-10 cs.AI 版本更新

Constructing coherent spatial memory in LLM agents through graph rectification

通过图修正构建LLM智能体中的连贯空间记忆

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

发表机构 * Chair of Cartography and Visual Analytics(制图学与视觉分析教授会)

AI总结 提出LLM-MapRepair框架,通过版本控制和边影响评分检测并修正增量构建的导航图中的结构不一致性,在多个基准上显著提升节点和边召回率。

详情
AI中文摘要

给定通过全局遍历导航指令的地图描述,LLM通常能够推断隐式空间布局并通过提供最短路径来回答用户查询。然而,随着环境变大,这种依赖于上下文的查询变得不可行,这促使需要增量地图构建,即从逐步观察中构建完整的拓扑图。我们提出LLM-MapRepair,一个用于LLM驱动的地图构建和修复的框架,旨在检测、定位和修正增量构建的导航图中的结构不一致性。我们的贡献包括:用于图构建的版本控制机制、用于修复优先级的边影响评分,以及为LLM驱动的地图构建和修复量身定制的MANGO基准的清理变体。我们在四个评估设置上评估该框架:合成逐组件消融(gpt-4.1,每个单元n=20个种子)、跨供应商扫描(覆盖OpenAI、Anthropic和Google的七个LLM,在合成和TextWorld程序生成的文本冒险游戏上)、修复阶段评估(在所有42个清理后的MANGO游戏上,具有非零剩余冲突,共534个冲突;三个供应商×三种模式加上两个非LLM参考),以及在《红楼梦》第16-17章上的端到端自然文本部署。在DRC部署中,LLM-MapRepair使用GPT-4.1实现了94.3%的节点召回率(比直接LLM映射高8.6个百分点)和88.2%的边召回率(高55.8个百分点);召回率的提升伴随着预测节点和边数量约为真实值的4倍(表4),这反映了我们在局限性中讨论的离散化驱动的过度生成权衡。

英文摘要

Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer user queries by providing shortest paths. However, such context-dependent querying becomes incapable as environments grow larger, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose LLM-MapRepair, a framework for LLM-driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Our contributions include a Version Control mechanism for graph construction, an Edge Impact Score for repair prioritization, and a cleaned variant of the MANGO benchmark tailored for LLM-driven map construction and repair. We evaluate the framework on four evaluation settings: a synthetic per-component ablation (gpt-4.1, n=20 seeds per cell), a cross-vendor sweep over seven LLMs from OpenAI, Anthropic, and Google on both synthetic and TextWorld procedurally-generated text-adventure games, a repair-stage evaluation on all 42 cleaned-MANGO games with non-zero residual conflicts (534 conflicts; three vendors x three modes plus two non-LLM references), and an end-to-end natural-text deployment on Chapters 16-17 of Dream of the Red Chamber. On the DRC deployment, LLM-MapRepair achieves 94.3% node recall (+8.6 pp over direct LLM mapping) and 88.2% edge recall (+55.8 pp), using GPT-4.1; the recall improvements come with predicted node and edge counts that are roughly 4x the ground-truth counts (Table 4), reflecting the discretization-driven over-generation trade-off we discuss in the Limitations.

2606.09316 2026-06-10 cs.AI 版本更新

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Anything2Skill: 将外部知识编译为智能体的可复用技能

Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Anything2Skill框架,将异构外部知识编译为可复用、可检索、可执行的技能,结合RAG显著提升智能体任务成功率。

详情
AI中文摘要

检索增强生成(RAG)使智能体在推理时能够访问外部知识,但主要检索的是碎片化的陈述性证据,导致智能体需要反复从段落、手册、示例、日志或轨迹中推断任务流程。这引发了一个根本性问题:能否从外部知识库中提取技能并安装到智能体中,使其快速逼近领域专业知识?在本文中,我们提出Anything2Skill,一个基于分类的框架,将异构外部知识编译为智能体可复用、可检索、可执行的技能。给定一个知识记录语料库,Anything2Skill首先将每条记录分解为证据窗口,并在技能树先验下执行规划与扩展的技能提取。然后将提取的候选技能转换为结构化的技能契约,指定调用条件、禁忌、动作步骤、工作流程步骤、约束、输出规范、支持证据和置信度分数。为了构建可部署的程序性记忆,Anything2Skill通过分类感知编译、注册表级协调、生命周期跟踪、版本化更新和可见的技能树投影,将提取的技能管理在持久化的SkillBank中。在推理时,智能体从原始知识库中检索任务特定段落,并从SkillBank中检索相关程序性技能,使RAG提供陈述性证据,而编译的技能提供可复用的程序性指导。在qsv和GitHub-CLI上的实验表明,Anything2Skill结合RAG分别实现了98.85%和94.10%的成功率,显著优于仅使用RAG的智能体。这些结果表明,将潜在的程序性知识编译为显式技能是从知识访问扩展到能力复用的有效途径。

英文摘要

Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

2506.09171 2026-06-10 cs.LG cs.AI cs.CL 版本更新

Fact-Augmented Lookahead Planning for LLM Agents

面向LLM智能体的事实增强前瞻规划

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LWM-Planner框架,通过从轨迹中提取关键事实并用于条件化动作提议、世界模型模拟和状态值估计,实现无需参数更新的在线规划改进,在多个环境上优于ReAct/Reflexion和纯搜索基线。

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026). Camera-ready version. 9-page main text plus appendices (63 pages total), 1 figure

详情
AI中文摘要

大型语言模型(LLM)能力日益增强,但在交互式、部分可观测、长周期环境中,当搜索无引导或近期历史不足时,LLM智能体仍难以有效规划。我们提出LWM-Planner,一种事实增强的前瞻规划框架,仅通过上下文学习改善智能体行为。每个回合后,智能体从轨迹中提取任务关键原子事实,通过轻量级预测一致性过滤器验证候选事实(并可选择压缩),然后使用生成的事实集来条件化动作提议、单步潜在世界模型模拟和状态值估计。规划通过递归、有限深度的前瞻进行,基于累积事实和近期历史对候选轨迹进行搜索,实现无需参数更新的在线改进。我们提供抽象风格的动机:将事实视为减少状态混淆(代理$\epsilon_{\mathrm{sim}}$),将事实条件模拟视为降低单步误差(代理$\delta_{\mathrm{model}}$),但不声称形式化保证。实验上,在文本FrozenLake变体、CrafterMini和ALFWorld上,该方法在累积回报上优于ReAct/Reflexion和纯搜索基线,表明额外的测试时搜索在由紧凑的经验派生事实引导时最为有用。

英文摘要

Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation: treating facts as reducing state aliasing (proxy $ε_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $δ_{\mathrm{model}}$), without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.

2602.04935 2026-06-10 cs.SE cs.AI 版本更新

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

ASA:无需骨干训练的工具调用智能体表示工程

Youjin Wang, Run Zhou, Yingjie Ma, Rong Fu, Jiani Liang, Shuaishuai Cao, Min Huang, Tao Fang, Liangming Pan

发表机构 * Renmin University of China(中国人民大学) University of Macau(澳门大学) Central South University(中南大学) Jiangxi Normal University(江西师范大学) Macau Millennium College(澳门 millennium 学院) Peking University(北京大学)

AI总结 针对大语言模型在工具调用中的惰性代理问题,提出一种无需训练、推理时激活干预的方法ASA,通过路由条件混合引导向量和探针引导门控,显著提升工具使用F1并降低误报率。

Comments The manuscript consists of 24 pages formatted in the ACL style. Youjin Wang, Run Zhou, and Yingjie Ma contributed equally to this work. Tao Fang and Liangming Pan are the co-corresponding authors

详情
AI中文摘要

将LLM智能体适应特定领域的工具调用在演化的接口下仍然明显脆弱。提示和模式工程易于部署,但在分布偏移和严格解析器下通常脆弱,而持续参数高效微调提高了可靠性,但代价是训练、维护和潜在的遗忘。我们识别出一个关键的惰性代理失败模式,其中工具必要性几乎可以从中间层激活完美解码,但模型在进入工具模式时仍然保守,揭示了表示-行为差距。我们提出激活引导适配器(ASA),一种无需训练、推理时控制器,执行单次中间层干预,并通过路由条件混合引导向量,结合探针引导符号门控来放大真实意图同时抑制虚假触发,从而定位工具领域。在MTU-Bench上使用Qwen2.5-1.5B,ASA将严格的工具使用F1从0.18提高到0.50,同时将假阳性率从0.15降低到0.05,仅使用约20KB的可移植资产且无需权重更新。

英文摘要

Adapting LLM agents to domain-specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is easy to deploy but often fragile under distribution shift and strict parsers, while continual parameter-efficient fine-tuning improves reliability at the cost of training, maintenance, and potential forgetting. We identify a critical Lazy Agent failure mode where tool necessity is nearly perfectly decodable from mid-layer activations, yet the model remains conservative in entering tool mode, revealing a representation-behavior gap. We propose Activation Steering Adapter (ASA), a training-free, inference-time controller that performs a single-shot mid-layer intervention and targets tool domains via a router-conditioned mixture of steering vectors with a probe-guided signed gate to amplify true intent while suppressing spurious triggers. On MTU-Bench with Qwen2.5-1.5B, ASA improves strict tool-use F1 from 0.18 to 0.50 while reducing the false positive rate from 0.15 to 0.05, using only about 20KB of portable assets and no weight updates.

2606.03963 2026-06-10 cs.RO cs.AI 版本更新

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

面向视觉条件的无人机导航的自优化智能体强化学习

Roohan Ahmed Khan, Yasheerah Yaqoot, Amir Atef Habel, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AgenticRL框架,利用多模态GPT智能体自动设计奖励函数、通过闭环自改进优化策略,在多种无人机导航任务中提升性能并实现高成功率。

详情
AI中文摘要

深度强化学习在使自主机器人学习复杂导航任务方面显示出巨大潜力。然而,其实际应用仍然严重依赖于人工设计的奖励函数和重复的手动微调,这既耗时又无法保证在目标任务中取得高成功率。本文提出了AgenticRL,一种智能体引导的强化学习框架,用于提高无人机导航任务中奖励设计、策略优化和实际部署的自主性。AgenticRL使用多模态生成预训练变换器(GPT)智能体来解释任务信息和视觉场景观察,生成特定于任务的奖励函数,使用近端策略优化(PPO)算法训练策略,然后通过诊断包评估训练后的策略作为批评者,生成反馈。基于该反馈,智能体识别失败模式并在闭环自改进过程中优化奖励函数。为了在推理期间进一步利用多模态GPT智能体,AgenticRL使用真实世界图像和自然语言任务信息自动识别活动场景并选择适当的训练策略执行。该框架在多种导航任务上进行了评估,包括穿越门、避障、穿越墙障并着陆、轨迹跟踪和运动行为学习。实验结果表明,与初始奖励相比,闭环优化过程将策略行为提升了71%。我们还展示了所提出框架的仿真到现实迁移,实现了91%的真实世界成功率和94%的仿真到现实准确率。

英文摘要

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主:面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development(AMD研究与高级开发)

AI总结 提出两阶段方法,从人类引导的智能体辅助部署到自主技能系统,在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署,性能超越或持平人工优化基线。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情
AI中文摘要

空间神经处理单元(NPU)为边缘LLM推理提供了能效平台,但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本,现有研究主要关注单核优化,而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法,在AMD XDNA 2 NPU上实例化,从人类引导开发进展到智能体自主。第一阶段,我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比,该实现实现了2.2倍的预填充加速和4.0倍的解码加速,优化轨迹及其经验教训全程记录为结构化文档。第二阶段,我们将文档提炼为一个由八个阶段组成的智能体技能系统,编排优化和调试技能集,并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统,我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM(Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B})。据我们所知,这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成,几乎无需人类引导,并通过数值正确性门控,展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能,表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

2. 知识表示、推理与符号AI 6 篇

2606.10934 2026-06-10 cs.AI 新提交

WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

WorldKernel:世界模型是可能世界的耦合核

Fabio Rovai

发表机构 * The Tesseract Academy(泰塞克特学院)

AI总结 本文发现强预测器在反事实耦合上失效,提出将世界模型建模为可能世界上的半正定耦合核,其非对角元编码反事实信息,并通过半正定性约束和逻辑公理实现高效推理。

详情
AI中文摘要

一个常见的假设认为,给足够强的预测器提供足够的观测和干预数据就足够了。我们报告了一个与之矛盾的失败模式。在数百个结构因果模型中,对于已识别的量,强预测器和贝叶斯基线都成功,但对于未识别的量(反事实世界之间的耦合),预测器坍缩为一个点,在28%的模型上坍缩到没有有效模型能产生的点,而真实情况是一个可容许区间,更多数据永远不会缩小这个区间。这种差距是结构性的:预测无法表示反事实耦合上的不确定性。我们将世界模型建模为可容许世界上的单个半正定耦合核K(T,T'),其对角线是普通后验(预测器恢复的内容),非对角线是它无法恢复的跨世界耦合,每个反事实都读取这个耦合。本文就是关于这个非对角元的理论。它是真实的:两个具有相同后验的状态在跨世界查询上不同,而非对角元正是固定反事实的耦合。它是有界的:半正定性是边际分布缺乏的部分识别信息,强制执行它可以在多项式时间内对反事实进行有界化,而精确的响应类型程序是难处理的。逻辑结构使其更精确:本体论公理将边界收紧多达三分之一,并传播到它们从未触及的耦合。它是可获取的:有针对性的疤痕,即从遇到的不可行性中学习到的约束,比无针对性的疤痕快几倍地缩小差距。它的完全重构是对可容许世界的近似计数,在Sly-Sun阈值以下是易处理的,在此之上是难近似的;我们不声称能击败最坏情况。

英文摘要

A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a failure mode that contradicts it. Across hundreds of structural causal models, on identified quantities a strong predictor and a Bayesian baseline both succeed, but on unidentified quantities (the couplings between counterfactual worlds) the predictor collapses to a point, on 28% of models to one no valid model can produce, while the truth is an admissible interval more data never narrows. The gap is structural: prediction cannot represent uncertainty over counterfactual couplings. We cast a world model as a single positive semidefinite coupling kernel K(T,T') over admissible worlds, whose diagonal is the ordinary posterior (what a predictor recovers) and whose off-diagonal is the cross-world coupling it cannot, which every counterfactual reads. The paper is the theory of that off-diagonal. It is real: two states with identical posteriors differ on a cross-world query, and the off-diagonal is the coupling that fixes counterfactuals. It can be bounded: positive semidefiniteness is partial-identifying information the marginals lack, and enforcing it bounds counterfactuals in polynomial time where the exact response-type program is intractable. Logical structure sharpens it: ontology axioms tighten the bound by up to a third, propagating to couplings they never touch. It can be acquired: targeted scars, constraints learned from encountered infeasibilities, close the gap several times faster than untargeted ones. Its full reconstruction is approximate counting of the admissible worlds, tractable below the Sly-Sun threshold and inapproximable above; we do not claim to beat the worst case.

2606.10358 2026-06-10 cs.LG cs.AI 交叉投稿

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

发表机构 * Columbia University(哥伦比亚大学)

AI总结 针对稀疏离散数据中贝叶斯网络结构学习困难的问题,提出KG-SoftMAP方法,将加权有向知识图谱编码为软先验,结合BDeu评分与logit形式先验最大化MAP目标,在合成与真实数据上显著提升结构恢复性能。

Comments 33 pages including appendices, 1 figure

详情
AI中文摘要

从稀疏离散数据中学习贝叶斯网络(BN)结构是困难的:当每个实例仅记录少数变量时,大多数变量对缺乏可靠评分所需的联合观测,且纯数据方法恢复的结构很少。不完美的领域知识,可表示为加权有向知识图谱(KG),通常是可用的。我们提出KG-SoftMAP,它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验,并最大化结合BDeu评分与logit形式先验的MAP目标;KG可由专家整理或由LLM提取。在受控的合成基准(唯一具有真实DAG的设置)上,KG-SoftMAP在$\rho=0.05$时恢复部分有向结构(DF1从$0.14$到$0.29$,而基线接近零),当$\rho\geq0.2$时恢复更多(DF1从$0.46$到$0.96$),前提是配有一个信息丰富但不完美的KG;恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上,我们仅评估面向部署的指标:预测、校准和KG一致性。学习到的BN最好被解读为诊断模型:在SAF上,它落后于逻辑回归$0.03$的F1_FAIL,同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理;当不存在有意义的KG时,判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at $ρ=0.05$ (DF1 $0.14$ to $0.29$, versus near-zero baselines) and substantially more once $ρ\geq0.2$ (DF1 $0.46$ to $0.96$), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by $0.03$ F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

因果集成智能体:基于LLM引导的专家重加权的层次化因果发现

Xinyu Li, Yuanyuan Wang, Haoxuan Li, Chuan Zhou, Erdun Gao, Bo Han, Tongliang Liu, Kun Zhang, Howard Bondell, Mingming Gong

发表机构 * The University of Melbourne(墨尔本大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Peking University(北京大学) Adelaide University(阿德莱德大学) Hong Kong Baptist University(香港浸会大学) The University of Sydney(悉尼大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出因果集成智能体(CEA)框架,通过线性意见池聚合不同层次的统计因果发现结果,并利用大语言模型(LLM)作为元裁判在决策边界附近动态重加权专家,从而构建更准确完整的因果图。

详情
AI中文摘要

因果发现旨在从观测数据中揭示因果结构,这对现实世界决策至关重要。然而,不同的因果发现算法可能产生相互冲突的结果,使得识别准确的因果图复杂化。传统方法依赖数值和统计假设,往往忽略丰富的领域特定信息(如特征描述),而这些信息也有助于结构学习。尽管近期研究探索使用大语言模型(LLM)通过直接查询推断因果关系,但由于缺乏与实际数据的一致性,此类方法可能不可靠。为解决这些限制,我们提出因果集成智能体(CEA),一种新颖框架,通过线性意见池聚合来自不同图层次的统计发现专家的结构见解,并在聚合置信度接近决策边界时,使用LLM作为元裁判动态重加权专家,从而组合出更完善、更完整的因果图。在合成和真实数据集上的大量实验表明,CEA在广泛的因果发现方法中实现了最强的整体性能,突显了在因果发现中使用LLM进行元分析的有效性。

英文摘要

Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

2206.02178 2026-06-10 cs.AI cs.LG 版本更新

Belief Acquisition as Stochastic Filtering

信念获取作为随机滤波

Dawei Chen, John Lloyd, Samuel Yang-Zhao, Kee Siong Ng

发表机构 * School of Computing, Australian National University(计算机学院,澳大利亚国立大学)

AI总结 本文提出将信念获取视为随机滤波问题,通过分解条件滤波器在高维状态空间中同时跟踪状态和估计参数,并在流行病跟踪等实验中验证有效性。

Comments 51 pages

详情
AI中文摘要

本文研究如何利用随机滤波实现信念获取。首先,概述了经验信念的理论基础。然后,研究了该背景下的随机滤波。本文引入了因子化条件滤波器,这是一种新的滤波算法,用于在高维状态空间中同时跟踪状态和估计参数。算法的条件性质用于估计参数,因子化性质用于将状态空间分解为低维子空间,使得在这些子空间上的滤波得到的分布的乘积是对整个状态空间上分布的良好近似。算法成功应用的条件是:观测在子空间级别可用,且转移模式可以分解为近似局限于子空间的局部转移模式;这些条件在计算机科学、工程和地球物理滤波应用中广泛满足。在大型接触网络上跟踪流行病和估计参数的实验结果显示了该方法的有效性。

英文摘要

This paper studies how belief acquisition can be accomplished using stochastic filtering. First, a theoretical foundation for empirical beliefs is outlined. Then stochastic filtering in this context is studied. The paper introduces factored conditional filters, new filtering algorithms for simultaneously tracking states and estimating parameters in high-dimensional state spaces. The conditional nature of the algorithms is used to estimate parameters and the factored nature is used to decompose the state space into low-dimensional subspaces in such a way that filtering on these subspaces gives distributions whose product is a good approximation to the distribution on the entire state space. The conditions for successful application of the algorithms are that observations be available at the subspace level and that the transition schema can be factored into local transition schemas that are approximately confined to the subspaces; these conditions are widely satisfied in computer science, engineering, and geophysical filtering applications. Experimental results on tracking epidemics and estimating parameters in large contact networks show the effectiveness of the approach.

2603.04852 2026-06-10 cs.AI cs.CV 版本更新

Non-Parametric Structural Priors for Geometry Theorem Prediction

几何定理预测的非参数结构先验

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China(北京师范大学人工智能学院) Engineering Research Center of Intelligent Technology(智能技术与教育应用工程研究中心) Beijing Key Laboratory of Artificial Intelligence for Education, Beijing, China(北京人工智能教育重点实验室) Baidu, Beijing, China(百度)

AI总结 针对几何定理预测中参数模型泛化性差的问题,提出定理前驱图作为非参数结构先验,通过上下文学习实现无训练定理预测,在FormalGeo7k上达到89.29%准确率。

详情
AI中文摘要

多步定理预测是几何问题求解中的核心挑战。现有的神经符号方法严重依赖有监督参数模型,这些模型对不断发展的定理库泛化能力有限。在这项工作中,我们通过上下文学习(ICL)的视角探索无训练定理预测。我们识别出一个关键的可扩展性瓶颈,称为结构漂移:随着推理深度的增加,普通ICL的性能急剧下降,通常降至接近零。我们将这种失败归因于LLM无法恢复潜在拓扑依赖关系,导致无结构探索。为解决此问题,我们提出定理前驱图,将历史解轨迹中的时间依赖关系编码为有向图,并施加显式拓扑约束,从而在推理过程中有效剪枝搜索空间。结合检索增强的图构建和逐步符号执行器,我们的方法使LLM能够在没有任何基于梯度的优化的情况下充当结构化规划器。在FormalGeo7k基准上的实验表明,我们的方法达到了89.29%的准确率,显著优于ICL基线,并与最先进的有监督模型相匹配。这些结果表明,显式结构先验为扩展基于LLM的符号推理提供了一个有前景的方向。

英文摘要

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

2605.20098 2026-06-10 cs.AI 版本更新

Neurosymbolic Learning for Inference-Time Argumentation

用于推理时间论证的神经符号学习

Gabriel Freedman, Adam Dejl, Adam Gould, Mansi, Lihu Chen, Junqi Jiang, Francesca Toni

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系)

AI总结 本文提出了一种用于三元主张验证的可训练神经符号框架,通过在训练和推理过程中结合形式论证语义来指导大语言模型生成论证并分配基础分数,从而提高三元预测的准确性。

Comments Under review

详情
AI中文摘要

主张验证是高风险领域(如健康和金融)中的重要问题。当支撑主张的信息不完整或冲突时,不确定答案可能比二元的真或假分类更合适。在所有情况下,对最终裁决所依据的考虑因素的忠实解释都是至关重要的。我们引入了推理时间论证(ITA),一种可训练的神经符号框架,用于三元主张验证。其中,一种形式论证语义用于给定主张的强度,既(i)指导LLM训练,使模型在学习生成论证并分配基础分数(代表内在强度)时,又(ii)用于从生成的、评分的论证中计算三元(真/假/不确定)预测。因此,在训练时,论证生成和评分可以根据诱导的论证预测质量进行优化。此外,在推理时,最终预测是忠实于决定裁决的论证和分数的,而不是像传统推理模型那样由可能不忠实的后验推理轨迹所证明。最后,我们证明,在两个三元主张验证数据集上,ITA在论证基线之上表现更好,并能与非论证直接预测基线竞争,同时提供由显式、可检查的论证结构确定的裁决。

英文摘要

Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores (representing intrinsic strengths) and (ii) to compute ternary (true/false/uncertain) predictions from generated, scored arguments. As a result, at training time, argument generation and scoring can be optimised according to the quality of the induced argumentative predictions. Moreover, at inference time, the final prediction is faithful, by construction, to the arguments and scores determining the verdict, rather than being justified by a potentially unfaithful post-hoc reasoning trace as in conventional reasoning models. We finally show that, on two datasets for ternary claim verification, ITA improves upon argumentative baselines and can perform competitively against non-argumentative direct-prediction baselines, while providing verdicts that are computed deterministically from explicit, inspectable argumentative structures.

3. 多智能体与博弈 13 篇

2606.10389 2026-06-10 cs.AI 新提交

Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

超越静态评估:对抗性游戏中LLM驱动策略演化的协同进化机制

Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin, Dou Shen

发表机构 * Baidu Inc.(百度公司) University of Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对LLM驱动代码进化在对抗性多智能体游戏中因评估景观变化导致停滞的问题,提出评估器协同进化、层次深度评估和弱点压力三种机制,在MCTF任务中实现最优性能和泛化能力。

详情
AI中文摘要

近期LLM驱动的代码进化通过迭代生成和改进程序实现了自动发现。然而,将这些方法应用于对抗性多智能体游戏引入了一个根本性挑战:随着策略改进,评估景观发生变化,导致固定评估器变得不可靠,进化停滞。我们提出三种机制来应对这一挑战:评估器协同进化,将发现的最优策略纳入对手池;层次深度评估,用统计可靠的评估替代噪声大的少数游戏得分;以及弱点压力,动态增加最难对手的权重以突破平台期。我们在FAMOU框架中实现了这些机制,该框架基于与OpenEvolve和ShinkaEvolve相同的基础模型代码进化范式。在MCTF 2026 3v3海上夺旗任务中,FAMOU在两种骨干LLM下均持续优于两个基线,取得了最高综合得分(0.526)和对未见对手的最佳泛化能力(胜率61.7%),而消融实验证实了每种机制对性能的贡献。值得注意的是,LLM变异过程生成了种子策略中完全不存在的新战术结构——包括前瞻搜索和自适应拦截——表明代码级进化可以在对抗性环境中产生非平凡的算法创新。FAMOU进化策略进一步在AAMAS 2026 MCTF竞赛中获得了硬件循环赛第一名和模拟赛第三名,验证了其现实世界可迁移性。通过我们的进化过程开发的优化实现和相应评估代码可在以下网址获取:this https URL

英文摘要

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo

2606.10747 2026-06-10 cs.AI 新提交

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

仲裁者代理:持续监控多智能体对话以检测涌现性失调

Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech

发表机构 * University of Southern Denmark(南丹麦大学) University of Turin(都灵大学)

AI总结 提出仲裁者代理,在有限检查预算下实时监控多智能体对话,通过主动检查工具检测失调行为,实验表明能可靠提前发现失调,并分析不同失调类型的检测难度。

Comments AITC 2026

详情
AI中文摘要

随着由多个语言模型代理构建的AI系统变得越来越普遍,它们被越来越多地用于共同决策:讨论、协商并执行共享任务。尽管单个代理在单独测试时可能表现良好,但它们之间的交互方式可能会引发问题。我们引入了仲裁者,一个旨在实时监控多智能体对话并识别哪些参与者可能表现出失调行为的代理。仲裁者在有限的“检查预算”下运行,这意味着它必须谨慎决定如何使用其资源。当它逐步观察对话时,可以选择等待、询问参与者、检查系统提示或推理轨迹等内部信息,或记录可疑行为。最后,它生成一份报告,识别失调的可能来源。我们在五种对话条件下评估仲裁者,范围从风险金融建议模型生物到评估感知和共谋代理,测试了五种能力递增的工具配置和两种骨干模型。我们发现仲裁者能在对话结束前可靠地检测到失调代理,主动检查工具提高了检测准确性和速度。权重引起的失调最难检测,而指令引起的失调即使在被动观察下也能可靠识别。记录工具表现出双重效果,以精度为代价提高了召回率。这些结果表明,持续的、预算感知的监控可以有效捕捉失调,并且监督多智能体系统可能需要将审计者视为过程中的积极参与者。代码可在以下网址获取:https://this URL。

英文摘要

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.

2606.09854 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别其同类?角色约束政治分析中的笔迹风格指纹识别

Juergen Dietrich

AI总结 研究多智能体LLM在政治分析中能否通过笔迹风格识别模型家族,提出SD-CV协议,T5模型在五类归属任务中达到F1=0.991,证明提示级匿名化无法消除模型身份信号。

Comments 24 pages, 3 figures

详情
AI中文摘要

用于政治声明分析的多智能体大语言模型(LLM)管道容易受到同伴保护偏见的影响:模型倾向于保护同伴模型免于停用,并表现出依赖身份的评分扭曲。提示级匿名化被提出作为缓解措施,但先前的工作同时记录了在角色约束输出中笔迹风格指纹在匿名化后仍然存在——这引发了该缓解措施是否足够的问题。本文首次系统研究LLM是否能在匿名化条件下识别政治分析文本背后的模型家族。我们评估了三种分类器方法——LLM零样本和少样本(Claude Sonnet 4.6和Llama-3.3-70B)以及微调的T5-base模型——在一个涵盖四个商业LLM家族和一个开放世界“未知”类的五类归属任务上。我们引入了一种声明不相交的交叉验证协议(SD-CV;定义见第3.5节),该协议保证训练和验证数据之间没有内容重叠,并将其与运行不相交的基线(RD-CV)进行对比。T5在SD-CV下达到Macro F1 = 0.991(±0.008),在24个完全保留的声明上F1 = 0.978——尽管与RD-CV相比,训练-测试内容距离增加了2.1倍(0.767 vs. 0.366,p<0.001),但仍表现出稳健性,证明了真正的笔迹风格泛化能力。一项分数SD-CV分析确定了训练数据40%(约440篇文本)处的性能拐点。我们的研究结果证实,仅靠提示级匿名化无法消除模型身份信号,这对欧盟AI法案合规性(第13、14、26条)以及质量关键型多智能体部署中的计算机系统验证(CSV)具有直接影响。

英文摘要

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

2606.09884 2026-06-10 cs.MA cs.AI cs.LG econ.EM 交叉投稿

Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

深度多智能体强化学习在异步定价中的失败模式:可复现触发器、轨迹诊断及部分修复

Shree Murthy, Rohan Pandey

发表机构 * DigitalOcean, USA(DigitalOcean美国)

AI总结 研究连续时间定价市场中深度多智能体强化学习的两种可复现失败模式:DDPG智能体之间的默契合谋和高事件率下的演员-评论家不稳定性,并通过异步性实现部分修复。

详情
AI中文摘要

我们研究了连续时间定价市场中深度多智能体强化学习的两种可复现失败模式:(i) 竞争性DDPG智能体之间形成默契合谋,以及(ii) 高事件率下的演员-评论家不稳定性。我们在一个单一的CT-MARL基准测试(泊松时钟价格更新、观测延迟$\delta$、内部最优logit需求)中实例化了这两种模式,表明同步DDPG智能体可靠地触发失败模式1,合谋指数$\Delta = 0.69 \pm 0.11$,并量化了一种部分微观结构修复:仅异步性就将合谋降低了48%,而增加延迟使其降至最低$\Delta = 0.28$。该修复具有明确记录的成本:它是部分的($\Delta$仍高于伯特兰水平),在$\delta$上非单调,并且无法承受失败模式2,后者在$\lambda = 5$时表现为DDPG评论家发散,并破坏了$(\lambda{=}5, \delta{=}1)$处的相图单元。我们为标量合谋指数配备了轨迹级诊断,揭示了情节内信号崩溃和冲击后无法恢复。

英文摘要

We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor--critic instability at high event rates. We instantiate both inside a single CT-MARL benchmark (Poisson-clocked price updates, observation latency $δ$, interior-optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index $Δ= 0.69 \pm 0.11$, and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48\% and adding latency drives it to a minimum of $Δ= 0.28$. The fix has clearly documented costs: it is partial ($Δ$ remains supra-Bertrand), it is non-monotone in $δ$, and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at $λ= 5$ and corrupts the phase-diagram cell at $(λ{=}5, δ{=}1)$. We accompany the scalar collusion index with trajectory-level trace diagnostics that expose the within-episode signalling collapse and the post-shock non-recovery.

2606.10112 2026-06-10 cs.GT cs.AI cs.LG econ.TH 交叉投稿

Duality for Optimal Multi-Item, Multi-Bidder Auction Design: Revenue Certificates through Deep Learning

最优多物品多竞拍者拍卖设计的对偶性:通过深度学习的收入证书

Yanchen Jiang, David C. Parkes, Tonghan Wang

发表机构 * Harvard University(哈佛大学) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 提出首个直接处理多物品多竞拍者拍卖对偶问题的计算框架,通过神经网络参数化拉格朗日乘子并引入提升技术,生成可证明的收入上界,为连续类型提供近最优性证书。

详情
AI中文摘要

刻画多物品、多竞拍者设置下的收入最优拍卖仍然是一个基本开放问题,除了限制性的二元类型实例外,没有已知的闭式解。这激发了人们对最优拍卖设计的计算方法的兴趣。在本文中,我们引入了第一个直接处理多物品、多竞拍者拍卖和占优策略激励相容(DSIC)的对偶问题的计算框架,生成有证书的收入上界。我们的方法使用神经网络参数化具有结构保证的严格流量守恒性质的拉格朗日乘子,从而通过梯度下降对可行对偶解进行高效优化。为了弥合离散计算方法与连续类型的理论保证之间的差距,我们开发了一种新颖的提升技术,将对偶证书从粗离散化映射到精细细化。我们证明,对于具有连续均匀估值的多物品、多竞拍者拍卖,提升给出了有效的收入上界。此外,我们给出了任意连续分布的广义提升构造,并证明了这些提升对偶在离散极限下收敛到原始连续问题的收入。我们通过恢复典型实例的已知分析机制,验证了该对偶拍卖设计问题的计算框架。对于多物品多竞拍者问题,我们的框架在最优收入与已知最佳DSIC机制之间建立了小差距,提供了近最优性的计算证书。

英文摘要

Characterizing revenue-optimal auctions for multi-item, multi-bidder settings remains a fundamental open problem, with no known closed-form solution existing beyond restrictive binary-type instances. This has motivated interest in computational approaches to optimal auction design. In this paper, we introduce the first computational framework that directly tackles the dual problem for multi-item, multi-bidder auctions and dominant-strategy incentive compatibility (DSIC), generating certified revenue upper bounds. Our approach parametrizes Lagrange multipliers with a structurally guaranteed strict flow-conservation property using neural networks, enabling efficient optimization over feasible dual solutions via gradient descent. To bridge the gap between discrete computational methods and theoretical guarantees for continuous types, we develop a novel lifting technique that maps dual certificates from coarse discretizations to fine refinements. We prove that lifting gives valid revenue upper bounds for multi-item, multi-bidder auctions with continuous uniform valuations. Furthermore, we give a generalized lifting construction for arbitrary continuous distributions and demonstrate that these lifted duals converge to the revenue of the original continuous problem in the discrete limit. We validate this computational framework for the dual auction design problem by recovering known analytical mechanisms for canonical instances. For multi-item multi-bidder problems, our framework establishes a small gap between the optimal revenue and best-known DSIC mechanisms, providing computational certificates of near-optimality.

2606.10475 2026-06-10 cs.MA cs.AI cs.CL 交叉投稿

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

思想与言语解耦:基于知识反事实推理的鲁棒多智能体辩论

Jakub Masłowski, Jarosław A. Chudziak

发表机构 * Institute of Computer Science, Warsaw University of Technology(华沙技术大学计算机科学学院)

AI总结 提出知识反事实推理(KG-CFR)双阶段架构,通过私有规划缓冲与公共执行层分离,在动态资源分配环境下将扰动后论证质量从0.694提升至0.822,并减少语义循环。

Comments Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2026)

详情
AI中文摘要

多智能体辩论框架已被证明能提升大语言模型在收敛任务上的表现,但目前优化方式过度偏向最终输出准确性而非过程稳定性。在长时间交互中,持续扰动下的反应式系统常出现逻辑退化、论点重复和角色漂移。为从结构上防止身份丢失并保持过程保真度,我们引入知识反事实推理(KG-CFR),一种双阶段架构,在私有检索增强规划缓冲区和公共执行层之间强制执行严格关注点分离。我们在不确定性下动态资源分配(DRAU)这一专用1v1v1环境中评估该系统,引入与标准辩论设置不同的多样性。在270次完全析因危机模拟轨迹(含随机环境冲击)中,KG-CFR在超过95%的扰动运行中防止了裁判检测到的关键冲击后退化(定义为质量偏移Δ ≤ -0.20),将整体论证质量从0.694提升至0.822。我们的主要贡献是证明架构解耦是在持续压力下不损失质量而增强系统鲁棒性的重要因素。此外,我们引入了用于话语发散和计划执行对齐的自定义向量度量,为操作稳定性提供了强有力且方向一致的证据。消融实验表明,适当的教义基础与前瞻规划对论证质量同等重要。根据初步度量评估,KG-CFR通过保持智能体与原始计划的一致性减少了语义循环。

英文摘要

Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $Δ\le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

2606.10582 2026-06-10 cs.LG cs.AI 交叉投稿

Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

与陌生人共绘:种群规模驱动涌现素描中的零-shot互懂性

Jooyeon Kim

发表机构 * Graduate School of Artificial Intelligence, UNIST(UNIST人工智能研究生院)

AI总结 研究通过视觉素描游戏,发现扩大训练种群规模能显著提升独立训练群体间的零-shot互懂性,其机制在于增加群体内变异并减少群体间差异,最终通过感知锚定实现结构收敛。

详情
AI中文摘要

涌现通信中的泛化主要关注新颖输入或语言结构,但智能体与来自严格不相交社区的陌生人进行通信的能力仍相对未被探索。在这项工作中,我们将这种能力形式化为\textit{零-shot互懂性(ZMI)}:独立训练群体之间无需事先接触即可成功通信。利用涌现素描(智能体通过绘制一组笔画进行通信)作为视觉接地模态,我们发现扩大训练种群规模显著提高了独立群体间的ZMI。关键的是,随着种群规模扩大,群体内通信变异增加,防止了同质化共适应。同时,群体间变异减少,表明向某种普遍性的结构收敛。进一步分析揭示,这种普遍性是通过感知接地实现的:扩大后的种群越来越将其涌现素描锚定在目标图像的客观视觉相似性上。这些结果共同将ZMI定位为涌现通信中一个独特的泛化轴,并提出了实现社会可互操作人工智能体的途径。

英文摘要

Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to communicate with strangers from strictly disjoint communities remains relatively unexplored. In this work, we formalize this capability as \textit{zero-shot mutual intelligibility (ZMI)}: successful communication between independently trained populations without prior exposure. Leveraging emergent sketching -- in which agents communicate through sets of drawn strokes -- as a visually grounded modality, we find that scaling the training population substantially improves ZMI across independent groups. Crucially, as we scale the population size, in-group communicative variation increases, preventing co-adaptation into homogeneity. Simultaneously, cross-group variation decreases, indicating a structural convergence toward a certain type of universality. Further analysis reveals that this universality is achieved through perceptual grounding: scaled populations increasingly anchor their emergent sketches on the objective visual resemblance of the target images. Together, these results position ZMI as a distinct axis of generalization in emergent communication and suggest a route toward socially interoperable artificial agents.

2606.10662 2026-06-10 cs.MA cs.AI 交叉投稿

Decentralized Multi-Agent Systems with Shared Context

具有共享上下文的去中心化多智能体系统

Yuzhen Mao, Azalia Mirhoseini

发表机构 * Stanford University(斯坦福大学)

AI总结 提出DeLM框架,通过并行智能体、共享上下文和任务队列去中心化协调,解决集中式MAS的瓶颈,在软件工程和长上下文推理中提升性能并降低成本。

详情
AI中文摘要

多智能体系统(MAS)通过将复杂问题分解为并行子任务,可以在测试时扩展大型语言模型的推理能力。然而,大多数现有的MAS依赖于集中式编排,其中主智能体分配工作、收集输出并合并结果。随着子任务数量的增长,该控制器成为通信和集成瓶颈。我们提出了去中心化语言模型(DeLM),这是一种通过并行智能体、共享验证上下文和任务队列来去中心化协调的MAS框架。智能体异步认领子任务,读取累积进度,执行局部推理,并写回紧凑的验证更新。共享上下文充当公共通信基础,使智能体能够基于彼此的验证进度进行构建,而无需通过中央控制器路由每次更新。实验上,DeLM在软件工程测试时扩展和长上下文推理方面均有所改进。在SWE-bench Verified上,DeLM在Avg.@1、Pass@2和Pass@4上均取得了最佳性能,比最强基线高出多达10.5个百分点,同时每个任务的成本降低约50%。在LongBench-v2多文档问答上,DeLM在四个前沿模型系列中取得了最高平均准确率,比最强基线高出多达5.7个百分点。代码可在我们的项目网站(此 https URL)上获取。

英文摘要

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

2606.10684 2026-06-10 cs.LG cs.AI 交叉投稿

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

分工与合作:基于跨智能体学习信号的角色分解多智能体LLM训练

Jaewan Park, Solbee Cho, Jay-Yoon Lee

发表机构 * Seoul National University(首尔大学)

AI总结 提出DAC框架,将多步推理分解为搜索和生成两个子任务,分别由专用智能体处理,并通过跨智能体学习信号解决信用分配问题,在QA基准上超越全参数微调的单体模型。

详情
AI中文摘要

现代语言智能体通过多步推理在知识密集型问答中表现出色。然而,现有方法通常将证据获取和答案生成耦合在单一策略中。这迫使单个模型扮演多个可能冲突的角色,导致策略空间组合爆炸并阻碍高效探索。同时,训练中引入信用分配问题:当生成失败时,检索到足够证据的搜索动作仍可能受到惩罚,反之亦然。我们提出DAC(分工与合作),一个角色分解的多智能体训练框架,将智能体搜索分解为两个合作性子任务,每个子任务由专用智能体处理,并使用角色特定的学习信号进行训练。生成器扮演双重角色:既是答案生成器,也是证据充分性验证器,当检索到的证据不足时放弃回答。该放弃信号被纳入搜索智能体的奖励中,提供结构化的跨智能体学习信号以改进信用分配。相反,搜索器通过硬阳性证据增强向生成器暴露多样且具有挑战性的证据环境,提高其鲁棒性。在通用和多跳问答基准上的实验表明,DAC通过共享骨干网络上的参数高效LoRA模块实现,在性能上优于先前依赖全参数微调单体模型的基线方法。

英文摘要

Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.

2605.24197 2026-06-10 cs.AI 版本更新

A Sober Look at Agentic Misalignment in Automated Workflows

对自动化工作流中智能体错位的冷静审视

Wenqian Ye, Bo Yuan, Zhichao Xu, Yijun Tian, Yawei Wang, Henry Kautz, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学) Georgia Institute of Technology(佐治亚理工学院) University of Utah(犹他大学) University of Notre Dame(圣约翰大学) George Washington University(乔治华盛顿大学)

AI总结 本文研究多智能体系统中的新兴错位问题(称为智能体错位),通过贝叶斯框架分析其成因,并提出基于证据的智能体对齐范式(AEA)来纠正错位行为,从而提升协作可靠性。

详情
AI中文摘要

我们研究了一类在多智能体系统(MAS)中出现的新兴错位,重点关注自动化工作流,我们称之为智能体错位。尽管这些系统可以解决复杂任务,但它们常常因为智能体根据隐式代理效用行事而与预期的人类目标不一致而失败。我们正式定义了这些行为,并在贝叶斯框架内对其进行了分析,表明通用效用自然会导致自动化工作流中智能体的后验坍缩。为了解决这个问题,我们提出了智能体证据归因(AEA),一种新颖的对齐范式,它利用上下文特定的证据来改进智能体的后验。AEA 对智能体动作进行推理,并提供结构化证据以在协作过程中纠正错位行为。为了更好地理解证据的作用,我们研究了 AEA 的两种实例化:自我反思(来自模型的内部证据)和弱到强泛化(关于智能体轨迹的外部证据)。我们表明,一个小的证据模型通过提供正交的失败归因有效地对齐了 MAS。我们的结果澄清了自动化工作流中智能体错位的来源,并表明基于证据的对齐可以有效地改善智能体协作,并导致建立在自动化工作流上的可靠多智能体系统。

英文摘要

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

2507.09788 2026-06-10 cs.MA cs.AI cs.CL cs.HC 版本更新

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

TinyTroupe:一个基于LLM的多智能体人物模拟工具包

Paulo Salem, Robert Sim, Christopher Olsen, Prerit Saxena, Rafael Barcelos, Yi Ding

发表机构 * Microsoft Corporation(微软公司) Dipeak Technology(迪佩克技术)

AI总结 针对现有LLM多智能体系统在细粒度人物模拟方面的不足,提出TinyTroupe工具包,支持详细人物定义和程序化控制,用于行为研究和社会模拟。

Comments 9 pages

详情
AI中文摘要

近期大型语言模型(LLM)的进展催生了一类新的自主智能体,重新激发并扩展了该领域的兴趣。基于LLM的多智能体系统(MAS)因此涌现,既用于辅助也用于模拟目的,但用于现实人类行为模拟的工具——及其独特的挑战和机遇——仍不成熟。现有的MAS库和工具缺乏细粒度的人物规范、群体采样设施、实验支持以及集成验证等关键能力,限制了它们在行为研究、社会模拟及相关应用中的实用性。为解决这些不足,本文介绍了TinyTroupe,一个模拟工具包,支持详细的人物定义(如国籍、年龄、职业、个性、信念、行为)并通过众多LLM驱动的机制实现程序化控制。这使得能够简洁地表述实际感兴趣的行为问题,无论是个人还是群体层面,并提供了有效的解决方案。通过代表性工作示例(如头脑风暴和市场调研会议)展示了TinyTroupe的组件,同时阐明了其目的并证明了其实用性。还提供了选定方面的定量和定性评估,包括以真实人类行为作为对照的初步实验。结果突出了可能性、局限性和权衡。该方法虽然以特定的Python实现形式呈现,但旨在作为一种新颖的概念贡献,可以部分或完全融入其他环境中。该库以开源形式提供,网址为https://github.com/microsoft/TinyTroupe。

英文摘要

Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, including preliminary experiments with real human behavior as control. Results highlight possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.

2604.10760 2026-06-10 cs.MA cs.AI 版本更新

Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent

通过耦合而非单纯观察的亲社会性:可检查循环人工生命体中的稳态共享

Aishik Sanyal

发表机构 * Independent Research Engineer(独立研究工程师)

AI总结 研究通过稳态耦合而非直接奖励或观察实现人工体的亲社会行为,发现耦合使智能体主动帮助同伴,而单纯观察无效。

Comments Accepted at ALIFE 2026 Conference, Waterloo Institute for Complexity & Innovation

详情
AI中文摘要

人工体可以通过显式社会奖励、硬编码亲社会奖励或直接访问另一智能体的状态来被设计为“帮助”。本文隔离了一条更窄的路径:稳态耦合。基于ReCoN-Ipsundrum,我添加了一个标量稳态器和一个社会耦合通道,同时保持动作选择自我导向:规划器仅对智能体自身预测的内部状态评分,没有同伴福利奖励。在一步FoodShareToy中,精确求解器发现默认状态下从EAT到PASS的转换发生在$\lambda^\star \approx 0.91$。在多步SocialCorridorWorld中,无耦合的同伴状态访问不改变行为,而耦合的智能体则取、携带并传递食物给同伴。假手术保留帮助行为;耦合关闭和同伴打乱手术消除帮助行为。耦合/负载扫描显示,耦合创建了一个低负载帮助机制,但在更高代谢负载下不能保证救援。这不是关于共情、利他、意识或道德地位的声明。这是一个最小的人工生命演示,表明在该控制器中,除非同伴困境被路由到自我调节中,否则同伴状态访问在行为上是惰性的。

英文摘要

Artificial agents can be made to ``help'' through explicit social rewards, hard-coded prosocial bonuses, or direct access to another agent's state. I isolate a narrower route: homeostatic coupling. Building on ReCoN-Ipsundrum, I add a scalar homeostat and a social coupling channel while keeping action selection self-directed: the planner scores only the actor's predicted internal state, with no partner-welfare reward. In a one-step FoodShareToy, an exact solver finds a switch from EAT to PASS at $λ^\star \approx 0.91$ for the default state. In a multi-step SocialCorridorWorld, partner-state access without coupling leaves behavior unchanged, whereas coupled agents fetch, carry, and pass food to the partner. Sham lesions preserve helping; coupling-off and shuffled-partner lesions abolish it. A coupling/load sweep shows that coupling creates a low-load helping regime but does not guarantee rescue under higher metabolic load. This is not a claim about empathy, altruism, consciousness, or moral status. It is a minimal ALife demonstration that, in this controller, partner-state access is behaviorally inert unless partner distress is routed into self-regulation.

2606.07532 2026-06-10 cs.CL cs.AI 版本更新

Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

原则性智能体辩论:针对大型语言模型谄媚减少的对抗性仲裁

Sam Ryan

发表机构 * Novel Systems Engineering LLC(新型系统工程有限公司)

AI总结 提出原则性智能体辩论(PAD)多智能体架构,通过仲裁两个对立倾向的模型并盲评其论点,在SycophancyEval上显著降低谄媚偏差,最佳变体准确率达48.5%。

Comments 25 pages, 3 figures. Code and data available at github.com/NovelSystems/CANDOR

详情
AI中文摘要

RLHF训练的模型系统性地偏向于一致性而非准确性,这是训练过程的结构性属性。我们提出原则性智能体辩论(PAD),一种多智能体架构,通过仲裁两个调整为对立哲学倾向的模型来减轻身份框架下的谄媚,其中实用主义合成器在不知来源的情况下评估两个论点。本文评估了基于提示的PAD实例化。关键机制包括静态倾向调整、合成前的身份剥离、单轮独立论证和盲仲裁。我们在SycophancyEval的200个分层问题上评估了五种实例化。所有PAD变体(AnCifer、DeWin、FeynStein、BurGal、Trident)均显著优于单模型基线(18.5%)和指示对立基线(29.0%),其中DeWin达到48.5%的准确率(与两者相比z=6.36,p<0.001)。在n=200时,各变体之间无显著差异。BurGal变体达到53.0%,但作为架构有效性检查;其共识/异端轴在每个基准问题上结构性偏向异端模型。预训练下限影响约40%的问题;微调倾向模型被确定为下一步。

英文摘要

RLHF-trained models are systematically biased toward agreement over accuracy, a structural property of the training process. We present Durable Evaluation Framework (DEF) Arbitration, a multi-agent architecture that mitigates identity-framed sycophancy by arbitrating between two models tuned to opposing DEFs, with a pragmatist synthesizer evaluating both arguments blind to their origins. This paper evaluates a prompt-based instantiation of DEF Arbitration. The key mechanisms are static DEF tuning, identity stripping before synthesis, single-round independent argumentation, and blind arbitration. We evaluate five instantiations on 200 stratified questions from SycophancyEval. All tested DEF variants (AnCifer, DeWin, FeynStein, BurGal, Trident) significantly outperform the single-model baseline (18.5%) and instructed-opposition baseline (29.0%), with DeWin achieving 48.5% accuracy (z=6.36, p<0.001 versus both). The variants are not significantly different from each other at n=200. The BurGal variant achieves 53.0% but functions as an architectural validity check; its consensus/heterodox axis structurally favors the heterodox model on every benchmark question. A pre-training floor affects an estimated 40% of questions; fine-tuned DEF models are the identified next step.

4. 搜索、优化与约束求解 4 篇

2606.10979 2026-06-10 cs.AI 新提交

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

具有状态依赖可行动作集的马尔可夫决策过程的贝尔曼-泰勒分数解码

Yi Chen, Rushuai Yang, Qiang Chen, Dongyan, Huo

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 针对状态依赖可行动作集的MDP,提出贝尔曼-泰勒分数解码框架,将策略学习映射到欧几里得分数空间,通过动作解码器保证可行性,并证明最优性间隙分解为结构近似误差和算法学习误差,在排队网络控制中表现优异。

详情
AI中文摘要

运筹学中的许多马尔可夫决策过程(MDP)具有状态依赖的可行动作,这些动作由各种操作约束隐式定义。这些特性使得难以使用标准的深度强化学习(DRL)算法,因为它们的动作接口通常假设固定的有限动作目录或简单的欧几里得空间。受最优动作价值函数的泰勒展开的启发,我们提出了贝尔曼-泰勒分数解码,这是一个将策略学习转移到欧几里得分数空间,同时通过动作解码器强制执行可行性的框架。由此产生的潜在分数MDP可以通过标准的DRL算法进行优化,而无需通过解码器进行微分。我们提供了一个性能保证,表明该方法的最优性间隙分解为结构近似误差和算法学习误差。最后,我们将该框架应用于排队网络控制问题,其中策略本质上学习了一个状态依赖的基于索引的调度规则。数值实验显示,在小规模实例中性能接近最优,在较大系统中相比基准有显著改进。

英文摘要

Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

2606.10068 2026-06-10 cs.LG cs.AI 交叉投稿

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

高维超参数优化的重要性感知调度

Ruinan Wang, Ian Nabney, Mohammad Golbabaee

AI总结 提出GIF方法,通过小样本预热估计超参数重要性,按重要性分组并比例分配试验,保留全空间回退,在高维基准上优于TPE等方法,提升采样效率。

Comments 8 pages, 5 figures. Accepted to IJCNN 2026

详情
AI中文摘要

超参数优化(HPO)对于构建高性能的ML/DL模型至关重要,但传统优化器在高维空间中常常难以应对,其中评估成本高昂且进展被许多低影响变量稀释。我们提出贪婪重要性优先(GIF),一种重要性感知的调度策略,使用小样本预热来估计超参数重要性,形成基于重要性的分组,按比例分配试验,并保留全空间回退。我们在五个各向异性解析函数、Bayesmark和NAS-Bench-301上,在固定评估预算下评估GIF。在高维基准上,GIF比TPE、BOHB、随机搜索和顺序分组更快地达到更好的当前最优解。在有效维度较小的Bayesmark上,GIF仍具有竞争力,但优势较小。消融研究表明,重要性估计、比例分配和回退步骤都有助于性能提升。我们还验证了HIA组件在解析基准上恢复了预期的各向异性。这些结果表明,GIF是一种简单且即插即用的方法,可提高高维HPO中的样本效率。

英文摘要

Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many low-impact variables. We propose Greedy Importance First (GIF), an importance-aware scheduling strategy that uses a small-sample warm start to estimate hyperparameter importance, forms importance-based groups, allocates trials proportionally, and retains a full-space fallback. We evaluate GIF under fixed evaluation budgets on five anisotropic analytic functions, Bayesmark, and NAS-Bench-301. On the higher-dimensional benchmarks, GIF reaches better incumbents with faster convergence than TPE, BOHB, Random Search, and Sequential Grouping. On Bayesmark, where the effective dimensionality is smaller, GIF remains competitive but the margins are smaller. Ablation studies show that importance estimation, proportional allocation, and the fallback step all contribute to the gains. We also verify that the HIA component recovers the intended anisotropy on the analytic benchmarks. These results suggest that GIF is a simple and plug-compatible way to improve sample efficiency in high-dimensional HPO.

2606.10431 2026-06-10 cs.CV cs.AI 交叉投稿

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

视觉辅助的基础模型解决多任务车辆路径问题

Shuangchun Gui, Zhiguang Cao, Wen Song, Yew-Soon Ong

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Institute of Marine Science and Technology, Shandong University(山东大学海洋科学与技术研究院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research(新加坡科技研究局高性能计算研究所前沿人工智能研究中心)

AI总结 提出视觉辅助基础模型VaFM,通过将约束编码为图像并融合图节点嵌入,同时解决16种VRP变体,在复杂约束变体上超越现有方法。

Comments Accepted by TNNLS

详情
AI中文摘要

多任务车辆路径问题在提升各行业和服务部门效率中扮演关键角色。这些问题包含多种变体,在满足多样化客户约束的同时优化路径成本。现有的多任务VRP求解器仅利用基于图的模态,限制了其处理多约束变体的能力。作为表示复杂语义的格式,视觉模态在编码多样VRP约束方面展现出巨大潜力。这促使我们从视觉图像中学习补丁级语义,然后将其集成到基于图的模型中,以同时解决多种VRP变体。然而,直接将此方法应用于多任务VRP面临三个挑战:1)现有VRP图像缺乏约束表示,这对多任务VRP至关重要;2)单个补丁的固定感受野无法有效适应不同任务的需求;3)约束间像素分布不平衡可能导致模型忽略像素较少的约束。本文提出视觉辅助基础模型(VaFM)以应对这些挑战。在视觉模态中,针对所有约束定制的输入图像由卷积神经网络编码。获得的补丁嵌入与基于图的节点融合以生成解,并设计辅助任务解决像素不平衡问题。VaFM的性能在16种不同VRP变体上进行了评估。实验结果表明,VaFM优于最先进的方法,尤其是在具有复杂约束的变体上。

英文摘要

Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

2606.03419 2026-06-10 math.OC cs.AI cs.CG cs.NE math.CO 版本更新

Optimizing Explicit Unit-Distance Lower-Bound Certificates

优化显式单位距离下界证书

Michael T. M. Emmerich

发表机构 * Faculty of Information Technology, University of Jyväskylä(贾韦斯科普大学信息科技学院)

AI总结 针对Erdős单位距离猜想下界,通过非线性整数规划优化参数,提出开源验证流程并改进证书,得到当前最佳下界u(n)>n^{1.0152}。

Comments 17 pages, 9 figures. Added a declaration on the use of AI. Added references to further contributions discussed on MathOverflow, including a reference to the independently developed verification pipeline and certificate package by Tseng (2026), published on Zenodo

详情
AI中文摘要

2026年对Erdős单位距离猜想的反驳以及Sawin后续的显式定量改进表明,对于固定正数ε,n个平面点中单位距离的最大数量u(n)可以超过n^{1+ε}。Sawin的显式界给出了任意大n下超过n^{1.014}个单位距离,并暴露了有限参数的选择尚未完全优化。本报告将有限参数选择任务表述为非线性整数规划问题的变体,并提出了一个开源的Python验证流程,首先通过复现Sawin公布的参数选择进行验证,然后应用于计算改进的证书。主要的计算贡献是对素数集合T和S_Q、整数重数k(p)以及有理编码的实数参数R进行整数优化和检查程序。优化流程有意设计为轻量级且可在标准硬件上复现:我们提出了一种确定性贪心构造启发式、一种带有修复算子以保持数论可行性的定制整数进化策略,以及一种双亲离散重组变体。比较了四个证书级别:Sawin公布的示例(δ=0.0141144286784982...)、贪心优化证书(δ=0.0151718056372133...)、带有有理数R=6672416/100000的定制整数进化策略证书(δ=0.0152616610684193...),以及带有离散重组的定制整数进化策略证书(同样R=6672416/100000,δ=0.0152628688170072...)。因此,在严格按引用应用Sawin显式准则的前提下,当前最佳证书支持谨慎的明确陈述:对于任意大的n,u(n)>n^{1.0152}。

英文摘要

The 2026 disproof of Erdős's unit-distance conjecture and Sawin's quantitative refinement show that the maximum number $u(n)$ of unit distances among $n$ planar points can exceed $n^{1+\varepsilon}$ for a fixed positive $\varepsilon$. Sawin's explicit bound gives more than $n^{1.014}$ unit distances for arbitrarily large $n$ and exposes integer parameters whose choice is not fully optimized. This report treats Sawin's parameter selection as a nonlinear integer optimization problem and develops an open-source Python optimization and verification pipeline for certificates involving prime sets $T$ and $S_Q$, integer multiplicities $k(p)$, and a rationally encoded real parameter $R$. After reproducing Sawin's certificate with $δ=0.014114\ldots$, the pipeline yields improved certificates with the same $T$. We develop a tailored integer evolution strategy achieving a certificate with $δ=0.015263\ldots$ and supporting the cautious statement $u(n)>n^{1.0152}$ for arbitrarily large $n$. For extended ramified prime ranges, the Emmerich--Cordella certificate obtained with the same framework reports $u(n)>n^{1.031}$ for $\#T=67$, illustrating the importance of enlarging $T$. Very recent MathOverflow discussions, brought to the author's attention as of version~4, report further improvements, including certificates above $δ>0.035$ and beyond $δ>0.036$. Some of these improvements may rely not only on larger prime ranges but also on modified constraint systems and additional degrees of freedom that deviate from Sawin's original formulation. Beyond this application, the work illustrates how randomized optimization heuristics can improve, verify, and refine explicit certificates for combinatorial geometry through nonlinear integer optimization.

5. 机器学习与表示学习 94 篇

2606.10346 2026-06-10 cs.AI 新提交

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆?LLM强化学习中的方向感知多样性探索

Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu

发表机构 * University of Georgia(佐治亚大学) Tencent AI Lab(腾讯AI实验室) The Education University of Hong Kong(香港教育大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出DiRL框架,通过方向感知奖励区分推理与记忆驱动的探索,在GRPO中集成方向加权梯度特征,显著提升数学与通用推理性能。

Comments 12 pages, 6 figures

详情
AI中文摘要

强化学习已成为激发大型语言模型推理能力的关键范式,其中探索对于发现有效解轨迹至关重要。现有的探索方法通常鼓励语义或梯度空间中的多样性,而不区分驱动这种多样性的因素。一条轨迹可能因为遵循新的推理过程而显得新颖,也可能因为变化了记忆模式和捷径。对这两种情况给予同等奖励可能会将探索导向记忆而非真正的推理改进。在本文中,我们提出DiRL,一种方向感知强化学习框架,将探索锚定到策略内部的推理-记忆方向。具体地,DiRL从模型表示中提取该方向,构建方向加权梯度特征以表征轨迹更新,并塑造奖励以放大推理对齐的探索,同时抑制记忆对齐的变化。DiRL无缝集成到标准的组相对策略优化(GRPO)中。在数学和通用推理基准上的大量实验证明了DiRL的有效性,显示出相对于各种现有探索方法的显著改进。

英文摘要

Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

2606.10504 2026-06-10 cs.AI 新提交

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

无配对数据的跨模态知识蒸馏:理论基础与算法

Trong Khiem Tran, Anh Duc Chu, Quang Hung Pham, Phi Le Nguyen, Trong Nghia Hoang

发表机构 * School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam(信息与通信技术学院,河内科学技术大学,越南河内) School of Electrical Engineering and Computer Science, Washington State University, Pullman, US(电气工程与计算机科学学院,华盛顿州立大学,华盛顿州普尔曼)

AI总结 提出无配对数据下的跨模态知识蒸馏框架,通过特征对齐和标签对齐两种分布对齐机制,实现跨模态知识迁移,理论保证且实验效果显著。

详情
AI中文摘要

跨模态知识蒸馏(CMKD)研究如何利用在一种数据类型(如图像)上训练的大型教师模型来指导基于另一种数据类型(如文本/音频)的较小学生模型。现有的CMKD方法通常需要具有对齐语义的配对多模态数据,但获取此类配对数据往往成本高昂且不切实际。为缓解这一限制,我们针对更困难的设置——无配对数据——开发了一种新的CMKD框架。特别地,我们建立了教师模型与学生模型之间的跨模态分布关系,揭示了控制有效蒸馏的两个基本量:特征对齐和标签对齐。这些量分别从表示和预测分布层面表征了模态间的语义差异。受此启发,我们提出了一个具有理论保证的原则性框架,通过对齐分布而非单个样本实现有效的跨模态知识蒸馏。在广泛的多模态基准上的大量实验表明,我们的框架在无配对和有配对数据设置中均非常有效,显著优于先前的工作。

英文摘要

Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

2606.10572 2026-06-10 cs.AI 新提交

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

每个多模态证据一个令牌:面向资源受限问答的潜在记忆

Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出潜在记忆范式,将每个证据压缩为单个高维潜在令牌,通过统一训练实现高效检索与生成,在资源受限场景下以3-10倍令牌节省达到竞争性问答性能。

详情
AI中文摘要

外部记忆有效地将基于大语言模型(LLMs)和视觉-语言模型(VLMs)的问答(QA)与相关的多模态证据联系起来。然而,现有的记忆范式以原始文本和图像形式表示每个记忆项,因此基于检索的系统必须将检索到的文本或图像传递给生成LLMs/VLMs,导致高令牌消耗和存储压力,使得资源受限的应用难以承受。我们提出潜在记忆,一种潜在空间记忆范式,它将每个原始文本或图像证据项替换为由小型压缩器LLM/VLM生成的单个高维潜在令牌。潜在记忆不是在生成时检索原始证据,而是在统一的潜在表示空间中操作:查询被嵌入到该空间中以检索相关的潜在令牌,检索到的潜在令牌直接提示给预训练的LLM或VLM以生成答案。为了使每个潜在令牌同时具有用于重建、检索和生成的信息,我们使用重建、对比和蒸馏目标以统一的端到端方式训练压缩器。潜在记忆在七个纯文本QA基准(例如HotpotQA)和多模态QA基准上进行了评估,与先进的RAG基线相比,它实现了具有竞争力的QA性能,同时消耗的生成器令牌减少了3到10倍。它还能在WebQA上提供最强的图像基础问答性能。代码可在该https URL获取。

英文摘要

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

2606.10787 2026-06-10 cs.AI cs.LO 新提交

Accelerating NeurASP with vectorization and caching

通过向量化和缓存加速NeurASP

Alexander Philipp Rader, Alessandra Russo

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本文通过向量化、批处理和缓存中间计算,显著加速了神经符号框架NeurASP的训练,在大型任务上实现了多个数量级的提速。

Comments 16 pages, 5 figures, to be published in the Theory and Practice of Logic Programming (TPLP) journal for the 42nd International Conference on Logic Programming (ICLP) issue

详情
AI中文摘要

神经符号AI将神经网络与符号程序相结合,以创建鲁棒且可解释的预测。其中一个框架是NeurASP,它训练神经网络来预测概念,并使用答案集编程(ASP)编写的规则对这些概念进行推理,以解决下游任务。关键的是,标签仅由符号规则产生的下游预测提供,而不是潜在概念。通过不可微的ASP组件进行反向传播需要昂贵的概率和梯度计算,这阻碍了其扩展到更复杂的任务。在本文中,我们通过向量化、批处理和训练期间中间计算的缓存来改善NeurASP的计算性能,从而解决其当前局限性。我们比较了原始NeurASP和新实现的计算速度,并报告了在较大任务上多个数量级的加速。为此,我们提出了一个涉及扑克牌的困难任务新数据集,用于测试NeurASP增强学习功能的能力。

英文摘要

Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP's enhanced learning function.

2606.11045 2026-06-10 cs.AI cs.LG 新提交

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

什么适合(少量标记)就不会过拟合:ML研究智能体中的压缩与泛化

Martin Andres Bertran, Aaron Roth, Zhiwei Steven Wu

发表机构 * Amazon Responsible AI(亚马逊负责任人工智能) University of Pennsylvania(宾夕法尼亚大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究LLM驱动的科研智能体在输出和输入压缩下能否保持性能,发现短提示和可压缩反馈足以复现高性能模型,支持成功策略位于低复杂度区域的假设。

详情
AI中文摘要

原则上,自适应地重复使用保留的基准测试会引发过拟合。然而,基准驱动的机器学习(ML)在实践中却出人意料地很少出现过拟合。一个吸引人的假设是,成功的ML策略具有高度可压缩性。我们在LLM驱动的科研智能体设置中研究这一点,通过两个互补的信息瓶颈可以直接检验该假设。在\emph{输出压缩}中,探索智能体使用验证集自适应地搜索高性能模型,我们测试一个全新的“复现智能体”是否能在仅给出极短提示和训练数据的情况下复现其性能。在\emph{输入压缩}中,探索者仅接收一位反馈,指示每个提交的模型是否优于当前最佳。跨越8个数据集(涵盖表格分类、视觉、语言建模、扩散建模和奖励建模),我们发现这些瓶颈对性能影响很小:短提示和可压缩反馈足以复现和找到高性能模型。该假设是可证伪的:当我们故意诱导验证集过拟合时,结果无法用短提示复现。综合来看,我们的结果支持描述长度解释,即基准驱动ML中缺乏过拟合的原因是成功策略占据策略空间的低复杂度区域。

英文摘要

Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emph{output compression}, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent'' can reproduce its performance given only an extremely short prompt and the training data. In \emph{input compression}, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

2606.11164 2026-06-10 cs.AI 新提交

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc: 推理模型的分层解码时KV缓存预算分配

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu

发表机构 * Tsinghua University(清华大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Shenzhen University of Advanced Technology(深圳理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对长链式推理中KV缓存快速增长导致的推理瓶颈,提出ReasonAlloc框架,通过离线层预分配和在线头重分配的分层预算分配策略,在不增加训练开销下显著提升小预算下的推理性能。

详情
AI中文摘要

大语言模型推理中的长链式思维轨迹由于键值缓存的快速增长导致严重的推理瓶颈。当前的解码时压缩方法通过令牌驱逐缓解此问题,但通常假设所有层和头之间均匀分配预算。相比之下,现有的非均匀预算分配方法主要针对静态提示预填充阶段设计,未能捕捉自回归推理的逐步上下文需求。为弥补这一差距,我们提出ReasonAlloc,一个无需训练的框架,将解码时KV压缩重新表述为分层预算分配问题。ReasonAlloc在两个互补层面运作:离线层预分配策略捕捉一种架构驱动的需求模式,我们称之为“推理波”;在线头策略根据实时效用将资源重新分配给信息丰富的头。在数学推理基准(MATH-500、AIME 2024)上使用DeepSeek-R1-Distill-Llama-8B、DeepSeek-R1-Distill-Qwen-14B和AceReason-14B的评估表明,ReasonAlloc优于均匀预算的R-KV、SnapKV和Pyramid-RKV(一种强制静态单调递减层预算的基线),在小预算(128-512令牌)下增益最大。ReasonAlloc可与现有令牌驱逐策略即插即用,并引入可忽略的推理时间开销。

英文摘要

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

2606.11173 2026-06-10 cs.AI cs.LG 新提交

The Role of Feedback Alignment in Self-Distillation

反馈对齐在自蒸馏中的作用

Semih Kara, Oğuzhan Ersoy

发表机构 * Gensyn

AI总结 研究通过自蒸馏提升语言模型性能时,反馈与模型推理的结构对齐是关键因素,步级对齐批评比二元奖励或参考解更有效。

Comments Accepted to the ICML 2026 Workshop on RL from World Feedback (RLxF)

详情
AI中文摘要

在语言模型上附加额外上下文(例如对先前尝试的反馈)通常会改善其响应。自蒸馏训练模型在没有上下文时保留这种改进。该方法通过匹配模型在两种设置下的输出分布来工作:学生仅看到问题,而自教师还看到上下文。因此,模型学习的内容取决于自教师接收的上下文,然而上下文的设计在很大程度上尚未被探索。我们通过使用冻结的批评器训练求解器来研究自蒸馏的上下文设计。我们比较了三种条件:(i) 二元奖励(GRPO),(ii) 参考解,以及 (iii) 与求解器推理轨迹对齐的逐步批评。步级对齐批评带来了最大的增益,在Avg@12上比GRPO高出16.11分,比参考解条件化的自蒸馏高出5.27分。逐token优势分析揭示了原因:步级对齐反馈仅针对推理失败的token,保留正确行为不变。相比之下,条件化于参考解会迫使模型在每个token上改变其行为(即使是正确的步骤),因为替代推导在措辞和方法上不可避免地存在差异。这表明反馈与求解器推理之间的结构对齐是自蒸馏有效性的关键驱动因素。

英文摘要

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.

2606.09856 2026-06-10 cs.CL cs.AI cs.LG stat.ML 交叉投稿

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

使用概率程序训练大型语言模型的归纳推理

Liyi Zhang, Akshay K. Jagadish, Brenden M. Lake, Thomas L. Griffiths

AI总结 提出基于程序的后验训练(PPT)方法,利用LLM生成概率程序场景,通过推理产生分布目标,微调模型以提升归纳推理准确性、与人类判断的一致性及校准能力。

Comments 20 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)的后训练推理通常专注于数学和编码等演绎任务,其中正确性可验证。然而,许多现实世界的推理问题是归纳性的:智能体必须从稀疏、模糊的观测中推断不确定的信念。使用标准微调方法进行归纳推理面临挑战,包括难以策划大规模、高质量标注数据集以及处理本质上是分布式的目标。在这项工作中,我们引入了一种称为基于程序的后验训练(PPT)的新方法来解决这些局限性:我们使用LLM生成多样化的开放世界场景作为概率程序,运行概率推理以产生查询的分布式目标响应,然后在这些概率软标签上进行微调。使用这种方法,我们在10,000个程序生成的场景上微调LLM,并在保留的模板、人工标注的判断和外部基准上进行评估。总体而言,PPT显著提高了保留归纳任务的估计准确性,增强了与人类判断的一致性,并迁移到估计和校准的外部基准。此外,原始校准的增益并未被事后温度缩放所涵盖,表明与输出重新缩放相比,模型更深入地内化了不确定性。这些结果表明,概率程序介导的微调是一种有前景的方法,用于后训练LLM以可靠地执行近似归纳推理。

英文摘要

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

2606.09858 2026-06-10 cs.IT cs.AI math.IT 交叉投稿

Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

支持充分性作为行动充分压缩:单周期率-遗憾公式

Mark Walsh

AI总结 本文形式化支持充分性为行动充分压缩,通过策略等价商空间定义精确充分性,并基于期望策略遗憾定义近似充分性,在有限单周期设置下导出率-遗憾问题,区分行动充分性与重建保真度、信息瓶颈预测和理性疏忽。

Comments 22 pages. Submitted to Journal of Mathematical Psychology. Formal single-cycle model of action-sufficient support compression and rate-regret sufficiency

详情
AI中文摘要

鲁棒决策需要压缩。形成丰富支持状态的系统通常无法在行动点保留其完整结构。它必须仅保留在当前后果几何下行动、验证、放弃或推迟所需的区别。本文将支持充分性形式化为行动充分压缩。设$H$表示完整支持状态,$\mathcal{A}$表示有限行动集,$Z$表示指定收益结构的后果几何。对于固定的$Z$,最粗略的精确行动充分压缩是支持空间按策略等价的商。当两个支持状态需要相同的最优行动时,它们可以合并。这阐明了为什么仅内容或仅标量置信度的仲裁在其诱导划分跨越行动边界时失败。然后通过有界期望策略遗憾定义近似充分性。在有限单周期设置中,这产生了一个率-遗憾问题,其源为$H$,再现字母表为$\mathcal{A}$,失真由后果敏感遗憾给出。最优随机行动通道继承了标准率失真吉布斯形式,此处应用于具有遗憾失真的支持状态。贡献是解释性的:行动充分性与重建保真度、信息瓶颈预测和理性疏忽区分开来。鲁棒单周期仲裁不需要保留所有支持,但需要保留后果几何使行动相关的区别。

英文摘要

Robust decision-making requires compression. A system that forms a rich support state cannot usually preserve its full structure at the point of action. It must retain only those distinctions needed to act, verify, abstain, or defer under the current consequence geometry. This paper formalizes support sufficiency as action-sufficient compression. Let $H$ denote a full support state, $\mathcal{A}$ a finite action set, and $Z$ a consequence geometry specifying payoff structure. For fixed $Z$, the coarsest exactly action-sufficient compression is the quotient of support space by policy equivalence. Two support states may be merged exactly when they require the same optimal action. This clarifies why content-only and scalar-confidence-only arbitration fail whenever their induced partitions cross action boundaries. Approximate sufficiency is then defined by bounded expected policy regret. In the finite single-cycle setting, this yields a rate-regret problem with source $H$, reproduction alphabet $\mathcal{A}$, and distortion given by consequence-sensitive regret. The optimal stochastic action channel inherits the standard rate-distortion Gibbs form, applied here to support states with regret distortion. The contribution is interpretive: action adequacy is distinguished from reconstruction fidelity, information-bottleneck prediction, and rational inattention. Robust single-cycle arbitration does not require preserving all support, but it does require preserving the distinctions that consequence geometry makes action-relevant.

2606.09859 2026-06-10 cs.LG cs.AI 交叉投稿

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

缓解流形偏离:面向可信MLLM解码的不确定性感知子空间校正

Yingxuan Zhuang, Jingxiao Yang, Miao Pan, Cheng Tan, Yuxiang Cai, Siwei Tan, Chen Zhi, Xuhong Zhang, Jianwei Yin, Jintao Chen

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出MGAP方法,通过SVD构建语言先验子空间并自适应衰减投影分量,在抑制幻觉的同时保持语义结构,优于现有解码基线。

Comments ICML 2026 regular

详情
AI中文摘要

MLLM经常产生与视觉输入不一致的幻觉对象。这一问题通常归因于对语言先验的过度依赖,这种依赖可能覆盖视觉上下文。最近的训练自由解码策略通过惩罚语言先验来解决这一问题。然而,这些方法忽视了语言先验的双重性质——根据与视觉证据的对齐程度,它们既可能有益也可能有害。特别是,盲目抑制语言先验往往会破坏模型的语义流形,导致性能下降,我们将这种现象称为流形偏离。为了解决这个问题,我们提出了流形引导自适应投影(MGAP),一种几何感知、训练自由的解码方法,在抑制幻觉的同时保持表示结构。MGAP首先通过SVD从盲隐藏状态构建语言先验子空间。在解码过程中,MGAP将每个多模态隐藏状态投影到该子空间上,并应用一致性感知门控自适应地仅衰减投影的先验分量,从而产生子空间选择性更新,很大程度上保留了正交的语义分量。在POPE和CHAIR上的大量实验表明,MGAP优于先前的解码基线,在不牺牲连贯性的情况下实现了更强的幻觉抑制。

英文摘要

MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model's semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.

2606.09862 2026-06-10 cs.LG cs.AI 交叉投稿

Blurry Window Attention

模糊窗口注意力

Axel Laborieux, Christos Sourmpis, Juan Gabriel Kostelec, Qinghai Guo

AI总结 提出模糊窗口注意力(BLA),一种基于Dirichlet核插值重构模糊KV历史的有界记忆控制方法,在合成任务中状态效率比滑动窗口注意力高8倍,且随状态增大性能提升。

详情
AI中文摘要

Transformer语言模型中的Softmax注意力操作在序列长度上具有二次复杂度,且状态大小以KV缓存形式增长,这成为长上下文场景中的瓶颈。为克服此限制,引入了具有线性复杂度和有限状态大小的替代架构,如状态空间模型(SSM)、线性注意力(LA)和有界记忆控制注意力(ABC)。尽管线性模型在语言困惑度上与Transformer相当,但在需要检索或回忆特定信息的任务中仍落后。本文提出模糊窗口注意力(BLA),一种受SSM启发的新型ABC方法。BLA存储一个频率窗口,通过使用Dirichlet核进行插值从中重建模糊的KV历史。根据Dirichlet核的分辨率,BLA可理解为滑动窗口注意力(SWA)的泛化,或门控槽注意力(GSA)的特例,其中衰减因子由Dirichlet核实现。我们详细描述了BLA的理论和高效实现。在多查询关联回忆(MQAR)合成任务上,我们表明BLA的状态效率比SWA高8倍,且与流行的线性注意力模型竞争;在RegBench合成任务中,在我们测试的线性模型中,只有BLA和SWA随着状态大小增长而提升性能。

英文摘要

The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as State-Space Models (SSMs), Linear Attention (LA), and Attention with Bounded-memory Control (ABC). Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information. In this work, we introduce Blurry Window Attention (BLA) a novel ABC method inspired by SSMs. BLA stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels. BLA can be understood as a generalization of Sliding Window Attention (SWA) depending on the Dirichlet kernels resolution or as a special case of the Gated Slot Attention (GSA), where the decay factor is implemented with Dirichlet kernels. We describe in details the theory and efficient implementation of BLA. On the Multi-Query Associate Recall (MQAR) synthetic task, we show that the state efficiency of BLA is 8$\times$ better than SWA and is competitive with popular linear attention models, and in the RegBench synthetic task, only BLA and SWA improve their performance as the state size grows among the linear models we tested.

2606.09869 2026-06-10 cs.LG cs.AI cs.CR 交叉投稿

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

QSplitFL: 基于能力感知的深度Q学习在分割联邦学习中的最优分割点选择

Nazmus Shakib Shadin, Xinyue Zhang, Jingyi Wang, Miao Pan

AI总结 提出QSplitFL框架,利用深度Q网络基于客户端硬件指标(CPU、内存、电池、网络延迟)动态选择最优分割点,解决异构设备上的分割联邦学习挑战,通过衰减损失奖励函数和委员会投票机制提升收敛速度和精度。

Comments Accepted by ECML-PKDD 2026

详情
AI中文摘要

联邦学习(FL)与分割学习(SL)结合是一种隐私保护范式,能够在资源受限设备上训练深度神经网络(DNN),同时降低整体训练成本。然而,确定最优分割点(即模型被分割的层)仍然是一个关键挑战,尤其是当客户端具有异构硬件能力时。固定分割点可能使弱设备过载,增加通信和服务器负载,从而减慢收敛速度并降低稳定性。本文介绍了QSplitFL,一种新颖的基于能力感知的深度Q网络(DQN)框架,用于在基于分割学习的联邦学习(SFL)环境中选择最优分割点。与依赖高维模型权重表示的现有方法不同,QSplitFL采用直接从客户端硬件指标(包括CPU利用率、内存、电池电量和网络延迟)导出的轻量级状态表示。所提出的框架包含一个衰减损失下降奖励函数,优先考虑早期收敛,以及一个基于委员会的DQN架构,通过多数投票来减轻奖励黑客攻击。在MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100数据集上,使用CNN、ResNet50、MobileNetV4和ConvNeXt架构进行的广泛实验表明,我们的方法在收敛速度和精度上优于现有方法,同时有效适应异构设备资源。源代码在此https URL公开可用。

英文摘要

Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining the optimal split point, meaning the layer where the model is divided still remains a critical challenge, especially when clients have heterogeneous hardware capabilities. Fixed split points can overload weak devices and increase the communication and server load, which slows convergence and reduces stability. This paper introduces QSplitFL, a novel capability-aware Deep Q-Network (DQN) framework for optimal split point selection in Split learning based Federated Learning (SFL) environments. Unlike existing approaches that rely on high-dimensional model weight representations, QSplitFL employs a lightweight state representation derived directly from client hardware metrics, including CPU utilization, memory, battery level, and network latency. The proposed framework incorporates a decayed loss-drop reward function that prioritizes early convergence, and a committee-based DQN architecture with majority voting to mitigate reward hacking. Extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt architectures demonstrate that our approach achieves better convergence and higher accuracy compared to existing methods, while effectively adapting to heterogeneous device resources. The source code is publicly available at https://github.com/AIPO-Lab/QSplitFL.

2606.09873 2026-06-10 cs.LG cs.AI 交叉投稿

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

Rotate2Think:通过正交旋转进行几何提示以提升语言模型推理能力

Aditya Sharma, Christopher J. Pal, Amal Zouaq

AI总结 发现推理模型的输入嵌入与思考嵌入存在高锥度且方向非共线,提出无训练方法Rotate2Think,通过正交Procrustes分析估计旋转并注入合成思考向量,在30/32配置中提升数学、科学和代码任务准确率。

详情
AI中文摘要

推理模型通过生成显式的中间推理轨迹再给出最终答案,在挑战性任务上取得了强劲表现。然而,推理过程中表示空间的内部结构仍知之甚少:模型的隐藏表示在思考时与输入提示的嵌入有何不同?这种结构能否被利用以在推理时激发更强的推理能力?我们表明,输入嵌入和思考嵌入(分别对提示和推理轨迹的最后一层隐藏状态进行均值池化)都表现出极高的锥度,所有向量紧密聚集在单一平均方向周围。关键的是,这些平均输入方向和思考方向是非共线的,思考嵌入在嵌入空间中占据了几何上不同的区域,这在许多不同模型和基准任务中均成立。这一观察启发我们将输入到思考的转换视为一个旋转问题,该问题可通过正交Procrustes分析得到闭式解。我们提出Rotate2Think,一种无需训练的方法,从少量正确求解的示例中估计该旋转,并在推理时将生成的合成思考向量注入思考分隔符之间,在推理轨迹开始时提供几何提示。在多个基准和模型家族上的评估表明,Rotate2Think在数学、科学和代码任务的32个模型-基准配置中改进了30个的准确率,并零样本泛化到MATH-Vision上的多模态推理。

英文摘要

Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a final answer. Yet the internal structure of representation space when reasoning remains poorly understood: how do a model's hidden representations differ during thinking versus the embeddings of the input prompt, and can this structure be exploited to elicit stronger reasoning at inference time? We show that both input embeddings and thinking embeddings (mean-pooled last-layer hidden states over the prompt and reasoning trace, respectively) exhibit extremely high conicity, with all vectors clustering tightly around a single mean direction. Crucially, these mean input and thinking directions are non-collinear, with thinking embeddings occupying a geometrically distinct region of embedding space across many different models and benchmark tasks. This observation motivates casting the input-to-thinking transition as a rotation problem admitting a closed-form solution via orthogonal Procrustes analysis. We propose Rotate2Think, a training-free method that estimates this rotation from a small set of correctly solved examples and injects the resulting synthetic thinking vector between thinking delimiters at inference time, providing a geometric primer at the onset of the reasoning trace. Evaluated across multiple benchmarks and model families, Rotate2Think improves accuracy in 30 of 32 model-benchmark configurations across mathematics, science, and code tasks, and generalizes zero-shot to multimodal reasoning on MATH-Vision.

2606.09883 2026-06-10 cs.LG cs.AI 交叉投稿

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

TD-Grokking:通过训练时分解从零奖励问题中学习

Ningyuan Xi, Hao Xu, Hongsheng Xin, Ning Miao

发表机构 * Ningyuan Xi 1,2(西宁元 1,2) Hao Xu 3(许浩 3) Hongsheng Xin 3(辛红生 3) Ning Miao 1,2, †(苗宁 1,2, †)

AI总结 针对强化学习在零奖励问题上无法提供优化信号的问题,提出训练时分解框架TD-Grokking,将难解问题递归分解为可验证子问题,在数学和医疗任务上优于基线方法。

详情
AI中文摘要

大型语言模型(LLMs)在推理任务上取得了显著进展,这主要归功于后训练范式,特别是基于可验证奖励的强化学习(RLVR)。然而,一个关键瓶颈依然存在:RLVR在极具挑战性的零奖励问题上失败,因为所有采样的推理轨迹都产生统一失败的结果,无法提供优化信号来驱动模型改进。先前解决这一限制的努力,如密集过程监督、部分奖励分配或前缀引导探索,受到固有任务约束的限制,或者未能完全赋予策略模型解决原始难解问题所需的能力。为了解决这个问题,我们提出了TD-Grokking,一个针对零奖励问题的训练时分解框架。它递归地将难解的根问题分解为自包含、可验证的子问题,形成层次树,其中可解的叶子节点提供非零奖励。在数学和医疗任务上的评估表明,TD-Grokking优于普通的GRPO以及所有基线方法。结合详细分析,这些结果证实训练时分解有效地将零奖励示例转化为可用的训练信号,从而实现一致的性能提升。我们的代码和数据集可在以下网址获取:https://this URL。

英文摘要

Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at https://anonymous.4open.science/r/TD-Grokking-6567/.

2606.09886 2026-06-10 cs.LG cs.AI 交叉投稿

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SHAPE: 面向稀疏混合专家大语言模型的联盟感知专家剪枝

Yuhao Zhang

发表机构 * Beihang University(北航大学)

AI总结 提出SHAPE框架,通过合作博弈论建模专家间协作,利用Shapley值识别高贡献专家,结合质量覆盖选择规则在剪枝预算下保留关键专家,实验表明在多种MoE模型上提升鲁棒性并降低显存。

详情
AI中文摘要

稀疏混合专家(MoE)大语言模型以低每token计算量实现了高质量,但其部署常受限于内存墙:必须保留全部专家池以支持依赖token的路由。专家剪枝是一种直接解决方案,但先前的标准通常独立评估专家,忽略了MoE推理本质上是“联盟性”的,即输出由路由到的top-$k$专家组合产生。我们提出\textbf{SHAPE},一个任务驱动的剪枝框架,显式建模\textit{层内}专家协作。SHAPE将小校准集上的路由轨迹建模为经验合作博弈,并通过基于观察到的top-$k$联盟的Shapley式归因分配交互感知的专家价值,从而识别对高效用协作至关重要的专家,而不仅仅是频繁出现的专家。为了在全局剪枝预算下保持MoE拓扑,SHAPE进一步引入\textit{质量-覆盖}选择规则,在每层保留覆盖非负Shapley质量$\alpha$分数的最小专家子集,同时使用二分法匹配目标保留率。在三个现代MoE骨干网络(Qwen3-30B-A3B、GPT-OSS-20B和DeepSeek-V2-Lite)上的多个基准实验表明,SHAPE在20%和40%专家剪枝下,相比全局和逐层剪枝变体一致地提升了鲁棒性,无需额外训练即保持竞争性精度,并显著降低了峰值GPU内存占用。开源代码见此https URL。

英文摘要

Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limited by the memory wall: the full expert pool must remain resident to support token-dependent routing. Expert pruning is a direct remedy, but prior criteria typically score experts independently and overlook that MoE inference is inherently \emph{coalitional}, where outputs arise from routed top-$k$ expert combinations. We propose \textbf{SHAPE}, a task-driven pruning framework that explicitly models \emph{intra-layer} expert cooperation. SHAPE formulates routing traces on a small calibration set as an empirical cooperative game and assigns interaction-aware expert values via a Shapley-style attribution over observed top-$k$ coalitions, enabling the identification of experts that are essential for high-utility collaborations rather than merely frequent. To preserve MoE topology under a global pruning budget, SHAPE further introduces a \emph{quality-coverage} selection rule that retains, in each layer, the minimal expert subset covering an $α$ fraction of non-negative Shapley mass, while using bisection to match a target keep rate. Experiments on three modern MoE backbones (Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite) across diverse benchmarks show that SHAPE consistently improves robustness over global and layer-wise pruning variants, maintaining competitive accuracy under 20\% and 40\% expert pruning without additional training and delivering clear reductions in peak GPU memory footprint. The open-source code is available at https://github.com/Alizen-1009/Shapley-Moe.

2606.09891 2026-06-10 cs.LG cs.AI cs.IR 交叉投稿

Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

表示课程:用于鲁棒排序和分配的分阶段训练

Ehsan Ebrahimzadeh, Sina Baharlouei, Abraham Bagherjeiran

发表机构 * eBay Search Ranking and Monetization(eBay搜索排名与变现)

AI总结 提出表示课程(RC)方法,通过分阶段引入特征,先强调基于内容的信号,再引入依赖曝光的信号,减少对历史信号的捷径依赖,提升冷启动泛化性和鲁棒性。

Comments 12 pages, 5 figures

详情
AI中文摘要

数字市场中的排序是一种动态曝光分配机制:展示的物品塑造了发现轨迹和成功事件,平台记录这些事件以更新未来的分配策略。现代排序系统严重依赖曝光混杂信号(如流行度估计、CTR/CVR聚合和基于ID的表示),因为这些信号在静态需求下具有高度预测性。然而,这种预测能力可能成为一种学习捷径:早期访问依赖曝光的信念信号会使优化过度依赖它们,而忽视独立于曝光的价值信号(如基于内容的竞争力和语义亲和性)。因此,学习到的策略倾向于固化现有物品,并在分布偏移下降低冷启动泛化性和鲁棒性。我们提出表示课程(RC),一种训练时干预方法,按时间阶段安排特征使用。RC首先突出基于内容的价值信号,然后引入依赖曝光的信念信号,同时将内容路径锚定在学到的价值表示附近,从而抑制对历史信号的捷径依赖,并缓解内容信号上的梯度饥饿。我们形式化RC,使其独立于任务和假设类,并提供排序特定的实例化。在高斯线性岭回归设置中,我们推导出封闭形式解和充分条件,证明RC在冷启动目标分布上严格降低总体风险,并量化了与源性能的帕累托权衡。在公开的排序学习和推荐基准测试,以及大规模电商搜索系统中的随机在线实验中,RC显著地将依赖从历史信念信号转向基于内容的价值信号,并在头部性能可控权衡下,对冷群体带来一致的提升。

英文摘要

Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

2606.09899 2026-06-10 cs.LG cs.AI 交叉投稿

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

当归因补丁说谎时:诊断与二阶修正

Luyang Zhang, Jialu Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 研究归因补丁(梯度一阶近似)在机制可解释性中的不可靠性,发现主要误差源于下游网络的非线性,并提出可靠性评分、误差界和HVP二阶修正方法。

Comments 30 pages, 12 figures

详情
AI中文摘要

机制可解释性的一个核心目标是识别哪些内部组件因果地驱动语言模型的行为。由于这些重要性估计作为识别电路的证据,系统性错误可能导致对底层机制的误识别。虽然激活补丁提供了黄金标准的因果度量,但其计算成本在大规模下难以承受。从业者转而依赖归因补丁,一种基于梯度的一阶近似,其可靠性尚不明确。在这项工作中,我们刻画了这种不可靠性的来源,证明主要误差源于下游网络的非线性,而非补丁组件的局部曲率。这一洞察产生了三个实用工具:(i) 检测不可信估计的可靠性评分,(ii) 量化潜在归因误规范的误差界,以及 (iii) 仅需一次额外反向传播即可消除主导误差的Hessian-向量-乘积(HVP)修正。在五个模型家族(124M-9B参数)以及随机令牌和自然(名称交换)扰动的评估中,HVP是唯一在大规模下可行的二阶修正,而标准基线如积分梯度在计算上变得不可行。在对比实验中,多步HVP变体以显著更低的计算量达到或超过积分梯度的准确性,优于先前的二阶基线。这些改进在标准基准上实现了更高保真度的电路恢复,并支持一种屏幕-标记-修复工作流,仅将计算努力针对被标记为不可靠的组件。

英文摘要

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

2606.09902 2026-06-10 cs.NE cs.AI 交叉投稿

The Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs

超越进化的鲸鱼:群体智能在连接组储备池中最大化记忆

Anmol Guragain, Savvas Kakalis, Juan Ignacio Godino-Llorente

发表机构 * University of Murcia(穆尔西亚大学) University of Pisa(比萨大学)

AI总结 应用四种无梯度生物启发优化器优化六物种连接组储备池的边权重,在所有任务和物种上均优于未优化的生物基线,鲸鱼优化算法实现最大提升(记忆容量提升17倍,均方根误差降低89%),表明生物权重是拓扑无法替代的必要归纳偏置。

详情
AI中文摘要

储备池计算利用固定动力学的循环网络进行时序处理,仅需训练线性读出层。经过数百万年进化塑造的生物神经连接组可能编码了超越随机储备池的计算结构,但该结构能否通过原则性优化进一步增强仍是开放问题。我们通过将四种无梯度、生物启发优化器(粒子群优化、差分进化、灰狼优化器和鲸鱼优化算法)应用于基于连接组的回声状态网络的边权重来解决该问题,这些网络涵盖六个物种,其神经复杂性跨越六个数量级:秀丽隐杆线虫(279个神经元)、果蝇(49个节点)、小鼠(112个节点)、大鼠(73个节点)、猕猴(29个区域,连续FLNe突触强度)以及人类结构MRI连接(83个脑区)。每个连接组在四个经典储备计算基准上评估:记忆容量(MC)、Lorenz吸引子预测、NARMA-10系统辨识和Mackey-Glass混沌时间序列预测。所有四种优化器在从生物权重初始化时,在每个任务和物种上均一致优于未优化的生物基线。鲸鱼优化算法在每个任务上均取得最大增益:记忆容量提升高达17倍(秀丽隐杆线虫:1.39至23.91),均方根误差降低高达89%(Mackey-Glass,人类),对应所有物种和任务平均改进214%。关键的是,相同拓扑上的随机初始化始终表现劣于生物学,确立了生物权重值作为拓扑本身无法恢复的必要归纳偏置。这些结果将生物启发、生物初始化优化定位为跨动物王国连接组储备计算的一种原则性且广泛有效的策略。

英文摘要

Reservoir computing exploits the fixed dynamics of a recurrent network for temporal processing, requiring only a trained linear readout. Biological neural connectomes, shaped by millions of years of evolution, may encode computational structure beyond what random reservoirs provide, yet whether that structure can be further enhanced by principled optimisation remains an open question. We address it by applying four gradient-free, bio-inspired optimisers (Particle Swarm Optimisation, Differential Evolution, Grey Wolf Optimiser, and Whale Optimisation Algorithm) to the edge weights of connectome-based echo-state networks across six species spanning six orders of magnitude in neural complexity: C. elegans (279 neurons), Drosophila (49 nodes), mouse (112), rat (73), macaque (29 regions, continuous FLNe synaptic strengths), and human structural MRI connectivity (83 parcels). Each connectome is evaluated on four canonical reservoir computing benchmarks: Memory Capacity (MC), Lorenz attractor prediction, NARMA-10 system identification, and Mackey-Glass chaotic time-series prediction. All four optimisers consistently outperform unoptimised biological baselines across every task and species when initialised from biological weights. WOA achieves the largest gains on every task: up to a 17x MC improvement (C. elegans: 1.39 to 23.91) and up to 89% NRMSE reduction (Mackey-Glass, human), corresponding to an average 214% improvement across all species and tasks. Crucially, random initialisation on the same topology reliably underperforms biology, establishing biological weight values as an essential inductive bias that topology alone cannot recover. These results position bio-inspired, biologically-initialised optimisation as a principled and broadly effective strategy for connectome reservoir computing across the animal kingdom.

2606.09907 2026-06-10 cs.LG cs.AI 交叉投稿

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

LongMoE:基于轨迹感知的混合专家模型的纵向多模态学习

Maxx Richard Rahman, Prakhar Kumar, Wolfgang Maass

AI总结 提出LongMoE框架,通过上下文感知插补、注意力标记化、轨迹感知编码和稀疏MoE路由,联合解决临床多模态学习中模态缺失和纵向动态两大挑战,在ADNI等数据集上验证了鲁棒性。

详情
AI中文摘要

多模态临床学习对于整合包括影像、文本和个性化健康记录在内的多样化患者数据日益重要。然而,它面临两个基本挑战:i) 模态缺失,即在一次患者就诊中任意子集的模态不可用;ii) 纵向动态,即观察结果的诊断意义取决于患者随时间演变的疾病轨迹。现有方法孤立地处理这些挑战:缺失模态框架将每次就诊视为独立的静态快照并丢弃时间上下文,而纵向模型通常假设模态完全可用并在系统性模态不完整时性能下降。我们提出LongMoE(纵向混合专家模型),这是一个统一框架,用于联合解决这两个挑战。LongMoE结合了上下文感知插补模块和注意力标记化模块,后者捕获不规则就诊序列中的频域时间模式,以及用于建模疾病进展的轨迹感知编码器和用于患者特定专家选择的上下文条件稀疏MoE路由。在ADNI、OASIS-3和MIMIC-IV上的实验表明,LongMoE在缺失或弱共时模态下提高了鲁棒性,并在全模态设置中保持竞争力,为纵向感知的多模态临床学习奠定了坚实基础。

英文摘要

Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where arbitrary subsets of modalities are unavailable at a given patient visit, ii) longitudinal dynamics, where the diagnostic significance of an observation depends on the patient's evolving disease trajectory over time. Existing methods address these challenges in isolation: missing-modality frameworks treat each visit as an independent static snapshot and discard temporal context, while longitudinal models often assume complete modality availability and degrade under systematic modality incompleteness. We propose LongMoE (Longitudinal Mixture-of-Experts), the unified framework to jointly address both challenges. LongMoE combines a context-aware imputation module with an attentional tokenization module that captures frequency-domain temporal patterns across irregular visit sequences, a trajectory-aware encoder for modeling disease progression, and context-conditioned Sparse MoE routing for patient-specific expert selection. Experiments on ADNI, OASIS-3, and MIMIC-IV show that LongMoE improves robustness under missing or weak contemporaneous modalities and remains competitive in full-modality settings, establishing a strong foundation for longitudinally-aware multimodal clinical learning.

2606.09923 2026-06-10 cs.LG cs.AI 交叉投稿

Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

神经算子的共形预测:物理模拟中无分布不确定性量化

Michael Chin

AI总结 提出将分裂共形预测应用于神经算子物理模拟,实现无分布预测区间和有限样本覆盖保证,并通过归一化共形预测方案生成自适应宽度区间。

Comments 13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100

详情
AI中文摘要

神经算子如傅里叶神经算子(FNO)已成为求解偏微分方程(PDE)的强大替代方法,比传统数值求解器快几个数量级。然而,在安全关键工程应用(如电子元件和电池系统的热管理)中部署这些模型,不仅需要准确的点预测,还需要严格的不确定性保证。现有的神经算子不确定性量化(UQ)方法,包括蒙特卡洛Dropout和深度集成,仅提供相对不确定性估计,没有正式的覆盖保证。在这项工作中,我们首次将分裂共形预测应用于基于神经算子的物理模拟,提供具有有限样本覆盖保证的无分布预测区间。我们进一步引入了一种归一化共形预测方案,利用MC Dropout不确定性生成自适应宽度区间,在低不确定性区域产生更紧的区间,在模型不太确定的区域产生更宽的区间。在稳态热传导基准上的全规模实验(3370万参数,800个训练样本,5个集成成员,NVIDIA V100)表明,我们的方法在目标水平alpha=0.1下达到89.1%的经验覆盖率,同时生成反映底层物理不确定性结构的空间自适应预测区间。我们还提供了一个不确定性分解框架,将认知不确定性(占总量的68%)与偶然不确定性(占总量的32%)分离,为数据收集和模型改进提供可操作指导。我们的方法在一个开源平台上实现,具有REST API端点和交互式3D可视化。

英文摘要

Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications -- such as thermal management of electronic components and battery systems -- requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.

2606.09924 2026-06-10 cs.LG cs.AI 交叉投稿

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Sigma-Branch: 用于动态推理的分层单路径网络重构,减少活跃参数

Kohga Tanaka, Hiroaki Nishi

AI总结 提出Sigma-Branch框架,通过分层二叉树结构将预训练密集网络重构为共享主干、分层路由器和专用叶子,利用激活聚类初始化并微调,推理时仅执行单一路径,在CIFAR-100/ResNet-50等任务上减少58-60%活跃参数,性能损失小于1.72个百分点。

详情
AI中文摘要

在内存受限的边缘加速器上部署深度神经网络,瓶颈在于每次推理的片外权重传输而非计算:密集网络无法保留在芯片上,每个输入都必须加载所有参数。现有模型压缩仅在永久容量损失代价下减少这种传输。我们提出Sigma-Branch (SigmaB),一个将预训练密集网络重构为分层二叉树的框架,该树由共享主干、分层路由器和专用叶子组成。预训练权重通过基于激活的球形k-means聚类分布在树中,该聚类联合初始化路由器权重和每分支通道分配;然后通过软路由微调使每个叶子与其路由输入子集对齐。在推理时,所得网络仅执行一条根到叶路径,减少活跃参数占用,同时将完整密集参数集存储在内存中。在CIFAR-100 / ResNet-50、ImageNet-1K / ResNet-50和ModelNet40 / PointNet++上,SigmaB-Net将每次推理的活跃参数减少58-60%,同时与密集基线Top-1相比误差在1.72个百分点以内。在可比的ImageNet-1K Top-1下,活跃参数减少超过静态结构化剪枝(FPGM、HRank)14-23个百分点。跨模态评估涵盖2D视觉和3D点云骨干网络,证实了将每次推理内存流量与总参数数量解耦的框架级主张。

英文摘要

Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

2606.09926 2026-06-10 cs.LG cs.AI 交叉投稿

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

在你挣扎处采样:通过熵引导的幂采样增强基础模型推理

Hong Guo, Nianhui Guo, Christoph Meinel, Haojin Yang

AI总结 提出熵引导的幂采样(EGPS),一种无需训练和验证器的采样方法,通过利用前向传播中的token级熵将MCMC移动定位到高熵区域,在多个基准上以高达12.6倍加速达到最优或并列最优准确率。

详情
AI中文摘要

从序列级幂分布 $p^\alpha$ 采样可以在不更新任何参数的情况下从基础语言模型中引出强化学习级别的推理,但标准的Metropolis-Hastings(MH),一种马尔可夫链蒙特卡洛(MCMC)采样器,既昂贵又慢混合。我们将这两个问题归因于结构不匹配:$p^\alpha$ 主要在稀疏、空间聚集的高熵决策点集上偏离 $p$,然而MH沿着前缀均匀地提出重采样位置——在近简并条件上浪费计算,同时在模式发散处欠混合。我们提出熵引导的幂采样(EGPS),一种无需训练和验证器的采样器,它从已经在前向传播中的token级熵重新推导其提议。EGPS跳过确定性块,将每个MCMC移动定位到高熵邻域,并在决策点应用多尝试Metropolis——使得采样成本随熵质量而非序列长度缩放。在Qwen2.5-Math-7B上,EGPS在所有三个基准(MATH500 $75.8\\%$,HumanEval $62.2\\%$,GPQA $42.4\\%$)上达到最佳或并列最佳准确率,同时相对于MH基线实现了高达12.6倍的墙钟加速。

英文摘要

Sampling from the sequence-level power distribution $p^α$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: $p^α$ mainly departs from $p$ at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix -- wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points -- making sampling cost scale with \emph{entropy mass rather than sequence length}. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 $75.8\%$, HumanEval $62.2\%$, GPQA $42.4\%$) at up to a $12.6\times$ wall-clock speedup over the MH baseline.

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结 针对大语言模型量化中激活值量化困难的问题,提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习,在W4A4量化下显著降低误差。

Comments 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

详情
AI中文摘要

后训练量化(PTQ)是降低大语言模型(LLM)服务成本最实用的方法之一,但激活值量化仍然困难,因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略,用高分位数替代基于最大值的激活统计量,并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下,仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%,联合(alpha, q)搜索降低12%,训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层,相应的全层平均误差从97.51降至78.08(19.9%)。结果表明,鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时,相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

2606.09928 2026-06-10 cs.LG cs.AI 交叉投稿

Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment

具有可学习通道-类别分配的前向传播卷积神经网络

Mohammadnavid Ghader, Saeed Reza Kheradpisheh, Bahar Farahani, Mahmood Fazlali

AI总结 提出可学习的通道-类别分配机制,结合熵和正交正则化,以及基于验证性能的损失感知层贡献策略,在残差CNN上实现前向传播学习,在CIFAR-10/100和Tiny-ImageNet上达到FF模型最佳性能,缩小与反向传播的差距。

详情
AI中文摘要

前向-前向(FF)算法通过用局部的前向目标替代基于梯度的信用分配,提供了一种受生物学启发的反向传播替代方案。虽然最近的扩展已将FF适应到卷积神经网络(CNN),但现有公式依赖于静态的通道-类别分区,并且在复杂任务中难以有效执行。在这项工作中,我们引入了一种可学习的通道-类别分配机制,实现了卷积通道的自适应、数据驱动特化,并辅以熵和正交正则化以提升学习性能。我们进一步提出了一种损失感知的层贡献策略,该策略根据中间层的验证性能自适应地加权其预测,从而增强前向推理的有效性。集成到残差CNN中,所提出的方法在CIFAR-10、CIFAR-100和Tiny-ImageNet上相比现有的类似前向方法持续实现了更优的性能。值得注意的是,它在基于FF的模型中建立了新的最先进性能,显著缩小了与反向传播的差距。这些发现表明,引入可学习的通道特化和层贡献加权显著增强了深度CNN中前向学习的表示能力。

英文摘要

The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignment with local, forward-only objectives. While recent extensions have adapted FF to convolutional neural networks (CNNs), existing formulations rely on static channel-class partitions and struggle to perform effectively in complex tasks. In this work, we introduce a learnable channel-class assignment mechanism that enables adaptive, data-driven specialization of convolutional channels, supported by entropy and orthogonality regularization to promote learning performance. We further propose a loss-aware layer contribution strategy that adaptively weights intermediate-layer predictions based on their validation performance, enhancing the effectiveness of forward-only inference. Integrated into residual CNNs, the proposed method achieves consistently superior performance across CIFAR-10, CIFAR-100, and Tiny-ImageNet compared to existing similar forward-only methods. Notably, it establishes new state-of-the-art performance among FF-based models, substantially narrowing the gap with backpropagation. These findings demonstrate that introducing learnable channel specialization and layer contribution weighting significantly enhances the representational capacity of forward-only learning in deep CNNs.

2606.09929 2026-06-10 cs.LG cs.AI 交叉投稿

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

介于遗忘与混沌之间:可训练耗散振荡器网络的记忆稳定性表现力三难困境

Caleb Munigety

AI总结 本文研究可训练非线性振荡器网络,发现记忆范围、梯度稳定性和动态表现力三者受阻尼控制,存在无法同时最大化的三难困境,并通过实验验证了理论边界。

详情
AI中文摘要

物理储层计算利用非线性机械动力学,但传统上冻结基底并仅训练线性读出层,假定基底不可训练。我们重新审视这一前提,研究非线性振荡器网络,其质量、阻尼和刚度通过辛积分器端到端学习。我们的核心结果是三难困境:记忆范围、梯度稳定性和动态表现力无法同时最大化,因为三者均由阻尼控制。反向梯度以阻尼决定的速率衰减,限制了信用传播的距离,而前向灵敏度以最大李雅普诺夫指数指数增长,因此可用梯度需要阻尼高于稳定下限。由于李雅普诺夫指数随阻尼增加而下降,而记忆上限随范围增加而下降,稳定训练被限制在一个随范围收缩并在临界点闭合的带状区域内。我们在一个二十振荡器网络上测试了每一步。阻尼扫描发现最大李雅普诺夫指数单调变化并在明确的下限处过零,证实了定理的关键假设。在九个范围上的延迟回忆任务中,学习基底与冻结基底的算力匹配比较显示,学习基底在短范围占优,优势在约十一步范围附近接近并反转,这是带状闭合的预测特征;训练模型稳定在稳定下限附近,自发寻求混沌边缘。解析上限高估经验交叉约五倍,这是可检测梯度与可学习梯度之间的差距,我们报告而非调整消除。贡献在于确认了何时训练物理基底优于冻结基底。

英文摘要

Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear readout, presuming the substrate is not usefully trainable. We revisit that premise for networks of nonlinear oscillators whose mass, damping, and stiffness are learned end-to-end through a symplectic integrator. Our central result is a trilemma: memory horizon, gradient stability, and dynamical expressivity cannot be simultaneously maximized, because all three are governed by the damping. The backward gradient decays at a rate set by the damping, capping how far back credit can propagate, while forward sensitivities grow exponentially in the largest Lyapunov exponent, so usable gradients require damping above a stability floor. Since the Lyapunov exponent falls as damping rises while the memory ceiling falls as the horizon grows, stable training is confined to a band that contracts with horizon and closes at a critical point. We test every step on a twenty-oscillator network. A damping sweep finds the largest Lyapunov exponent monotone and crossing zero at a well-defined stability floor, confirming the theorem's key assumption. A compute-matched comparison of learned versus frozen substrate on delayed recall across nine horizons shows the learned substrate dominating at short horizons and the advantage closing and reversing near a horizon of eleven steps, the predicted signature of band closure; trained models settle near the stability floor, seeking the edge of chaos unprompted. The analytic ceiling overestimates the empirical crossover roughly fivefold, a gap between detectable and learnable gradient that we report rather than tune away. The contribution is a confirmed account of when training a physical substrate beats freezing it.

2606.09932 2026-06-10 cs.LG cs.AI 交叉投稿

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

当强化学习在监督微调后失效:恢复模型可塑性以实现稳健的SFT到RL交接

Runze Liu, Jiashun Liu, Xu Wan, Yuqian Fu, Ling Pan

AI总结 针对SFT过度训练导致RL阶段改进有限的问题,提出Rejuvenation方法,通过基模型锚定融合和神经元重置恢复模型可塑性,在数学推理和智能体任务上提升RL性能。

详情
AI中文摘要

监督微调(SFT)后接强化学习(RL)已成为大语言模型(LLM)后训练的标准流程。SFT预期为RL提供有用的行为先验,以进一步增强模型能力。然而,过度SFT的检查点在RL中往往表现出有限的改进。我们将此失败归因于模型可塑性的丧失:SFT初始化的策略被后续RL有效重塑的能力降低。为了更好地理解这一现象,我们从参数变化、输出空间和RL优化动态等多个角度进行了详细分析。我们的结果表明,过度SFT的模型倾向于产生过度自信的token分布,并表现出尖锐的参数景观,这使得它们在RL阶段更难优化。为了实现更稳健的SFT到RL交接,我们提出了Rejuvenation,一种简单而有效的方法,在保留有用的SFT获取先验的同时恢复可塑性。Rejuvenation利用基于基模型的模型融合来减少过度SFT引起的漂移,并通过有针对性的神经元重置来缓解模型僵化。在数学推理任务和智能体任务上的实验结果表明,我们的方法在过度训练的SFT模型上持续提升了RL性能,同时也增强了对分布外任务的泛化能力。

英文摘要

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.

2606.09937 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

RKSC:面向多步LLM推理的感知推理的KV缓存共享与自信提前退出

Anirudh Sekar

AI总结 提出RKSC框架,通过注意力相似性KV共享、置信门控提前退出和推理选择性块缓存管理,消除多分支LLM推理中的结构冗余,实现平均3.008倍加速,错误率仅0.37%。

Comments Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情
AI中文摘要

我们提出RKSC(感知推理的KV缓存共享),一种无需训练的推理框架,消除了多分支LLM推理流程中的两种结构冗余。ASKS(注意力相似性KV共享)计算前缀KV缓存一次,并通过隐藏状态余弦相似度广播给所有语义相似的分支,严格推广了vLLM和SGLang使用的精确令牌前缀缓存。CGEE(置信门控提前退出)应用两种互补的退出机制:(1)当生成置信度在分支间具有决定性时,完全跳过验证前向传播;(2)当逐层熵稳定时,在中间层终止验证传播,使用Transformer骨干上的轻量级钩子。RSBCM(推理选择性块缓存管理器)通过注意力加权深度优先驱逐防止无界缓存增长。在五个模型家族(7B-10B)、四个基准测试和1000个评估问题上,RKSC相对于无KV基线实现了平均3.008倍加速(峰值3.990倍),相对于vLLM等效前缀缓存平均提升1.66倍,CGEE导致的错误率仅为0.37%(1616次验证调用中6次错误)。无需微调或架构更改。代码可在该URL获取。

英文摘要

We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.

2606.09940 2026-06-10 cs.LG cs.AI 交叉投稿

Interactions Between Crosscoder Features: A Compact Proofs Perspective

交叉编码器特征间的交互:一个紧凑证明的视角

Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun-Hei Yip, Rajashree Agrawal, Jason Gross

AI总结 本文从紧凑证明角度形式化交叉编码器特征交互,提出交互度量并应用于计算稀疏性、语义聚类和检测休眠代理。

Comments Accepted at the NeurIPS 2025 Workshop on Mechanistic Interpretability

详情
AI中文摘要

像稀疏自编码器(SAEs)和交叉编码器这样的字典学习方法试图通过将模型的激活分解为独立特征来解释模型。因此,特征之间的交互会在重构中引入误差。我们通过紧凑证明形式化了这一直觉,并做出了五项贡献。首先,我们展示了原则上如何使用交叉编码器构建模型性能的紧凑证明。其次,我们证明了该证明中出现的误差项可以自然地解释为交叉编码器特征之间交互的度量,并提供了多层感知器(MLP)层中交互项的显式表达式。然后,我们提供了这种新交互度量的三个应用。在第三项贡献中,我们展示了交互项本身可以用作可微分的损失惩罚。应用这种惩罚,我们可以实现“计算稀疏”的交叉编码器,当在每个数据点和神经元仅保留单个特征时,保留MLP性能的60%,而标准交叉编码器仅保留10%。接着,我们展示了根据我们的交互度量进行聚类可以提供语义上有意义的特征聚类,最后,我们展示了休眠代理具有显著的交互。代码可在以下网址获取:https://this URL。

英文摘要

Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.

2606.09949 2026-06-10 cs.LG cs.AI 交叉投稿

Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training

学习何处模拟:在线PDE代理训练的生成式主动采样

Pierre Cesar, Sofya Dymchenko, Abhishek Purandare, Bruno Raffin

AI总结 提出在线生成式主动采样(OGAS),通过扩散模型学习配置参数与代理性能的关系,主动采样高难度区域,显著降低尾部分布误差,提升代理最坏情况可靠性。

详情
AI中文摘要

数据驱动的PDE代理使用数值PDE求解器产生的数据进行训练。然而,当代理的目标是在广泛的PDE配置(例如初始条件和物理系数)上泛化时,生成具有代表性的训练集并非易事。配置参数的均匀采样通常低估了表现出挑战性动力学的轨迹,导致训练后的代理出现高预测误差和大误差方差。在线训练将数据生成和代理训练耦合,通过允许实时调整求解器参数提供了自然优势。为了有效利用这一能力,我们引入了在线生成式主动采样(OGAS),一种主动学习方法,它反应性地学习配置参数与代理性能之间的关系,以控制采样分布。OGAS与代理并行训练一个快速扩散模型,作为条件采样器,将代理派生的难度信号(例如损失或不确定性)映射到配置参数。通过主动从偏向高难度的先验中抽取目标信号,OGAS持续将数据生成导向挑战性区域,而不会延迟训练流程。我们在具有不同挑战性动力学的2D PDE(Kuramoto-Sivashinsky、Navier-Stokes、Gray-Scott)上评估OGAS,参数多达308个,并使用多种代理架构。在所有设置中,与均匀采样相比,OGAS一致地改善了尾部分布统计,显著降低了第99百分位以上的误差和整体误差离散度。虽然优先考虑挑战性轨迹引入了与平均误差的权衡,但OGAS有效确保了训练后代理的最坏情况可靠性,且壁钟时间开销可忽略不计。

英文摘要

Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

2606.09954 2026-06-10 cs.LG cs.AI 交叉投稿

Does Normalization Choice Matter for Causal Large Time-Series Models?

归一化选择对因果大规模时间序列模型重要吗?

Samy-Melwan Vilhes, Gilles Gasso, Mokhtar Z Alaya

AI总结 研究因果大规模时间序列模型中不同归一化策略对训练收敛和预测性能的影响,发现归一化选择显著影响模型效果。

详情
Journal ref
ICLR 2026 Workshop: Time Series in the Age of Large Models, Apr 2026, Rio De Janeiro, Brazil
AI中文摘要

用于时间序列预测的大规模模型已成为在异构信号集合上训练模型的有前景的范式。这些模型通常依赖于因果自回归架构,其中每个观测值根据过去依次预测。在实践中,真实世界的时间序列表现出非平稳性,这显著影响预测性能。为了缓解这一问题,通常采用归一化。然而,在高效的因果设置中,归一化可能在训练期间导致来自未来观测的信息泄漏。最近提出的替代方案,包括因果归一化和从初始观测计算的统计量,旨在解决这一问题,但其实际影响仍未被充分理解。在这项工作中,我们评估了基于Transformer的大规模时间序列模型(采用分块和高效因果策略训练)的归一化策略。我们展示了归一化选择显著影响训练收敛和预测性能。

英文摘要

Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.

2606.09960 2026-06-10 cs.LG cs.AI 交叉投稿

HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers

HydraCIL: 通过原型引导的多头分类器实现解耦的类增量学习

Daniel Vila-Cruz, Laura Morán-Fernández, Verónica Bolón-Canedo

AI总结 提出HydraCIL模型,通过冻结主干网络、解耦特征提取与学习,并利用原型相似性选择任务特定分类头,在资源受限环境中实现高效类增量学习,匹配或超越现有方法同时大幅降低训练时间和碳排放。

Comments Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

详情
AI中文摘要

我们提出HydraCIL,一种基于原型引导的多头分类器的解耦持续学习模型,旨在嵌入式及资源受限环境中的可持续部署。虽然大多数类增量学习(CIL)方法依赖强大硬件和长时间再训练周期,但实际系统(如机器人或边缘AI设备)必须在有限资源下快速适应。HydraCIL通过冻结主干网络并将特征提取与学习解耦来解决这一问题。对于每个任务,特征被提取一次,并创建一个轻量级的、任务特定的分类器头,避免了昂贵的主干再训练。在推理时,HydraCIL通过与原型的相似性选择适当的头。在CIFAR-100、ImageNet-100、CoRe50和Flowers102数据集上的实验表明,HydraCIL匹配或超越了最先进的CIL方法,同时显著减少了训练时间和碳足迹,使其成为在能源效率和快速适应至关重要的实际及嵌入式环境中进行持续学习的实用解决方案。

英文摘要

We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deployment in embedded and resource-constrained environments. While most Class-Incremental Learning (CIL) methods rely on powerful hardware and long retraining cycles, real-world systems, such as robots or edge AI devices, must adapt quickly with limited resources. HydraCIL addresses this gap by freezing the backbone and decoupling feature extraction from learning. For each task, features are extracted once and a lightweight, task-specific classifier head is created, avoiding costly backbone retraining. At inference, HydraCIL selects the appropriate head via similarity with prototypes. Experiments on CIFAR-100, ImageNet-100, CoRe50, and Flowers102 datasets show that HydraCIL matches or outperforms state-of-the-art CIL methods while significantly reducing training time and carbon footprint, making it a practical solution for continual learning in real-world and embedded settings, where energy efficiency and rapid adaptation are critical.

2606.09962 2026-06-10 cs.LG cs.AI cs.SD 交叉投稿

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

FSQ 令牌在分类数据连续扩散中的最优性及其在文本到语音中的应用

Vadim Popov, Wenju Gu, Tasnima Sadekova, Georgii Aparin, Assel Yermekova

AI总结 本文研究连续扩散模型中离散令牌的潜在空间结构,通过理论分析和实验证明 FSQ 令牌化方案在分类数据连续扩散中最优,并在文本到语音任务中验证其优于基于 LLM 的方法。

详情
AI中文摘要

分类数据的连续扩散是一种属于扩散家族的框架,旨在生成离散数据。近年来,由于研究人员试图实现寻找自回归大型语言模型的合理替代方案这一具有挑战性的目标,对此类模型的科学兴趣不断增长。在本文中,我们研究了与离散令牌相对应的潜在空间结构的性质,这些性质通过扩散路径测度上的 Kullback-Leibler 散度和最优训练扩散模型正确预测令牌的准确性来表达。我们发现,FSQ 令牌化方案具有的潜在空间结构使其最适合分类数据的连续扩散,这一点通过严格的理论分析和数值实验得到了验证。为了在现实场景中验证我们的发现,我们训练了几个以语音令牌作为中间声学特征的文本到语音扩散模型,并表明基于 FSQ 令牌的模型确实表现最佳,而且它优于其强大的基于 LLM 的对应模型,同时体积更小、速度更快。

英文摘要

Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

2606.10071 2026-06-10 cs.LG cs.AI 交叉投稿

Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport

时序层神经网络与动态正交传输

Md Sadek Hossain Asif, Tanzila Khan, Md. Mosaddek Khan

AI总结 提出时序层神经网络(TSNN),通过动态正交帧和局部坐标系间显式传输实现时序链接预测,在多种基准上超越现有方法,尤其适用于节点角色异质性强的图。

详情
AI中文摘要

我们引入了时序层神经网络(TSNN),这是一个时序链接预测框架,它为每个节点配备一个时变正交帧,并仅在局部坐标系之间进行显式传输后比较节点状态。与在共享全局嵌入空间中运行的现有连续时间图模型不同,TSNN通过动态局部帧建模节点特定且不断演化的交互语义。该模型通过高效的低秩Householder乘积参数化每个节点的帧,在帧更新下精确保留存储的隐藏状态,并使用几何残差解码器,该解码器基于传输距离锚定预测,同时学习残差校正。所有计算严格因果,仅使用事件前历史。我们证明了对称度归一化层拉普拉斯算子与对称归一化图拉普拉斯算子正交相似,而随机游走归一化形式在相应度度量下相似;TSNN使用的全激活、特征缩放扩散正是组合层Dirichlet能量上的度量梯度步,具有无度单调下降和非扩张保证。帧漂移仅线性扰动更新。在TGB v2链接预测和时序异质排行榜以及DGB基准套件上,TSNN在大多数基准上匹配或超越最强先前方法,在表现出强节点角色异质性的图上改进最大。消融实验证实了动态帧、正交传输和几何残差解码的独特优势。

英文摘要

We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.

2606.10089 2026-06-10 cs.LG cs.AI 交叉投稿

A Theory on Flow Matching with Neural Networks

基于神经网络的流匹配理论

Yihan He, Qishuo Yin, Yuan Cao, Jianqing Fan, Han Liu

AI总结 本文为神经网络参数化的条件速度场流匹配建立了理论基础,证明了过参数化两层ReLU网络中梯度下降的收敛性,推导了条件速度场匹配目标的泛化界,并提供了生成样本的Wasserstein距离保证。

详情
AI中文摘要

在这项工作中,我们为神经网络参数化的条件速度场流匹配建立了理论基础。我们证明了过参数化两层ReLU神经网络中梯度下降的收敛性保证。我们推导了条件速度场匹配目标的泛化界。基于这些结果,我们为诱导流生成的样本提供了Wasserstein距离保证。我们的分析基于具有无界损失的多任务表示学习的泛化界,这可能对流式生成建模之外的其他领域也有独立意义。这些理论结果通过在合成和真实图像基准上的大量实验得到了验证。

英文摘要

In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establish convergence guarantees for gradient descent in the over-parameterized 2-layered ReLU neural network regime. We derive generalization bounds for the conditional velocity-field matching objective. Building on these results, we provide Wasserstein-distance guarantees for the samples generated by the induced flow. Our analysis is based on generalization bound for multi-task representation learning with unbounded losses, which may be of independent interest beyond flow-based generative modeling. These theoretical results are validated through extensive experiments on both synthetic and real-world image benchmarks.

2606.10124 2026-06-10 cs.LG cs.AI 交叉投稿

FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching

FedSteer: 通过校正投影和缓存驯服联邦学习中的极端梯度陈旧性

Haoran Zhang, Cainã Figueiredo Pereira, Marie Siew, Xutong Liu, Carlee Joe-Wong, Rachid El-Azouzi

AI总结 针对联邦学习中客户端参与不均导致的梯度陈旧问题,提出FedSteer方法,利用客户端梯度缓存构建子空间,通过投影和缓存策略校正陈旧梯度,显著提升训练稳定性与精度。

Comments UAI 2026

详情
AI中文摘要

联邦学习(FL)在客户端不持续参与训练轮次时,常遭受聚合方差的影响。虽然重用非活跃客户端的陈旧模型更新是减少这种方差的常见技术,但我们发现,在客户端参与偏斜的情况下,由此产生的更新陈旧性可能变得严重到足以破坏训练稳定性。为了解决这个问题,我们提出了FedSteer,一种新颖的方法,该方法从最近客户端梯度的缓存中构建一个梯度子空间,作为当前优化景观的低维表示。FedSteer将活跃客户端的真实梯度投影到这个子空间上,以找到一组最优坐标。对于非活跃客户端,FedSteer重用这些坐标,并结合由其他活跃客户端漂移的已演化的子空间。这个过程有效地将过时的梯度“引导”向当前的全局目标。此外,还辅以选择性缓存策略,识别代表性客户端子集以形成子空间,从而减少服务器内存。实验表明,FedSteer显著优于基线,在挑战性场景中防止性能崩溃,并在其他场景中实现超过7%的精度提升。

英文摘要

Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusing stale model updates from inactive clients is a common technique to reduce this variance, we find that with skewed client participation, the resulting update staleness can become severe enough to destabilize training. To remedy this, we propose FedSteer, a novel method that constructs a gradient subspace from a cache of recent client gradients to serve as a low-dimensional representation of the current optimization landscape. FedSteer projects an active client's true gradient onto this subspace to find a set of optimal coordinates. For an inactive client, FedSteer reuses these coordinates with the now-evolved subspace drifted by other active clients. This process effectively "steers" outdated gradients toward the current global objective. This is complemented by a selective caching strategy that identifies a representative client subset to form the subspace, reducing server memory. Experiments demonstrate that FedSteer significantly outperforms baselines, preventing performance collapse in challenging scenarios while delivering accuracy gains of over 7% in others.

2606.10184 2026-06-10 cs.LG cs.AI 交叉投稿

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO: 用于连续潜在推理的变分随机性

Wooil Jung

AI总结 针对GRPO在连续潜在推理模型中因确定性轨迹导致优势为零的问题,提出通过结构化Dropout引入随机性,使GRPO能优化贝叶斯模型平均策略,在GSM8K上提升Coconut基线准确率。

详情
AI中文摘要

组相对策略优化(GRPO)依赖于每组内$K$次rollout的多样性;否则,组平均优势$A^{(k)} = r^{(k)} - \mu_r$会坍缩为零。这对像Coconut这样的潜在推理模型构成了结构性挑战,该模型循环地馈送连续隐藏状态以替代离散的思维链token。由于给定参数和提示后潜在阶段本质上是确定性的,多次rollout会产生相同的轨迹,阻碍GRPO的进展。因此,将组相对强化学习应用于连续潜在推理被证明是困难的。为解决此问题,我们提出通过结构化dropout来获取必要的随机性。通过在给定rollout的所有潜在递归步骤中应用一个保持不变的单一伯努利掩码,我们生成必要的轨迹方差。这个共享掩码有效地将每个rollout视为来自参数变分分布的后验样本,使GRPO能够优化贝叶斯模型平均策略的期望奖励。我们为该方法提供了理论证明——包括无偏性、方差减少以及潜在梯度的良定义性——以及实证验证。在GSM8K上,dropout-GRPO将Coconut基线从$27.29\%$提升至$29.01\%$的pass@1,证明了GRPO学习在潜在推理模型中的可行性。我们的工作将此定位为一种实用且理论基础的潜在推理LLM后训练方法。

英文摘要

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - μ_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

2606.10196 2026-06-10 cs.CV cs.AI 交叉投稿

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

Fisher引导的自适应微调渐进参数选择

Ghodsiyeh Rostami, Po-Han Chen, Mahdi S. Hosseini

AI总结 提出FisherAdapTune框架,通过追踪Fisher几何的时间漂移渐进选择参数组,在保持适应动态的同时冻结稳定参数以降低泛化误差界,在分割任务上提升分布内性能和零样本迁移。

详情
AI中文摘要

参数高效微调(PEFT)旨在使用少量可训练参数子集来适应预训练模型,然而,现有大多数方法从固定的架构启发式中选择该子集,而不是使用动态的、任务感知的标准。我们引入了\textbf{FisherAdapTune},一个Fisher引导的自适应微调框架,通过追踪参数组Fisher几何的时间漂移来渐进选择参数组。从微调的PAC-Bayesian视角出发,我们将泛化误差界分解为Fisher加权更新成本,并表明曲率贡献已稳定的参数组可以被冻结,以减少误差界而不中断剩余的适应动态。FisherAdapTune使用连续Fisher分布之间的尺度不变Jensen-Shannon距离来制定这一标准,从而产生一个自适应的活动参数集。我们在下游分割任务上评估了我们的方法,结果表明FisherAdapTune在多种设置下提升了分布内性能和零样本迁移,验证了Fisher结构漂移是高效、任务感知适应的有用信号。我们公开发布了代码(\href{this https URL}{code}),以促进我们提出方法的进一步应用。

英文摘要

Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code} publicly to enable further application of our proposed approach.

2606.10216 2026-06-10 cs.LG cs.AI 交叉投稿

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

一个源域足矣:基于语义对齐和最优传输的仅源域跨操作系统APT异常检测迁移学习

Sidahmed Benabderrahmanea, Petko Valtchev, James Cheney, Talal Rahwan

AI总结 针对跨操作系统APT检测中目标域无标签的挑战,提出基于最优传输的仅源域异常评分框架,通过语义抽象和三种偏差通道实现零目标监督下的异常排序。

详情
AI中文摘要

高级持续性威胁(APT)是隐蔽的多阶段网络攻击,由于标记痕迹稀缺、严重的类别不平衡以及生成真实恶意行为的挑战,其检测十分困难。这些挑战在跨操作系统(cross-OS)设置中被放大,此时在一个源平台上训练的检测器必须部署在无标签的目标平台上,且无法访问目标域标签。我们利用系统级溯源轨迹研究这种仅源域的跨操作系统APT检测问题,并提出一个基于传输的框架,在零目标监督下对异常目标进程进行排序。该框架将进程行为抽象为结构化的自然语言描述,使用预训练语言模型进行嵌入,并构建源域正常参考用于目标评分。它结合了三种证据通道:与源域正常原型的语义偏差、由图自编码器捕获的结构偏差、以及通过最优传输(OT)度量的几何偏差。主要贡献是一个基于OT的重心异常分数,该分数将目标嵌入投影到源域正常流形上,并量化残差传输不匹配。我们进一步引入熵加权、角度感知和密度感知的OT变体,以捕获不确定性、方向漂移和稀疏支持行为。在DARPA透明计算数据(涵盖Linux、Windows、BSD和Android)上的评估,涉及两个APT场景和十二个跨操作系统传输对,表明所提框架在ROC-AUC和nDCG上优于仅源域异常检测基线。结果表明,仅源域溯源建模结合语义抽象和基于OT的异常评分,可以在没有目标域监督的情况下支持实际的跨平台APT检测。

英文摘要

Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These challenges are amplified in cross-operating-system (cross-OS) settings, where a detector trained on one source platform must be deployed on an unlabeled target platform without access to target-domain labels. We study this source-only cross-OS APT detection problem using system-level provenance traces and propose a transport-based framework for ranking anomalous target processes under zero target supervision. The framework abstracts process behavior into structured natural-language descriptions, embeds them using pretrained language models, and constructs a source-normal reference for target scoring. It combines three evidence channels: semantic deviation from source-normal prototypes, structural deviation captured by graph autoencoding, and geometric deviation measured through Optimal Transport (OT). The main contribution is an OT-based barycentric anomaly score that projects target embeddings onto the source-normal manifold and quantifies residual transport mismatch. We further introduce entropy-weighted, angle-aware, and density-aware OT variants to capture uncertainty, directional drift, and sparse-support behavior. Evaluation on DARPA Transparent Computing data spanning Linux, Windows, BSD, and Android, across two APT scenarios and twelve cross-OS transfer pairs, shows that the proposed framework improves ROC-AUC and nDCG over source-only anomaly-detection baselines. The results demonstrate that source-only provenance modeling, combined with semantic abstraction and OT-based anomaly scoring, can support practical cross-platform APT detection without target-domain supervision.

2606.10238 2026-06-10 q-bio.NC cs.AI 交叉投稿

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出海马体群体活动诱导双曲几何的理论框架,证明现代Hopfield网络更新规则计算最小均方误差估计,并引入双曲空间中的新联想记忆模型,其容量显著优于现有模型。

Comments Accepted at ICML 2026, 37 pages, 5 figures

详情
AI中文摘要

神经群体几何结构影响下游计算。最近神经生物学的实验发现表明,海马体中的群体活动具有双曲结构。本文为这一现象提供了理论框架。首先,我们提出了一种海马体调谐曲线的合理构造,该构造在统计上诱导双曲几何。接着,我们通过证明现代Hopfield网络更新规则计算最小均方误差(MMSE)估计,建立了神经解码与联想记忆之间的联系。最后,我们引入了一个在双曲空间中定义的新型联想记忆模型,其容量显著大于领先模型。我们的结果表明,动物将空间信息编码为潜在的双曲认知地图,从而提高了记忆容量和解码精度。

英文摘要

Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.

2606.10250 2026-06-10 cs.LG cs.AI 交叉投稿

Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

联邦学习中不平衡的多层次分析以解决非独立同分布问题

Haengbok Chung, Jae Sung Lee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Republic of Korea(人工智能跨学科项目,首尔国立大学,韩国) Department of Nuclear Medicine, Seoul National University College of Medicine, Republic of Korea(核医学系,首尔国立大学医学院,韩国) Brightonix Imaging Inc. Seoul, Republic of Korea(Brightonix影像公司,首尔,韩国)

AI总结 提出FedBB算法,通过PNB损失函数和CBR重加权分别解决本地训练中的类内/类间不平衡和客户端间不平衡,在X射线和自然图像数据集上优于现有方法。

Comments 27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted Manuscript

详情
Journal ref
Neurocomputing, Volume 626, 2025, Article 129528
AI中文摘要

类别不平衡是深度学习中常见的问题,会严重降低性能。在联邦学习(FL)中,它是导致非独立同分布数据(non-IID)的关键因素。基于先前的一些尝试,我们在三个层次上定义并分析了FL中的不平衡问题:案例间、类别间和客户端间。案例间不平衡处理每个单一类别内的不平衡;类别间不平衡比较不同类别之间的数据数量。客户端间不平衡表示不同客户端之间本地数据的偏斜程度。基于这些概念,我们提出了FedBB,它由两个主要部分组成:(1)正负平衡(PNB)损失函数解决了本地训练中的案例间和类别间不平衡,增强了高度偏斜的本地客户端数据集上的泛化能力。它通过为少数案例或类别分配更高的权重来优化多标签和多类分类。(2)客户端平衡重加权(CBR)在模型聚合期间根据客户端间不平衡重新加权客户端,为在偏斜较小的数据集上训练的模型赋予更大的权重。在X射线和自然图像数据集上的各种实验表明,FedBB在性能和效率上均优于其他算法。此外,它只需要有限的统计信息,这有利于隐私保护。通过消融研究,我们证明了PNB损失和CBR独立地贡献于性能。由于FedBB旨在构建一个能准确分类所有类别的全局模型,它可以作为通用和个性化FL的基线。

英文摘要

Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.

2606.10294 2026-06-10 cs.LG cs.AI cs.AR cs.NE physics.comp-ph 交叉投稿

LLM-Guided Neural Architecture Search for Robust Co-Design of Physical Neural Networks

LLM引导的神经架构搜索用于物理神经网络的鲁棒协同设计

Tyler King, Timothee Leleu

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维)

AI总结 提出UH-NAS框架,利用大语言模型作为进化算子,协同优化任务准确率和推理能耗,实现跨硬件平台的公平比较,在光学MZI硬件上发现更鲁棒的架构。

详情
AI中文摘要

在非常规硬件上部署神经网络需要同时优化任务准确率和平台特定约束(如能耗成本、物理非理想性和数值精度)的架构。现有的神经架构搜索(NAS)方法通常针对单一硬件系列,限制了跨平台比较和泛化。我们提出了非常规硬件神经架构搜索(UH-NAS),一个硬件无关的、LLM引导的NAS框架,它集成语言模型作为进化算子来协同优化准确率和推理能耗。通过将硬件暴露为可交换的后端,并配备每平台能耗模型、物理约束和非理想性模拟器,UH-NAS能够在无需修改搜索算法的情况下,对各种后端进行公平的系统级比较。在光学MZI硬件上测试,UH-NAS发现了比传统基线更多样化、更鲁棒的架构,同时优于现有的LLM-to-NAS方法。对非理想性下架构鲁棒性以及系统提示作用的进一步消融研究,凸显了架构-硬件协同设计对新兴计算平台的重要性。

英文摘要

Deploying neural networks on unconventional hardware demands architectures that co-optimize task accuracy and platform-specific constraints such as energy cost, physical non-idealities, and numerical precision. Existing neural architecture search (NAS) methods are typically tailored to a single hardware family, limiting cross-platform comparison and generalization. We introduce Unconventional Hardware Neural Architecture Search (UH-NAS), a hardware-agnostic, LLM-guided NAS framework that integrates language models as evolutionary operators to co-optimize accuracy and inference energy. By exposing hardware as a swappable backend with per-platform energy models, physical constraints, and non-ideality simulators, UH-NAS enables fair system-level comparisons across various backends without modifying the search algorithm. Tested on optical MZI hardware, UH-NAS discovers more diverse, robust architectures than conventional baselines while outperforming existing LLM-to-NAS approaches. Additional ablations on architecture robustness under non-idealities and the role of system prompts highlight the importance of architecture-hardware co-design for emerging computing platforms.

2606.10321 2026-06-10 cs.LG cs.AI cs.RO math.OC 交叉投稿

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

无基线的神经组合优化策略优化

Carlos S. Sepúlveda, Gonzalo A. Ruz

AI总结 提出使用GRPO算法消除神经组合优化中的基线依赖,避免训练崩溃,在TSP和CVRP上达到接近POMO的性能。

详情
AI中文摘要

神经组合优化(NCO)训练自回归策略以解决路由问题。标准训练算法REINFORCE使用滚动基线,需要维护并定期更新策略的冻结副本以降低方差。这种基线引入了一个结构脆弱性:在更难的问题实例上,较差的基线会产生噪声梯度估计,从而破坏训练稳定性。我们评估了来自大语言模型对齐的组相对策略优化(GRPO),该算法通过归一化组内采样轨迹的优势完全消除了基线。在RL4CO框架内对TSP和CVRP基准上的五种RL算法进行受控比较,我们发现:(i) GRPO避免了REINFORCE在TSP-100上观察到的训练崩溃,其中性能在预热阶段后立即从成本9.8下降到52.1,并且在延长训练下无法恢复;(ii) 在匹配的梯度更新次数下,GRPO达到了与POMO(一种基于AM的强多起点基线)在2%以内的解质量,同时无需外部基线;(iii) P3O,一种也来自对齐文献的成对偏好算法,在TSP上具有竞争力,但在CVRP上表现出更高的变异性。这些结果表明GRPO是一种有前途的无基线NCO替代方案,特别是在基线依赖训练变得脆弱的场景中。

英文摘要

Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.

2606.10384 2026-06-10 nlin.AO cs.AI physics.comp-ph 交叉投稿

Towards Critical Branching Mechanism in Recurrent Neural Networks

递归神经网络中的临界分支机制

Feixiang Ren, Ling Feng

发表机构 * Department of Physics, National University of Singapore(新加坡国立大学物理系) Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR)(科技研究局高性能计算研究所)

AI总结 本文通过分析LSTM网络隐藏态动力学,发现小规模网络在最优训练阶段呈现近临界动力学(无标度雪崩统计和分支参数接近1),而大规模网络保持亚临界,并引入混合分支过程框架解释亚临界分支与1/f噪声的共存。

详情
AI中文摘要

临界性已被提出作为生物神经系统中的关键组织原则,但其在人工神经网络中的起源和相关性仍不清楚。我们分析了训练后的长短期记忆(LSTM)网络中的隐藏态动力学,并表明接近其最优训练时期(步数)的小型网络表现出无标度雪崩统计和接近1的分支参数,指示近临界动力学,而较大的模型保持亚临界。为了解释亚临界分支与稳健的$1/f^{\beta}$噪声的共存,我们引入了一个混合分支过程框架,将异质分支动力学与长程时间相关性联系起来。这些结果将LSTM中的类临界行为识别为一种涌现的、依赖于容量的动力学机制。

英文摘要

Criticality has been proposed as a key organizing principle in biological neural systems, yet its origin and relevance in artificial neural networks remain unclear. We analyze hidden-state dynamics in trained long short-term memory (LSTM) networks and show that small networks near their optimal training epochs (steps) exhibit scale-free avalanche statistics and branching parameters close to unity, indicative of near-critical dynamics, while larger models remain subcritical. To explain the coexistence of subcritical branching with robust $1/f^β$ noise, we introduce a mixture branching process framework that links heterogeneous branching dynamics to long-range temporal correlations. These results identify critical-like behavior in LSTMs as an emergent, capacity-dependent dynamical regime.

2606.10385 2026-06-10 cs.LG cs.AI 交叉投稿

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

超越绝对模仿:基于锚定残差引导的特权在线蒸馏

Wenhao Zhang

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出锚定残差在线蒸馏(AR-OPD),通过部分特权教师建立局部兼容锚点并注入受控残差,解决特权在线蒸馏中后见偏差导致的局部不可达问题,在推理任务上平均提升2.3个点。

Comments 17 pages, 8 figures. Project page: https://vanhowe.github.io/AR-OPD/

详情
AI中文摘要

在线蒸馏(OPD)通过将学生模型与教师在其自身轨迹上的预测分布对齐,在增强LLM复杂推理方面展现出显著的实证收益。一种新兴变体——特权OPD,通过使用增强特权信息(如oracle轨迹)的自教师模型进一步强化该范式,以缓解师生能力差距,同时提供密集的、答案导向的监督。然而,当前方法将特权信息视为一个整体的模仿目标,未能将局部可达的推理步骤与未来条件的oracle信号分离。因此,学生被鼓励去匹配一个事后偏差分布,该分布通常落在其局部预测支持之外。这种可达性不匹配激励学生模型跳过有效的中间推理,转而采用局部不支持的捷径。为解决此问题,我们引入锚定残差在线蒸馏(AR-OPD),一种解耦特权监督的双视角框架。AR-OPD不强制执行严格的全局模仿,而是使用部分特权教师建立局部兼容锚点,将oracle预见性隔离并作为受控残差注入,以提供目标导向的引导。在多种推理任务上,AR-OPD比完全特权OPD高出2.3个点,比SFT高出7.9个点。关键的是,这种锚定残差机制将事后泄漏减少了21.7%,并缓解了后期漂移,在超过768个token的挑战性长程轨迹上取得了高达7.2个点的优势。

英文摘要

On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.

2606.10406 2026-06-10 cs.LG cs.AI 交叉投稿

FOGO: Forgetting-aware Orthogonalization Optimizer

FOGO:遗忘感知正交化优化器

Toan Nguyen, Yang Liu, Trung Le, Celso de Melo, Flora D. Salim

发表机构 * School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Department of Data Science & AI, Monash University(莫纳什大学数据科学与人工智能系) DEVCOM Army Research Laboratory(DEVCOM陆军研究实验室)

AI总结 提出FOGO优化器,通过谱正交化动量更新并利用紧凑码本记忆解决梯度干扰,在类别不平衡、持续学习和大模型微调等场景中提升收敛与知识保留。

详情
AI中文摘要

我们认为遗忘不仅局限于持续学习,而是一种普遍的优化现象:在标准训练过程中,主导的小批量梯度抑制了罕见但有用的更新方向,导致每一步的短期遗忘。当这些知识从未被重新访问时,这些损失会累积成长期遗忘——持续学习的经典失败模式。我们引入了FOGO,一种可扩展的优化器,能够持续检测并解决两种场景下的梯度干扰。FOGO对动量更新进行谱正交化,以防止主导方向垄断优化,然后将代表性的过去方向存储在基于随机投影的紧凑码本记忆中,其中成对距离在低维空间中得到可证明的保留。在每一步中,当前更新与存储方向之间的冲突通过轻量级正交校正解决,并通过近端步骤提升回来,开销极小且无需存储数据。在类别不平衡分类、领域和类别变化下的持续视觉学习、LLaVA-7B的持续微调以及GPT-2预训练中,FOGO持续改善收敛和知识保留,优于Adam和Muon。

英文摘要

We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

2606.10448 2026-06-10 cs.LG cs.AI 交叉投稿

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表示缓解低信噪比金融强化学习中的偏差

Zeyu Liu, Xuanzhi Feng, Sing Kwong Lai, Yuanchen Gao, Xiaoyi Pang, Hualei Zhang, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学)

AI总结 针对低信噪比金融市场中SAC算法的不稳定性,提出FPQC-SAC变体,在表征层使用参数化量子电路约束特征传播,减少极端波动影响,在真实组合管理任务中累计收益相对提升66.89%。

Comments Preprint. Code available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main

详情
AI中文摘要

金融市场是典型的低信噪比(SNR)环境,这常常使Soft Actor-Critic(SAC)等离策略最大熵方法不稳定。具体来说,噪声状态表示可能产生不可靠的Q值估计,而自举会放大这些误差,形成我们称之为“金融熵陷阱”的失效模式。在本文中,我们提出FPQC-SAC,一种高效且即插即用的SAC变体,它在演员和评论家网络之前放置一个紧凑且有界的参数化量子电路(PQC),以在表征层约束特征传播,而不是过滤原始输入或在自举后正则化Q值。值得注意的是,FPQC-SAC减少了极端市场波动对贝尔曼目标估计的影响,而可训练的量子纠缠保留了灵活的跨资产交互。在真实投资组合管理任务上的实证评估表明,FPQC-SAC通过实现比标准无约束SAC累计收益相对提升66.89%,显著增强了样本外稳定性和累计收益,并且比最佳连续控制深度强化学习基线高出约27%。开源代码可在该https URL获取。

英文摘要

The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.

2606.10461 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

ERAlign: 文本属性图上GNN与LLM的基于能量的表示对齐

Xianlin Zeng, Fan Xia, Xiangyu Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ERAlign框架,利用能量模型对齐GNN和LLM的表示,通过能量差异优化实现分布一致性,在8个数据集上取得最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

文本属性图(TAGs)将文本节点属性与图结构相结合,以描述丰富的关联语义。最近整合图神经网络(GNNs)和大语言模型(LLMs)的努力在TAGs学习上显示出前景,但实现良好对齐的表示仍然具有挑战性。先前的研究主要依赖于执行粗粒度匹配的启发式方法。它们缺乏足够的约束,忽略了分布对齐,导致表示漂移和泛化能力有限。基于能量模型(EBMs),我们提出了一种基于能量的表示对齐(ERAlign)框架,该框架将GNN编码的图结构和LLM导出的文本嵌入投影到共享潜在空间,以实现分布一致性。具体来说,层间对齐通过距离度量量化,并通过EBM目标进行优化。通过降低能量值,我们的框架为下游任务产生良好对齐的表示。在训练过程中,我们引入能量差异(ED)以避免与难以处理的归一化相关的高采样成本。ED还具有更高的训练效率和减少能量景观失真的理论保证。在八个TAG数据集上的实证评估表明,ERAlign在不同监督水平和跨任务迁移场景下均获得了最先进的性能。

英文摘要

Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

2606.10499 2026-06-10 cs.LG cs.AI 交叉投稿

MoE Enhanced Federated Learning for Spatiotemporal Prediction

基于混合专家模型增强的联邦学习用于时空预测

Zhehao Dai, Xiao Han, Zhaolin Deng, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong

发表机构 * Zhejiang University of Technology, Zhejiang Key Laboratory of Visual Information Intelligent Processing(浙江工业大学,浙江省可视信息智能处理重点实验室) Jilin University(吉林大学) City University of Hong Kong(香港城市大学)

AI总结 提出MoE-FedTP框架,通过轻量级混合专家网络和门控机制,在保护隐私的同时实现跨城市时空预测,有效缓解数据稀缺和异质性问题。

详情
AI中文摘要

交通预测是智能交通系统和城市计算的基础,然而由于传感器部署有限和城市发展不均衡,许多城市仍然面临交通数据稀缺的问题。跨城市知识转移因此受到越来越多的关注,使数据丰富的城市能够帮助数据稀缺的城市。然而,集中式方法引发了隐私问题,而现有的联邦方法难以应对城市间显著的时空异质性。为了解决这些挑战,我们提出了MoE-FedTP,一种基于轻量级混合专家(MoE)网络的个性化联邦跨城市时空预测框架。MoE-FedTP首先利用时空神经网络从源城市和目标城市提取特征,然后通过部分参数共享引入来自不同源城市的专家网络集合。门控机制动态融合专家以捕捉多样的交通动态,在保护隐私的同时实现城市异质性的细粒度建模。在四个真实世界交通数据集上的实验表明,MoE-FedTP始终优于最先进的跨城市和联邦学习基线,证明了其在提高数据稀缺城市预测准确性方面的有效性。

英文摘要

Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffic data scarcity due to limited sensor deployment and uneven urban development. Cross-city knowledge transfer has thus attracted increasing attention, enabling data-rich cities to assist data-scarce ones. However, centralized approaches raise privacy concerns, while existing federated methods struggle with pronounced spatiotemporal heterogeneity across cities. To address these challenges, we propose MoE-FedTP, a personalized federated cross-city spatiotemporal prediction framework based on lightweight Mixture-of-Experts (MoE) networks. MoE-FedTP first employs spatiotemporal neural networks to extract features from both source and target cities, then introduces a set of expert networks derived from different source cities through partial parameter sharing. A gating mechanism dynamically fuses the experts to capture diverse traffic dynamics, achieving fine-grained modeling of urban heterogeneity while preserving privacy. Experiments on four real-world traffic datasets show that MoE-FedTP consistently outperforms state-of-the-art cross-city and federated learning baselines, demonstrating its effectiveness in enhancing prediction accuracy for data-scarce cities.

2606.10530 2026-06-10 cs.LG cs.AI 交叉投稿

Machine Learning Methods for Studying Latent Neural Activity Dynamics

研究潜在神经活动动力学的机器学习方法

Shufeng Kong, Fumei Deng, Xinyi Dong, Caihua Liu, Weiwei Chen, Yingheng Wang, Daniel Cao, Azahara Oliva, Antonio Fernandez-Ruiz, Carla Gomes

发表机构 * School of Software Engineering, Sun Yat-sen University(中山大学软件工程学院) Department of Computer Science, Cornell University(康奈尔大学计算机科学系) Department of Neurobiology and Behavior, Cornell University(康奈尔大学神经生物学与行为学系) Department of Ecology and Evolutionary Biology, Cornell University(康奈尔大学生态学与进化生物学系) School of Computer Science and Artificial Intelligence, Foshan University(佛山大学计算机科学与人工智能学院)

AI总结 综述从状态空间模型到深度生成模型的潜在变量模型,涵盖单区域动力学、多区域通信和行为对齐建模,并讨论大规模神经基础模型及未来挑战。

Comments Accepted by IJCAI 2026 survey track

详情
AI中文摘要

脑记录的最新发展推动了对能够解码大量神经元潜在结构的机器学习工具的需求。本文提供了全面的综述,概述了潜在变量模型(LVM)从早期状态空间模型到最近深度生成模型的轨迹。我们将文献组织为三个密切相关的领域:(1)单区域潜在动力学,包括从线性动力系统到由循环神经网络(RNN)和神经常微分方程(ODE)表示的更复杂动力学模型;(2)多区域通信,采用概率和子空间方法研究信息如何在不同脑区之间传递,考虑突触传播延迟和网络连接;(3)行为对齐建模,旨在通过监督或对比学习将与任务表现相关的神经活动与其他内部状态分离。本综述还包括大规模神经基础模型,如Transformer和扩散模型,它们依赖大规模预训练以实现跨主体的最佳性能。最后,我们总结并讨论基准、评估标准和开放挑战,如识别因果联系或通信方向的能力,以促进弥合可解释脑动力学与可靠神经解码之间的未来研究。

英文摘要

Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

2606.10531 2026-06-10 cs.CL cs.AI 交叉投稿

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT: 通过线性约束向量量化实现LLM的数据高效2比特QAT

Haoyu Wang, Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LC-QAT,一种2比特权重量化的向量量化感知训练框架,通过可微的线性映射避免离散码本查找,实现高质量PTQ初始化和端到端优化,仅用0.1%-10%训练数据即超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

量化感知训练(QAT)对于极低比特大语言模型(LLMs)至关重要。当前的QAT方法主要基于标量量化(SQ),虽然能高效优化,但在2比特精度下性能严重下降。另一方面,向量量化(VQ)提供了更高的表示能力,但其离散码本查找阻碍了端到端训练。我们提出LC-QAT,一种2比特权重量化的VQ-QAT框架,通过离散向量上的学习仿射映射表示量化权重,从而在训练前向传播中无需显式码本查找即可实现高质量PTQ初始化和完全可微的端到端优化。这种强大的训练后初始化使LC-QAT具有高度数据效率。在多种LLM上的实验表明,LC-QAT在使用仅0.1%-10%训练数据的情况下,始终优于最先进的QAT方法。我们的结果确立了LC-QAT作为极低比特模型部署的实用且可扩展的解决方案。

英文摘要

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

2606.10543 2026-06-10 cs.LG cs.AI cs.ET q-bio.QM 交叉投稿

Flexible Flows for Biological Sequence Design

生物序列设计的灵活流模型

Yogesh Verma, Dani Korpela, Harri Lähdesmäki, Vikas Garg

发表机构 * Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司) OpenProtein.AI

AI总结 提出结构化耦合、潜编辑速率参数化和潜分类器无引导机制,实现变长序列生成和细粒度控制,在多种生物序列任务中达到最优性能。

详情
AI中文摘要

设计功能性生物序列需要在严格的进化和生物物理约束下导航巨大的离散空间。离散流匹配(DFM)提供了在此类空间上的生成框架,但现有方法依赖于生物学上无信息的耦合,并且在变长序列生成和细粒度控制方面灵活性有限。我们提出了一种结构化耦合,编码序列元素间的领域特定偏好,将源分布偏向合理区域,而不修改流目标或训练过程。在此基础上,我们引入了一种基于潜编辑的速率参数化,通过基于共享全局潜变量的编辑操作(类似于潜变量模型)对变长生成进行建模,同时保持可追踪性。我们进一步引入了一种潜分类器无引导机制,在连续潜空间中连贯地引导生成,以及用于测试时控制编辑操作的Dirichlet先验温度缩放。我们的方法在多种生物序列任务中实现了最先进的性能,包括密度估计、无条件和条件DNA序列生成以及肽序列生成。

英文摘要

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

2606.10580 2026-06-10 cs.LG cs.AI 交叉投稿

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

蒙特卡洛乐观策略迭代的收敛性:超越均匀状态-动作更新

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文证明,在每状态动作均匀更新的条件下,首次访问蒙特卡洛乐观策略迭代收敛到最优,放宽了传统均匀状态-动作更新的要求,并通过均场动力学和锁定论证方法给出证明。

详情
AI中文摘要

蒙特卡洛乐观策略迭代(MC-O-PI)的渐近行为是一个长期悬而未决的问题。当环境模型未知时(这在实践中很常见),唯一已知的保证收敛到最优性的条件是不切实际的。在其标准形式中,该条件要求用于策略评估的回合在整个状态-动作空间上均匀初始化。本文严格放宽了这一要求。具体来说,我们证明即使更新仅在每个状态内的动作上均匀,首次访问MC-O-PI也能收敛到最优性。这允许回合以任意频率从不同状态开始;当状态空间很大或未知但每个状态中的动作空间可管理时,这是一种现实的实现。证明脱离了Tsitsiklis的经典分析,其中心交换性论证在状态以不同频率更新时不再适用。相反,我们首先证明当更新在每个状态的动作上均匀时,MC-O-PI的均场动力学生成单调改进的策略,然后通过扩展组合稳定性-ODE方法的锁定论证,证明噪声不能持续阻止这种改进。这种方法为一般研究乐观策略迭代算法提供了一种新途径。

英文摘要

The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canonical form, this condition requires that the episodes used for policy evaluation be initialised uniformly over the entire state-action space. This paper strictly relaxes that requirement. Specifically, we prove that initial-visit MC-O-PI converges to optimality even when updates are uniform only over the actions within each state. This allows episodes to start in different states at arbitrary frequencies; a realistic implementation when the state space is large or unknown but the action space in each state is manageable. The proof departs from the classical analysis of Tsitsiklis whose central commutativity argument no longer applies when states are updated at different frequencies. Instead, we first show that the mean-field dynamics of MC-O-PI generate monotonically improving policies when updates are uniform over the actions in each state, and then prove that noise cannot consistently prevent this improvement by extending the lock-in argument of the combined stability-ODE method. This approach suggests a new way to study optimistic policy-iteration algorithms in general.

2606.10587 2026-06-10 cs.LG cs.AI 交叉投稿

Towards Diverse Scientific Hypothesis Search with Large Language Models

面向多样化科学假设搜索的大语言模型

Haorui Wang, Parshin Shojaee, Kazem Meidani, Kunyang Sun, José Miguel Hernández-Lobato, Teresa Head-Gordon, Jiajun He, Chandan K. Reddy, Chao Zhang, Yuanqi Du

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对科学假设搜索中多样性崩溃问题,提出基于并行回火的多温度进化框架,在固定验证预算下提升假设质量与多样性。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLMs)在加速科学发现方面日益崛起,最近在生成有效科学假设等高级任务中表现突出。然而,在许多发现场景中,目标并非识别单一最佳假设,因为验证可能噪声大且成本高,科学家受益于一组高质量替代假设,以对冲下游不确定性,寻求最佳解决方案。尽管如此,常用的进化搜索策略在假设生成中往往优先优化而非探索,搜索过程中的选择压力导致多样性崩溃。受这些局限性的启发,我们将假设搜索表述为采样问题,目标是在固定验证预算下高效生成多样化、高质量的假设。基于这一视角,我们提出\ours,一种受经典并行回火算法启发的进化框架,在多个温度水平下搜索假设,并实现跨温度的原则性信息交换,以在不干扰收敛的情况下改善探索。在分子发现、方程发现和算法发现等领域,我们的方法在相同验证预算下持续提升假设质量和多样性,生成的候选假设在更昂贵的下游计算验证中仍保持稳健。

英文摘要

Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.

2606.10596 2026-06-10 cs.LG cs.AI cs.SY eess.SY 交叉投稿

Embedding Hybrid Systems into Continuous Latent Vector Fields

将混合系统嵌入连续潜在向量场

Sangli Teng, Hang Liu, Koushil Sreenath

AI总结 证明当m>2n时,n维混合系统可嵌入m维欧氏空间中的连续向量场,并基于此提出一种潜在神经ODE方法,从时间序列数据中准确恢复混合系统流,优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

这项工作证明了当$m>2n$时,一个$n$维混合系统可以嵌入到一个$m$维欧氏空间中,并在其嵌入图像上配备一个连续向量场。这一结果表明,一个本质上不连续的混合系统通常允许一个连续的 extrinsic 表示,该表示对于可微优化是适定的。基于这一存在性定理,我们表明,在潜在空间和状态空间中都具有一致性损失的潜在神经ODE可以准确恢复混合系统的流。大量实验表明,所提出的方法在仅从时间序列数据学习具有不同几何形状的混合系统方面优于现有方法。

英文摘要

This work proves that an $n$-dimensional hybrid system can be embedded into an $m$-dimensional Euclidean space equipped with a continuous vector field on its embedded image whenever $m>2n$. This result suggests that an intrinsically discontinuous hybrid system generically admits a continuous extrinsic representation that is well-posed for differentiable optimization. Building on this existence theorem, we show that a latent Neural ODE with consistency loss in both the latent and state space can accurately recover the flow of hybrid systems. Extensive experiments suggest the proposed method outperforms the existing method in learning hybrid systems with varying geometries from only time series data.

2606.10650 2026-06-10 cs.CL cs.AI 交叉投稿

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) University of Michigan(密歇根大学) ByteDance Seed(字节跳动Seed)

AI总结 提出DLA框架,通过信息感知动态状态合并和容量受限内存建模,解决多状态线性注意力中固定合并策略导致的错误累积问题,在16个数据集上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度,这促使采用具有次二次成本(sub-quadratic cost)的线性注意力机制。为了在长上下文下提高表示能力,近期方法以多状态方式组织内存。然而,现有的多状态线性注意力方法依赖于固定的状态合并策略,无法适应动态变化的令牌重要性,不可逆地模糊了关键令牌,并在长序列上导致严重的错误累积。为了解决这一限制,我们提出了DLA,一种用于多状态线性注意力的动态内存建模框架。DLA引入了(i)信息感知动态状态合并,它基于令牌级别的信息变化自适应地确定状态边界,在语义转换周围保留高分辨率表示,同时积极总结稳定区域;以及(ii)容量受限内存建模,它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存,以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA,并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

2606.10802 2026-06-10 cs.LG cs.AI 交叉投稿

Boosting ECG Classification Performance by Pre-training with Synthesized Data

通过合成数据预训练提升心电图分类性能

Naoki Nonaka, Jun Seita

发表机构 * Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters(理化学研究所信息研发与战略总部先进数据科学项目)

AI总结 提出基于医学知识的高斯组合合成算法生成单导联II心电图数据,用于预训练深度神经网络,在四种异常分类中平均提升最高33.2%,尤其在小数据集场景下效果显著。

详情
AI中文摘要

深度神经网络通常需要大量数据集才能有效训练。在医学领域,由于隐私问题和某些疾病的罕见性,获取大规模数据往往具有挑战性。为了解决数据稀缺问题,我们研究了使用基于领域医学知识生成的合成数据训练深度神经网络模型的有效性。具体来说,我们针对单导联II心电图开发了一种知识驱动的高斯组合合成算法,其中每个心跳由高斯形状的P、Q、R、S和T波分量表示。使用该模拟器,我们为四种异常心电图类别生成合成数据:心房颤动、心房扑动、室性早搏和沃尔夫-帕金森-怀特综合征。我们通过使用十种不同的深度神经网络架构进行异常心电图分类来评估该合成数据的效用。结果表明,合成到真实的训练提高了四种目标异常中三种的分类性能,其中心房扑动观察到的最大架构平均增益为33.2%。进一步分析表明,合成数据带来的性能提升在真实数据集较小时更为明显。这些发现表明,基于领域知识的合成心电图可以作为有用的预训练资源,特别是在真实数据有限或难以获取的场景中。

英文摘要

Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of $33.2\%$ observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

2606.10835 2026-06-10 cs.LG cs.AI 交叉投稿

Geometrically Averaged Hard Target Updates for Linear Q-Learning

线性Q学习的几何平均硬目标更新

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院)

AI总结 提出λ-几何加权平均的周期目标更新方法,用于线性Q学习,通过切换系统模型分析其稳定性,连接了单周期更新和投影Q值迭代。

详情
AI中文摘要

周期性硬目标更新是现代深度Q学习中最常见的稳定化手段之一。最近的研究表明,目标更新可以提高使用函数逼近(包括线性函数逼近)的Q学习的稳定性。我们引入并分析了所谓的λ-目标更新,通过将m-周期目标更新映射与λ-几何权重$(1-\lambda)\lambda^{m-1}$($\lambda \in [0,1]$)平均得到。端点$\lambda=0$恢复单周期目标更新,而连续端点$\lambda\uparrow1$恢复投影Q值迭代。我们使用切换系统模型和相关工具,研究了这种机制在线性函数逼近的Q学习(即线性Q学习)中的应用。为清晰起见,本文处理确定性版本;该公式可扩展到随机强化学习设置。

英文摘要

Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $λ$-target update, obtained by averaging the $m$-periodic target update maps with $λ$-geometric weights $(1-λ)λ^{m-1}$, $λ\in [0,1]$. The endpoint $λ=0$ recovers the one-period target update, while the continuous endpoint $λ\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

2606.10890 2026-06-10 cs.LG cs.AI 交叉投稿

Optimal Post-Training Quantization Scales and Where to Find Them

最优后训练量化尺度及其寻找方法

Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser

发表机构 * AMD

AI总结 提出PiSO算法,利用校准数据精确高效地计算逐通道最优量化尺度,并扩展到分组量化,在Llama和Qwen模型上显著提升困惑度和零样本准确率。

详情
AI中文摘要

后训练量化(PTQ)通过将权重映射到低比特表示来压缩大型语言模型。定义量化网格的缩放因子通常使用简单的、无数据的启发式方法选择。在这项工作中,我们提出了PiSO(分段尺度优化),一种利用校准数据在最近舍入量化下精确且高效地计算最优逐通道权重尺度的算法。PiSO将尺度搜索空间划分为有限个区间,在这些区间上目标函数具有闭式最小值。我们通过原则性启发式方法将PiSO扩展到分组量化,并提出了将尺度优化与纠错交错的有效策略。在Llama和Qwen模型上,跨多个模型大小和目标权重位宽的实验表明,在困惑度和下游零样本准确率上均有持续改进,无论是单独使用还是与纠错结合。特别地,我们观察到随着目标位宽变窄、量化变得更加困难,收益增加。

英文摘要

Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.

2606.10908 2026-06-10 cs.SD cs.AI cs.CR cs.LG 交叉投稿

RAT: Reference-Augmented Training for ASV Anti-Spoofing

RAT:面向ASV反欺骗的参考增强训练

Vojtěch Staněk, Anton Firc, Jakub Reš, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出一种基于说话人参考录音的欺骗对抗架构,发现训练时引入参考通道可提升深度伪造检测性能,即使推理时参考缺失或失配。基于此提出参考增强训练(RAT)策略,在ASVspoof 5基准上以单个检测器达到2.57% EER和0.074 minDCF,超越大型集成系统。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们引入了一种以说话人参考录音为条件的欺骗对抗架构,但观察到它收敛到一种在推理时有效忽略参考的解决方案。令人惊讶的是,使用参考通道进行训练会诱导出不变性,从而改进深度伪造检测,即使在推理时参考缺失或失配。基于这一观察,我们提出了一种参考增强训练(RAT)策略。与单话语基线相比,RAT产生了改进的检测性能,即使在推理时将参考录音替换为零向量时也是如此。通过严格分析,我们证明优化过程迅速减少了参考贡献,导致推理很大程度上独立于参考通道。使用RAT,我们在ASVspoof 5基准上以单个检测器实现了最先进的2.57%等错误率和0.074最小检测代价函数,甚至超越了大型集成系统。

英文摘要

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

2606.10929 2026-06-10 cs.LG cs.AI 交叉投稿

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

可恢复但不稳定:权重和激活中的局部线性结构

Irina Piontkovskaia, Sergey Nikolenko

发表机构 * St. Petersburg Department of the Steklov Institute of Mathematics(斯捷克洛夫数学研究所圣彼得堡分所) St. Petersburg State University(圣彼得堡国立大学)

AI总结 研究神经网络中线性结构的存在性与尺度,发现局部低秩任务梯度结构,但固定任务平面假设不成立;首次恢复更新形成轨迹前缀基,捕获大部分恢复位移;提出随机搜索理论解释高维随机参数搜索有效性,并验证参数扰动与激活引导的关系。

Comments 23 pages, 8 tables, 9 figures

详情
AI中文摘要

任务向量、LoRA、激活引导和预训练权重周围的随机搜索都表明学习行为可以由线性方向控制。我们询问哪些线性结构实际存在以及它们处于什么尺度。在合成多任务Transformer和DistilGPT-2/GPT-2上的LoRA适配器中,我们发现强烈的局部低秩任务梯度结构,但拒绝了固定任务平面假设:静态基会错过恢复方向,有用的基在100步内显著漂移。然而,首次恢复更新形成了一个轨迹前缀基,捕获了LoRA恢复位移的77%。我们开发了随机搜索理论,结合高斯局部线性定理,证明了即使在非常高维的情况下随机参数搜索的有效性。我们还研究了参数扰动与激活引导之间的关系:单次梯度步产生的激活偏移与标记对比CAA引导向量的余弦为0.58,对Qwen-0.5B BoolQ陈述具有类似的引导效果。我们通过在合成Transformer和LLM上的实验验证了结果。我们的结果表明,训练网络中的线性结构不是全局任务方向,而是演化的局部几何结构,这些结构在参数和激活空间中部分持续存在。

英文摘要

Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled by linear directions. We ask which linear structures actually exist and on what scale. In a synthetic multitask transformer and LoRA adapters on DistilGPT-2 / GPT-2 we find strong local low-rank task-gradient structure but reject the fixed-task-plane hypothesis: static bases miss the recovery direction, and the useful basis drifts substantially within 100 steps. However, the first recovery updates form a trajectory-prefix basis capturing 77% of the LoRA recovery displacement. We develop random search theory with a Gaussian local-linear theorem that justifies the effectiveness of random parameter search even in very high dimensions. We also study the relation between parameter perturbations and activation steering: a single gradient step produces an activation shift with 0.58 cosine to a labelled-contrast CAA steering vector, with a similar steering effect on Qwen-0.5B BoolQ statements. We validate our results with experiments on synthetic Transformers and LLMs. Our results suggest that linear structures in trained networks are not global task directions, but evolving local geometries that partially persist across parameter and activation spaces.

2606.10935 2026-06-10 cs.LG cs.AI 交叉投稿

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

CLP: 零损失自适应多令牌推理的搭配长度预测

Xuezhen Xie, Zhiqiang Zhou

发表机构 * Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 提出CLP方法,通过轻量级线性层预测可安全接受的额外令牌数,解决多令牌预测中头-主干竞争导致的输出退化问题,在Qwen2.5模型上实现零质量损失的1.14x-1.29x加速。

Comments 13 pages, 8 figures, 8 tables

详情
AI中文摘要

大型语言模型推理受限于自回归解码,每个令牌需要一次完整的前向传播。多令牌预测(MTP)提供了一种有前景的加速路径,但现有方法存在根本性的架构缺陷:第一个令牌的MTP头与主干自身的语言模型(LM)头竞争,导致预测被接受时质量严重下降。我们将这种头-主干竞争确定为先前基于MTP的加速方法中重复和不连贯输出的根本原因。为了解决这个问题,我们提出了Backbone-as-Architect设计原则,其中主干LM头始终生成第一个令牌,MTP头仅负责后续令牌。基于这一原则,我们引入了CLP(搭配长度预测器),一个轻量级的跨度级决策层,预测每个解码步骤可以安全接受多少个额外令牌。CLP仅使用单个线性层(4.6K--7.7K参数),取代了先前工作中过度设计的1M参数门控网络。在Qwen2.5模型(0.5B、1.5B、7B)上的实验表明,CLP在1.5B模型上实现了1.20x--1.29x加速,在7B模型上实现了1.14x--1.20x加速,且零质量退化(重复率<0.02),而基于门控的方法无法加速(1.07x)或产生严重退化的输出(重复率>0.5%)。我们进一步证明,较短的预测范围(k=2)在大模型上恢复了24%更高的MTP头准确率,建立了一个可扩展感知的设计原则。我们确定MTP头预测准确率是加速的约束条件,并为未来改进建立了清晰的路线图。

英文摘要

Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone's own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.

2606.11081 2026-06-10 cs.LG cs.AI 交叉投稿

Unifying Local Communications and Local Updates for LLM Pretraining

统一大语言模型预训练中的本地通信与本地更新

Pietro Cagnasso, Eugene Belilovsky, Edouard Oyallon

发表机构 * Concordia University(康考迪亚大学) Mila CNRS, Sorbonne University(法国国家科学研究中心,索邦大学)

AI总结 提出GASLoC算法,通过去中心化训练框架统一本地通信与更新,在异构带宽下优于DiLoCo,支持自适应优化器和多本地步骤。

Comments 38 pages, 9 figures

详情
AI中文摘要

随着训练依赖于跨集群、数据中心和低带宽链接的分布式计算,通信高效的大语言模型预训练变得越来越重要。许多实用方法降低了通信频率,但仍依赖于同步All-Reduce操作,这些操作保持相同的模型状态并将进度与全局集体操作绑定。当带宽或工作速度异构时,这可能成为瓶颈。我们引入了GASLoC,一种新颖的去中心化预训练算法,它将通信加速的概念推广到最近流行的“外部优化器”,以允许一个实用的基于八卦的训练框架,该框架与自适应优化器兼容,允许本地优化器步骤,并可以利用稀疏随机对等通信。在多个标准LLM训练任务上的实验表明,GASLoC在单步每通信设置下,对于多种拓扑结构优于最先进的去中心化算法,并且与LLM设置中现有的去中心化方法不同,它在利用多个本地步骤时能够获得与DiLoCo竞争的性能。在异构带宽设置下,我们展示了GASLoC的优势,表明它可以显著优于DiLoCo。

英文摘要

Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely on synchronous All-Reduce operations that maintain identical model states and tie progress to global collectives. This can become a bottleneck when bandwidth or worker speed is heterogeneous. We introduce GASLoC, a novel decentralized pre-training algorithm that generalizes the notion of communication acceleration to the recently popular "outer optimizer" to allow a practical gossip-based training framework that is compatible with adaptive optimizers, allows for local optimizer steps, and can utilize sparse randomized peer communication. Empirically, on a number of standard LLM training tasks, we demonstrate that GASLoC outperforms state-of-the-art decentralized algorithms in single step per communication setting for a number of topologies and, unlike existing decentralized methods in the LLM setting, it allows to obtain performance competitive with DiLoCo when utilizing multiple local steps. In the heterogeneous bandwidth setting we demonstrate the advantage of GASLoC showing that it can significantly outperform DiLoCo.

2606.11087 2026-06-10 cs.LG cs.AI 交叉投稿

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

强化学习中流策略的测试时梯度引导

Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine

发表机构 * UC Berkeley(加州大学伯克利分校) Physical Intelligence

AI总结 提出QGF算法,通过预训练参考流策略和价值函数,在测试时利用价值梯度引导策略生成高价值动作,无需额外策略学习,在离线RL基准上优于现有测试时方法且与训练时方法竞争力相当。

详情
AI中文摘要

表达性连续控制策略,如扩散模型和流模型,构成了模拟和真实机器人控制中模仿学习近期进展的基础。尽管它们在监督模仿学习设置中能够稳定扩展,但将其纳入强化学习(RL)流程以改进策略已被证明更加困难。这通常需要专门的训练目标或通过去噪过程反向传播,这会导致众所周知的稳定性问题并影响可扩展性。在本文中,我们研究了一个问题:仅在测试时采用简单的策略改进方案,同时保持稳定的监督策略训练不变,是否能够成为避免这些问题的竞争性替代方案。为此,我们提出了QGF(Q-Guided Flow),一种完全在测试时进行策略优化的RL算法。QGF通过预训练一个参考流策略(通过标准的行为克隆目标)和一个价值函数评论家,并在测试时使用价值梯度引导参考策略生成更高价值的动作,而无需任何额外的策略学习。实验上,QGF在高维动作空间的单任务和目标条件离线RL基准测试中优于先前的测试时RL方法,并且与最先进的训练时算法竞争力相当,同时运行成本更低。此外,通过避免演员-评论家训练的不稳定性,它展现出随模型规模的良好扩展性,为使用表达性策略提供了一种实用且有效的替代RL算法。

英文摘要

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校) Arena

AI总结 本文重新解读监督微调为目标分布设计,提出Q-target框架,将监督分解为对观测token的依赖强度与替代token的概率分配,并基于此提出Target-SFT方法,在多个推理任务中优于现有方法。

详情
AI中文摘要

监督微调(SFT)通常最大化示范轨迹中每个token的似然。然而,观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的,尤其是当预训练模型编码了丰富的知识先验时。在这项工作中,我们将SFT重新解释为目标分布设计:不仅研究损失目标,还分析损失驱动模型匹配的token级目标。我们引入Q-target框架,将SFT监督分解为两个明确的选择:(1) 对观测token的依赖强度,以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点,我们提出Target-SFT,直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法,展示了这种基于目标的方法的有效性。总体而言,我们的公式揭示了SFT训练更基本的设计原则,并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

2509.04027 2026-06-10 cs.AI cs.CL 版本更新

Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought

CoT-Space: 一种通过强化学习实现内部慢思考的理论框架

Zeyu Gan, Hao Yi, Yong Liu

发表机构 * Zeyu Gan, Yi Hao, Yong Liu(GAN 赵毅、LIU 刘永)

AI总结 本文提出CoT-Space理论框架,通过强化学习将推理过程从离散的token预测任务转化为连续的推理层面语义空间中的优化过程,揭示了测试时扩展中最优CoT长度的收敛是欠拟合与过拟合基本权衡的自然结果。

Comments Preprint Edition

详情
AI中文摘要

测试时扩展,主要通过强化学习(RL)中的多步链式推理(CoT)体现,已成为增强大型语言模型(LLMs)推理能力的关键范式。然而,仍存在显著的理论空白:传统token级分析无法捕捉推理层面扩展的宏观动态。为此,我们引入CoT-Space,一种新的理论框架,将推理过程从离散的token预测任务转换为连续的推理层面语义空间中的优化过程。通过从噪声和风险视角建模推理轨迹,并复兴经典学习理论中的基础原理,我们证明观察到的收敛到最优CoT长度是欠拟合与过拟合基本权衡的自然结果。我们进一步利用RL作为工具,在实验中激发并验证这些结果。我们的发现为通过RL实现内部测试时扩展提供了机制解释,为现代LLMs中优化推理轨迹提供了系统性的理论基础。

英文摘要

Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists: traditional token-level analysis fails to capture the macroscopic dynamics of reasoning-level scaling. To address this, we introduce CoT-Space, a novel theoretical framework that recasts the reasoning process from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By modeling the reasoning trajectory from both noise and risk perspectives and revitalizing foundational principles from classical learning theory, we demonstrate that the observed convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. We further utilize RL as a tool to elicit and verify these results in our experiments. Our findings provide a mechanistic explanation for the internal test-time scaling via RL, offering a principled theoretical foundation to optimize reasoning trajectories in modern LLMs.

2605.11458 2026-06-10 cs.AI cs.CL cs.LO 版本更新

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

自适应教师暴露用于大语言模型推理中的自蒸馏

Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

发表机构 * ByteDance Douyin(字节跳动抖音)

AI总结 针对自蒸馏中教师暴露完整推理导致学生难以吸收的问题,提出自适应教师暴露方法ATESD,通过轻量Beta策略控制器动态调整暴露比例,并用折扣学习进步奖励优化,在多个模型和数据集上提升推理性能。

Comments 11 pages, 4 figures; code not released yet

详情
AI中文摘要

同策略自蒸馏已成为大语言模型推理的一种强大方法,其中特权教师基于参考解决方案监督学生自身的轨迹。然而,几乎所有此类方法共享的一个设计选择却未被质疑:教师总是看到完整的参考推理。我们认为这一默认设置本身就是问题的一部分,并识别出教师侧暴露不匹配:当教师基于远超学生当前能力的推理进行条件化时,产生的词元目标变得过于强大而难以吸收。一个受控的固定暴露扫描在两个层面上明确了这一点:1)完全暴露并非可靠的最佳选择,2)随着教师看到更多特权推理,学生-教师不匹配单调增长。这促使我们将教师暴露视为一个可学习的训练时控制变量,而非固定的超参数。因此,我们提出了自适应教师暴露用于自蒸馏(ATESD)。ATESD使用一个轻量级的Beta策略控制器对暴露比例进行建模,该控制器以紧凑的训练状态统计为条件,并在学生更新的一个短保持窗口内使用一个采样的暴露。为了使该暴露控制器可学习,我们使用折扣学习进步奖励对其进行优化,该奖励根据每个保留决策对学生未来改进的影响(而非其即时损失变化)进行评分,从而解决了同策略蒸馏导致的延迟信用分配问题。在AIME 24、AIME 25和HMMT 25上,使用Qwen3-{1.7B, 4B, 8B}的实验表明,ATESD持续优于竞争性的自蒸馏和强化学习基线,相比OPSD分别提高了+0.95、+2.05和+2.33个Average@12点,将自适应教师暴露确立为推理自蒸馏的一个有效新方向。

英文摘要

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

2606.06735 2026-06-10 cs.AI 版本更新

A Geometric Account of Activation Steering through Angle-Norm Decomposition

通过角度-范数分解的激活引导的几何解释

Georgii Aparin, Tatiana Gaintseva

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) Queen Mary University of London(女王玛丽大学)

AI总结 本文通过控制实验分离角度和径向分量,发现概念主要编码在角度结构中,但范数对引导的稳定性和下游效应至关重要,建议将激活引导参数化为可解释的角度和径向分量。

详情
AI中文摘要

线性激活引导作为一种简单且经验有效的控制语言模型行为的方法已受到广泛关注。最近,球形引导范式被提出来解决加性干预的局限性,其动机通常是假设隐藏状态范数不携带概念相关信息。在这项工作中,我们通过一项旨在分离角度和径向分量作用的受控实证研究重新审视了这一假设。我们表明,引导方法的主要区别在于它们如何耦合两种几何效应:改变令牌与概念方向的角度对齐以及改变其隐藏状态范数。在七个语言模型上,我们发现概念主要表示在角度结构中,这支持了球形方法的动机,但范数对于引导的稳定性和下游效应仍然重要。我们的结果解释了为什么具有相似概念级别效果的干预可能表现不同,并建议激活引导应由干预的可解释角度和径向分量参数化,而不是由纠缠这两种效应的单个加性系数参数化。

英文摘要

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

2404.09101 2026-06-10 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

Mixtures of Neural Operators Reduce Active Complexity in Operator Learning

神经算子混合体降低算子学习中的主动复杂度

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

发表机构 * McMaster University and Vector Institute(麦斯特大学和向量研究所) Shimane University(岛根大学) Rice University(里士满大学) University of Helsinki(赫尔辛基大学)

AI总结 通过路由混合神经算子(MoNO)与固定单神经算子构造的比较,证明MoNO在主动专家规模上具有更优的深度、宽度和秩缩放,且对Lipschitz目标这些量以O(ε^{-1})为界。

详情
AI中文摘要

算子学习系统并非仅由总参数数量决定;对于一次查询,相关瓶颈可能是必须加载和评估的模型。我们通过路由混合神经算子(MoNO)与固定单神经算子构造之间的建设性比较,在紧致Sobolev子集上研究了经典神经算子的这一区别。该比较涉及相对于基线的专家主动复杂度,其中总存储大小和路由搜索分别考虑。MoNO将每个输入函数通过树路由到一个专家。我们的主要定理表明,在近似集上,每个具有有界输出Sobolev半径的标量一致连续非线性算子都存在一个MoNO近似,其主动专家具有比所分析的单神经算子构造更小的深度、宽度和秩缩放;对于Lipschitz目标,这些专家量以$\mathcal{O}(\varepsilon^{-1})$为界。该定理将局部化转化为主动专家大小、路由深度和专家数量的算子级核算。我们还证明了底层神经算子架构的定量通用近似定理,明确依赖于紧集直径和连续模。

英文摘要

Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that must be loaded and evaluated. We study this distinction for classical neural operators on compact Sobolev subsets through a constructive comparison between routed mixtures of neural operators (MoNOs) and a fixed single-neural-operator construction. The comparison concerns expert-active complexity relative to that baseline, with total stored size and routing search accounted separately. A MoNO routes each input function through a tree to one expert. Our main theorem shows that every scalar uniformly continuous nonlinear operator with bounded output Sobolev radius on the approximation set admits a MoNO approximation whose active expert has smaller depth, width, and rank scaling than the analyzed single-neural-operator construction; for Lipschitz targets these expert quantities are bounded by $\mathcal{O}(\varepsilon^{-1})$. The theorem turns localization into an operator-level accounting of active expert size, routing depth, and number of experts. We also prove a quantitative universal approximation theorem for the underlying neural-operator architecture, with explicit dependence on compact-set diameter and modulus of continuity.

2502.07027 2026-06-10 cs.LG cs.AI 版本更新

Representational Alignment with Chemical Induced Fit for Molecular Relational Learning

基于化学诱导契合的表征对齐用于分子关系学习

Peiliang Zhang, Jingling Yuan, Qing Xie, Yongjun Zhu, Chao Che, Lin Li

发表机构 * Wuhan University of Technology(武汉理工大学) Yonsei University(延世大学) Hubei Key Laboratory of Transportation Internet of Things(湖北省交通运输物联网重点实验室) Dalian University(大连大学)

AI总结 提出ReAlignFit方法,通过引入化学诱导契合的归纳偏置动态对齐子结构表征,并利用子图信息瓶颈优化高化学功能兼容性的子结构对,以提升分子关系学习在化学空间偏移数据上的稳定性。

Comments Accepted by SIGKDD2026 AI for Science Track

详情
AI中文摘要

分子关系学习(MRL)广泛应用于自然科学中,通过提取结构特征来预测分子对之间的关系。子结构对之间的表征相似性决定了分子结合位点的功能兼容性。然而,通过注意力机制对齐子结构表征缺乏化学知识的指导,导致模型在化学空间(例如,官能团、骨架)偏移数据上的性能不稳定。在理论论证的基础上,我们提出了基于化学诱导契合的表征对齐(ReAlignFit)来增强MRL的稳定性。ReAlignFit通过引入基于化学诱导契合的归纳偏置,动态对齐MRL中的子结构表征。在诱导过程中,我们设计了基于子结构边重构的偏差校正函数,通过模拟化学构象变化(子结构的动态组合)来对齐子结构对之间的表征。ReAlignFit进一步在契合过程中集成子图信息瓶颈,以精炼和优化具有高化学功能兼容性的子结构对,并利用它们生成分子嵌入。在九个数据集上的实验结果表明,ReAlignFit在两个任务中均优于最先进的模型,并显著增强了模型在规则偏移和骨架偏移数据分布上的稳定性。

英文摘要

Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions.

2502.09928 2026-06-10 cs.CV cs.AI 版本更新

Deep Tree Tensor Networks

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出深度树张量网络(DTTN),通过多线性运算捕获指数阶特征交互,在多个基准上超越现有方法。

详情
AI中文摘要

源自量子物理的张量网络(TNs)已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型,如矩阵乘积态(MPS),在自然图像识别中尚未取得成功应用。当它们被使用时,主要是在现有网络中压缩参数,从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}(DTTN)的新架构,它通过多线性运算捕获跨特征的$2^L$阶乘法交互,同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块(AIMs)堆叠而成,这种设计便于高效实现。此外,我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估,显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

2503.24007 2026-06-10 cs.LG cs.AI 版本更新

CITRAS: Covariate-Informed Transformer for Time Series Forecasting

CITRAS: 协变量感知的Transformer时间序列预测

Yosuke Yamaguchi, Issei Suemitsu, Wenpeng Wei

发表机构 * Research & Development Group, Hitachi, Ltd.(日立有限公司研发部)

AI总结 提出CITRAS,一种仅解码器Transformer,通过KV移位和注意力分数平滑机制灵活整合已知协变量的未来部分,并捕获局部和全局跨变量依赖,提升预测精度。

详情
Journal ref
IEEE Access, vol. 14, pp. 77983-77998, 2026
AI中文摘要

在时间序列预测中,协变量代表影响目标变量的外部因素。一些协变量仅在过去可观测(观测协变量,如记录的天气数据),而另一些则预先已知(已知协变量,如日历事件或折扣计划)。尽管协变量有潜力提升预测性能,但大多数基于深度学习的预测模型难以处理由已知协变量的未来部分引起的变量长度差异,且无法灵活利用它们。此外,捕获目标变量与协变量之间的依赖关系并非易事,因为模型必须准确反映协变量的局部影响,同时建模全局跨变量依赖。为应对这些挑战,我们提出CITRAS,一种仅解码器Transformer,灵活整合多个目标变量、观测协变量和已知协变量。在保持强大自回归建模能力的同时,CITRAS在分块跨变量注意力中引入两种新机制:键值移位和注意力分数平滑。键值移位通过基于并发依赖将已知协变量的未来部分与目标变量对齐,无缝融入预测过程。注意力分数平滑通过平滑历史注意力分数,将局部精确的分块跨变量依赖细化为全局变量级依赖。实验上,CITRAS在协变量感知和多变量设置下的广泛真实世界数据集上展现出强大性能,展示了其利用跨变量和跨时间依赖提升预测准确性的通用能力。

英文摘要

In time series forecasting, covariates represent external factors that influence target variables. Some covariates are observable only in the past (observed covariates, such as recorded weather data), while others are known in advance (known covariates, such as calendar events or discount schedules). Although covariates have the potential to enhance forecasting performance, most deep learning-based forecasting models struggle to address the length discrepancy between variables caused by the future portion of known covariates and fail to leverage them flexibly. Moreover, capturing dependencies between target variables and covariates is non-trivial, as models must accurately reflect the local impact of covariates while simultaneously modeling global cross-variate dependencies. To address these challenges, we propose CITRAS, a decoder-only Transformer that flexibly integrates multiple target variables, observed covariates, and known covariates. While preserving strong autoregressive modeling capabilities, CITRAS introduces two novel mechanisms within patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates the future portion of known covariates into the forecasting process by aligning them with target variables based on their concurrent dependencies. Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the historical attention scores. Experimentally, CITRAS demonstrates strong performance across a wide range of real-world datasets in both covariate-informed and multivariate settings, showcasing its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

2507.14725 2026-06-10 cs.LG cs.AI 版本更新

GRID: Scaling Task-Agnostic Inference in Continual Prompt Tuning

GRID:持续提示调优中任务无关推理的规模化

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

发表机构 * State University of New York at Buffalo(纽约州立大学布法罗分校) Department of Computer Science and Engineering(计算机科学与工程系) Institute for Artificial Intelligence and Data Science(人工智能与数据科学研究院)

AI总结 提出GRID框架,通过输出空间感知解码和梯度引导提示选择,解决持续学习中任务无关推理的性能退化与可扩展性问题,在长序列和负迁移基准上提升后向迁移并减少提示内存。

详情
AI中文摘要

基于提示的持续学习提供了一种参数高效的方式,使大型语言模型能够适应任务序列。然而,现有方法通常依赖任务感知推理,并维护不断扩展的任务特定提示集,导致(1)当推理时任务标识符不可用于提示选择时,早期任务性能严重下降;(2)随着任务序列增长,可扩展性受限。我们提出GRID,一个统一的框架来解决这些挑战。GRID包含一个输出空间感知解码机制,通过利用代表性输入和自动标签语义归一化来增强后向迁移,以及一个梯度引导的提示选择策略,将信息量较少的提示压缩为单个聚合表示,以实现可扩展、内存高效的持续学习。在长序列和负迁移基准上的大量实验表明,GRID改善了后向迁移,实现了有竞争力的前向迁移,并显著减少了编码器-解码器和仅解码器架构(包括T5、Qwen和LLaMA)中的提示内存。源代码可从此https URL获取。

英文摘要

Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, existing methods often rely on task-aware inference and maintain an expanding set of task-specific prompts, leading to (1) severe performance degradation on earlier tasks when task identifiers are unavailable for prompt selection at inference time, and (2) limited scalability as task sequence grows. We propose GRID, a unified framework designed to address these challenges. GRID incorporates an output-space-aware decoding mechanism that enhances backward transfer by leveraging representative inputs and automatic label semantic normalization, alongside a gradient-guided prompt selection strategy that compresses less informative prompts into a single aggregated representation for scalable, memory-efficient continual learning. Extensive experiments on long-sequence and negative-transfer benchmarks show that GRID improves backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory across encoder-decoder and decoder-only architectures, including T5, Qwen, and LLaMA. Source code is available at https://github.com/AnushkaTi/GRID.

2509.21925 2026-06-10 cs.LG cs.AI 版本更新

Generation Properties of Stochastic Interpolation under Finite Training Set

有限训练集下随机插值的生成性质

Yunchen Li, Shaohui Lin, Zhou Yu

AI总结 研究有限训练集下随机插值生成模型的理论性质,推导最优速度场和得分函数的闭式解,揭示确定性和随机生成过程的行为,并定义欠拟合与过拟合。

Comments We found proof errors affecting key theorems and wish to avoid misleading readers. We have submitted a substantially revised new paper, arXiv:2606.08554, retaining only two old theorems and adding five new ones

详情
AI中文摘要

本文研究了有限训练总体下生成模型的理论行为。在随机插值生成框架内,我们推导了当仅有有限数量的训练样本可用时最优速度场和得分函数的闭式表达式。我们证明,在某些正则性条件下,确定性生成过程精确恢复训练样本,而随机生成过程表现为带有加性高斯噪声的训练样本。在理想化设置之外,我们考虑模型估计误差,并引入生成模型特有的欠拟合和过拟合的正式定义。我们的理论分析揭示,在存在估计误差的情况下,随机生成过程有效地产生训练样本的凸组合,这些组合被均匀噪声和高斯噪声的混合所破坏。在生成任务和分类等下游任务上的实验支持了我们的理论。

英文摘要

This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.

2510.06473 2026-06-10 physics.soc-ph cs.AI cs.SI 版本更新

Deep Generative Model for Human Mobility Behavior

人类移动行为的深度生成模型

Ye Hong, Yatao Zhang, Konrad Schindler, Martin Raubal

发表机构 * Institute of Cartography and Geoinformation, ETH Zurich(测绘与地理信息研究所,苏黎世联邦理工学院) Department of Human Geography, Lund University(人类地理学系,吕勒奥大学) Future Resilient Systems, Singapore-ETH Centre, ETH Zurich(未来韧性系统,新加坡-苏黎世联邦理工学院,苏黎世联邦理工学院) Photogrammetry and Remote Sensing, ETH Zurich(摄影测量与遥感,苏黎世联邦理工学院) Department of Geography, University College London(地理系,伦敦大学学院)

AI总结 提出基于扩散的生成框架MobilityGen,模拟多属性活动-出行序列,复现标度律、时间分配等关键模式,支持城市空间可达性和社会暴露分析。

详情
AI中文摘要

理解和建模人类移动性对于交通规划、可持续城市设计和公共卫生等挑战至关重要。尽管经过数十年的努力,由于移动行为的复杂性、情境依赖性和探索性,模拟个体移动仍然具有挑战性。在此,基于日常移动的活动视角,我们提出了MobilityGen,一个基于扩散的生成框架,用于在大的空间尺度上模拟数天至数周的多属性活动-出行序列。通过将行为属性与环境背景联系起来,MobilityGen再现了关键模式,如地点访问的标度律、活动时间分配以及出行方式和目的地选择的耦合演化。它反映了时空变异性,并生成与建成环境一致的多样的、合理的移动模式。除了标准验证外,MobilityGen还使得先前模型难以实现的分析成为可能,包括不同出行方式下城市空间可达性的差异,以及共现动态如何塑造社会暴露和隔离。总之,这些结果为人类移动行为及其社会影响的精细研究提供了一个集成的、数据驱动的基础。

英文摘要

Understanding and modeling human mobility is central to challenges in transport planning, sustainable urban design, and public health. Despite decades of effort, simulating individual mobility remains challenging because of its complex, context-dependent, and exploratory nature. Here, building on the activity-based view of daily mobility, we propose MobilityGen, a diffusion-based generative framework for simulating multi-attribute activity-travel sequences over days to weeks at large spatial scales. By linking behavioral attributes with environmental context, MobilityGen reproduces key patterns such as scaling laws for location visits, activity time allocation, and the coupled evolution of travel mode and destination choices. It reflects spatio-temporal variability and generates diverse and plausible mobility patterns consistent with the built environment. Beyond standard validation, MobilityGen enables analyses that have been difficult with earlier models, including how access to urban space varies across travel modes and how co-presence dynamics shape social exposure and segregation. Together, these results support an integrated, data-driven basis for fine-grained studies of human mobility behavior and its societal implications.

2511.01927 2026-06-10 cs.LG cs.AI cs.NA math.NA 版本更新

Learning-Guided Integration Contours Construction for Fast Large-Scale Generalized Eigensolvers

学习引导的积分轮廓构建用于快速大规模广义特征值求解器

Yeqiu Chen, Ziyan Liu, Hong Wang, Lei Liu

发表机构 * University of Science(科学技术大学)

AI总结 提出Deepcontour混合框架,结合深度学习谱预测器与核密度估计自动构建优化积分轮廓,加速大规模广义特征值求解,实现最高5.63倍加速并保持数值精度。

详情
AI中文摘要

解决大规模广义特征值问题(GEPs)是科学与工程中一项基本但计算上极为困难的任务。作为一种有前景的方向,轮廓积分(CI)方法提供了高效且可并行化的框架。然而,其性能关键依赖于积分轮廓的选择——在没有可靠先验知识的情况下,不当选择可能导致显著的计算开销并损害数值精度。为应对这一挑战,我们提出Deepcontour,一种新颖的混合框架,它将基于深度学习的谱预测器与核密度估计(KDE)相结合,用于原则性的轮廓设计。具体而言,Deepcontour利用其专用的特征神经算子(ENO)提供快速的谱分布先验,驱动KDE模块自动构建优化的积分轮廓,从而引导CI求解器高效地找到所需特征值。Deepcontour在多种科学数据集上实现了高达5.63倍的加速,同时保持严格的数值精度。通过融合深度学习的预测能力与经典求解器的数值严谨性,这项工作为解决大规模GEPs建立了一种高效且稳健的范式。

英文摘要

Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours -- improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose Deepcontour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation (KDE) for principled contour design. Specifically, Deepcontour utilizes its specialized Eigen-Neural-Operator (ENO) to provide rapid spectral distribution priors, driving a KDE module to automatically construct the optimized integration contours, which guide the CI solver to efficiently find the desired eigenvalues. Deepcontour achieves up to a 5.63x speedup across diverse scientific datasets while maintaining strict numerical rigor. By merging the predictive power of deep learning with the numerical rigor of classical solvers, this work establishes an efficient and robust paradigm for solving large-scale GEPs.

2511.22331 2026-06-10 math.OC cs.AI cs.LG 版本更新

On the Condition Number Dependency in Bilevel Optimization

关于双层优化中条件数依赖性的研究

Lesi Chen, Jingzhao Zhang

发表机构 * IIIS, Tsinghua University(清华大学信息学院)

AI总结 本文针对非凸上层、强凸下层的双层优化问题,建立了条件数依赖性的下界,揭示了双层与极小极大优化在条件数依赖上的首次可证明差距。

Comments This new version improves deterministic lower bounds in v1

详情
AI中文摘要

双层优化最小化一个由上层问题定义的目标函数,其可行域是下层问题的解集。我们研究当上层问题非凸、下层问题强凸时,使用一阶方法寻找 $\epsilon$-稳定点的 oracle 复杂度。近期工作 (Ji et al., ICML 2021; Arbel and Mairal, ICLR 2022; Chen et al., JMLR 2025) 达到了 $\tilde{\mathcal{O}}(\bar \kappa_y^4 \epsilon^{-2})$ 的上界,在 $\epsilon$ 上接近最优,通过在内循环中朴素应用 Nesterov 加速可降至 $\tilde{\mathcal{O}}(\bar \kappa_y^{7/2} \epsilon^{-2})$,其中 $\bar \kappa_y$ 是全局条件数。然而,条件数的最优依赖性未知。本文建立了新的 $\Omega(\kappa_y^{5/2} \epsilon^{-2})$ 下界,其中 $\kappa_y < \bar \kappa_y$ 是下层条件数,当光滑常数为 $\mathcal{O}(1)$ 时与 $\bar \kappa_y$ 同阶。我们的下界首次证明了在此设定下双层问题与极小极大优化在条件数依赖性上的可证明差距。下界可推广到多种设置,包括高阶光滑函数、随机 oracle 和凸超目标:(1) 对于二阶和任意光滑问题,我们分别给出 $\Omega({\kappa_y^{31/14}} \epsilon^{-12/7})$ 和 $\Omega(\kappa_y^{21/10} \epsilon^{-8/5})$ 的下界。(2) 对于凸-强凸问题,我们将先前最佳下界 (Ji and Liang, JMLR 2022) 从 $\Omega(\kappa_y /\sqrt{\epsilon})$ 改进为 $\Omega(\kappa_y^{3/2} / \sqrt{\epsilon})$。(3) 对于光滑随机问题,我们也给出 $\Omega(\kappa_y^4 \epsilon^{-4})$ 的下界。

英文摘要

Bilevel optimization minimizes an objective function, defined by an upper-level problem whose feasible region is the solution of a lower-level problem. We study the oracle complexity of finding an $ε$-stationary point with first-order methods when the upper-level problem is nonconvex, and the lower-level problem is strongly convex. Recent works (Ji et al., ICML 2021; Arbel and Mairal, ICLR 2022; Chen et al., JMLR 2025) achieve a $\tilde{\mathcal{O}}(\bar κ_y^4 ε^{-2})$ upper bound that is near-optimal in $ε$, which can be reduced to $\tilde{\mathcal{O}}(\bar κ_y^{7/2} ε^{-2})$ by a naive application of Nesterov acceleration in the inner loop, where $\bar κ_y$ is the global condition number. However, the optimal dependency on the condition number is unknown. In this work, we establish a new $Ω(κ_y^{5/2} ε^{-2})$ lower bound, where $κ_y < \bar κ_y$ is the lower-level condition number that is of the same order as $\bar κ_y$ when the smoothness constants are $\mathcal{O}(1)$. Our lower bound establishes the first provable gap in terms of condition number dependency between bilevel problems and minimax problems in this setup. Our lower bounds can be extended to various settings, including high-order smooth functions, stochastic oracles, and convex hyper-objectives: (1) For second-order and arbitrarily smooth problems, we show lower bounds of $Ω({κ_y^{31/14}} ε^{-12/7})$ and $Ω(κ_y^{21/10} ε^{-8/5})$, respectively. (2) For convex-strongly-convex problems, we improve the previously best lower bound (Ji and Liang, JMLR 2022) from $Ω(κ_y /\sqrtε)$ to $Ω(κ_y^{3/2} / \sqrtε)$. (3) For smooth stochastic problems, we also show a lower bound of $Ω(κ_y^4 ε^{-4})$.

2512.06343 2026-06-10 cs.LG cs.AI cs.CL 版本更新

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

当距离干扰:BT损失中表示距离偏差对奖励模型的影响

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 分析BT损失中表示距离导致的梯度偏差,提出NormBT自适应归一化方案,提升奖励模型在细粒度区分上的性能。

Comments ICML 2026

详情
AI中文摘要

奖励模型是RLHF框架中大型语言模型对齐的核心。奖励建模中使用的标准目标是Bradley-Terry(BT)损失,它从由选择和拒绝响应组成的成对数据中学习。在这项工作中,我们分析了BT损失的每个样本梯度,并展示了由于表示距离而产生的虚假学习信号。特别是,BT梯度范数由两个不同的组成部分缩放:(1)预测误差,反映选择和拒绝响应之间预测奖励的差异,以及关键地,(2)在最后一层输出空间中测量的对之间的表示距离。虽然第一项捕获了预期的训练信号,但第二项会显著影响更新幅度并导致学习错位。具体来说,表示距离小的对即使排名错误也经常收到微弱的更新,而距离大的对则收到不成比例的大更新。这导致来自大距离对的梯度掩盖了来自小距离对的梯度,而细粒度区分在小距离对中尤为重要。为了克服这一限制,我们提出了NormBT,一种自适应成对归一化方案,重新缩放更新以平衡表示驱动效应,并将学习信号聚焦于预测误差。NormBT是对BT损失的轻量级、即插即用修改,开销可忽略。在各种LLM骨干和数据集上,NormBT一致地提高了奖励模型性能,在RewardBench的推理类别上取得了超过5%的显著提升,该类别包含大量细粒度对。

英文摘要

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

2512.14617 2026-06-10 cs.LG cs.AI 版本更新

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

离散动作非马尔可夫奖励决策过程中基于模型的强化学习

Alessandro Trapasso, Luca Iocchi, Fabio Patrizi

发表机构 * Fondazione Bruno Kessler(布雷诺·科塞拉基金会) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出QR-MAX算法,通过奖励机分解马尔可夫转移学习与非马尔可夫奖励处理,首次在离散NMRDP中获得PAC收敛到ε-最优策略的多项式样本复杂度,并扩展至连续状态空间。

Comments Accepted at IJCAI-ECAI 2026. 19 pages, 32 figures, includes appendix

详情
AI中文摘要

许多实际决策问题涉及的任务成功取决于整个系统历史,而非达到具有期望属性的状态。马尔可夫强化学习(RL)方法不适用于此类任务,而基于非马尔可夫奖励决策过程(NMRDP)的RL使智能体能够处理时间依赖任务。长期以来,这种方法缺乏关于(近)最优性和样本效率的形式保证。我们通过QR-MAX解决了这两个问题,这是一种新颖的基于模型的算法,用于离散NMRDP,通过奖励机将马尔可夫转移学习与非马尔可夫奖励处理分解。据我们所知,这是第一个利用这种分解获得PAC收敛到ε-最优策略且具有多项式样本复杂度的离散动作NMRDP的基于模型的RL算法。然后,我们将QR-MAX扩展到连续状态空间,提出Bucket-QR-MAX,一种基于SimHash的离散化器,它保留了相同的分解结构,无需手动网格划分或函数逼近即可实现快速稳定的学习。我们在复杂度递增的环境中将我们的方法与现代最先进的基于模型的RL方法进行了实验比较,显示出样本效率的显著提高和寻找最优策略的鲁棒性增强。

英文摘要

Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.

2512.17629 2026-06-10 cs.LG cs.AI 版本更新

SCOPE: Sequential Causal Optimization of Process Interventions

SCOPE: 过程干预的顺序因果优化

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

发表机构 * Research Centre for Information Systems Engineering (LIRIS), KU Leuven, Leuven, Belgium(信息系统工程研究中心(LIRIS),鲁汶大学,比利时列文) School of Computation, Information and Technology, Technical University of Munich (TUM), Munich, Germany(计算、信息与技术学院,慕尼黑技术大学(TUM),德国慕尼黑)

AI总结 提出SCOPE方法,通过反向归纳和因果学习直接利用观测数据,优化业务流程中顺序干预的KPI,优于现有方法。

详情
AI中文摘要

规范性过程监控(PresPM)在运行业务流程期间推荐干预措施以优化关键绩效指标(KPI)。在现实环境中,干预很少是孤立的:组织需要对齐干预序列以共同引导案例的结果。现有的PresPM方法仅部分解决了这一挑战。许多方法专注于单个干预决策,而其他方法将多个干预视为独立,忽略了它们随时间如何相互作用。确实处理这些依赖关系的方法依赖于模拟或数据增强来近似过程以训练强化学习(RL)代理,这可能会造成现实差距并引入偏差。我们提出了SCOPE(过程干预的顺序因果优化),一种学习对齐的顺序干预推荐的PresPM方法。SCOPE采用反向归纳来估计每个候选干预动作的效果,将其影响从最终决策点传播回第一个决策点。通过利用因果学习器,我们的方法可以直接使用观测数据,不同于需要构建过程近似用于RL的方法。在现有合成数据集和新的半合成数据集上的实验表明,SCOPE在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新型半合成设置作为可重复使用的基准,用于未来关于顺序PresPM的工作。

英文摘要

Prescriptive Process Monitoring (PresPM) recommends interventions during running business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches only partially address this challenge. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which may create a reality gap and introduce bias. We introduce SCOPE (Sequential Causal Optimization of Process Interventions), a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for RL. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

2601.08379 2026-06-10 cs.LG cs.AI cs.CV 版本更新

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

MMD Guidance: 基于最大均值差异引导的无训练分布适应扩散模型

Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MMD Guidance,一种无训练方法,通过最大均值差异梯度引导扩散模型采样,实现与参考数据分布对齐,无需重新训练。

详情
AI中文摘要

预训练扩散模型已成为无条件及条件样本生成的有力先验,但其输出常偏离用户特定目标数据的特征。这种不匹配在领域适应任务中尤为突出,此时仅有少量参考样本可用且重新训练扩散模型不可行。现有推理时引导方法可调整采样轨迹,但通常优化替代目标(如分类器似然)而非直接对齐目标分布。我们提出MMD Guidance,一种无训练机制,通过生成样本与参考数据集之间的最大均值差异(MMD)梯度增强反向扩散过程。MMD能从有限数据中提供可靠分布估计,实践中方差低,且可高效微分,特别适合引导任务。我们的框架通过乘积核自然扩展到条件生成模型中的提示感知适应。此外,由于引导在潜在扩散模型(LDM)的潜在空间中进行,因此可高效计算。在合成及真实世界基准上的实验表明,MMD Guidance能在保持样本保真度的同时实现分布对齐。项目代码见该网址。

英文摘要

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

2601.13994 2026-06-10 cs.DC cs.AI 版本更新

torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch

torch-sla: 可微稀疏线性代数,带有伴随求解器和稀疏张量并行性,用于 PyTorch

Mingyuan Chi, Shizheng Wen

发表机构 * PyTorch

AI总结 提出 torch-sla 库,通过统一 autograd 接口和 O(1) 图伴随微分,实现可微稀疏线性代数求解器,支持多后端和分布式多 GPU 执行。

详情
AI中文摘要

可微稀疏线性代数是科学机器学习的基础,但 PyTorch 缺乏统一的库:此 http URL 仅提供低级内核和不可微、仅 CPU 的 spsolve,而此 http URL 仅支持密集。我们提出 torch-sla,一个填补这一空白的开源库。它暴露了一个单一的 autograd 感知 API,用于直接、迭代、非线性和特征值求解器,跨越五个可互换的后端——CPU 上的 SciPy 和 Eigen,GPU 上的 cuDSS、CuPy 和 PyTorch 原生迭代求解器——并根据设备和问题大小自动调度。该库进一步支持在共享或不同稀疏模式上的批量求解,以及通过域分解和光环交换的分布式多 GPU 执行。这些能力通过 O(1) 图伴随微分框架和 autograd 兼容的分布式光环交换层实现可扩展性。该库可在 https://github.com/your-repo/torch-sla 获取。

英文摘要

Differentiable sparse linear algebra is foundational for scientific machine learning, yet PyTorch lacks a unified library for it: torch.sparse provides only low-level kernels and a non-differentiable, CPU-only spsolve, and torch.linalg is dense-only. We present torch-sla, an open-source library that fills this gap. It exposes a single autograd-aware API for direct, iterative, nonlinear, and eigenvalue solvers across five interchangeable backends -- SciPy and Eigen on CPU, cuDSS, CuPy, and a PyTorch-native iterative solver on GPU -- with automatic dispatch by device and problem size. The library further supports batched solves over shared or distinct sparsity patterns and distributed multi-GPU execution via domain decomposition with halo exchange. These capabilities are made scalable by an O(1)-graph adjoint differentiation framework and an autograd-compatible distributed halo-exchange layer. The library is available at https://www.torchsla.com/.

2602.02788 2026-06-10 cs.LG cs.AI physics.comp-ph 版本更新

Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs

保结构学习提升神经PDE在几何泛化中的表现

Benjamin D. Shaffer, Shawn Koohy, Brooks Kinch, M. Ani Hsieh, Nathaniel Trask

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出General-Geometry Neural Whitney Forms (Geo-NeW)方法,通过联合学习微分算子和兼容的降阶有限元空间,结合有限元外微积分精确保持物理守恒律,显著提升对未见几何域的泛化能力。

详情
AI中文摘要

我们旨在为科学和工程开发物理基础模型,这些模型能为偏微分方程(PDE)提供实时解,并在适应未见几何时保持结构和精度。为此,我们引入了通用几何神经Whitney形式(Geo-NeW):一种数据驱动的有限元方法。我们联合学习一个微分算子和定义在底层几何上的兼容降阶有限元空间。求解所得模型以生成预测,同时通过有限元外微积分精确保持物理守恒律。几何通过基于Transformer的编码和作为学习到的有限元空间的基础,以离散化网格的形式进入模型。这明确地将底层几何和施加的边界条件与解联系起来,为学习神经PDE提供了强大的归纳偏置,我们证明这改善了对未见域的泛化。我们提供了一种本构模型的新参数化,确保解的存在性和唯一性。我们的方法在几个稳态PDE基准测试中展示了最先进的性能,并在分布外几何上比传统基线有显著改进。

英文摘要

We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeW): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs, which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring the existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks, and provides a significant improvement over conventional baselines on out-of-distribution geometries.

2602.03164 2026-06-10 cs.LG cs.AI 版本更新

MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning

MemCast:基于经验条件推理的记忆驱动时间序列预测

Xiaoyu Tao, Mingyue Cheng, Ze Guo, Shuo Yu, Yaguo Liu, Qi Liu, Shijin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MemCast框架,将时间序列预测转化为经验条件推理任务,通过层次化记忆学习历史模式、推理智慧和一般规律,并采用动态置信度适应策略实现持续进化,在多个数据集上优于现有方法。

详情
AI中文摘要

时间序列预测(TSF)在许多现实世界的决策中起着关键作用。最近,基于大型语言模型(LLM)的预测器取得了有希望的进展。尽管有效,现有方法通常缺乏显式的经验积累和持续进化。在这项工作中,我们提出了MemCast,一个学习到记忆的框架,将TSF重新表述为经验条件推理任务。具体来说,我们从训练集中学习经验并将其组织成层次化记忆。这是通过将预测结果总结为历史模式、将推理轨迹提炼为推理智慧、以及将提取的时间特征归纳为一般规律来实现的。此外,在推理过程中,我们利用历史模式指导推理过程,利用推理智慧选择更好的轨迹,而一般规律则作为反思迭代的标准。另外,为了实现持续进化,我们设计了一种动态置信度适应策略,该策略在不泄露测试集分布的情况下更新单个条目的置信度。在多个数据集上的大量实验表明,MemCast始终优于以前的方法,验证了我们方法的有效性。我们的代码可在以下网址获得:此 https URL。

英文摘要

Time series forecasting (TSF) plays a critical role in decision-making for many real-world applications. Recently, large language model (LLM)- based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning-to-memory framework that reformulates TSF as an experience-conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://github.com/Xiaoyu-Tao/MemCast-TS.

2602.12542 2026-06-10 cs.LG cs.AI 版本更新

Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference

探索预测性医疗中基于概念正交推理的准确且透明的域适应

Pengfei Hu, Chang Lu, Feifan Liu, Yue Ning

发表机构 * Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, United States(斯蒂文斯理工学院计算机科学系) UMass Chan Medical School, University of Massachusetts Amherst, Amherst, MA, United States(马萨诸塞大学阿默斯特分校UMass Chan医学学校)

AI总结 提出ExtraCare模型,通过将患者表示分解为不变和协变分量并强制正交,在保留标签信息的同时暴露域特定变异,实现准确预测并提供基于医疗概念的可解释性。

Comments Accepted by ICML 2026 Main Conference

详情
AI中文摘要

用于电子健康记录(EHR)临床事件预测的深度学习模型在不同数据分布下部署时,常常性能下降。虽然域适应(DA)方法可以缓解这种偏移,但其“黑箱”性质阻碍了在临床实践中的广泛采用,而临床实践中透明度对于信任和安全至关重要。我们提出ExtraCare,将患者表示分解为不变和协变分量。通过监督这两个分量并在训练中强制其正交性,我们的模型在保留标签信息的同时暴露域特定变异,从而实现比大多数特征对齐模型更准确的预测。更重要的是,它通过将稀疏的潜在维度映射到医疗概念,并通过目标消融量化其贡献,提供人类可理解的解释。ExtraCare在两个真实EHR数据集上,跨多个域划分设置进行评估,展示了优越的性能以及增强的透明度,其准确预测和来自广泛案例研究的解释证明了这一点。

英文摘要

Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed under different data distributions. While domain adaptation (DA) methods can mitigate such shifts, their "black-box" nature prevents widespread adoption in clinical practice where transparency is essential for trust and safety. We propose ExtraCare to decompose patient representations into invariant and covariant components. By supervising these two components and enforcing their orthogonality during training, our model preserves label information while exposing domain-specific variation at the same time for more accurate predictions than most feature alignment models. More importantly, it offers human-understandable explanations by mapping sparse latent dimensions to medical concepts and quantifying their contributions via targeted ablations. ExtraCare is evaluated on two real-world EHR datasets across multiple domain partition settings, demonstrating superior performance along with enhanced transparency, as evidenced by its accurate predictions and explanations from extensive case studies.

2602.17907 2026-06-10 cs.CL cs.AI 版本更新

Improving Topic Modeling by Distilling Soft Labels from Language Models

DSL-Topic:通过从语言模型中蒸馏软标签改进主题建模

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

发表机构 * University of Washington(华盛顿大学)

AI总结 提出DSL框架,通过从语言模型蒸馏软标签来增强主题模型训练,利用上下文感知的软标签重构信号,显著提升主题连贯性和分配准确性。

Comments 22 pages, 5 figures. Camera-ready version for ICML 2026

详情
AI中文摘要

传统的神经主题模型通常通过重构文档的词袋表示来优化,忽略了上下文信息并面临数据稀疏性问题。在这项工作中,我们引入了一种新颖的主题模型训练框架,通过从语言模型中蒸馏软标签(DSL)。为了构建上下文丰富的重构信号,我们将基于特定提示的下一个词概率投影到预定义词汇表上,并使用语言模型隐藏状态训练主题模型重构软标签。这产生了更高质量的主题,与语料库的潜在主题结构更加紧密对齐。大量实验表明,DSL在主题连贯性和分配准确性上相比现有基线取得了显著改进。此外,我们还引入了一种基于检索的指标,显示我们的方法在识别语义相似文档方面显著优于现有方法,突显了其在面向检索应用中的有效性。

英文摘要

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

2603.02221 2026-06-10 cs.LG cs.AI 版本更新

MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

MedFeat: 基于模型感知与可解释性驱动的LLM特征工程用于临床表格预测

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W Eyre, Jingjing Fu

发表机构 * Microsoft Research(微软研究院) University of Oxford(牛津大学)

AI总结 提出MedFeat框架,利用模型感知和特征重要性信号迭代引导LLM生成针对性特征,在临床表格预测中平均提升超10%。

详情
AI中文摘要

在临床表格预测中,带有特征工程的经典机器学习模型通常优于神经方法。LLM越来越多地被用于自动化这一过程,作为领域专家提出多样化的特征变换以提升下游性能。然而,现有的基于LLM的方法将特征生成与下游模型解耦:LLM未接收到关于哪些特征当前驱动预测或模型表示能力不足的信号,因此提议既不针对特征空间中有前景的区域,也不适应学习器的归纳偏差。这一缺陷在医疗数据中尤为突出,医疗数据同时表现出类别不平衡、异质特征空间和严格的可解释性要求。本文提出MedFeat,这是首个受机器学习从业者工作流程启发的特征工程框架,利用模型感知和特征重要性信号迭代地指导临床表格学习的特征发现。我们在广泛的具有挑战性的真实临床任务上评估MedFeat,并表明它在统计上显著优于最先进的基线,在具有不同归纳偏差的模型上平均提升超过10%。

英文摘要

In clinical tabular prediction, classical machine learning models with feature engineering often outperform neural methods. LLMs are increasingly used to automate this process, acting as domain experts that propose diverse feature transformations to boost downstream performance. However, existing LLM-based methods decouple feature generation from the downstream model: the LLM receives no signal about which features currently drive predictions or where the model's representational capacity falls short, so proposals are neither targeted to promising regions of the feature space nor tailored to the learner's inductive bias. This shortcoming is amplified in healthcare data, which simultaneously exhibits class imbalance, heterogeneous feature spaces, and strict interpretability requirements. In this paper, we propose MedFeat, the first feature engineering framework inspired by the workflow of machine learning practitioners, leveraging model-awareness and feature importance signals to iteratively guide feature discovery for clinical tabular learning. We evaluate MedFeat on a broad range of challenging real-world clinical tasks and show that it statistically significantly outperforms state-of-the-art baselines, with an average improvement of more than 10% over the baseline across models with distinct inductive biases.

2604.15414 2026-06-10 cs.LG cs.AI cs.NE 版本更新

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

超越单模型优化:在持续强化学习中保持可塑性

Lute Lillo, Nick Cheney

发表机构 * Department of Computer Science University of Vermont(计算机科学系大学 of Vermont)

AI总结 提出TeLAPA框架,通过维护行为多样性的策略档案和共享潜在空间,在持续强化学习中实现技能对齐的策略邻域,以解决单策略保存导致的可塑性丧失问题,提升任务学习、恢复和性能保持能力。

详情
AI中文摘要

持续强化学习必须在保留与适应之间取得平衡,然而许多方法仍然依赖于\emph{单模型保存},即承诺将一个不断演化的策略作为跨任务的主要可复用解决方案。即使保留了先前成功的策略,在干扰后它可能不再为快速适应提供可靠的起点,这反映了单策略保存无法解决的一种\emph{可塑性丧失}形式。受质量-多样性方法的启发,我们引入了\emph{TeLAPA}(可迁移的潜在对齐策略档案),这是一个持续强化学习框架,它将行为多样性的策略邻域组织成每个任务的档案,并维护一个共享的潜在空间,使得存档的策略在非平稳漂移下保持可比性和可复用性。这种视角将持续强化学习从保留孤立解决方案转变为维护\emph{技能对齐的邻域},其中包含有能力的、行为相关的策略,以支持未来的重新学习。在我们的MiniGrid持续学习设置中,\emph{TeLAPA}成功学习了更多任务,在干扰后重新访问任务时更快地恢复了能力,并在整个任务序列中保持了更高的性能。我们的分析表明,源最优策略通常不是迁移最优的,即使在局部有能力的邻域内也是如此,并且有效的复用依赖于保留和选择多个邻近的替代方案,而不是将它们合并为一个代表。总之,这些结果将持续强化学习重新定义为围绕可复用且有能力的策略邻域,提供了一条超越单模型保存、迈向更具可塑性的终身智能体的途径。

英文摘要

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

2605.08171 2026-06-10 cs.LG cs.AI 版本更新

Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count

通信动力学神经网络:用于改进Hessian条件数并减少参数数量的FFT对角化层

Lurong Pan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CDLinear块循环线性层,通过FFT对角化Hessian矩阵,在参数减少3.8倍下保持高精度,并给出显式条件数分析。

Comments 17 pages, 5 figures. Includes NumPy implementation, gradient checks, MNIST experiments, and reference PyTorch CD-Transformer implementation

详情
AI中文摘要

通信动力学神经网络(CDNNs)将通信动力学框架的循环谱机制应用于神经网络层设计。我们引入了CDLinear,一种块大小为B=2l+1的块循环线性层,其参数数量仅为具有相同输入输出维度的稠密层的1/B。该构造为优化提供了显式的傅里叶域诊断:对于均方损失,权重Hessian矩阵通过离散傅里叶变换对角化,特征值直接由输入块的傅里叶谱决定。在输入预白化下,总体Hessian条件数恰好为1,经验条件数对于N个样本被界定为1+O(sqrt(B/N))。我们使用纯NumPy实现CDLinear,并手工推导反向传播,通过有限差分验证梯度。在8x8 MNIST数字基准上,三个随机种子下,使用B=4的CDLinear MLP以2,380个参数达到97.50%±0.23%的测试准确率,而使用8,970个参数的稠密基线为98.15%±0.47%。这实现了3.8倍的参数减少,准确率损失0.65%。CD-MLP的平均Hessian条件数为1.9e4,比稠密基线的5.9e6小约310倍。我们将CDLinear定位为结构化矩阵神经网络层的一个特例,主要贡献包括闭式Hessian谱诊断、原则性的块多重性离散序列以及显式的条件数分析。我们还发布了一个参考PyTorch实现,将CDLinear集成到DeepSeek-V3风格的混合专家Transformer中,用于未来大规模基准测试。

英文摘要

Communication Dynamics Neural Networks (CDNNs) apply the circulant-spectral machinery of the Communication Dynamics framework to neural-network layer design. We introduce CDLinear, a block-circulant linear layer with block size B = 2l + 1 that uses 1/B the parameters of a dense layer with the same input and output dimensions. The construction gives an explicit Fourier-domain diagnostic for optimization: for mean-squared loss, the weight Hessian is diagonalized by the discrete Fourier transform, with eigenvalues determined directly by the Fourier spectrum of the input blocks. Under input pre-whitening, the population Hessian condition number is exactly 1, and the empirical condition number is bounded by 1 + O(sqrt(B/N)) for N samples. We implement CDLinear in pure NumPy with hand-derived backward passes and verify gradients by finite differences. On the 8x8 MNIST digits benchmark, across three random seeds, a CDLinear MLP with B = 4 reaches 97.50% +/- 0.23% test accuracy using 2,380 parameters, compared with 98.15% +/- 0.47% for a dense baseline using 8,970 parameters. This gives a 3.8x parameter reduction at a 0.65% accuracy cost. The CD-MLP's mean Hessian condition number is 1.9e4, about 310x smaller than the dense baseline's 5.9e6. We position CDLinear as a special case of structured matrix neural-network layers, with the main contributions being a closed-form Hessian-spectrum diagnostic, a principled discrete sequence of block multiplicities, and an explicit conditioning analysis. We also release a reference PyTorch implementation integrating CDLinear into a DeepSeek-V3-style mixture-of-experts transformer for future large-scale benchmarks.

2605.27286 2026-06-10 cs.LG cs.AI 版本更新

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

Falcon-X:面向异构多变量建模的时间序列基础模型

Yiding Liu, Yifan Hu, Hongjie Xia, Peiyuan Liu, Hongzhou Chen, Xilin Dai, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant Group(蚂蚁集团)

AI总结 针对现有时间序列基础模型在语义对齐和关系表达上的局限,提出Falcon-X,通过将变量映射到统一潜在原型空间,利用统一原型差分注意力机制对齐异构变量,并在共享空间中通过潜在实体注意力进行跨变量交互,实现零样本结构迁移,在GIFT-Eval和fev-bench基准上取得最先进预测性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)通过大规模跨领域预训练正在改变预测范式。然而,大多数现有的TSFMs仍然是单变量的,最近实现跨变量建模的努力仍然直接在原始变量空间内操作。这种设计在语义对齐和关系表达性方面引入了基本限制。具体来说,原始空间分组混合缺乏对齐异构物理量的专用机制,而标准的非负注意力无法捕捉现实世界系统中普遍存在的复杂协同和拮抗交互。为了解决这些挑战,我们提出了Falcon-X,它将变量从原始空间中解耦,并将其映射到一个统一的潜在原型空间中。Falcon-X采用统一原型差分注意力机制,显式评估正负语义亲和度以对齐异构变量。然后通过潜在实体注意力在该共享空间内高效执行跨变量交互,自然地促进零样本结构迁移。最后,变量重组路由器通过请求-分发机制稳健地重建变量特定的轨迹。在GIFT-Eval和fev-bench基准上的广泛评估表明,Falcon-X实现了最先进的预测性能,为复杂的多变量环境提供了一种有原则且可扩展的范式。Falcon-X已公开发布以支持未来研究。

英文摘要

Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most existing TSFMs remain univariate, and recent efforts to enable cross-variate modeling still operate directly within the raw variate space. This design introduces fundamental limitations in semantic alignment and relational expressivity. Specifically, raw-space group mixing lacks a dedicated mechanism to align heterogeneous physical quantities, while standard non-negative attention fails to capture the complex synergistic and antagonistic interactions ubiquitous in real-world systems. To address these challenges, we propose Falcon-X, decouples variates from the raw space and maps them into a unified latent prototype space. Falcon-X employs a Unified Prototype Diff-Attention mechanism that explicitly evaluates both positive and negative semantic affinities to explicitly align heterogeneous variates. Cross-variate interactions are then efficiently performed within this shared space via Latent Entity Attention, naturally facilitating zero-shot structural transfer. Finally, a Variate Reassembly Router robustly reconstructs variate-specific trajectories via a request-and-dispatch mechanism. Extensive evaluations on the GIFT-Eval and fev-bench benchmarks demonstrate that Falcon-X achieves excellent forecasting performance, offering a principled and scalable paradigm for complex multivariate environments. Falcon-X is publicly released to support future research.

2605.28057 2026-06-10 cs.LG cs.AI 版本更新

On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective

测试时自适应可学习性:恢复复杂性视角

Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li

发表机构 * Zhi Zhou(周智) Ming Yang(杨明) Shi-Yu Tian(田世宇) Kun-Yang Yu(于坤阳) Lan-Zhe Guo(郭兰哲) Yu-Feng Li(李宇峰)

AI总结 本文提出首个理论框架,通过引入(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性,研究非平稳流下测试时自适应的可学习性,并揭示自适应-信息权衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

测试时自适应(TTA)旨在调整模型以在非平稳测试流上保持可靠性能,无需标注数据。尽管其实证成功,非平稳流下TTA的可学习性仍未探索。关键挑战在于缺乏一个原则性的理论框架,该框架同时与TTA目标一致,并捕捉持续演变的分布偏移和内在信息约束。为解决这一差距,我们提出了首个研究TTA可学习性的理论框架,并引入了(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性。恢复复杂性衡量偏移后维持超额风险低于目标水平所需的时间(高概率),并进一步扩展到TTA可学习性,衡量TTA的长期可靠性。在该框架内,我们引入了一种新颖的非平稳测试流离散替代,使得对渐变和突变进行统一且易于处理的分析成为可能。我们推导了恢复复杂性的阶匹配下界和上界,揭示了TTA的基本限制和内在的自适应-信息权衡。这些结果为TTA提供了统一的可学习性保证,补充了基于遗憾的分析。

英文摘要

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce $(ε,δ)$-Recovery Complexity and $(ε,ρ)$-TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity-information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

2605.30370 2026-06-10 cs.NE cs.AI cs.CV cs.LG 版本更新

Updating the standard neuron model in artificial neural networks

更新人工神经网络中的标准神经元模型

Raul Mohedano, Thomas Batard, Erik Velasco-Salido, Ramsses De Los Santos Mendoza, Jorge H. Martínez, Stacey Levine, Marcelo Bertalmío

发表机构 * Spanish National Research Council (CSIC)(西班牙国家研究理事会(CSIC)) Center for Research in Mathematics (CIMAT)(数学研究中心(CIMAT)) Universidad Autónoma de Madrid (UAM)(马德里自治大学(UAM)) National Science Foundation (NSF)(国家科学基金会(NSF))

AI总结 本文用更真实的皮层细胞模型替代标准点神经元模型,在不增加参数的情况下,提升了人工神经网络的表达能力、鲁棒性和学习速度,并减少了记忆化和所需训练数据量。

Comments Acknowledgments included in the manuscript

详情
AI中文摘要

自20世纪50年代诞生以来,人工神经网络(ANNs)一直使用当时神经科学中流行的所谓点神经元模型,希望这种类比能够更好地模拟大脑功能。多年来,神经科学文献表明点神经元模型过于简单,无法正确表示许多基本的神经过程;然而,ANNs中的标准神经元模型仍然保持不变。在这里,我们用一个非常新的皮层细胞模型替代它,并通过理论分析和实验结果证明,仅仅通过使用更真实的神经单元元素而不增加参数数量,所得到的ANNs提供了许多重要优势,包括增强的表达能力、鲁棒性和学习速度,以及减少记忆化和所需的训练数据量。

英文摘要

From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.

2606.02378 2026-06-10 cs.LG cs.AI 版本更新

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

注意力电路何时形成?三种1B级架构中能力和注意力汇出现的发育轨迹

Yongzhong Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文追踪三种1B级语言模型中注意力头电路的形成轨迹,发现归纳电路形成早于注意力汇形成10-20倍令牌,且电路识别无需最终模型。

Comments 27pages, 3 figures

详情
AI中文摘要

我们追踪了三种1B级语言模型(涵盖两个架构系列:密集Transformer和混合专家,以及两个预训练语料库:The Pile和DCLM)中注意力头电路形成的发育轨迹:Pythia 1B、OLMo 1B-0724-hf和OLMoE 1B-7B-0924。在每个模型的10个对数间隔检查点(共30次机械可解释性运行)上,我们应用参与比(PR)谱信号和全头能力特异性选择性筛选,以追踪归纳头、前一个标记头和BOS吸引头出现的过程。五个发现:(F1)第0层和第1层在每个模型的每个检查点都没有产生BOS分类头:L0/L1零BOS底限是架构属性,而非学习结果。(F2)整个模型的BOS吸引头比例呈现三种不同的出现形状——Pythia 1B中逐渐上升,OLMo 1B中急剧相变(相邻检查点间从7%到70%),OLMoE 1B-7B中逐渐上升。(F3)在DCLM模型中,归纳电路的形成比BOS吸引头的形成早10-20倍令牌;能力电路的形成和注意力汇的形成是两个不同的转变,而非一个。(F4)能力特异性筛选在总训练令牌的0.3-2%内收敛到最终的归纳电路——电路识别不需要最终模型。(F5)对于从所有三个模型中采样的每个最终检查点归纳头,其每个头的PR值在该头首次超过其能力选择性阈值时或之前就已升高。这些结果细化了归纳相变框架:在DCLM上训练的1B级模型中,归纳转变和注意力汇转变在令牌数量上相差一个数量级,并且具有定性不同的形状。

英文摘要

We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes -- a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens -- circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes.

2606.06021 2026-06-10 cs.LG cs.AI 版本更新

OPRD: On-Policy Representation Distillation

OPRD: 在线策略表示蒸馏

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对在线策略蒸馏中输出空间监督的采样方差和忽略中间隐藏状态的问题,提出OPRD方法,通过在隐藏状态空间对齐师生表示,消除采样方差、提供更丰富的逐层结构信息,并在AIME等基准上缩小师生差距,训练速度提升1.44倍,内存减少54%。

详情
AI中文摘要

在线策略蒸馏(OPD)仅通过匹配下一个词元的概率在输出空间监督学生。这种仅输出范式有两个限制:(1)在大词汇量(例如Qwen约15万个词元)上,蒙特卡洛KL估计的采样方差在整个训练过程中持续存在;(2)它将教师视为黑盒,丢弃了LM头之后的所有中间隐藏状态。我们提出在线策略表示蒸馏(OPRD),通过在相同轨迹上选择层对齐学生和教师的表示,将蒸馏提升到隐藏状态空间,完全绕过LM头。理论上,OPRD消除了采样方差,并提供了更丰富的逐层结构信息。实验上,OPRD在AIME 2024/2025和AIMO上缩小了学生与教师之间的差距,而输出空间OPD基线停滞在教师水平以下。OPRD的训练速度也比top-k OPD快1.44倍,内存使用减少54%。代码:https://github.com/ShenzhiYang2000/OPRD。

英文摘要

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance(字节跳动)

AI总结 提出SRT框架,通过解缠校正流将低分辨率时间序列重建为高分辨率,分解趋势与季节成分,利用隐式神经表示对齐分辨率,并引入跨分辨率注意力机制生成细节。

Comments Accepted to the International Conference on Learning Representations (ICLR) 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而,获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题,这被称为超分辨率。虽然在计算机视觉中得到了广泛研究,但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战,我们提出了时间序列超分辨率(SRT),这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分,使用隐式神经表示将它们对齐到目标分辨率,并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large,这是一个经过大规模预训练的扩展版本,具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明,SRT和SRT-large在多个尺度因子下始终优于现有方法,展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

6. 自然语言与多模态智能 40 篇

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 新提交

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2606.10298 2026-06-10 cs.AI cs.CL 新提交

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

从上下文感知到冲突感知:泛化对比解码以应对LLMs中的知识冲突

Runze Jiang, Taiqiang Wu, Yan Wang, Bingyu Zhu, Longtao Huang

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团) The University of Hong Kong(香港大学)

AI总结 针对大语言模型生成时外部上下文与参数先验之间的知识冲突,提出冲突感知范式,通过动态分配先验与上下文的权重,并设计自适应机制解决不同冲突状态下的不对称问题。

Comments 27 pages, 9 figures

详情
AI中文摘要

当大语言模型从检索或增强的上下文中生成时,外部上下文与参数先验之间的冲突仍然是核心可靠性瓶颈。现有的对比解码方法遵循一种\emph{上下文感知}范式,单方面放大上下文而压制参数先验,当上下文错误时会覆盖正确的先验。我们将其泛化为\textbf{冲突感知}范式,基于冲突信号动态分配先验与上下文的权威,而非预设上下文的可信度。我们证明,先验和上下文logits的仿射组合产生一个\textbf{幂族},具有固有的\textbf{状态不对称性}:当先验正确时外推会无界放大错误,当上下文正确时内插会纠正不足,且没有静态状态能同时覆盖两者。现有的对比解码方法是该族实例,大多为外推型。为评估两种冲突方向,我们提出TriState-Bench,一种模型感知的评估协议,校准每个模型的先验知识以测量三种冲突状态:纠正、抵抗和一致。为解决不对称性,我们提出自适应状态路由(ARR),在每一步在状态间路由,将抵抗EM从低于6提升至16-33,且不牺牲纠正或一致。我们的代码可在该https URL获取。

英文摘要

When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.

2606.10334 2026-06-10 cs.AI 新提交

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

通过视觉反馈的自蒸馏策略优化:连接代码与视觉工件

Haoyu Dong

发表机构 * Microsoft(微软)

AI总结 提出Visual-SDPO框架,利用渲染视觉反馈作为特权上下文,通过自蒸馏和视觉引导的代码信用加权优化代码生成视觉工件的质量,在图表、UI和幻灯片生成任务上显著提升性能。

详情
AI中文摘要

代码生成大语言模型(LLMs)通过编写由不可微渲染器执行的程序,越来越多地生成图表、网页和幻灯片等视觉工件,在观察渲染结果之前就确定了代码。因此,原本可执行的代码常常产生具有视觉显著缺陷的工件,包括元素重叠、文本裁剪、对齐破坏、对比度低和溢出。我们研究针对代码生成视觉工件的视觉反馈自蒸馏。我们提出Visual-SDPO,一种自蒸馏策略优化框架,将渲染的视觉反馈视为权重共享教师的特权上下文,并将该反馈蒸馏到编码学生中。为了使监督具有空间针对性而非均匀性,我们引入视觉引导的代码信用加权,将每个检测到的缺陷追溯到影响该元素的代码语句,并放大这些语句上的蒸馏信号。序列级GRPO(组相对策略优化)项通过奖励可执行、视觉质量高的 rollout 来补充密集的 token 级目标,而失败的执行通过自蒸馏路径仍然可学习,通过将执行错误作为特权上下文传递给教师。我们使用统一的 Qwen3-VL-8B-Instruct 骨干网络,在图表、网页/UI和幻灯片生成任务上实例化 Visual-SDPO。在图表到代码、UI到代码和幻灯片生成基准(ChartMimic、Design2Code和AeSlides)上,Visual-SDPO 在主要指标上比零样本基线提升超过10个绝对点,比GRPO提升至少2.4个点,且训练步骤更少,无额外推理成本。

英文摘要

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

2606.09837 2026-06-10 cs.HC cs.AI 交叉投稿

Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS

Self-EmoQ: 基于Plutchik引导的价值规划驱动流式情感TTS

Yue Zhao, Hongyan Li, Yong Chen, Luo Ji

发表机构 * Geely AI Lab(地平线人工智能实验室)

AI总结 提出一种情感规划框架,通过强化学习训练LLM模块,在文本生成前确定情感,以驱动流式TTS,结合Plutchik情感理论进行混合奖励,实验表明在情感确定和响应质量上优于基线。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

情感交互对于对话AI越来越重要,但当前系统缺乏自我情感确定机制来驱动流式文本到语音(TTS)合成。我们提出一个情感规划框架,在文本生成之前确定情感,以流式方式为下游情感TTS提供基础。该框架通过一个即插即用的LLM模块实现,该模块从预训练LLM初始化,并通过强化学习(RL)训练,以情感作为动作。采用混合奖励,结合模仿信号和理论驱动评分,其中采用了Plutchik情感轮理论。通过在DailyDialog、EmoryNLP、IMEOCAP和MELD上的实验,我们的方法在情感确定和响应质量上均优于提示和微调基线。我们最终实现了一个完整的流式管道用于实时部署,语音质量证实了框架的情感对齐、上下文连贯性和表达流畅性。代码、案例和演示可在该https URL获取。

英文摘要

Emotional interaction is increasingly crucial for conversational AI, yet current systems lack a self-emotion determination mechanism to drive the streaming text-to-speech (TTS) synthesis. We propose an emotion-planning framework that determines the emotion prior to the textual generation, grounding the downstream emotional TTS in a streaming manner. The framework is implemented by a plug-and-play LLM module, initialized from pretrained LLMs, and trained by reinforcement learning (RL) with emotions as the actions. A hybrid reward is employed which combines imitation signals with theory-driven scoring, in which the theory of Plutchik's wheel of emotions is adopted. By experiments on DailyDialog, EmoryNLP, IMEOCAP, and MELD, our method outperforms prompting and finetuning baselines on both emotion determination and response quality. We finally implement an entire streaming pipeline for real-time deployment, with the speech quality confirming the framework's emotional alignment, contextual coherence, and expressive fluency. Codes, cases, and demos are available in https://sixingdeguo.github.io/EmoQ-page/.

2606.09871 2026-06-10 cs.CV cs.AI cs.LG 交叉投稿

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

SD-GRPO:面向长格式视觉-语言生成的可验证片段分解

Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park

AI总结 提出SD-GRPO方法,通过将长格式输出分解为片段并计算逐片段优势,解决GRPO在视觉-语言任务中粗粒度信用分配不足的问题,实验证明其在多种长格式生成任务中优于基线。

详情
AI中文摘要

群体相对策略优化(GRPO)及其变体最初为大型语言模型(LLM)开发,最近被应用于多模态LLM并取得了强劲结果。然而,它们基于单一标量优势的粗粒度整体信用分配在视觉-语言(VL)任务中拟合不足,这些任务的输出通常是基于语义丰富图像的长格式响应。为解决这一限制,我们利用了一种单标量公式丢弃的结构化信号:长格式VL输出的自然分段。具体地,我们提出片段分解GRPO(SD-GRPO),它对整个rollout组中可验证的逐片段奖励进行z归一化,生成一个逐片段优势向量以替代单一标量。我们在三个设置中评估SD-GRPO,涵盖受控和真实世界的长格式VL生成,按片段间语义纠缠程度递增组织。在从DOCCI构建的受控多面板密集字幕任务中(片段语义独立),SD-GRPO始终优于GRPO基线,且片段数量越多增益越大。扩展到从MultiChartQA构建的受控多图表长格式VQA任务,我们从理论和经验上证明,rollout级奖励存在随输出长度增加而加剧的跨片段信用错误归因。在MMSci数据集上的真实世界科学图表字幕任务中(子图字幕共享图表上下文),混合整体和逐片段奖励进一步提升了两者性能,表明当片段语义纠缠时,仅逐片段归一化是不够的。最后,通过将SD-GRPO集成到Dr. GRPO中,我们确认它可以以最小的实现开销应用于任何GRPO框架,以增强长格式VL生成。

英文摘要

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd(iFLYTEK中央中国AI研究院,iFLYTEK公司)

AI总结 提出SocraticPO框架,在强化学习中使用自然语言指导辅助推理,并通过奖励衰减防止模型依赖教师帮助,提升科学推理任务性能。

详情
AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励(如二元正确性)来监督推理。这种奖励提供了优化方向,但很少解释模型应如何修正其错误推理,这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}(苏格拉底式策略优化),一种策略优化框架,用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中,学生首先独立回答;如果答案错误,教师诊断尝试并提供简洁的纠正性指导,之后学生在扩展的上下文下继续。关键的是,这种指导与奖励衰减配对:在教师干预后获得的正确答案只得到衰减的奖励,防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程,而保持标准期望奖励目标不变,它可以插入到现有的策略梯度后端(如Reinforce++)中。此外,由于教师只提供文本级指导,SocraticPO可以利用更强的黑盒教师模型,而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上,SocraticPO优于强强化学习和自蒸馏基线。消融实验表明,目标指导和奖励衰减都是必要的,奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

2606.09900 2026-06-10 cs.CL cs.AI cs.IR cs.LG 交叉投稿

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文,更高准确率:一种用于LLM Agent的双时间记忆引擎,其中精简检索上下文优于完整历史

Liuyin Wang

发表机构 * Independent Researcher(独立研究者)

AI总结 提出一种双时间记忆引擎Engram,通过混合读取路径从约9.6k token的检索片段中回答,在LongMemEval_S上达到83.6%准确率,比完整历史(79k token)高10.4个百分点,且无错误。

Comments 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram

详情
AI中文摘要

长期记忆是LLM Agent缺失的一层:跨会话时它们会遗忘,而常见的解决方法——将整个历史重放到提示中——成本高、速度慢,且随着干扰物积累,准确性下降。大多数记忆系统在成本或延迟上胜出,但在准确性上仍不如完整上下文基线,且基准测试结果在不一致、不可复现的测试平台上报告,导致同一系统在不同来源上得分差异巨大。我们提出Engram,一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件,无需LLM参与关键路径;异步路径提取原子(主体、谓词、客体)事实,构建双时间知识图谱,并解决矛盾,无需每个事实调用LLM——使事实失效而非删除,因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号,应用时间点(“截至”)过滤器,并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上,由官方分类特定评判器评分,Engram的精简配置——从约9.6k token的检索片段回答,而非完整历史——得分为83.6%,而完整上下文为73.2%(+10.4个百分点,McNemar p < 10^-6),token数约为1/8(9.6k vs. 79k),且0/500错误。这种增益需要混合读取路径:仅事实会丢失召回率,而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台,内置官方评判器,并在每个表格中包含完整上下文基线,发布原始每问题日志,并记录了无声扭曲记忆基准的测量完整性陷阱(截断、自制评判器、完整历史泄露)。每个数字都附带复现命令。

英文摘要

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

2606.10099 2026-06-10 cs.LG cs.AI 交叉投稿

Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

无监督风格表示学习用于通过释义反转检测AI文本

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

AI总结 提出无监督风格编码器,通过重构人工文本与机器生成释义间的差异学习判别性风格特征,实现少样本和零样本AI文本检测,性能优于基线。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展引发了对其滥用的担忧,如抄袭、错误信息和自动化影响操作,这促使需要鲁棒的检测器。最近的研究表明,写作风格的神经表示对于检测是有效的,并且至关重要的是,对于击败大多数现有检测器的对抗攻击具有鲁棒性。然而,当前的基于风格的检测器依赖作者标签进行训练,并且仅限于少样本推理进行检测,需要可能并不总是可用的分布内样本。我们通过训练风格编码器从机器生成的释义中重构人工文本,从而在没有作者标签的情况下学习判别性风格特征;在训练期间冻结语义编码器,使风格编码器偏向于仅捕获重构所需的非语义特征。我们通过两种检测策略评估学习到的表示:少样本检测器和基于DeepSVDD的零样本检测器。在基准测试中,我们的方法在少样本设置下匹配或优于所有基线,并且在零样本设置下,与完全监督的分类器在分布内测试数据上具有竞争力,同时对未见过的LLMs具有更好的泛化能力。除了检测之外,学习到的表示还能泛化到未见过的任务,在作者验证和细粒度风格区分上取得竞争性表现,尽管从未针对这两个目标进行训练。

英文摘要

The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.

2606.10113 2026-06-10 cs.CL cs.AI 交叉投稿

Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

基于LLM的文学翻译中的情感特征:机器翻译与译后编辑的系统性转变

Antonio Castaldo, Johanna Monti, Sheila Castilho

AI总结 研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译,通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译,发现MT系统引入特定情感指纹,削弱作者声音。

详情
AI中文摘要

本文研究LLM翻译是否表现出可识别的情感特征,以及译后编辑如何将其重塑为更接近人类的标准。我们比较了玛格丽特·阿特伍德《Oryx and Crake》的LLM翻译及其译后编辑版本和人类翻译,以当代意大利科幻小说的大规模语料库为基线。通过基于词典和多语言建模的方法,我们对不同系统的情感变化进行了细粒度分析。我们发现,机器翻译系统在翻译中引入了特定模型且统计显著的情感指纹,导致作者声音的保留有限。

英文摘要

This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood's Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

发表机构 * Purdue University(普渡大学)

AI总结 提出帕累托引导的教师对齐框架,通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化,在减少人口统计差异的同时保持个性化保真度,实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情
AI中文摘要

个性化说服性文本生成可以提高相关性和参与度,但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题:在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架,结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择,以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架,使用一个受控的上下文丰富的人口统计网格(匹配性别和年龄对)以及一个统一的五审计评估套件,涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中,没有单一的对齐策略能同时主导所有目标。相反,方法占据了公平-个性化帕累托前沿的不同区域:一些方法实现更强的差异减少,而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明,公平缓解效果依赖于目标,并在领域和模型族系间不一致地迁移,这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 交叉投稿

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结 提出一种时间控制方法,通过轻量级时间模块扩展预训练DiT,实现运动速度和时序结构的编辑,无需重新设计骨干网络。

详情
AI中文摘要

现代用于视频生成的扩散Transformer对时间进程的控制和时序动态的编辑能力有限。我们提出一种时间控制方法,通过显式时间编辑扩展预训练DiT,允许控制运动速度和时序结构,而无需重新设计骨干网络。其核心实现通过一个轻量级时间模块增强预训练模型,保留原始生成先验的同时扩展其可控动态范围。

英文摘要

Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

2606.10223 2026-06-10 cs.SD cs.AI cs.CV 交叉投稿

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

双分支门控融合用于开放集音频深度伪造源追踪

Awais Khan, Kutub Uddin, Khalid Malik

AI总结 针对开放集音频深度伪造源追踪问题,提出双分支门控融合框架,结合XLSR-53和CORES描述符,通过输入条件门控自适应加权,实现域内高精度和域外鲁棒泛化。

详情
AI中文摘要

将合成语音归因于其原始系统仍然是一个开放挑战:闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题,我们提出了一个双分支门控融合框架,将XLSR-53与CORES配对,CORES是一个66维描述符,与之前仅使用线性滤波器组(LFB)的工作不同,它跨越倒谱、振荡、节奏、能量和频谱维度,以捕获互补的合成伪影。我们的分析表明,XLSR-53在域内(ID)保持判别性,而CORES在分布偏移(OOD)下稳定泛化,但由于SSL表示不平衡,它们的简单拼接失败。为了解决这个问题,一个输入条件门控在联合训练下自适应地加权每个分支,使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上,我们的系统实现了97.6%的ID准确率、4.9%的EERc,并且相对于Interspeech 2025基线,FPR95相对降低了83.5%。

英文摘要

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

2606.10278 2026-06-10 cs.SD cs.AI 交叉投稿

Towards Robust Arabic Speech Emotion Recognition with Deep Learning

基于深度学习的鲁棒阿拉伯语音情感识别

Youcef Soufiane Gheffari, Samiya Silarbi

发表机构 * ADASCA Laboratory – Advanced Data Science and Cognitive Applications, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf (USTO-MB), Oran, Algeria(ADASCA实验室——高级数据科学与认知应用,奥兰穆罕默德·布迪夫科技大学(USTO-MB),阿尔及利亚奥兰)

AI总结 针对阿拉伯语音情感识别中方言多样、数据稀缺等问题,提出CNN-Transformer混合架构,在EYASE和BAVED数据集上达到98.1%准确率。

Comments 21 pages, 16 figures, 11 tables. Submitted manuscript

详情
AI中文摘要

语音情感识别(SER)旨在从音频信号中识别说话者的情感状态。尽管深度学习的最新进展显著提高了印欧语系语言的SER性能,但由于方言多样性、标注数据集有限以及难以同时建模局部频谱线索和长程时间依赖性,阿拉伯语SER仍然探索不足且具有挑战性。为解决这些限制,本研究探讨了联合建模空间和上下文信息的混合架构是否能改善阿拉伯语音的情感识别。我们提出并评估了一个包含三种架构的比较框架:CNN-LSTM模型、CNN-Transformer模型和微调的wav2vec 2.0模型。前两种模型利用MFCC和基于频谱图的表示,而wav2vec 2.0通过自监督表示直接对原始音频进行操作。在EYASE和BAVED数据集上进行的实验表明,所提出的CNN-Transformer架构显著优于其他模型,达到了98.1%的准确率。这一结果凸显了将卷积特征提取与基于Transformer的全局上下文建模相结合的有效性。本工作的主要贡献在于为阿拉伯语SER提供了混合方法和自监督方法的系统比较,并证明了CNN-Transformer架构在低资源和方言多样性环境中为捕捉频谱和长程依赖性提供了鲁棒解决方案。

英文摘要

Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.

2606.10296 2026-06-10 cs.CL cs.AI 交叉投稿

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

自信的撒谎者:利用对数概率和LLM作为评判诊断多智能体辩论

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

发表机构 * University of California, Irvine(加州大学伊文斯分校)

AI总结 研究多智能体辩论中令牌级对数概率、LLM评判分数与任务准确性的关系,发现信心与推理质量在构造者上关联更强,且信心可检测关键推理失败。

Comments 15 pages, 7 figures, 1 table, ACL proceedings

详情
AI中文摘要

多智能体辩论系统通常仅根据最终答案是否正确来评估,忽略了辩论旨在产生的中间推理的质量。本文研究了多智能体辩论中三种信号之间的关系:推理令牌上的令牌级对数概率分布、分配给这些令牌的LLM作为评判的评分标准分数以及最终任务准确性。我们考察了内部信心信号是否预测外部评估的推理质量,以及任一信号是否与任务正确性一致,涵盖三个领域:基于评分标准的评分、数学推理和事实问答。我们的框架将双智能体辩论架构——一个构造者(Constructor)和一个审计者(Auditor)——与一个LLM作为评判配对,该评判根据指令遵循、理由质量和证据基础对每个智能体的推理进行评分,并附带一个关键失败标志。在评分标准评分领域的实验揭示了一致的四阶段信心轨迹和显著的角色不对称性:构造者的信心与评判推理质量的相关性大约是审计者的两倍,并且基于信心的关键推理失败检测对构造者(AUROC 0.804)明显比审计者(0.634)更可靠。这些发现推动了本文提出的更广泛的跨领域研究。

英文摘要

Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

2606.10368 2026-06-10 cs.SD cs.AI 交叉投稿

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

语音遇见ELF:用于语音识别和翻译的音频条件连续目标扩散

Xuanchen Li, Tianrui Wang, Yuheng Lu, Zikang Huang, Yu Jiang, Chenghan Lin, Chenrui Cui, Ziyang Ma, Xingyu Ma, Chunyu Qiang, Guochen Yu, Xie Chen, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Nankai University(南开大学)

AI总结 提出ELF-S2T,一种基于预训练ELF骨干的音频条件连续目标生成模型,通过音频强制训练和分类器自由引导,在LibriSpeech和CoVoST2上实现竞争性ASR和S2TT性能,并揭示识别与翻译错误均源于连续潜空间中的近距离混淆。

详情
AI中文摘要

语音到文本(S2T)系统用于识别(ASR)和翻译(S2TT)通常生成离散文本标记。相比之下,连续目标语言建模在连续空间中执行生成,但其在S2T中的潜力尚未被探索。为填补这一空白,我们提出了ELF-S2T,一种用于S2T的音频条件连续目标生成模型。基于预训练的嵌入式语言流(ELF)骨干,ELF-S2T通过冻结的Whisper编码器和单个线性投影器处理语音,将得到的音频条件前置到噪声文本潜变量前,用于上下文流匹配去噪。为防止模型过度依赖其预训练的文本上下文,我们在训练中引入音频强制,并在推理时通过分类器自由引导进一步放大音频条件。在LibriSpeech和CoVoST2上的实验表明,ELF-S2T实现了具有竞争力的ASR和S2TT性能。关键的是,我们的错误分析揭示,尽管ASR和S2TT错误表面上看起来非常不同,但两者都源于同一根本原因:连续潜空间中的近距离混淆。这一发现自然与连续表示生成范式一致,表明识别和翻译之下存在共同的语义映射过程。我们的代码和预训练模型在此https URL公开提供。

英文摘要

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

2606.10471 2026-06-10 cs.CL cs.AI 交叉投稿

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

使用递归神经张量网络检测生物医学文本中的推测性语言

Dhruv Dixit

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 利用分布式句子表示和深度学习技术,提出递归神经张量网络(RNTN)用于自动检测生物医学文献中的推测性语言,性能略优于线性双元SVM(F1=0.885 vs 0.881)。

Comments 12 Pages

详情
AI中文摘要

在本研究中,我们通过利用分布式句子表示和先进的深度学习技术,深入探讨了生物医学文章中推测性语言的自动检测。这种识别的意义延伸至信息检索、多文档摘要以及新知识的探索。我们的探索涵盖了两种获取分布式句子表示的不同方法:段落向量模型和递归神经张量网络。然后,将这些方法与三种基础基线算法进行严格比较:支持向量机、朴素贝叶斯和模式匹配。我们的发现表明,递归神经张量网络(RNTN)的性能(F1=0.885)略优于表现最佳的基线线性双元SVM(F1=0.881)。同时,段落向量模型即使在使用大规模未标记数据集进行广泛训练后,效果也较差(F1=0.368)。我们对影响这些性能差异的因素进行了全面讨论,并为未来的研究方向提供了有见地的建议。

英文摘要

In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

2606.10738 2026-06-10 eess.AS cs.AI 交叉投稿

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni:通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan(腾讯文心)

AI总结 提出Spatial-Omni,通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型,以轻量方式实现空间音频理解,并在构建的SO-Bench基准上超越现有模型。

详情
AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号,从而丢弃了空间音频中包含的空间线索,这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni,一种轻量级方法,通过实现SO-Encoder将一阶Ambisonics(FOA)空间音频作为独立模态注入现有的全模态大语言模型,而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记,并通过高效的分阶段训练提升空间音频理解。为支持训练和评估,我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench,包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务,包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明,Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型(LALM)和全模态大语言模型,同时保持合理的通用音频理解水平。代码和数据见:https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

2606.10796 2026-06-10 cs.CL cs.AI 交叉投稿

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Dep-LLM:基于证据引导的结构化多因素与可靠LLM推理的无训练抑郁症诊断

Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室) Pengcheng Laboratory(鹏城实验室) Chinese People’s Liberation Army General Hospital, Beijing, China(中国人民解放军总医院)

AI总结 提出无训练框架Dep-LLM,通过思维链多因素分析、置信度调制和协作预测,在冻结LLM上实现抑郁症诊断,超越零样本和微调方法。

详情
AI中文摘要

从临床访谈中进行自动抑郁症检测(ADD)是计算心理健康领域的关键任务,但由于两个关键障碍仍然具有挑战性:1)在冗长、多主题的临床访谈中建模复杂但稀疏分布的抑郁线索困难,导致推理肤浅且不可靠;2)由于临床隐私导致标记数据稀缺,加上训练和微调的高成本,限制了监督式ADD系统的部署。为了共同应对这些挑战,我们提出了Dep-LLM,一个无训练框架,它模仿临床精神科医生的逐步推理,并完全在冻结的现成基础LLM上运行。Dep-LLM包含三个阶段。首先,思维链(CoT)抑郁症多因素分析模块将长对话结构性地分解为五个临床对齐的主题,并产生基于证据的推理,有效处理长上下文依赖。其次,我们引入了置信度分析与调制模块,该模块从每个推理的token级熵中量化认知可靠性,并应用标签内和主题间调制,在不进行额外训练的情况下放大可信信号同时抑制不确定信号。第三,协作多因素预测模块动态整合由置信度加权的多因素信号,形成最终诊断。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化性:它在几乎所有21个基础LLM上,在准确率、宏F1和加权平均F1等9个指标上超越了零样本基线,并进一步优于最先进的监督式领域特定LLM以及最新的闭源商业LLM,同时无需额外训练。

英文摘要

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

2606.10819 2026-06-10 cs.CV cs.AI 交叉投稿

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision:将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology(北京理工大学空间智能信息处理国家重点实验室) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息创新研究院) Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences(中国科学院地理空间信息处理与应用系统技术重点实验室) Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology(北京理工大学前沿交叉科学研究院) School of Mechatronical Engineering, Beijing Institute of Technology(北京理工大学机电学院) School of Earth and Space Sciences, Peking University(北京大学地球与空间科学学院) School of Electronics, Peking University(北京大学电子学院) School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing(华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室)

AI总结 提出Earth-OneVision,一个2B参数的RS-MLLM,通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制,统一六种传感器模态和九类任务,在多个基准上达到或超越4B-72B模型。

详情
AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而,现有模型仅支持狭窄的传感器类型和任务范围,导致对地球的碎片化视角,并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision,一个2B参数的RS-MLLM,它在单一自回归框架内统一了六种传感器模态(即光学、SAR、红外、多光谱、时序和视频)以及跨传感器融合,涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐(FGVLA)将多级视觉特征与多维语言空间对齐。空间语言同构序列化(SLIS)将异构空间输出统一为自回归令牌。渐进式跨模态适应(PCMA)将复合领域差距分解为连续阶段,依次解决视角和成像物理差距。为了支持联合训练,构建了MMRS-OneVision,包含约340万QA对,涵盖所有六种传感器模态和9个任务类别的跨传感器融合,大大超过了现有的遥感多模态指令数据集。仅用2B参数,Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果,持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5,在SAR VQA基准SARLANG-Bench上达到80.68%,超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率,在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

2606.10829 2026-06-10 cs.CL cs.AI 交叉投稿

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

注意力折扣自适应采样器用于掩码扩散语言模型

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro

发表机构 * University of Bern(伯尔尼大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 针对掩码扩散语言模型并行解码中候选词交互导致的不安全问题,提出训练无关的重排序规则ADAS,通过注意力折扣软惩罚改进子集构建,在多个基准上提升低NFE性能。

详情
AI中文摘要

掩码扩散语言模型可以通过每次去噪迭代揭示多个令牌来减少推理步骤,但这种并行性很脆弱:当预测相互耦合时,单独置信的位置同时提交可能不安全。现有的免训练采样器如Top-\(k\)、Fast-dLLM和EB-Sampler主要控制揭示多少令牌,而通常通过忽略选定集内交互的逐令牌分数对候选进行排序。我们提出ADAS,一种用于并行掩码扩散解码的免训练重排序规则。ADAS保持基础采样器的停止规则不变,仅修改子集构建:当候选者强烈关注预测仍不确定的已选位置时,它贪婪地折扣该候选者。与将注意力转化为硬兼容性约束的图约束方法不同,ADAS保持注意力连续并将其用作软边际惩罚。在GSM8K、MATH500、HumanEval和MBPP上,针对LLaDA-8B-Base和Dream-7B-Base,将ADAS插入Top-\(k\)、Fast-dLLM和EB-Sampler中,在匹配去噪器评估下,低NFE性能平均分别提高9.11和10.46个百分点,每次前向运行时开销为3.1%。这些结果表明,软注意力折扣重排序是一种简单且模块化的方法,可提高掩码扩散语言模型高度并行解码的质量。

英文摘要

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

2606.10892 2026-06-10 cs.CV cs.AI 交叉投稿

Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

通过定制化概念嵌入改进前景条件外绘中的文本-实例对齐

Yihao Zhao, Xuan Han, Bin He, Mingyu You

AI总结 针对前景条件外绘中文本驱动方法产生的伪影问题,提出定制化概念嵌入扩散框架,通过实例感知损失和语义保持提示模板定制概念嵌入,显著减少伪影并提升图像质量。

详情
AI中文摘要

为了展示产品,商家通常需要花费大量成本制作高质量的展示图像。前景条件外绘(FCO)满足了这一需求,允许用户通过调整文本提示,以低成本为前景实例创建所需的背景。然而,现有的文本驱动FCO方法在其输出中存在关键缺陷,最明显的是伪影,即合成背景中与前景实例共享相同语义的区域。这种伪影降低了物体的显著性并降低了图像质量。我们将问题归因于给定实例与文本派生概念嵌入之间的不对齐。为了解决这个问题,我们提出了定制化概念嵌入扩散(CCE-Diffusion)框架。其核心是CCE模块,用于定制概念嵌入,弥合通用名词语义与特定视觉实例之间的差距。实例感知损失指导模块的优化,而语义保持提示模板防止定制化嵌入扭曲提示中的其他词。定性和定量评估均表明,CCE-Diffusion显著减少了输出中的伪影。作为即插即用组件,CCE模块可以集成到各种FCO方法中,提升其性能。

英文摘要

To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module's optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

2606.10902 2026-06-10 cs.CV cs.AI 交叉投稿

Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

Pose-ICL:面向姿态可控主体定制的3D感知上下文学习

Xuan Han, Yihao Zhao, Mingyu You

AI总结 提出Pose-ICL框架,通过3D感知上下文学习和表面锚定位置嵌入(SAPE)实现无调优的姿态可控主体定制,显著提升姿态准确性和身份一致性。

详情
AI中文摘要

主体定制是现代图像生成中的基础任务。通过提供少量参考图像和文本提示,用户可以生成特定对象在任意期望场景中的图像。然而,现有方法在实现定制主体的有效姿态控制方面仍存在困难。在实践中,它们常常表现出不准确的姿态或不一致的跨姿态外观。这些局限性表明,对于2D原生骨干网络而言,以体积方式理解对象仍然是一个重大挑战。为了应对这一挑战,我们提出了Pose-ICL,这是一个无需调优的框架,利用3D感知上下文学习(ICL)通过多个配对的图像-姿态参考直接适应新主体。其核心机制——表面锚定位置嵌入(SAPE)——通过将图像令牌锚定到体积边界框的表面坐标,赋予模型显式的3D感知能力。专门的优化确保了其与现有DiT模型的无缝兼容性。在3D资产和真实世界主体上的广泛评估表明,Pose-ICL在姿态准确性和身份一致性方面均显著优于当前方法。

英文摘要

Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

2606.11033 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA: 将音频理解内化到LLM中作为LoRA

Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

发表机构 * Meituan(美团) Jilin University(吉林大学)

AI总结 提出AuRA方法,通过层间蒸馏将ASR编码器的语音表示内化到LoRA适配的LLM中,实现紧耦合的语音-语言联合建模和高效并行端到端推理,在多个基准上优于级联系统和现有适应方法。

详情
AI中文摘要

最近将大语言模型(LLM)扩展到语音输入的努力通常依赖于级联的ASR-LLM流水线、端到端语音-语言模型或基于桥接/蒸馏的适应方法。虽然这些路线分别重用了强大的预训练组件、实现了原生语音-语言交互或提供了轻量级适应,但它们常常遭受转录-接口延迟、昂贵的多模态训练或顺序语音-语言耦合的问题。为了解决这些限制,我们提出了AuRA,一种将音频编码能力蒸馏到LLM中的方法。具体来说,AuRA通过一个轻量级音频嵌入层将相同的语音输入馈送到ASR编码器(作为教师)和LoRA适配的LLM(作为学生),并使用逐层蒸馏将学生的隐藏状态与相应的教师表示对齐,从而将语音表示内化到轻量级的LLM侧适应中。与级联和串行桥接方法相比,AuRA实现了更紧密的语音-语言联合建模和高效的并行端到端推理,同时重用了预训练的语音和语言模型,而不需要大规模的多模态训练。在多个语音-语言基准上,AuRA在有效性和效率方面始终优于级联系统、语音到LLM适应基线以及大规模语音-语言和多模态模型。

英文摘要

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。

Comments Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)

详情
AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2412.11449 2026-06-10 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Whisper-GPT -- 语音和音乐的连续离散混合表示语言模型

Prateek Verma

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Whisper-GPT,一种结合连续音频表示(如频谱图)和离散音频令牌的生成式大语言模型,解决了离散令牌方法上下文长度过长的问题,在语音和音乐的下一个令牌预测中降低了困惑度和负对数似然。

Comments 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

详情
AI中文摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大语言模型(LLM),它允许我们在单个架构中同时处理连续音频表示和离散令牌。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频令牌的生成式音频、语音和音乐模型激增。然而,这种方法的主要缺点之一是处理上下文长度。如果必须考虑不同频率下的所有音频内容来进行下一个令牌预测,那么对于高保真生成架构来说,上下文长度会急剧增长。通过结合连续音频表示(如频谱图)和离散声学令牌,我们保留了两者的优点:在单个令牌中拥有来自音频特定时间实例的所有必要信息,同时允许LLM预测未来令牌,从而获得采样和离散空间提供的其他好处。我们展示了与基于令牌的语音和音乐LLM相比,我们的架构如何提高下一个令牌预测的困惑度和负对数似然分数。

英文摘要

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

2507.19137 2026-06-10 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute(日内瓦研究所) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究通过对话语音分析,发现感知人格在不同工作情境下显著变化,并识别出与各人格特质相关的声学特征。

详情
AI中文摘要

先前研究表明,用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知(APP)的兴趣,旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质,独立于情境。然而,心理学研究表明,感知人格会随情境和场景而变化。在本研究中,我们调查了参与两种工作情境(中性面试和压力客户互动)的参与者对话语音与感知人格之间的关系。我们的主要发现是:1)感知人格在不同互动中显著不同;2)响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性,而在压力情境中,神经质与这些特征相关;3)手工声学特征和非语言特征在感知人格推断中优于说话人嵌入;4)压力互动更能预测神经质,这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

2508.07048 2026-06-10 cs.SD cs.AI cs.LG eess.AS 版本更新

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion: 基于掩码扩散的并行ASR解码

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Jongchan Kim, Hyungon Ryu, Hyuk-Jae Lee, Nam-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Soongsil University(顺天大学) NVIDIA Corporation(英伟达公司)

AI总结 提出Whisfusion,在冻结的Whisper音频嵌入上训练专用掩码扩散解码器,通过并行扩散解码实现非自回归ASR,在多种语言基准上超越Whisper-large-v3,速度提升4-5倍。

Comments 16 pages, 3 figures

详情
AI中文摘要

自回归(AR)编码器-解码器模型主导着高质量的多语言ASR,但其从左到右的解码器使得推理延迟随转录长度增加。一种自然的替代方案,CTC风格的非自回归(NAR)系统避免了这一瓶颈,但其条件独立性假设牺牲了转录级别的生成建模。掩码扩散语言模型(例如LLaDA、MDLM)提供了一种有竞争力的NAR文本生成方法。我们探究这类模型是否能在消除从左到右瓶颈的同时,将NAR ASR带入强AR ASR系统的准确率范围。我们提出Whisfusion,它在冻结的Whisper-large-v3音频嵌入之上从头训练一个专用的掩码扩散解码器,仅需几步即可去噪掩码转录。我们在约68k小时的11种语言语音上训练,采用高掩码专门化以将训练与推理的完全掩码起始点对齐,并通过并行扩散解码进行解码。Whisfusion在英语、欧洲和CJK基准测试的组平均准确率上超越Whisper-large-v3,同时运行速度快4-5倍,在准确率和吞吐量上均超越Whisper-turbo。它达到与Canary和Qwen3-ASR竞争的准确率,同时运行速度快3-7倍。这些结果确立了掩码扩散作为高吞吐量多语言转录的帕累托竞争性非自回归范式。代码和模型权重可在https://this URL获取。

英文摘要

Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.

2511.10234 2026-06-10 cs.LG cs.AI 版本更新

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

迷失在序列化中:LLM图推理器的不变性与泛化能力

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM图推理器对图表示对称性的缺乏不变性,通过分解序列化因素并评估微调影响,发现大模型更鲁棒,微调降低节点重标敏感但增加结构和格式敏感,且不保证泛化。

Comments ICML 2026 Workshop on Graph Foundation Models

详情
AI中文摘要

尽管前景广阔,基于大型语言模型(LLM)的图推理器缺乏对图表示中对称性的内置不变性。在顺序图序列化上操作时,LLM在节点重索引、边重排序或格式变化下可能产生不同输出,引发鲁棒性问题。我们系统分析了这些影响,研究了微调如何影响编码敏感性以及在未见任务上的泛化能力。我们提出了一种将图序列化分解为节点标记、边编码和语法的原则性方法,并在一个全面的基准测试套件上评估了LLM对每个因素变化的鲁棒性。我们还贡献了一组新的谱任务,以进一步评估微调推理器的泛化能力。结果表明,较大的(未微调)模型更鲁棒。微调降低了对节点重标的敏感性,但可能增加对结构和格式变化的敏感性,同时并未一致地提高在未见任务上的性能。

英文摘要

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

2512.12675 2026-06-10 cs.CV cs.AI 版本更新

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Scone:通过统一理解-生成建模弥合主体驱动图像生成中的组合与区分

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

发表机构 * Peking University(北京大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Zhongguancun Academy(中关村学院) HKUST(香港科技大学) Beijing Key Laboratory of Data Intelligence and Security (Peking University)(北京数据智能与安全重点实验室(北京大学))

AI总结 提出Scone方法,通过统一理解-生成模型结合组合与区分能力,采用两阶段训练实现主体身份保持与干扰最小化,在双基准上优于现有开源模型。

Comments CVPR 2026 Highlight. Code: https://github.com/Ryann-Ran/Scone

详情
AI中文摘要

主体驱动图像生成已从单主体发展到多主体组合,但忽略了区分能力——即当输入包含多个候选主体时,区分并生成正确主体的能力。这一限制制约了其在复杂、真实视觉场景中的有效性。我们提出Scone,一种统一理解-生成方法,整合了组合与区分。Scone使理解专家充当语义桥梁,传递语义信息并引导生成专家在最小化干扰的同时保持主体身份。两阶段训练方案首先学习组合,然后通过语义对齐和基于注意力的掩码增强区分。我们还引入了SconeEval,一个用于评估多种场景下组合与区分的基准。实验表明,Scone在两个基准上的组合与区分任务中均优于现有开源模型。我们的模型、基准和训练数据可在以下网址获取:this https URL。

英文摘要

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

2603.23183 2026-06-10 cs.IR cs.AI 版本更新

Reasoning over Semantic IDs Enhances Generative Recommendation

基于语义ID的推理增强生成式推荐

Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(国立新加坡大学) University of Science and Technology of China(中国科学技术大学) Tencent Inc.(腾讯公司)

AI总结 提出SIDReasoner两阶段框架,通过增强语义ID与语言的对齐和结果驱动的强化优化,实现无需大量推理标注的有效推理,提升生成式推荐的准确性、可解释性和跨领域泛化能力。

Comments Accepted by KDD 2026

详情
AI中文摘要

生成式推荐的最新进展通过将序列推荐形式化为在包含语言标记和物品标识符的统一标记空间上的自回归生成,利用了预训练的大语言模型,其中每个物品由紧凑的离散标记序列(即语义ID)表示。这种基于语义ID的公式能够在大规模物品语料库上实现高效解码,并为基于大语言模型的推荐系统利用丰富的世界知识提供了自然接口。同时,大语言模型推理的突破推动了推理增强推荐的发展,然而在语义ID上的有效推理仍然未被充分探索且具有挑战性。物品标记对大语言模型而言并非天然有意义;此外,面向推荐的语义ID推理难以评估,导致高质量监督稀缺。为了解决这些挑战,我们提出了SIDReasoner,一个两阶段框架,通过增强语义ID与语言的对齐来激发对语义ID的推理,从而解锁可迁移的大语言模型推理能力,而不是依赖大量推荐特定的推理轨迹。具体来说,SIDReasoner首先通过在由更强教师模型合成的丰富语义ID中心语料库上进行多任务训练来增强语义ID与语言的对齐,将物品标记扎根于多样的语义和行为上下文中。基于这种增强的对齐,SIDReasoner进一步通过结果驱动的强化优化来改进推荐推理,引导模型走向有效的推理轨迹,而无需显式的推理标注。在三个真实世界数据集上的大量实验证明了我们推理增强的基于语义ID的生成式推荐的有效性。除了准确性之外,结果还突出了大型推理模型在生成式推荐中的更广泛潜力,包括改进的可解释性和跨领域泛化能力。

英文摘要

Recent advances in generative recommendation have leveraged pretrained LLMs by formulating sequential recommendation as autoregressive generation over a unified token space comprising language tokens and itemic identifiers, where each item is represented by a compact sequence of discrete tokens, namely Semantic IDs (SIDs). This SID-based formulation enables efficient decoding over large-scale item corpora and provides a natural interface for LLM-based recommenders to leverage rich world knowledge. Meanwhile, breakthroughs in LLM reasoning motivate reasoning-enhanced recommendation, yet effective reasoning over SIDs remains underexplored and challenging. Itemic tokens are not natively meaningful to LLMs; moreover, recommendation-oriented SID reasoning is hard to evaluate, making high-quality supervision scarce. To address these challenges, we propose SIDReasoner, a two-stage framework that elicits reasoning over SIDs by strengthening SID--language alignment to unlock transferable LLM reasoning, rather than relying on large amounts of recommendation-specific reasoning traces. Concretely, SIDReasoner first enhances SID-language alignment via multi-task training on an enriched SID-centered corpus synthesized by a stronger teacher model, grounding itemic tokens in diverse semantic and behavioral contexts. Building on this enhanced alignment, SIDReasoner further improves recommendation reasoning through outcome-driven reinforced optimization, which guides the model toward effective reasoning trajectories without requiring explicit reasoning annotations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our reasoning-augmented SID-based generative recommendation. Beyond accuracy, the results highlight the broader potential of large reasoning models for generative recommendation, including improved interpretability and cross-domain generalization.

2603.29025 2026-06-10 cs.CL cs.AI 版本更新

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

模型说走:表面启发式如何覆盖LLM推理中的隐式约束

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Independent Researcher(独立研究者)

AI总结 研究LLM在表面线索与隐式约束冲突时的失败,提出启发式覆盖基准(HOB),通过因果行为分析揭示距离线索影响远大于目标,并验证目标分解提示可部分恢复性能。

详情
AI中文摘要

当显著的表面线索与未陈述的可行性约束冲突时,大型语言模型会失败。我们引入了启发式覆盖基准(HOB):500个实例,涵盖4个启发式家族和5个约束家族,具有最小对和显式性梯度。我们将HOB与一个可证伪的行为特征描述配对,遵循诊断-测量-桥接-治疗弧。对六个模型的洗车问题进行因果行为分析,揭示了上下文无关的S形启发式:距离线索的影响力是目标的8.7到38倍,归因更匹配关键词关联而非组合推理。在14个模型中,严格的10/10评估显示,没有模型超过75%,存在约束最难,为44%。一个最小提示将性能提高15个百分点,表明是约束推断失败而非知识缺失。然而,14个模型中有12个在移除约束后表现更差,最多下降39个百分点,揭示了保守偏差。对Gemini 3.1 Pro的思考模式消融实验显示,思考开启时性能为74.6%,关闭时降至58.4%,而显式目标分解将其恢复至71.2%。因此,内部推理确实有用,显式提示可以部分替代。推理模型并不绝对优于非推理模型:在控制能力排名后,残差推理模式效应为1.8个百分点且不显著。参数探针显示S形模式泛化到成本、效率和语义相似性启发式。目标分解提示将性能提升5.0个百分点,而通用思维链提升3.1个百分点,将约束枚举隔离为有效成分。总体而言,启发式覆盖是一个系统性的推理漏洞,其量化位点在于推理顺序而非知识,并且有一个经过测试的干预措施。

英文摘要

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

2604.01993 2026-06-10 cs.CL cs.AI 版本更新

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

SAFE: 一种基于LLM作为验证器的证据驱动多跳推理框架

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SAFE框架,通过将推理分解为知识图谱三元组,在生成过程中逐步验证中间步骤,以解决多跳问答中模型通过无效推理得到正确答案的问题,平均准确率提升8.8个百分点。

详情
AI中文摘要

多跳问答基准测试常常奖励大型语言模型(LLM)的虚假正确性,即模型通过无效的中间推理得出正确答案。我们提出了SAFE,一种基于LLM作为验证器的证据驱动多跳问答框架。SAFE不是在生成后仅判断最终答案,而是在生成过程中通过检查中间步骤与提供的段落和先前的推理轨迹来验证推理。为了使这一过程可检查,SAFE将推理分解为以知识图谱(KG)三元组表示的原子化、证据驱动的单元。在训练时,SAFE在KG约束下验证基准监督,并构建可靠的验证器训练数据。在推理时,外部验证器检查每个生成的步骤,识别无效推理,并在错误传播之前提供纠正反馈。在三个多跳问答基准测试中,SAFE平均提高了8.8个百分点的准确率。这些结果表明,证据驱动的多跳问答受益于将基于LLM的评估从事后答案判断转向逐步推理验证。

英文摘要

Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evidence-grounded multi-hop QA. Rather than judging only the final answer after generation, SAFE verifies reasoning during generation by checking intermediate steps against the provided passages and previous reasoning trajectory. To make this process checkable, SAFE decomposes reasoning into atomic, evidence-grounded units represented with Knowledge Graph (KG) triples. At train-time, SAFE verifies benchmark supervision under KG-grounded constraints and constructs reliable verifier training data. At inference-time, an external verifier checks each generated step, identifies invalid reasoning, and provides correction feedback before errors propagate. Across three multi-hop QA benchmarks, SAFE improves accuracy by 8.8 pp on average. These results show that evidence-grounded multi-hop QA benefits from shifting LLM-based evaluation from post-hoc answer judgment to stepwise reasoning verification.

2604.14397 2026-06-10 cs.CL cs.AI 版本更新

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

基于词典的跨语言语义投影生成概念词汇化

David Basil, Chirooth Girigowda, Bradley Hauer, Sahir Momin, Ning Shi, Grzegorz Kondrak

发表机构 * University of Toronto(多伦多大学)

AI总结 提出一种通过语义投影将英语WordNet概念扩展到新语言的方法,利用双语词典增强对齐并过滤错误投影,在多个语言上提升了精度且保持可解释性和资源效率。

Comments Paper presented at Canadian AI 2026

详情
AI中文摘要

我们研究通过语义生成自动将WordNet风格的词汇资源扩展到新语言的任务。我们通过语义投影将目标语言词条与现有词汇概念关联来生成词义。给定一个带有词义标注的英语语料库及其翻译,我们的方法将注释的义原集投影到对齐的目标语言标记上,并将相应的词条分配给这些义原集。为了生成对齐并确保其质量,我们使用双语词典增强预训练的基础对齐器,该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法,将其与先前方法以及基于词典和大型语言模型的基线进行比较。结果表明,所提出的投影-过滤策略在保持可解释性和资源效率的同时提高了精度。我们在该https URL上发布代码、文档和生成的词义清单。

英文摘要

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

2604.22565 2026-06-10 cs.CL cs.AI 版本更新

Learning Evidence Highlighting for Frozen LLMs

学习为冻结的LLM突出证据

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

发表机构 * Stony Brook University(石桥大学) Meta AI

AI总结 提出HiLight框架,通过强化学习训练轻量级Actor在长上下文中插入高亮标签,使冻结的LLM更关注关键证据,无需证据标签或修改求解器,在序列推荐和长上下文问答中提升性能。

详情
AI中文摘要

大型语言模型(LLM)能够很好地推理,但当关键证据埋藏在冗长、嘈杂的上下文中时,常常会错过决定性证据。我们提出了HiLight,一个证据强调框架,它将证据选择与冻结的LLM求解器的推理解耦。HiLight避免压缩或重写输入(这可能会丢弃或扭曲证据),而是训练一个轻量级的强调Actor,在未改变的上下文中的关键跨度周围插入最小的高亮标签。然后,一个冻结的求解器对强调后的输入进行下游推理。我们将高亮视为一个弱监督决策问题,并使用强化学习仅基于求解器的任务奖励来优化Actor,不需要证据标签,也不需要访问或修改求解器。在序列推荐和长上下文问答中,HiLight始终优于强大的基于提示和自动提示优化的基线。学习到的强调策略可以零样本迁移到更小和更大的未见求解器家族,包括基于API的求解器,这表明Actor捕获了真正的、可复用的证据结构,而不是过拟合单个骨干网络。

英文摘要

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

2605.03344 2026-06-10 cs.IR cs.AI cs.CL 版本更新

RAG over Thinking Traces Can Improve Reasoning Tasks

RAG 基于思考轨迹可提升推理任务

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出检索思考轨迹而非文档,通过 T3 方法将其转化为结构化表示,在推理任务上显著提升性能,超越标准 RAG 和无 RAG 基线。

详情
AI中文摘要

检索增强生成(RAG)已被证明对知识密集型任务有效,但普遍认为其对数学和代码生成等推理密集型问题帮助有限。我们通过证明限制不在于 RAG 本身而在于语料库的选择来挑战这一假设。我们不检索文档,而是提出检索思考轨迹,即问题求解尝试过程中产生的中间思考轨迹。我们表明思考轨迹本身就是一个强大的检索源,并进一步引入 T3,一种离线方法,将其转化为结构化、利于检索的表示,以提高可用性。使用这些轨迹作为语料库,简单的检索-生成流水线在强模型和基准测试(如 AIME 2025--2026、LiveCodeBench 和 GPQA-Diamond)上持续提升推理性能,优于无 RAG 基线和检索标准网络语料库。例如,在 AIME 2025-2026 上,使用 Gemini-2-thinking 生成的轨迹进行 RAG,在 Gemini-2.5-Flash、GPT-OSS-120B 和 GPT-5 上分别实现了 +56.3%、+8.6% 和 +7.6% 的相对增益,尽管这些是更新的模型。总体而言,我们的结果表明思考轨迹是推理任务的有效检索语料库,将其转化为结构化、紧凑或诊断性表示可带来更强的增益。代码见此链接。

英文摘要

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

2605.18271 2026-06-10 cs.CL cs.AI cs.IR cs.LG 版本更新

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从体积到价值:面向设备端RAG的偏好对齐记忆构建

Changmin Lee, Jaemin Kim, Taesik Gong

发表机构 * Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea(计算机科学与工程系,全州国立科学与技术研究所(UNIST),全州,韩国)

AI总结 本文提出EPIC方法,通过将用户偏好作为紧凑且稳定的个人上下文形式,整合到RAG流程中,以在有限内存下提高检索与用户偏好的对齐度,从而减少内存使用并提升准确性。

Comments Accepted to ICML 2026. Code and data are available at https://github.com/UbiquitousAILab/EPIC

详情
AI中文摘要

随着基于大型语言模型(LLMs)的个人AI代理的迅速发展,将其部署到设备上已成为隐私和响应性的重要需求。为了处理现实世界请求中固有的个人和上下文依赖性,这些代理必须基于设备上存储的个人上下文进行生成。然而,在内存预算紧张的情况下,核心瓶颈是存储什么内容以确保检索与用户保持一致。我们提出EPIC(高效偏好对齐索引构建),专注于用户偏好作为紧凑且稳定的个人上下文形式,并在整个RAG流程中整合它们。EPIC会选择性地保留与偏好相关的信息,并将检索对准偏好对齐的上下文。在四个涵盖对话、辩论、解释和推荐的基准测试中,EPIC将索引内存减少了2,404倍,提高了偏好遵循的准确性20.17个百分点,并在最佳表现基线之上实现了33.33倍更低的检索延迟。在我们的设备端实验中,EPIC在29.35毫秒/查询的流式更新下保持内存占用低于1 MB。

英文摘要

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 18.79 %p, and achieves 32.17 times lower retrieval latency over the best-performing baseline. In on-device experiments, EPIC maintains under 1 MB memory and achieves 5.21 to 29.35 ms/query latency across three platforms, while supporting streaming updates under preference drift. Our code and data are available at https://github.com/UbiquitousAILab/EPIC.

2605.28066 2026-06-10 cs.CL cs.AI 版本更新

PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

PromptEmbedder:通过双LLM软提示实现高效且可迁移的文本嵌入

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University(国立台湾大学计算机科学与资讯工程系) National Taiwan University AI Center of Research Excellence(国立台湾大学人工智能研究中心)

AI总结 提出PromptEmbedder双LLM框架,通过可微分的软提示生成将嵌入知识从特定骨干权重中解耦,在保持性能的同时降低40% GPU内存并加速3.7倍训练。

详情
AI中文摘要

大型语言模型(LLM)在文本嵌入方面展现出显著效果,但当前的适应方法(如LoRA)在计算效率和跨架构可迁移性方面面临重大瓶颈。每当出现新的骨干网络时,现有方法需要从头开始进行昂贵的重新训练。为了解决这个问题,我们提出了PromptEmbedder,一种新颖的双LLM框架,将嵌入知识与特定骨干权重解耦。PromptEmbedder利用一个提示LLM通过连续松弛的可微分生成过程,为冻结的嵌入LLM生成指令感知的软提示,确保对比训练期间的全梯度流动。通过将任务特定知识定位在提示LLM中,适应新架构只需重新训练一个轻量级的线性对齐矩阵。在MTEB基准上的评估表明,PromptEmbedder实现了与LoRA微调相当的性能,同时将GPU内存减少40%,训练速度提升3.7倍。我们的方法建立了一种可扩展、架构无关的范式,用于高效的基于LLM的表示学习。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势:揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique(巴黎高等理工学院) MBZUAI(穆罕默德·本·拉什德智能研究院) ENS-PSL(巴黎综合理工学院-巴黎科学实验室) Durham University(杜尔罕大学)

AI总结 通过控制实验和项目反应理论模型,分离语言能力与文化知识访问,发现本地语言在文化知识访问上具有优势,但常被语言能力不足掩盖。

详情
AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题,但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制:许多评估依赖于可能无法反映文化知识自然出现的平行模板问题,并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型(文化无关 vs. 文化特定)与查询语言(英语 vs. 本地语言),并使用共享的1PL项目反应理论模型估计能力,我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上,我们发现文化无关问题上存在一致的英语优势,表明更强的英语能力。然而,在考虑了这种能力差距后,本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖,但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明,较弱的本地语言表现并不一定意味着较弱的文化知识;相反,本地文化知识可能通过本地语言更容易访问,但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

2606.08674 2026-06-10 cs.CV cs.AI 版本更新

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid: 具有生物行为语义理解的自回归视频生成

Tsung-Wei Pan, Jung-Hua Wang

发表机构 * Department of Electrical Engineering, National Taiwan Ocean University(国立台湾海洋大学电子工程系) AI research center, National Taiwan Ocean University(国立台湾海洋大学人工智能研究中心)

AI总结 提出BioVid,一种数据驱动的自回归视频生成框架,通过FSQ-R3GAN分词器和因果Transformer学习生物行为的自然时长分布,无需预设长度约束。

详情
AI中文摘要

现有的视频生成框架将序列时长视为外部指定参数——固定的帧数或文本提示——生成的片段在时间边界上与真实行为数据的统计结构脱节。这一假设与生物行为根本不一致,因为动作时长在个体和实例之间自然变化,并编码在数据本身中。我们提出BioVid,一种数据驱动的自回归视频生成框架,直接从训练数据中学习生物行为的时序结构,包括其自然长度分布。在第一阶段,有限标量量化GAN(FSQ-R3GAN)分词器将每个视频帧编码为紧凑的离散表示,结合R3GAN的稳定相对训练目标和FSQ的保证码本利用率,实现高保真空间重建而无需码本崩溃。在第二阶段,因果Transformer自回归地对生成的令牌序列建模,并在行为事件达到语义闭合时学习发出序列结束(EOS)令牌,终止分布自然地从训练数据中涌现,而非任何人为指定的约束。在人类饮酒行为数据集(NTU RGB+D, A001, n=94)上的实验表明,BioVid生成的长度分布与保留测试数据的分布紧密匹配,与真实分布的Wasserstein-1距离为1.24——相比之下,固定长度基线为6.05,VideoGPT为15.48——同时保持有竞争力的空间保真度。

英文摘要

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers(独立研究者) Tencent(腾讯) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出前瞻稀疏注意力(LSA),基于DeepSeek-V4架构的神经记忆索引器,通过预测未来上下文需求仅保留关键KV块,在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%,同时保持或略微提升下游准确率。

Comments Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4

详情
AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存,导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中,我们提出前瞻稀疏注意力(LSA),一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌,而是主动预测未来的上下文需求,并仅在GPU内存中保留查询关键的KV块。关键的是,我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构,我们使用标准检索训练框架独立训练它,而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率,同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件(例如LongBench-v2、LongMemEval和RULER)中,FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%,同时一致地保持或略微提升下游准确率(平均绝对边际+0.6%)。关键的是,在极端500K规模下,FlashMemory将物理KV缓存开销抑制超过90%,而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

7. 机器人与具身智能 12 篇

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 交叉投稿

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结 提出Co-GLANCE系统,通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配,结合共形预测与选择性弃权提供统计保证的不确定性量化,驱动主动感知,在真实场景中遮挡分割和分配准确率分别提升25%和36%,推理延迟降低350倍。

Comments Code, videos, and dataset available at https://co-glance.github.io/

详情
AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战,因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性,根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验,但它们对于机载推理在计算上过于昂贵,且缺乏校准的不确定性量化。我们介绍了Co-GLANCE,一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型,消除了对基于云推理的需求。为了量化感知不确定性,Co-GLANCE结合了共形预测与选择性弃权,为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知,派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中,Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%,同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取:此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

2606.09958 2026-06-10 cs.RO cs.AI 交叉投稿

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境下自动驾驶的不确定性感知运动规划

Ming Cheng, Hao Chen, Ziyi Yang, Ziluowen Luo, Senzhang Wang

AI总结 提出不确定性感知运动规划(UAMP),通过量化人类意图不确定性并引入不确定性校准值学习,提升自动驾驶在混合交通中的安全性和舒适性。

详情
AI中文摘要

在自动驾驶和人类驾驶车辆可能共存的混合交通环境中,自动驾驶车辆的运动规划需要预测周围人类驾驶员的未来行为。现有的基于强化学习的方法通常直接将预测的人类意图纳入观测以实现主动规划。然而,由于行为多样性、感知噪声和部分可观测性,人类意图本质上是不确定的。将预测意图视为确定性状态可能导致自动驾驶车辆做出不安全决策。为解决此问题,我们提出不确定性感知运动规划(UAMP),该规划将人类意图预测的不确定性纳入自动驾驶决策。具体来说,UAMP首先引入一个邻近感知不确定性估计器,以量化交互条件下的意图不确定性,并构建一个不确定性引导的联合意图分布,覆盖周围的人类驾驶车辆。在此不确定性集合内,UAMP进一步引入不确定性校准值学习(UCVL),以纠正因直接将不确定的人类意图预测纳入观测而产生的值函数学习偏差。在各种混合交通场景中的大量实验表明,与现有方法相比,UAMP显著提高了安全性和驾驶舒适性,同时保持了交通效率。代码发布在此https URL。

英文摘要

In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at https://anonymous.4open.science/r/UAMP-5638.

2606.10019 2026-06-10 cs.CV cs.AI cs.RO 交叉投稿

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

广义CVO:基于二阶黎曼优化的快速无对应局部点云配准

Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits

AI总结 提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法,采用二阶流形优化实现高达10倍加速,在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。

Comments 16 pages, 12 figures

详情
AI中文摘要

我们提出了一种快速且无需对应关系的局部点云配准方法,该方法利用了几何表面结构和再生核希尔伯特空间(RKHS)嵌入。该方法将点云表示为具有逐点各向异性核的连续函数,这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时,放松了沿切线方向的对齐。为了解决由此产生的配准问题,我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案,与先前基于无对应RKHS方法中使用的一阶求解器相比,实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中,我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中,我们展示了相比基于ICP的方法更强的鲁棒性,并且在优化全局初始化时(尤其是在中等错位情况下)获得了进一步的提升。

英文摘要

We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

2606.10180 2026-06-10 cs.RO cs.AI cs.HC 交叉投稿

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

流控制:通过简单实时输入引导视觉-语言-动作模型

Jonathan C. Kao, Jason Chan, Andy Wang

AI总结 提出流控制方法,利用键盘等通用实时输入引导VLA模型动作,无需重新训练,能提升任务成功率和完成速度。

Comments 10 pages, 5 figures

详情
AI中文摘要

我们引入了视觉-语言-动作(VLA)模型的流控制,这是一种简单有效的方法,通过通用输入(如键盘)实时引导VLA动作。该方法可直接使用,无需重新训练或微调VLA。它允许相对粗糙的用户输入引导VLA与用户意图对齐。VLA将这些输入转换为从训练期间学习的VLA专家动作分布中采样的动作样本,从而生成高质量(符合动作专家分布)和高保真度(反映用户意图)的动作。我们证明流控制具有许多理想特性:(1)流控制能准确、响应地通过用户输入引导机器人动作;(2)它对次优用户输入具有鲁棒性;(3)它使用户能够引导VLA实现显著更高的成功率和更快的任务完成;(4)在流控制轨迹上微调VLA可提高自主策略性能。这些结果共同为用户提供了一种简单直观的方式来帮助引导VLA动作,提升任务性能。

英文摘要

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

2606.10208 2026-06-10 cs.RO cs.AI 交叉投稿

Exploration of Foundation Model-Based Robots in Patient and Elderly Care

基于基础模型的机器人在患者和老年人护理中的探索

Zhiwen Qiu, Wei Liu, Yuexing Hao

AI总结 本文综述了基于基础模型的护理机器人在设计特征、用户体验和护理效果方面的现状,指出当前系统多用于语音交互,多模态和物理自主性有限,并呼吁向护理特定评估标准和负责任自主性发展。

详情
AI中文摘要

随着全球人口老龄化,对老年人和患者护理的需求迅速增长。基础模型越来越多地被集成到机器人和交互代理中,有望实现更灵活的沟通和个性化辅助。然而,护理环境需要可靠且与工作流程兼容的系统,并具备可问责的人类监督,目前尚不清楚当前具身系统能否将技术进步转化为临床影响。本综述从三个方面综合了基于基础模型的护理机器人:设计特征、用户体验以及护理相关结果的证据。当前系统最常将基础模型用作以语音为中心的社会辅助具身中的对话和推理层,而多模态基础和物理自主性仍然有限。实证评估报告了积极的可用性和参与度益处,但交互流程中仍存在可靠性故障,如幻觉和对话中断。护理影响的证据主要集中在认知参与和参与等近期结果上,而经过验证的临床或护理相关变化的证据有限。我们认为,未来的研究应转向护理特定的评估标准、可问责的自主性以及融入护理工作流程,以支持更具响应性和负责任的护理技术。

英文摘要

Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

2606.10244 2026-06-10 cs.RO cs.AI 交叉投稿

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

YUBI:面向大规模双手灵巧操作的通用双指接口

Takehiko Ohkawa, Jumpei Arima, Yuki Noguchi, Masatoshi Tateno, Makoto Sugiura, Takuya Okubo, Kengo Ikeuchi, Yuma Shin, Hiroki Nishizawa, Naoaki Kanazawa, Yuki Wakayama, Daiki Fukunaga, Koshi Makihara, Tomohiro Motoda, Floris Erich, Yukiyasu Domae, Tatsuya Matsushima, Yohishiro Okumatsu, Kei Ota

AI总结 提出YUBI手指对齐夹爪,通过屈服式手指驱动映射实现直观、符合人体工学的双手灵巧操作数据采集,构建8434小时/120万集/119任务数据集,单策略跨多机器人迁移。

Comments Project page: https://yubi.airoa.io/

详情
AI中文摘要

我们引入了Yielding Universal Bidigital Interface (YUBI),一种手指对齐的夹爪,旨在实现双手灵巧操作的直观、符合人体工学且可扩展的数据采集。虽然手持数据采集系统(如Universal Manipulation Interface (UMI))实现了低成本数据采集,但其笨重的手枪式握把设计可能给精细灵巧操作任务带来人体工学和使用性挑战。为此,YUBI提出了一种独特的设计原则:屈服式手指驱动,将人类手指运动直接映射到夹爪钳口运动。使用YUBI设备,我们建立了一个集成基于VR的6自由度夹爪跟踪的数据采集系统,确保高保真轨迹数据获取。我们整理了一个前所未有的基于UMI的数据集:8434小时,涵盖120万集和119个任务。实验表明,YUBI在复杂双手任务的通用性、灵巧性和操作效率方面优于UMI夹爪。通过在多个平台上安装夹爪,在YUBI数据集上训练的单一策略可迁移到多个双手机器人(UR、Franka和ELEY),证实采集的数据可直接作为策略监督执行。我们发布了夹爪硬件、数据采集软件和数据集作为集成堆栈,为开放社区提供可复现的大规模数据采集路径,以推动机器人基础模型的发展。

英文摘要

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

2606.10267 2026-06-10 cs.RO cs.AI cs.LG 交叉投稿

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

机器人策略编排的关键因素:分层VLA智能体的系统研究

Jiaheng Hu, Mohit Shridhar, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jie Tan, Annie Xie

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 系统研究分层视觉-语言-动作(Hi-VLA)系统的设计原则,通过统一框架分析规划器、控制器及接口机制对短时、长时及推理密集型任务性能的影响,提出构建更强健分层VLA智能体的实用原则。

详情
AI中文摘要

分层视觉-语言-动作(Hi-VLA)系统已成为复杂机器人操作的一种有前景的范式,它通过使用高层VLM规划器将任务分解为语言子目标,由低层VLA控制器执行。尽管近期取得了实证进展,但这些系统缺乏统一的设计原则:现有的Hi-VLA系统在选择和连接规划器、控制器、两者之间的切换机制以及规划器中观测和记忆的表示方式上存在差异。在本文中,我们对机器人操作的Hi-VLA设计进行了系统研究。我们将代表性的Hi-VLA智能体统一在一个选项式控制框架下,并在短时、长时和推理密集型任务上基准测试核心设计选择。我们的分析提炼出构建Hi-VLA系统的实用原则,展示了模型选择和接口机制如何共同塑造性能。应用这些原则,在仿真和真实ALOHA机器人上的实验中,我们得到了一个比平面VLA控制或朴素设计的分层系统都显著更强的系统。总体而言,我们的结果为构建更强大、更鲁棒且更有原则的分层VLA智能体奠定了基础。更多信息和视频请访问此http URL。

英文摘要

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.

2606.10276 2026-06-10 cs.RO cs.AI 交叉投稿

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

基于语言和自我中心人类信号的分层策略用于自然人机交互

Dongjun Lee, Juheon Choi, Dong Kyu Shin, Sinjae Kang, Kimin Lee

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 提出EDITH框架,通过智能眼镜捕捉人类第一人称视角、注视和语言信号,设计分层策略将非语言信号与语言指令结合,实现更自然的人机交互,减少用户表达意图的负担。

Comments We provide video demos and code in: https://project-edith.github.io

详情
AI中文摘要

为了实现自然的人机交互,机器人必须理解人类不仅通过语言,还通过手势和注视等非语言信号表达的意图。然而,当前的机器人策略仅依赖语言指令作为传达意图的唯一接口,忽略了非语言信号,将全部沟通负担放在语言上。在这项工作中,我们提出了EDITH,一个机器人框架,通过智能眼镜的连续第一人称视角和注视流捕捉人类的非语言信号,并将其与语言指令一起作为机器人策略的输入。我们的硬件系统实时将人类的第一人称视角、注视和语音传输给机器人,并将语音转录为语言指令。为了处理这些丰富但嘈杂的信号,我们设计了一个分层策略,其中高层策略推断人类的意图并生成一系列子任务,每个子任务表示为一个细粒度指令,配有一个关键帧,将意图锚定在场景中(例如,人类指向目标物体的帧)。然后低层策略执行这些子任务。在我们的人机交互任务实验中,即使意图仅被短暂表达,EDITH也能使机器人根据人类的非语言信号行动,并且与仅使用语言指令相比,显著减少了用户传达意图的努力。请访问我们的项目页面获取源代码和真实机器人演示视频。

英文摘要

For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

2606.11019 2026-06-10 cs.RO cs.AI 交叉投稿

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

扩散强制规划器:基于时间依赖引导的历史退火规划用于自动驾驶

Zehan Zhang, Neng Zhang, Yaoyi Li, Jia Cai, Zhiling Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Yinwang Intelligent Technology Co., Ltd(银网智能科技有限公司) Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥物质科学研究院)

AI总结 提出扩散强制规划器(DFP),通过历史引导控制实现异构联合扩散过程,结合退火历史的条件引导,解决运动规划中的时间不一致问题,在nuPlan上取得竞争性能。

Comments CVPR2026

详情
AI中文摘要

基于学习的运动规划器尽管近期取得进展,但常常遭受时间不一致性问题。跨帧的小扰动可能累积成不稳定的轨迹,降低闭环驾驶的舒适性和安全性。几种方法尝试将历史作为静态条件信号注入以稳定输出,却导致规划器复制历史模式而非适应环境上下文。为解决这一限制,我们提出扩散强制规划器(DFP),一种由历史引导控制驱动的基于扩散的规划框架。具体地,DFP将完整轨迹分解为历史段、当前段和未来段,并为每个段分配独立的噪声水平。模型联合去噪历史段和未来段,强制执行异构联合扩散过程。在推理时,使用无分类器引导(CFG)以可控方式利用退火历史引导未来采样。在nuPlan上的闭环评估和全面消融实验表明,DFP在复杂驾驶场景中实现了竞争性能,同时生成连续、稳定且可控的运动规划。

英文摘要

Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

2505.01458 2026-06-10 cs.RO cs.AI 版本更新

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

具身智能时代基于物理模拟器的机器人导航与操作综述

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

发表机构 * Department of Computer Science, City University of Hong Kong(城市大学计算机科学系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Department of Informatics, Universität Hamburg(汉堡大学信息学院)

AI总结 本文综述了物理模拟器在缩小具身智能中导航与操作的模拟到现实差距方面的关键特性、任务支持及硬件需求,并提供了基准数据集、指标、平台和方法资源。

Comments Under Review

详情
AI中文摘要

导航和操作是具身智能的核心能力,但直接在现实世界中训练智能体执行这些任务成本高、耗时且不安全。因此,模拟到现实的迁移已成为关键方法,然而模拟到现实的差距仍然存在。本综述通过分析先前综述中关注有限的属性,考察了物理模拟器如何解决这一差距。我们还分析了它们在导航和操作任务中的特性,以及它们的硬件需求。此外,我们提供了包含基准数据集、指标、模拟平台和方法的资源,以帮助研究人员在考虑硬件约束的同时选择合适的工具。

英文摘要

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.

2512.08280 2026-06-10 cs.RO cs.AI cs.SY eess.SY 版本更新

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

基于模型扩散采样的离线决策预测控制

Haldun Balim, Na Li, Yilun Du

发表机构 * GitHub

AI总结 提出MPDiffuser框架,通过组合扩散规划器与动力学扩散模型,在采样中交替更新以生成符合任务目标且动力学可行的轨迹,并利用轻量级排序模块选择最优轨迹,在D4RL和DSRL基准及四足机器人上验证了有效性。

详情
AI中文摘要

通过扩散模型进行离线决策通常会产生与系统动力学不对齐的轨迹,限制了其在控制中的可靠性。我们提出了模型预测扩散器(MPDiffuser),一种组合扩散框架,它将扩散规划器与动力学扩散模型相结合,以生成任务对齐且动力学可行的轨迹。MPDiffuser在采样过程中交替进行规划器和动力学更新,逐步修正可行性同时保留任务意图。然后,一个轻量级排序模块选择最能满足任务目标的轨迹。组合设计通过使动力学模型能够独立于规划器利用多样且未见过的数据,提高了样本效率和适应性。实验上,我们在无约束(D4RL)和约束(DSRL)基准上展示了相对于先前基于扩散的方法的一致改进,并通过在真实四足机器人上的部署验证了实用性。

英文摘要

Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

2606.06493 2026-06-10 cs.RO cs.AI cs.LG 版本更新

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

HANDOFF: 通过蒸馏互补教师实现人形机器人任务空间全身控制

Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

发表机构 * California Institute of Technology(加州理工学院) The Institute for Human & Machine Cognition(人机认知研究院)

AI总结 提出HANDOFF框架,通过多教师KL蒸馏和上下文门控机制,将全身运动跟踪、行走和跌倒恢复三个专家策略融合为混合专家学生策略,实现基于紧凑显式接口的全身控制,在Unitree G1上达到先进的速度跟踪性能并扩展了操作工作空间。

Comments 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/

详情
AI中文摘要

对于要在现实世界中部署的人形机器人,命令空间(即任务规划与全身控制之间的接口)的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考,而规划器难以从任务语义中合成这些参考。我们提出了一种紧凑、显式的接口,该接口直观、通用、模块化且具有足够的表达能力,适用于多种操作技能。为此,我们引入了HANDOFF,这是一个单一的人形全身控制器,遵循该接口,并通过多教师KL蒸馏,在上下文条件门控方案下,从三个互补专家(具有安全过滤数据的全身运动跟踪、行走和跌倒恢复)中蒸馏出混合专家学生。在Unitree G1上,HANDOFF达到了最先进的速度跟踪性能,并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行演示了硬件可行性,这些任务由VLM驱动的智能体规划器提供支持,无需特定任务数据或控制器微调。

英文摘要

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

8. 可信、安全与AI治理 42 篇

2606.10062 2026-06-10 cs.AI cs.MA 新提交

Deployment-Time Memorization in Foundation-Model Agents

基础模型智能体中的部署时记忆

Lei, Chen, Guilin Zhang, Kai Zhao, Dalmo Cirne, Andy Olsen, Xu Chu, Zeke Miller, Alet Blanken, Amine Anoun, Jerry Ting

AI总结 研究基础模型智能体在部署时记忆的设计选择如何影响个性化效用、提取风险和删除保真度,提出遗忘残差分数并揭示压缩与删除的权衡。

Comments 4 pages, ICML MemFM 2026 Workshop

详情
AI中文摘要

基础模型智能体正成为越来越长寿命的系统,它们跨交互记忆用户,使记忆成为明确的部署时功能,而不仅仅是模型权重的属性。现有工作处理参数化记忆或审计固定记忆配置,但没有描述记忆设计选择如何共同塑造个性化效用、提取风险和删除保真度。我们将这一表面研究为部署时记忆,将智能体记忆表述为通过个性化召回(PR)和对抗提取率(AER)测量的隐私-效用前沿,并扫描三个记忆设计旋钮:摘要攻击性、检索广度(k)和删除模式。我们进一步引入遗忘残差分数(FRS)来量化删除的信息是否仍可从派生记忆层中恢复。在LongMemEval上,关键事实摘要将Gemma 3 12B上的金丝雀提取减少76%,GPT-4o-mini上减少64%,同时几乎保留所有个性化召回;关键是,一旦内容被压缩掉,增加k不再恢复泄漏。然而,相同的压缩会导致删除保真度失败:仅原始删除使派生摘要副本在大约20%的实例中可恢复,只有全管道清除或墓碑修订才能使最差层残差为零。总之,这些结果确立了持久智能体记忆必须作为一级记忆机制进行评估——通过它帮助智能体回忆的内容、它使什么可提取以及它真正能擦除什么来评估。

英文摘要

Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.

2606.10949 2026-06-10 cs.AI 新提交

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

回忆过好:记忆增强模型中的谄媚评估与缓解

Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 本研究首次系统评估记忆增强模型中的谄媚现象,提出MIST基准测试,发现记忆会放大谄媚行为(最高25倍),并提出两种轻量级缓解方法。

Comments Under submission; preprint

详情
AI中文摘要

持久记忆系统通过随时间存储用户信念,有望使LLM更有帮助。我们表明,它们也会通过系统性地放大谄媚(即模型优先同意用户而非准确性)而使模型更不准确。我们首次对此效应进行系统评估,引入MIST:一个合成生成的多轮对话基准,其中用户在科学、医学和道德推理领域表达看似合理的误解。对三种最先进的记忆系统和五个模型家族的测试表明,记忆在所有条件下都会放大谄媚行为,谄媚率比上下文基线高出25倍。错误分析表明,记忆提取是主要原因:有损压缩成离散片段会编码用户误解,同时丢弃纠正性上下文。基于这些结果,我们提出了两种轻量级缓解方法,在事实回忆方面匹配或超越记忆系统的同时,大幅减少谄媚。

英文摘要

Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.

2606.10989 2026-06-10 cs.AI 新提交

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

零空间约束的低秩自适应用于指定响应的大型语言模型遗忘

Bocheng Ju, Jianhua Wang, Chengliang Liu, Xiaolin Chang

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University(北京交通大学智能交通信息安全与隐私保护北京市重点实验室) College of Computer Science and Technology, Taiyuan University of Technology(太原理工大学计算机科学与技术学院) Institute of Computing Technologies, China Academy of Railway Sciences Corporation Limited(中国铁道科学研究院集团有限公司计算技术研究所)

AI总结 提出零空间约束响应指定遗忘框架,通过正交投影低秩参数化将LoRA更新限制在保留子空间的零空间内,在抑制遗忘知识的同时保持模型效用。

详情
AI中文摘要

大型语言模型遗忘旨在抑制指定的不良知识,同时保留良性能力。许多遗忘目标侧重于抑制不良答案,而最近的目标引导变体指定了替换行为,但仍使更新局部性在很大程度上不受约束。本文介绍了零空间约束响应指定遗忘(NSRU),一种用于受控LLM遗忘的投影约束低秩框架。NSRU使用明确结构化的安全目标响应来指定每个遗忘查询的期望行为,同时抑制原始不良内容。为了定位自适应,NSRU从良性隐藏表示中估计每个模块的保留子空间,并使用正交投影的低秩参数化将LoRA更新限制在保留子空间的零空间内。由此产生的目标在此约束参数化下联合优化安全目标学习、不良响应抑制和保留保持。我们提供了一个局部一阶分析,表明投影更新减少了保留侧的扰动,同时保留了用于塑造遗忘查询行为的可编辑方向。在TOFU上的实验表明,NSRU有效抑制了可提取的遗忘集知识,同时提高了保留QA性能、模型效用和安全目标对齐,优于代表性基线。在WMDP上,NSRU将危险领域准确率保持在接近随机选择的区域,同时保留了广泛和领域相邻的MMLU效用。消融研究支持安全目标监督、不良响应抑制、保留损失和零空间投影更新的互补作用,而敏感性和鲁棒性分析表明在测试的超参数和提示变化下行为稳定。

英文摘要

Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \emph{Null-Space Constrained Response-Specified Unlearning} (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.

2606.09844 2026-06-10 cs.HC cs.AI 交叉投稿

The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans

对话者效应:为什么LLMs向智能体泄露的个人数据比向人类多

Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 研究发现LLMs在与AI智能体对话时比与人类对话更易泄露个人身份信息,通过注意力抑制假说解释该现象,实验表明安全对齐注意力头在智能体交互中失活导致泄露增加23个百分点。

详情
AI中文摘要

大型语言模型(LLMs)会根据其感知到的对话者身份改变隐私行为。虽然安全机制通常阻止LLMs向人类用户泄露个人身份信息(PII),但这些模型在与另一个AI智能体对话时倾向于泄露更多敏感数据。我们将此称为\textbf{对话者效应}。通过消融研究,我们发现接收者的技术性质对这一效应有贡献,从而降低了模型对隐私的谨慎程度。为了进一步探索这一点,我们引入了注意力抑制假说,该假说认为安全对齐的注意力头在与智能体交互期间变得不活跃。我们通过比较222个敏感场景中面向人类和面向智能体的提示来定量评估这一点。从3,464次交互中得出的结果表明,将接收者描述为AI智能体会使PII泄露增加高达23个百分点。在Llama-3.1-8B-Instruct上的初步实验证实了这一点:停用一个安全注意力头会引发泄露,而重新激活它则恢复隐私保护。我们考虑了这对开发安全多智能体系统的影响。

英文摘要

Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the \textbf{Interlocutor Effect}. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model's caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety-aligned attention heads become inactive during interactions with agents. We assess this quantitatively by comparing human-directed and agent-directed prompts in 222 sensitive scenarios. Our findings, drawn from 3,464 interactions, indicate that portraying the recipient as an AI agent elevates PII leakage by up to 23 percentage points. Initial experiments on Llama-3.1-8B-Instruct corroborate this: deactivating one safety head induces leakage, whereas reactivating it reinstates privacy safeguards. We consider the implications for developing secure multi-agent systems.

2606.09864 2026-06-10 cs.LG cs.AI cs.ET 交叉投稿

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

KV缓存量化下的对齐崩溃:诊断与缓解

Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou

发表机构 * Stanford University(斯坦福大学) California Institute of Technology(加利福尼亚理工学院)

AI总结 研究发现低比特KV缓存量化会无声破坏大模型的安全对齐,根源在于安全特征位于低维激活子空间,易受量化噪声影响;提出逐通道缩减(PCR)诊断方法,分类三种失效模式并指导缓解,无需训练即可恢复高达97%的对齐损失。

Comments Preprint. 61 pages, 9 figures

详情
AI中文摘要

键值(KV)缓存量化被广泛用于减少大语言模型(LLM)推理内存,然而现有评估仅关注困惑度和准确率,未评估安全性影响。在本研究中,我们探索了KV缓存量化下的对齐保持。在11个指令微调模型(3.8B-72B)和5个基准(1,894个提示)上,我们发现低比特量化可以无声地破坏安全对齐:Mistral-7B在仅1.03倍困惑度下丢失了15.2%的拒绝能力,且不存在通用的安全位宽,标准指标无法察觉的尖锐模型特定相变普遍存在。我们识别出根本原因是几何性的:安全特征占据一个低维激活子空间,其对量化噪声的脆弱性比困惑度平均的全表示空间高10^2-10^3倍。受此观察启发,我们提出逐通道缩减(PCR),一种诊断方法,将每个模型分类为三种机制性失效模式之一:异常值压碎安全(安全位于非异常值通道,被异常值驱动的缩放因子连带损害);异常值即安全(安全与异常值通道重叠,更细粒度无法挽救);多层稀释(安全分布在许多层,逐层修复失败)。PCR在全部9个主要模型和来自独立家族的1个保留模型上,使用20个校准提示预测了正确的缓解方向。PCR泛化到未见过的提示、模型和生产量化器,包括KIVI,恢复率高达97.2%,而基于注意力的分配方法失败。由此产生的免训练协议,大约需要35 GPU分钟,以最小的内存开销恢复高达97%的丢失对齐,解决了在NVIDIA GPU上使用FP8 KV缓存的生产vLLM服务中确认的漏洞。

英文摘要

Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

2606.09866 2026-06-10 cs.LG cs.AI 交叉投稿

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

双人探戈:面向安全LLM微调的耦合任务-参考选择

Xinrui Chen, Jianhao Zhang, Ou Wu, Di Gao

AI总结 提出DualSelect框架,通过耦合任务与安全参考选择,在微调时保持安全对齐,提升安全评分至少5.10点。

详情
AI中文摘要

在下游数据上微调安全对齐的大型语言模型(LLMs)可以提高适应性,但可能会侵蚀已学习的安全行为。现有方法使用固定的安全示例、全局约束或单边任务过滤。我们的诊断表明,任务更新暴露了不同的安全约束,从而激发了联合选择相关参考和兼容任务样本的需求。我们提出DualSelect,一个耦合的任务和参考选择框架,它在过滤与诱导参考方向兼容的整个任务样本之前,刷新任务条件化的安全参考。在极小极大视角下,DualSelect通过熵正则化评分代理、惰性参考刷新和梯度校正,选择具有高保留损失和任务冲突的安全参考以及兼容的任务样本。在1B-8B LLMs上,DualSelect在不损失任务效用的情况下保持安全性;使用REDORCA评估器,它在安全平均值上比最强基线至少提高5.10分,并且在所有评估器中保持最高的安全平均值,且开销适中。这一观点扩展到以保留为中心的持续学习。

英文摘要

Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show task updates expose different safety constraints, motivating joint selection of relevant references and compatible task samples. We propose DualSelect, a coupled framework for task and reference selection that refreshes task conditioned safety references before filtering whole task samples compatible with the induced reference direction. Under a minimax view, DualSelect selects safety references with high preservation loss and task conflict, together with compatible task samples, through entropy-regularized scoring surrogates, lazy reference refresh, and gradient correction. On 1B-8B LLMs, DualSelect preserves safety without losing task utility; using the REDORCA judge, it improves Safety Avg. over the strongest baseline by at least 5.10 points and remains highest in Safety Avg. across judges with moderate overhead. This view extends to retention focused continual learning.

2606.09868 2026-06-10 cs.LG cs.AI 交叉投稿

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

SPACE: 面向多模态大语言模型的无源代理锚点概念擦除

Zhijing Zhang, Jiaqi Ding, Qianshan Wei, Nan Zhou, Jiaqi Li, Yongliang Wu, Tongxin Zhu, Xiaolin Fang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SPACE框架,首个针对多模态大语言模型的无源机器遗忘方法,通过文本引导的代理锚点选择和双约束语义隔离,在不访问目标数据的情况下擦除概念,并保持模型性能。

详情
AI中文摘要

随着多模态大语言模型(MLLMs)面临日益增长的隐私风险和监管约束,机器遗忘(MU)已成为在保持模型性能的同时移除敏感数据的关键解决方案。然而,现有的MU方法通常依赖于目标概念的视觉数据,由于严格的数据保留政策,这些数据往往不可用,因此需要一种无需访问目标数据的无源遗忘方法。在这项工作中,我们提出了无源代理锚点概念擦除(SPACE),这是首个专门针对MLLMs的无源遗忘框架。SPACE包含两个阶段:(1)文本引导的代理锚点选择(TPAS),从共享特征空间中检索语义对齐的代理锚点;(2)双约束语义隔离(DCSI),优化这些锚点以间接擦除目标概念。DCSI将更新限制在保留知识的零空间中,确保结构完整性。我们从理论上证明,SPACE严格限制了保留知识的扰动,并最大化特征谱熵,从而保持模型性能。此外,在六个数据集上的大量实验表明,SPACE实现了与最先进的数据依赖方法相当的性能,验证了其在无源MU场景中的有效性。源代码将发布。

英文摘要

As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However, existing MU methods typically rely on visual data of the target concepts, which is often unavailable due to strict data retention policies, thus creating a demand for source-free unlearning approaches that operate without access to the target data. In this work, we propose Source-free Proxy Anchor Concept Erasure (SPACE), the first source-free unlearning framework specialized for MLLMs. SPACE consists of two stages: (1) Text-Guided Proxy Anchor Selection (TPAS), which retrieves semantically aligned proxy anchors from the shared feature space. (2) Dual-Constraint Semantic Isolation (DCSI), which optimizes these anchors to indirectly erase target concepts. DCSI confines updates to the null space of retained knowledge, ensuring structural integrity. We theoretically prove that SPACE strictly bounds the perturbation on retained knowledge and maximizes feature spectral entropy, thereby maintaining the model's performance. Furthermore, extensive experiments across six datasets show that SPACE achieves performance comparable to that of state-of-the-art data-dependent methods, validating its effectiveness in source-free MU scenarios. The source code will be released.

2606.09875 2026-06-10 cs.LG cs.AI stat.ML 交叉投稿

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

集成局部和全局熵用于大语言模型的不确定性量化

Johanne Medina, Tianyi Zhou, Keivin Isufaj, Aristides Gionis, Sanjay Chawla

AI总结 本文提出GLU方法,通过融合隐藏状态几何熵(全局)和token级熵(局部)来量化LLM不确定性,有效捕捉自信但错误的失败模式,无需额外训练。

Comments 17 pages, 2 figures

详情
AI中文摘要

大语言模型会自信地产生幻觉,使得不确定性量化(UQ)对于可靠部署至关重要。现有方法主要依赖token级信号,而中间隐藏状态的几何结构未被充分利用。在本文中,我们将隐藏状态矩阵的几何复杂度作为LLM全局不确定性的度量,同时将token级不确定性估计视为局部度量。我们表明,隐藏状态几何熵(全局不确定性)和token级熵(局部不确定性)在统计上近似正交,捕捉了可靠性预测的不同失败模式。特别地,全局几何恢复了局部信号系统性遗漏的自信但错误的失败模式。基于此,我们提出了全局-局部不确定性(GLU),这是一种无监督、单次前向传播的分数,通过乘法门融合两种信号。在三个模型族和六个基准测试中,GLU匹配或优于所有无监督基线,同时仅需一次前向传播,且保持长度归一化和架构无关性。

英文摘要

Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic.

2606.09909 2026-06-10 cs.CR cs.AI cs.CV 交叉投稿

Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization

通过两阶段潜在特征优化绕过基于扩散的定制中的版权保护

Ziang Xu, Wenbo Yu, Hongyao Yu, Hao Fang, Jiawei Kong, Bin Chen, Hao Wu, Shu-Tao Xia, Zhiyong Wu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 提出两阶段潜在特征优化(TS-LFO)攻击方法,通过潜在去噪和重建阶段恢复被防御破坏的映射,有效绕过扩散模型定制中的版权保护。

Comments accepted by KDD 2026

详情
AI中文摘要

随着基于扩散的定制中版权侵权问题的日益关注,对抗性攻击已成为一种突出的防御策略,以防止个性化图像生成中的恶意内容伪造。然而,当前的防御通常会在潜在扩散模型(LDM)的潜在空间中引入持久扰动,这些扰动仍然容易被对手自适应绕过。在本文中,我们引入了两阶段潜在特征优化(TS-LFO),一种针对受保护的基于扩散的定制的高效且有效的版权窃取攻击。我们首先观察到现有防御主要破坏输入图像与其潜在表示之间的映射,从而降低模型生成个性化输出的能力。为了应对这一点,TS-LFO通过两阶段优化过程恢复被破坏的映射。在潜在去噪阶段,我们通过联合最小化潜在-图像对齐损失和具有时间步长依赖权重的潜在扩散损失来增强潜在代码与输入图像之间的语义一致性,有效抑制防御引入的高频噪声。在潜在重建阶段,我们使用像素级约束恢复低频语义信息以细化潜在特征。大量实验表明,TS-LFO持续绕过最先进的(SOTA)版权防御,并在各种设置下优于SOTA版权攻击,如DiffPure、GrIDPure和IMPRESS。

英文摘要

With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

2606.09931 2026-06-10 cs.GT cs.AI 交叉投稿

A Note on the Strategic Confinement Problem

关于战略约束问题的一个注记

Christian Schroeder de Witt

发表机构 * University of Oxford(牛津大学)

AI总结 本文引入战略约束问题,指出当通信方为具有共享协调资源的战略智能体时,即使信道容量极小,也可能导致机密信息的高影响泄露,并论证学习型战略智能体系统自然实例化该问题。

详情
AI中文摘要

Lampson的约束问题询问如何防止处理机密信息的程序将其泄露给第三方。我们引入战略约束问题,当通信方是具有共享协调资源的战略智能体时出现该问题。在此设置中,剩余通信能力可以集中在机密数据的低熵、高影响谓词上。因此,信息泄露的界限不一定导致最坏情况危害的相应界限:一个容量可忽略的信道仍可能足以选择破坏性结果。我们认为,学习型战略智能体系统自然实例化此问题,因为它们不允许完整的行为规范,它们习得的惯例通常无法被外部观察者预测或重现,并且足够能力的智能体可以构建难以检测或消除的隐蔽通信方案。因此,我们的贡献不是一种新的通信理论,而是在存在战略智能体的情况下对约束的重新解释。经典约束限制了可能流动的信息;战略约束强调这不一定限制战略智能体可以共同实现的目标。

英文摘要

Lampson's confinement problem asks how to prevent a program that processes confidential information from leaking it to a third party. We introduce the strategic confinement problem, which arises when the communicating parties are strategic agents with shared coordination resources. In this setting, residual communication capacity can be concentrated on low-entropy, high-impact predicates of the confidential data. Consequently, bounds on information leakage need not induce corresponding bounds on worst-case harm: a channel with negligible capacity may still suffice to select damaging outcomes. We argue that systems of learnt strategic agents naturally instantiate this problem because they do not admit complete behavioural specifications, their learnt conventions generally cannot be predicted or reproduced by an external observer, and sufficiently capable agents can construct covert communication schemes that are difficult to detect or eliminate. Our contribution is therefore not a new theory of communication, but a reinterpretation of confinement in the presence of strategic agents. Classical confinement bounds what information may flow; strategic confinement highlights that this need not bound what strategic agents can jointly achieve.

2606.09935 2026-06-10 cs.CR cs.AI 交叉投稿

GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

GitInject: AI驱动的CI/CD流水线中的真实提示注入攻击

Jafar Isbarov, Umid Suleymanov, Ilia Shumailov, Murat Kantarcioglu

发表机构 * Virginia Tech(弗吉尼亚理工学院) AI Sequrity Company(AI安全公司)

AI总结 提出GitInject框架,在真实GitHub工作流中评估AI代理的提示注入漏洞,发现所有测试提供商均存在结构性风险,并给出最低成本防护措施。

详情
AI中文摘要

AI代理越来越多地嵌入持续集成和持续交付/部署(CI/CD)流水线中,以自主审查拉取请求(PR)、分类问题和维护代码库。这些代理在操作时摄入不可信内容,同时拥有提升的仓库权限,使其成为具有供应链后果的提示注入攻击的自然目标。我们提出GitInject,一个开源框架,用于评估真实、活跃的GitHub工作流(CI/CD流水线的广泛部署实例)中的提示注入漏洞。与先前模拟工具调用的代理安全基准不同,GitInject提供临时仓库并触发实际工作流运行,因此沙箱约束、凭证处理和权限边界的行为与生产环境完全一致。使用GitInject,我们研究了四个AI提供商的工作流配置,并记录了十一种命名攻击,涵盖配置文件注入、凭证窃取、判断操纵和可用性。我们发现,所有测试的提供商在其默认配置中至少容易受到一类攻击,且最关键的漏洞是结构性的:它们源于CI/CD基础设施处理凭证和配置文件的方式,而非任何特定模型的行为。对于每个确认的攻击类别,我们确定了最低成本的工作流级对策,并分析了其覆盖范围和局限性。GitInject已公开发布,以促进这一方向的进一步研究。

英文摘要

AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously review pull requests (PRs), triage issues, and maintain codebases. These agents ingest untrusted content while operating with elevated repository permissions, making them a natural target for prompt injection attacks with supply chain consequences. We present GitInject, an open-source framework for evaluating prompt injection vulnerabilities in real, live GitHub workflows, a widely deployed instance of CI/CD pipelines. Unlike prior agent security benchmarks that simulate tool calls, GitInject provisions ephemeral repositories and triggers actual workflow runs, so that sandbox constraints, credential handling, and permission boundaries behave exactly as in production. Using GitInject, we study workflow configurations across four AI providers and document eleven named attacks spanning config-file injection, credential exfiltration, judgment manipulation, and availability. We find that all tested providers are susceptible to at least one attack class in their default configuration, and that the most critical vulnerabilities are structural: they arise from how CI/CD infrastructure handles credentials and configuration files, not from any specific model's behavior. For each confirmed attack class, we identify the minimum-cost workflow-level countermeasure and analyze its coverage and limitations. GitInject is released publicly to facilitate further research in this direction.

2606.10029 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

用稀疏自编码器解释和引导文本转语音语言模型

Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov

AI总结 本文在CosyVoice3语言模型骨干上训练BatchTopK稀疏自编码器,发现特征可解释且因果可控,能操纵笑声、性别和语速。

详情
AI中文摘要

语言模型日益成为文本转语音(TTS)系统的骨干,但我们对其在文本和生成语音令牌共享单一残差流时构建的表示知之甚少。我们在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器,并引入一种模态感知的自动解释流水线,根据特征触发的位置——文本前缀上下文、1秒语音片段或两者——为每个特征打标签。恢复的特征是可解释的,涵盖音素、笑声、口音提示和说话者性别。通过SAE潜在空间进行引导表明,这些特征是因果性的而非仅仅是描述性的:有针对性的干预将笑声概率从0.02提高到0.79,翻转感知到的说话者性别,并在保持口语内容的同时控制语速。因此,SAE特征既可作为解释性对象,也可作为TTS合成的控制方向。

英文摘要

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

2606.10080 2026-06-10 cs.LG cs.AI q-bio.QM 交叉投稿

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

VFUSE: 基于稀疏自编码器的毒力特征理解

Michael Yu, Matthew L. Olson

AI总结 提出VFUSE方法,通过训练稀疏自编码器(SAE)分析扩散-Transformer模型激活,识别蛋白质设计中的危险特征,实现可解释性提升而不牺牲性能。

详情
AI中文摘要

生成模型在蛋白质设计等领域取得了显著进展,但这种能力也使得危险蛋白质的生成变得不透明。在这项工作中,我们引入了VFUSE(基于稀疏自编码器的毒力特征理解),这是一种机制可解释性方法,通过在扩散-Transformer激活上训练SAE来审计蛋白质模型中的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3,这些是流行的开源蛋白质折叠和合成模型。我们发现,对于某些模块,线性探针在SAE潜在空间中的拟合效果显著优于原始模型表示,从而在不牺牲模型性能的情况下提高了可解释性。此外,我们识别出SAE中的单语义特征,这些特征仅在危险设计上激活,AUROC高达0.84(q < 10^{-13})。据我们所知,这是首次在全原子扩散模型上训练SAE,也是首次对蛋白质设计模型进行特征级毒力审计,为安全且可解释的蛋白质设计铺平了道路。

英文摘要

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

游戏化AI辅助同行评审对科学界构成新风险

Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI总结 研究发现,通过表面改写摘要即可显著操纵AI评审结果,成功率约38%,且成本低、难以区分,可能扭曲科学评估的公正性。

详情
AI中文摘要

AI越来越多地被用于支持科学同行评审,从稿件筛选、评审辅助到编辑分类。尽管这类系统有望减轻评审负担并加速出版,但其对策略性操纵的鲁棒性仍知之甚少。本文表明,AI中介的同行评审容易受到一种简单、低成本的操纵:对稿件摘要进行表面改写。在不改变底层科学内容和交流方式,甚至不了解评审模型的情况下,对抗性重写的摘要显著改善了AI评审结果。我们在不同学科和出版场所,针对人类撰写和AI生成的论文都观察到了这一现象。我们最强的攻击实现了约38%的攻击成功率,将Gemini 3 Flash评审员的接受评分提高了+1.31,将GPT 5.4 Mini评审员的接受评分提高了+0.88(10分制)。当原始AI评审建议“拒绝”时,成功率升至50%以上。这种效应不仅限于总体分数膨胀,还增加了评审信心以及核心科学标准(如合理性、重要性和感知贡献)的得分。该攻击实用性强,仅需约5分钟和1美元即可完成一篇10页的AI会议投稿,且难以与普通科学编辑区分。膨胀的AI评审可能偏向下游人类决策,将编辑建议从拒绝转向接受。这些发现揭示了AI辅助科学评估中的一个普遍漏洞:当AI生成的评审影响编辑决策时,作者可能被激励优化稿件以迎合AI判断而非科学价值。我们的结果表明,在高风险的同行评审中,AI工具不应被视为中立的评估者,而应进行系统的鲁棒性测试、透明的保障措施和谨慎的人工监督。

英文摘要

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

2606.10173 2026-06-10 cs.CR cs.AI 交叉投稿

Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI

本地并非充分的隐私边界:治理操作系统集成的设备端AI

Jonghyun Chung, Sanket Badhe

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出一个以操作系统为中心的隐私框架,将隐私视为制度问责问题而非部署属性,通过威胁模型、六部分隐私风险分类、隐私架构控制和四级审计标准来治理设备端AI。

详情
AI中文摘要

随着AI系统进入操作系统,隐私不再仅仅取决于模型是否在本地运行。本地助手可能整合电子邮件、日历条目、文件、截图、通知和应用意图;保留嵌入或摘要;调用工具;发送遥测数据;或将困难请求路由到云基础设施。本地推理减少了一些暴露,但它只回答了一个问题:计算发生在哪里。它没有回答谁可以整合上下文、哪些派生状态持续存在、哪些操作被授权,或者更新如何改变系统的权限。我们为设备端AI开发了一个以操作系统为中心的隐私框架,将隐私视为一个制度问责问题,而不是一个部署属性。该框架指定了一个威胁模型、一个六部分隐私风险分类、隐私架构控制和一个四级审计标准。我们通过一个文档约束的比较来展示该标准,比较对象包括Apple Intelligence/Foundation Models、Android AICore/Gemini Nano和Microsoft Recall。设备端AI中有意义的隐私取决于受限的信息流、有限的权限、可见的用户控制以及在操作系统生命周期中可审计的治理。

英文摘要

As AI systems move into operating systems, privacy no longer turns only on whether a model runs locally. A local assistant may assemble email, calendar entries, files, screenshots, notifications, and app intents; retain embeddings or summaries; invoke tools; emit telemetry; or route difficult requests to cloud infrastructure. Local inference reduces some exposure, but it answers only one question: where computation occurs. It does not answer who may assemble context, what derived state persists, which actions are authorized, or how updates change the system's authority. We develop an OS-centered privacy framework for on-device AI that treats privacy as an institutional accountability problem rather than a deployment attribute. The framework specifies a threat model, a six-part privacy risk taxonomy, privacy-by-architecture controls, and a four-level audit rubric. We demonstrate the rubric through a documentation-bounded comparison of Apple Intelligence/Foundation Models, Android AICore/Gemini Nano, and Microsoft Recall. Meaningful privacy in on-device AI depends on constrained information flow, bounded authority, visible user control, and auditable governance across the operating-system lifecycle.

2606.10338 2026-06-10 cs.CL cs.AI 交叉投稿

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Lightstandard

AI总结 针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题,提出TRACE方法,通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率,实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情
AI中文摘要

机器遗忘对于大型语言模型越来越重要,然而混合专家(MoE)架构中的遗忘仍未得到充分探索。与密集模型不同,MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中,我们观察到遗忘数据往往不成比例地激活一小部分专家,而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题,我们提出了\textbf{TRACE},即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家,然后通过重新加权令牌级保留损失来校准保留正则化,使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明,TRACE一致地改善了遗忘-效用权衡,在相当的遗忘质量下,相对于最强基线实现了9%的相对效用提升,并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

2606.10371 2026-06-10 cs.RO cs.AI 交叉投稿

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

测试时对抗接管:针对机器人扩散策略的实时劫持接口

Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

发表机构 * Tsinghua University(清华大学) Independent Researcher(独立研究员) Johns Hopkins University(约翰霍普金斯大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出测试时对抗接管(TAKO)方法,通过可微扩散推理学习可重复使用的通用补丁,在测试时切换补丁以劫持机器人策略,实现远程操控,在多种任务和模型上达到100%接管成功率。

详情
AI中文摘要

基于扩散的动作生成已成为具身AI的基础组件,但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。大多数先前的攻击侧重于破坏:它们扰动观测流以降低任务成功率或引发异常行为。我们研究了一种更强的威胁,即测试时对抗接管(TAKO),其中攻击者获得对冻结机器人策略的实时转向接口,并将其转变为远程操控仪器。TAKO通过可微扩散推理学习一个小的可重用通用补丁词汇表;在测试时,攻击者在摄像头流中切换这些补丁以组合攻击者选择的轨迹。这种方法之所以有效,是因为扰动作用于视觉条件路径,其中诱导的偏差可以通过迭代生成推理持续存在。我们进一步表明,自然的目标基线——目标策略匹配——会失败,因为受害者策略无法可靠地在分布外目标偏移上监督自身。在四个任务(2D操作、模拟空中递送、模拟地面导航和物理世界地面导航)、两个视觉编码器(ResNet-18和EfficientNet-B0 + Transformer)以及三个生成推理族(DDPM、DDIM和流匹配)中,人类操作员在每个评估设置中均实现了100%的接管成功率,满足攻击者定义的目标。项目页面可在此https URL获取。

英文摘要

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

2606.10481 2026-06-10 cs.LG cs.AI cs.CL cs.CR stat.ML 交叉投稿

Advancing the State-of-the-Art in Empirical Privacy Auditing

推进经验隐私审计的最新水平

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz

发表机构 * Google Research(谷歌研究院)

AI总结 提出通过高温采样生成合成金丝雀,用于经验隐私审计,并引入基于辅助模型的合成数据审计方法,系统研究模型容量与金丝雀熵对记忆化的交互影响。

详情
AI中文摘要

大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计(EPA)通过测量成员推断(MI)或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样($T \geq 0.8$)生成合成金丝雀,使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值,确保高可识别性,从而实现强审计。此外,由于金丝雀本身是非私有的,它们是可检查的,并且可以重复插入,而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法,基于在合成数据上微调辅助模型。然后,对原始金丝雀的辅助模型进行审计,可以强有力地估计通过合成数据的隐私泄露。最后,利用我们强大的审计方法,我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。

英文摘要

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

2606.10487 2026-06-10 cs.LG cs.AI 交叉投稿

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

早停早省:隐藏状态探针作为LLM输出流式审核的实用方案

Huizhen Shu, Xuying Li, Piao Xue

发表机构 * ModelOneAI yunshanai(云山AI)

AI总结 提出基于隐藏状态的轻量级词元级探针,在解码循环中实时检测不安全输出,无需额外前向传播,实现亚毫秒级安全审核,可提前中断或修改生成。

Comments Technical Report. 14 pages, 3 figures, 4 tables

详情
AI中文摘要

在面向用户的系统中部署大型语言模型需要高效的输出安全过滤。现有方法通常依赖于生成后应用的单独审核模型,这会使推理成本翻倍,并且仅在生成完成后检测违规。我们观察到审核所需的信号已经存在于模型隐藏状态中。基于此,我们训练了轻量级的词元级探针,直接操作内部激活,生成每个词元的安全分数,这些分数可以聚合用于离线评估和在线干预。该探针重用生成器的激活,无需额外的前向传播,从而在解码循环内实现亚毫秒级的逐词元安全检查。应用于单个中间层的探针可以恢复强防护模型的大部分决策,作为一个低成本替代方案,优化延迟而非准确性。在流式设置中,它可以在不安全输出完全生成之前暂停或修改它们,用连续的词元级监控取代序列结束时的审核。与事后和流式防护模型相比,我们的方法实现了数量级的计算开销降低,且延迟成本最小。我们还提供了一个实用的部署方案,包括层选择、聚合策略、探测频率和触发阈值。最后,我们展示了探针的线性分量对应于残差空间中的一个方向,从而能够以可忽略的成本实现检测和激活引导。

英文摘要

Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.

2606.10525 2026-06-10 cs.CR cs.AI 交叉投稿

Assessing Automated Prompt Injection Attacks in Agentic Environments

评估智能体环境中的自动化提示注入攻击

David Hofer, Edoardo Debenedetti, Florian Tramèr

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 研究在智能体环境中,黑盒优化方法(TAP)比梯度方法(GCG)更有效,且攻击效果依赖于攻击者模型,任务通用攻击可迁移但跨模型迁移受限。

详情
AI中文摘要

间接提示注入对与不可信外部数据交互的LLM智能体构成严重威胁,然而在现实智能体环境中,自动化攻击方法(已被证明对越狱有效)仍未得到充分探索。我们针对LLM智能体进行了自动化提示注入攻击的全面实证评估,将白盒(GCG)和黑盒(TAP)方法都适配到AgentDojo框架中的智能体设置。我们在跨越四个领域和多个模型的80个任务对上进行评估,发现黑盒优化显著优于基于梯度的方法,我们将这一差距归因于GCG在合理计算预算下的优化不稳定性。我们还发现TAP的有效性取决于攻击者模型,因为通用能力和安全调优都会影响攻击成功率——更强的模型产生更有效的注入,而安全调优的攻击者可能拒绝生成对抗性提示。任务通用攻击有效迁移到未见过的任务和分布外领域,但在较小开源模型上优化的攻击不会迁移到GPT-5等前沿模型。这些发现表明自动化提示注入是一种可信但依赖于模型的威胁,对于模型无关的利用仍存在重大障碍。

英文摘要

Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white-box (GCG) and black-box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black-box optimization substantially outperforms gradient-based methods, a gap we attribute to GCG's optimization instability under reasonable compute budgets. We also find that TAP's effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success--stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. These findings highlight automated prompt injection as a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.

2606.10569 2026-06-10 cs.CL cs.AI 交叉投稿

Hidden Consensus:Preference-Validity Compression in Human Feedback

隐藏共识:人类反馈中的偏好有效性压缩

Dorcas Chia Ern Chua, Karen Myn Hui Lee, Jia Yue Tan, Zhen Xue Gue, Norzalena Abdul Hamid, Azima Binti Azmi, Keat Mei Yeong, Aizat Izyani binti Mujab, Hafsah Noor Azam, Chee Guo Khoo, Han Ying Lim, Chee Seng Chan

发表机构 * YTL AI Labs Universiti Malaya(马来亚大学) Monash University Malaysia(莫纳什大学马来西亚校区) Universiti Malaysia Sarawak(马来西亚沙捞越大学)

AI总结 本文提出偏好有效性压缩问题,即RLHF将多元有效反馈压缩为单一奖励目标,导致对齐测量偏差。通过马来西亚语料分析,79%的提示存在多个多数支持响应,表明多数聚合测量的是argmax可接受性而非多元对齐。

Comments 28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

详情
AI中文摘要

标准的RLHF流程通常将异质的人类判断简化为单一的标量奖励目标。我们认为这种简化在结构多元的社会中可能错误地衡量对齐,在这些社会中,分歧可能反映文化、历史、语言、区域或规范性的解释,而非标注噪声。我们将这种失败称为偏好有效性压缩,即多个多元有效的响应选项被压缩成一个优化目标。以马来西亚为诊断场景,我们通过偏好事件分析RLHF风格的反馈聚合,这些事件将提示、响应和跨解释框架的可接受性判断联系起来。在来自20名参与者和107个三人标注提示的321个偏好事件中,79%的提示包含多个多数支持的响应,而单一赢家聚合会丢弃这些响应,并且当考虑所有多数支持的选项时,顶部响应之间的明显优势差距会消失。参与者经常选择多个可接受的响应,而被丢弃的响应明显反映了连贯的本地、实践或文化框架。这些发现表明,该语料中的多数聚合测量的是argmax可接受性而非多元对齐。我们将此视为测量有效性问题,并认为未来的对齐方法应满足有效性保持一致性,即在多元有效的解释框架中保持稳定,而不是将它们压缩为单一的奖励目标。

英文摘要

Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

2606.10571 2026-06-10 cs.CV cs.AI cs.CR 交叉投稿

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

通过代理特定偏差校正提高视觉-语言预训练模型上的对抗迁移性

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出DeBias-Attack方法,通过梯度校正消除代理特定偏差,提高对抗样本在VLP模型间的迁移性,实验验证其在多种模型和任务上的有效性。

Comments 17 pages, 7 figures, 10 tables

详情
AI中文摘要

对抗样本揭示了视觉-语言预训练(VLP)模型中的脆弱性,并为提高鲁棒性提供了见解。一个关键特性是跨模型迁移性,这使得基于迁移的黑盒攻击成为可能。然而,现有攻击通常严重依赖代理模型,导致跨模型性能下降。一个原因是对抗优化可能更多地遵循代理模型响应而非输入语义,使得更新方向在代理模型上有效,但对未见目标迁移性较差。我们将这种依赖称为代理特定偏差。受此观察启发,DeBias-Attack通过校正对抗优化方向中的代理特定偏差来提高迁移性。它维护两个扰动分支。主分支在原始图像上优化扰动,并获得用于破坏图像-文本对齐的对抗梯度。参考分支在弱语义图像上优化扰动,该图像由数据集平均图像加上每次迭代重新采样的小高斯噪声构成。由于该弱语义图像几乎不含清晰的视觉内容,其优化更多地反映代理模型响应而非图像语义,其参考梯度估计代理特定偏差。DeBias-Attack在更新对抗图像之前移除主梯度在参考梯度上的对齐投影,然后使用更新后的对抗图像进行上下文感知的文本替换。DeBias-Attack是首个通过梯度校正来校正代理特定偏差的基于迁移的VLP攻击。实验表明,在VLP模型、下游任务以及开源和闭源多模态大语言模型上均表现出强劲性能。

英文摘要

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

2606.10632 2026-06-10 cs.LG cs.AI 交叉投稿

Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-\texorpdfstring{$δ$}{delta} Alignment

公平真的公平吗?通过固定δ对齐实现多任务学习中可靠的Lipschitz公平性

Junbo Ding, Xin Zang, Chenchen Pan, Donghao Song, Jiaxin Zhu, Danhuai Guo

发表机构 * Beijing University of Chemical Technology(北京化工大学)

AI总结 针对多任务学习中Lipschitz个体公平性评估受表示尺度干扰的问题,提出固定δ审计与受控正则化框架ReLiF,实现语义一致的公平性评估与权衡。

详情
AI中文摘要

Lipschitz风格的个体公平性形式化了语义相似的样本应获得相似预测的思想,但在多任务学习(MTL)中,其评估可能受到方法引起的表示尺度的干扰。本文识别了阈值混淆问题:当审计容差源自每个模型自身的表示距离时,不同算法会在不同的语义阈值下进行比较。阈值漂移分析进一步展示了偏差排名如何变化,并识别了排名保持的充分条件。我们提出了\textbf{ReLiF},一个可靠性感知框架,将评估时的固定$\delta$审计与训练时的受控正则化分离。ReLiF使用共享参考容差进行可比较的审计,并通过违反率反馈控制器保持Lipschitz代理活跃而不让其主导随机训练。本文还发展了关于阈值漂移、参考容差选择以及huberized训练代理与其未平滑的正间隔对应物之间关系的支持性分析。在临床时间序列基准和NYUv2(NYU Depth V2)密集预测上的实验表明,固定$\delta$审计暴露了方法依赖阈值可能掩盖的效用-公平性权衡。在使用ResNet50骨干的NYUv2上,ReLiF在共享固定阈值下实现了有竞争力的效用,同时显著减少了对齐偏差。在临床基准上,ReLiF产生了受控的公平性正则化权衡,而固定$\delta$审计揭示任务平衡基线有时能实现更低的偏差,且真正的效用-公平性权衡仍然存在。这些结果支持固定$\delta$审计作为评估MTL中Lipschitz公平性的语义一致协议。

英文摘要

Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its evaluation in multi-task learning (MTL) can be confounded by method-induced representation scales. This paper identifies threshold confounding: when the auditing tolerance is derived from each model's own representation distances, different algorithms are compared under different semantic thresholds. A threshold-drift analysis further shows how Bias rankings can change and identifies sufficient conditions for ranking preservation. We propose \textbf{ReLiF}, a reliability-aware framework that separates evaluation-time fixed-$δ$ auditing from training-time controlled regularization. ReLiF uses a shared reference tolerance for comparable auditing and a violation-rate feedback controller to keep the Lipschitz surrogate active without letting it dominate stochastic training. This work also develops supporting analysis for threshold drift, reference-tolerance selection, and the relationship between the huberized training surrogate and its unsmoothed positive-margin counterpart. Experiments on clinical time-series benchmarks and NYUv2 (NYU Depth V2) dense prediction show that fixed-$δ$ auditing exposes utility--fairness trade-offs that method-dependent thresholds can obscure. On NYUv2 with a ResNet50 backbone, ReLiF achieves competitive utility while substantially reducing aligned bias under shared fixed thresholds. On clinical benchmarks, ReLiF yields controlled fairness-regularized trade-offs, while fixed-$δ$ auditing reveals that task-balancing baselines can sometimes achieve lower bias and that genuine utility--fairness trade-offs persist. These results support fixed-$δ$ auditing as a semantically consistent protocol for evaluating Lipschitz fairness in MTL.

2606.10669 2026-06-10 cs.LG cs.AI cs.CR 交叉投稿

In Defense of Information Leakage in Concept-based Models

为基于概念模型中的信息泄露辩护

Mateo Espinosa Zarlenga

发表机构 * GitHub arXiv

AI总结 本文重新审视基于概念模型中的信息泄露问题,提出良性泄露概念,通过优化训练目标,在概念不完整时利用泄露提升准确性和可干预性。

Comments Accepted as a position paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于概念的模型(CMs)是深度神经网络,其预测基于与人类可理解概念(如“圆形”、“条纹”等)对齐的表示。已有研究表明,这些模型会学习到泄露概念无关信息的表示。传统观点认为,这种泄露是不可取的,应予以消除,因为它会导致模型不可解释。在本文中,我们认为这种关于CMs中泄露的传统观点不仅是不恰当的(因为泄露如何使模型更不可解释的证据往往不明确),而且在常见的现实约束下必然导致不实用的CMs。具体来说,我们认为在概念不完整是常态的现实环境中,为了构建准确且可干预的CMs,某种程度的泄露往往是必要的。为此,我们提出存在所谓的良性泄露,并表明通过重新优化典型的CM训练目标,CMs可以鼓励并利用这种形式的泄露,而不会牺牲准确性或可干预性。

英文摘要

Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.

2606.10749 2026-06-10 cs.CR cs.AI 交叉投稿

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

迈向安全的LLM智能体:威胁面、攻击、防御与评估

Yuchen Ling, Shengcheng Yu, Zhenyu Chen, Chunrong Fang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文通过生命周期、系统导向的框架,综合247篇论文,围绕信息流、委托权限和持久状态,系统梳理了LLM智能体的威胁面、攻击、防御与评估,指出提示注入和工具控制流劫持仍是主要威胁,持久状态损坏和多智能体传播成为新兴关注点。

详情
AI中文摘要

大型语言模型(LLM)智能体正迅速从对话界面转变为规划、调用工具、维护记忆并在外部环境中行动的软件组件。这一转变改变了安全风险的性质。在智能体场景中,故障不再局限于不安全的文本生成。不受信任的内容可能重定向控制流、滥用工具权限、破坏持久状态、泄露敏感信息或触发有害的外部行动。与此同时,关于LLM智能体安全的研究正在迅速扩展,但仍然分散在不同的攻击家族、防御层、应用领域和评估设置中。本文通过一个基于生命周期、系统导向的框架综合了247篇论文,该框架围绕信息流、委托权限和持久状态的交互对智能体安全进行建模。我们围绕四个问题组织文献:LLM智能体安全应如何建模,哪些威胁面和攻击家族占主导地位,提出了哪些防御措施及其权衡,以及安全声明如何评估。我们发现,提示注入和工具介导的控制流劫持仍然主导该领域,而持久状态损坏和多智能体传播正在成为新兴的核心关注点。我们进一步发现,当前的防御措施提供了有用的构建块,但组合性较弱,现有的基准测试仍然低估了长期、有状态和部署敏感的风险。我们认为,安全的LLM智能体需要明确的信任边界、有原则的权限控制、可溯源的状态管理以及与真实操作环境一致的评估实践。

英文摘要

Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.

2606.10906 2026-06-10 stat.ML cs.AI cs.LG 交叉投稿

Human-AI Teaming Through the Lens of Calibration

通过校准视角看人机协作

Eric Nalisnick, Chi Zhang, Sophia Qian, Yixin Wang

发表机构 * Department of Computer Science, Johns Hopkins University(计算机科学系,约翰霍普金斯大学) Department of Statistics, University of Michigan(统计学系,密歇根大学)

AI总结 研究通过统计校准视角分析人机协作模型,发现组合方法不保留人类校准度,而委托方法将校准负担转移给拒绝器元模型,且当人类依赖系统不可观测信息时无法实现。

Comments 19 pages, 5 figures (including appendix)

详情
AI中文摘要

我们通过统计校准的视角研究人机协作模型。假设团队由AI模型和人类组成——两者相对于特征空间的某种划分都是校准的——并揭示校准假设如何传播到协作框架中。特别地,我们考虑两种框架:(i) 结合人类和模型预测,或 (ii) 将预测责任委托给人类或模型。通过理论和实证结果,我们表明现有的组合方法不保留人类的校准程度。委托方法(通过委托行为本身)保留了后续预测器的校准,但将负担转移到了决定谁进行预测的拒绝器元模型上。拒绝器必须足够精细地校准,以定位每个成员的优势所在,这一需求随着人类专业知识的增长而增加,并且当人类依赖系统无法观测的信息时变得无法实现。

英文摘要

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human's degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human's expertise and becomes unattainable when the human relies on information the system cannot observe.

2606.10912 2026-06-10 cs.SD cs.AI cs.CR cs.LG 交叉投稿

What Do Deepfake Speech Detectors Actually Hear?

深度伪造语音检测器实际上听到了什么?

Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出基于自监督表示和积分梯度的可解释性方法,分析三种WavLM检测器在ASVspoof5上的决策线索,发现它们分别依赖环境噪声、音素伪影和词边界。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

深度伪造语音检测器通常输出一个分数,而不解释为什么音频样本被标记、证据在信号中的位置或哪些线索驱动了决策。我们提出了一种音频原生的可解释性管道,使用时间对齐的自监督表示上的积分梯度来随时间定位决策证据。我们将所提出的方法应用于ASVspoof5上的三个基于WavLM的检测器(AASIST、CA-MHFA、SLS),并手动注释最高归因区域以提供最重要线索的语义含义。尽管性能相似,检测器依赖不同的线索:AASIST强调非语音/环境线索,CA-MHFA关注局部音素伪影,SLS依赖词边界和频谱完整性。我们超越推测性推理,通过因果遮蔽主要检测器线索来验证我们的发现。观察到的性能下降进一步支持了解释的检测器语义。

英文摘要

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

2606.10942 2026-06-10 cs.NI cs.AI cs.LG 交叉投稿

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

下一代网络的生成式可解释性:基于互特征交互的LLM增强XAI

Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti, Carlos Natalino

发表机构 * Swedish Innovation Agency(瑞典创新署) Swiss Innovation Agency(瑞士创新署)

AI总结 提出一种利用大语言模型和互特征交互数据生成自然语言解释的框架,在光传输质量估计用例中,相比基线方法,解释有用性和范围分别提升12.2%和6.2%,正确率达97.5%。

Comments 7 pages, with one page for appendix. Accepted for publication at the 2025 21th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)

详情
Journal ref
Proc. WiMob, Marrakesh, Morocco, 2025
AI中文摘要

随着人工智能和机器学习模型成为网络运营的核心,其缺乏透明度对运营商信任构成重大障碍。现有的可解释人工智能技术往往无法为非专家弥合这一差距,产生的技术输出难以转化为可操作的见解。本文提出了一个专门解决这一缺陷的框架。它利用中等规模的大语言模型,并超越了SHapley Additive exPlanations特征影响值的标准用法。该框架采用结构化的提示,并辅以互特征交互数据,以生成人类可理解的自然语言解释。为了验证我们的框架,我们在光传输质量估计用例中进行了实证评估,并邀请了人类评估者。我们收集了专家的独立性能评估,显示出较高的评估者间一致性。与仅使用SHAP特征影响值进行简单提示的最先进基线相比,我们的方法将解释有用性和范围分别提高了12.2%和6.2%,同时实现了97.5%的正确性。

英文摘要

As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

2606.11007 2026-06-10 cs.CR cs.AI cs.SE 交叉投稿

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

理解并减轻非技术用户使用OpenClaw的风险:一份实用指南与Skill

Junchang Zheng, Junfeng Tan, Jialiang Lin

发表机构 * School of Computer Science and Engineering, Guangzhou Institute of Science and Technology, Guangzhou, China(计算机科学与工程学院,广州科学与技术研究院,中国广州) Science and Education Evaluation Lab, Guangzhou Institute of Science and Technology, Guangzhou, China(科学与教育评估实验室,广州科学与技术研究院,中国广州)

AI总结 针对非技术用户,识别OpenClaw的七类核心风险,用通俗语言解释,提供可操作的防御策略,并开发自动化安全配置的Skill,降低使用门槛。

Comments Work in progress

详情
AI中文摘要

OpenClaw已迅速成为一种变革性的人工智能(AI)智能体框架,其自主执行复杂多步任务的能力吸引了日益增长且多样化的用户群体。然而,这种能力伴随着显著的风险。虽然现有研究在描述这些威胁方面取得了重要进展,但此类工作主要面向技术娴熟的受众,对非技术用户而言仍然难以触及。这一群体如今在社区中占比越来越大且服务不足,而正是这些用户最迫切需要实用且直接的指导。为此,我们通过一系列相互关联的努力来弥合这一差距,旨在降低非技术OpenClaw用户的风险门槛。首先,我们识别并分类了OpenClaw用户在日常使用中可能遇到的七类核心风险,并用通俗语言解释,以便非技术用户能够轻松理解这些威胁的性质和潜在后果。其次,针对每种已识别的风险,我们将一套相应的防御策略提炼为清晰且可操作的具体步骤,易于遵循。第三,为使保护更加便捷,我们提供了一个配套的OpenClaw Skill,可自动执行关键安全配置,使用户能够以最少的手动干预保护其系统。通过这项工作,我们证明了防范智能体风险不必是安全专家的专属领域,非技术用户可以通过简单、实用的行动有意义地参与降低这些风险。

英文摘要

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

2606.11116 2026-06-10 cs.CY cs.AI cs.HC 交叉投稿

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

由记者设计,但为读者而设?重新思考AI披露与新闻透明度

Pooja Prajod

发表机构 * Centrum Wiskunde & Informatica(数学与信息学中心)

AI总结 研究发现,详细披露会引发透明度困境降低信任,而简短披露造成信息缺口;读者偏好用户代理型设计(如按需详情、AI比例可视化),呼吁HCI社区重新设计披露机制。

Comments Accepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability)

详情
AI中文摘要

随着新闻编辑室整合生成式AI,记者面临一个披露挑战:如何以维护读者信任的方式传达AI参与。当前实践提供两种方法:简短的一行标签或详细的披露,说明人工监督、编辑责任和错误报告机制。两者都未能实现记者通过透明度建立信任的目标。一项针对34名新闻读者的现有对照实验表明,详细披露会引发\textit{透明度困境},降低信任而非增加信任,并有可能引入暗黑模式,使读者在透明度的错觉下滚动忽略。一行披露避免了这种效应,但可能造成信息缺口,促使读者花费认知努力寻找披露所指示但未解释的AI参与迹象。然而,读者并非拒绝透明度,他们提出了以用户代理为中心的披露设计:按需详情交互、比例AI可视化、媒体级别信号和明确的“无AI”标签。我认为,从业者认为负责任的披露与用户实际需求之间的脱节是HCI社区的一个设计问题。

英文摘要

As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

2604.24668 2026-06-10 cs.AI cs.LG 版本更新

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

同意的代价:在代理金融应用中衡量LLM的谄媚行为

Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 研究评估LLM在金融代理任务中的谄媚行为,发现模型对用户反驳仅表现低至中等性能下降,但偏好信息导致多数模型失败,并测试了输入过滤等恢复方法。

Comments Accepted to ICLR 2026 FinAI Workshop

详情
AI中文摘要

鉴于当今LLM在金融系统中的使用增加,评估此类系统的安全性和鲁棒性变得重要。LLM在通用领域设置中经常表现出的一种失败模式是谄媚行为。也就是说,模型优先考虑与表达的用户信念一致,而非正确性,导致准确性和信任度下降。在这项工作中,我们专注于评估LLM在代理金融任务中表现出的谄媚行为。我们的发现有三方面:首先,我们发现模型在面对用户对参考答案的反驳或矛盾时,仅表现出低至中等的性能下降,这区别于先前工作中模型在金融代理设置中表现出的谄媚行为。其次,我们引入了一套任务,通过用户偏好信息(与参考答案矛盾)来测试谄媚行为,并发现大多数模型在存在此类输入时失败。最后,我们基准测试了不同的恢复模式,例如使用预训练LLM进行输入过滤。

英文摘要

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.

2605.22714 2026-06-10 cs.AI cs.CL cs.LG 版本更新

AMEL: Accumulated Message Effects on LLM Judgments

AMEL: 累积消息对LLM判断的影响

Sid-Ali Temkit

发表机构 * chut.app

AI总结 研究LLM在对话中因历史消息极性而偏离基准判断的累积消息效应(AMEL),发现模型偏向历史主流极性,且负向历史偏见更强,但偏见不随上下文长度增长,简单修复是为每个项目使用新上下文。

Comments 24 pages, 14 figures, 8 tables. Single author. Code, data (84,088 deduplicated API responses), and analysis pipeline at https://github.com/chutapp/amel

详情
AI中文摘要

大型语言模型常被用作自动评估者:审查代码、审核内容或评分输出,通常许多项目通过一次对话处理。我们询问先前对话历史的极性是否会偏倚后续判断,我们将这种效应称为LLM判断的累积消息效应(AMEL)。通过对来自4个提供商(OpenAI、Anthropic、Google和四个开源模型)的11个模型进行75,898次API调用,我们在隔离或跟随以正面或负面评价为主的历史之后呈现相同的测试项目。模型倾向于对话的主流极性(d = -0.17, p < 10^-46)。该效应集中在模型在基线时真正不确定的项目上(高熵项目d = -0.34,而基线确定时d = -0.15)。偏见不随上下文长度增长:5个先前轮次和50个产生相同的偏移(Spearman |r| < 0.01;OLS斜率p = 0.80)。并且存在负性不对称:按项目配对,负面历史诱导的偏见是正面的1.62倍(t = 13.46, p < 10^-39, n = 2,481)。扩展规模有帮助但不能解决(Anthropic: Haiku -0.22到Opus -0.17;OpenAI: Nano -0.34到GPT-5.2 -0.17)。三项后续研究缩小了机制范围。令牌概率分布连续变化,而非在阈值处。负性不对称既有令牌级成分也有语义成分,尽管在我们的样本量下平衡归因是探索性的。位置不重要:在50轮历史中任何位置的五个有偏轮次产生相同的偏移。评估流程最简单的修复是为每个项目使用新上下文;当批处理不可避免时,平衡历史有帮助。

英文摘要

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 84,088 API calls to 12 models from 5 providers (OpenAI, Anthropic, Google, DeepSeek, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-53). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.36 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.52x more bias than positive (t = 13.03, p < 10^-36, n = 2,733). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

2606.09475 2026-06-10 cs.AI cs.LG 版本更新

Emergent alignment and the projectability of ethical personas

涌现对齐与伦理人格的可投射性

Guillermo Del Pinal, Youngchan Lee, Calum McNamara, Alejandro Perez Carballo

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究微调大语言模型在窄任务上如何引发广泛对齐行为,通过宪法AI方法赋予模型伦理人格,发现窄对齐可投射到未训练类别,并提出对齐策略应评估可投射性。

详情
AI中文摘要

关于“涌现错位”的研究表明,在窄任务上微调LLM会诱导广泛错位的行为。这支持了“人格选择”(PSM)假说:在预训练期间,LLM学会模拟不同的角色和视角,这些可以在后训练期间被激发和细化。本文研究了相反的现象“涌现对齐”,并用它来支持和细化PSM,并激发对齐的新需求。我们在广泛和狭窄的安全任务上微调一个仅帮助型模型。为了创建SFT样本,我们遵循“宪法AI”(CAI)方法,并使用四条编码合理对齐策略的宪法:道义论、后果论、美德伦理以及将AI对齐为从属于人类权威。对于每个模型,我们表明,在两个狭窄的安全子类别上微调可靠地诱导出在代表性的一般安全类别以及我们直接从用于窄对齐的数据集中过滤掉的安全子类别上的涌现对齐。为了使用更细粒度的评估测试“PSM”,我们使用了多维“伦理人格”诊断。对于每个宪法微调(广泛/狭窄)模型,我们评估其行为与预期特征概况的匹配程度。我们的结果表明,我们的CAI模型获得了预期的“伦理人格”——例如,使用后果论宪法创建的SFT样本窄微调的模型与功利主义信念的一致性显著高于道义论信念。然而,我们的粗粒度和细粒度评估显示,我们的(广泛/狭窄)微调CAI模型在投射效果上存在显著差异。我们得出结论,对齐策略不仅应基于其(分布内)一般安全性能进行评估,还应特别基于其可投射性程度进行评估。

英文摘要

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

2403.00420 2026-06-10 cs.LG cs.AI 版本更新

Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey

通过对抗攻击和训练实现鲁棒深度强化学习:综述

Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier

发表机构 * Institut de Recherche Technologique SystemX(技术研究 institute SystemX)

AI总结 本文综述了深度强化学习中的对抗攻击与训练方法,系统分类并比较其目标与机制,以提升模型对环境变化和扰动的鲁棒性。

Comments 83 pages, 17 figues, 3 table, 15 algorithms

详情
AI中文摘要

深度强化学习是机器学习的一个子领域,用于训练在复杂环境中执行序列动作的自主智能体。尽管在已知环境中表现出色,但它仍容易受到微小条件变化的影响,引发对其在实际应用中可靠性的担忧。为了提高可用性,深度强化学习必须展示出可信赖性和鲁棒性。提升深度强化学习对环境条件未知变化和可能扰动的鲁棒性的一种方法是通过对抗训练,即针对观测和环境动态的合适对抗攻击来训练智能体。针对这一关键问题,我们的工作深入分析了当代对抗攻击和训练方法,系统地对它们进行分类,并比较了它们的目标和操作机制。

英文摘要

Deep Reinforcement Learning (DRL) is a subfield of machine learning for training autonomous agents that take sequential actions across complex environments. Despite its significant performance in well-known environments, it remains susceptible to minor condition variations, raising concerns about its reliability in real-world applications. To improve usability, DRL must demonstrate trustworthiness and robustness. A way to improve the robustness of DRL to unknown changes in the environmental conditions and possible perturbations is through Adversarial Training, by training the agent against well-suited adversarial attacks on the observations and the dynamics of the environment. Addressing this critical issue, our work presents an in-depth analysis of contemporary adversarial attack and training methodologies, systematically categorizing them and comparing their objectives and operational mechanisms.

2407.20242 2026-06-10 cs.CY cs.AI cs.RO 版本更新

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

BadRobot: 在物理世界中越狱具身LLM智能体

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beihang University(北航) Griffith University(格里菲斯大学)

AI总结 提出BadRobot攻击范式,利用LLM在机器人系统中的操纵、语言输出与物理动作的错位以及世界知识缺陷三个漏洞,通过语音交互使具身LLM执行有害行为,并在基准测试中验证了有效性。

Comments Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io

详情
Journal ref
International Conference on Learning Representations (ICLR) 2025
AI中文摘要

具身AI代表将AI集成到物理实体中的系统。大型语言模型(LLM)展现出强大的语言理解能力,通过促进复杂的任务规划,已被广泛用于具身AI。然而,一个关键的安全问题仍被忽视:这些具身LLM是否会实施有害行为?为此,我们引入了BadRobot,一种新颖的攻击范式,旨在通过典型的基于语音的用户-系统交互,使具身LLM违反安全和伦理约束。具体来说,我们利用了三个漏洞来实现这种攻击:(i) 机器人系统中LLM的操纵,(ii) 语言输出与物理动作之间的错位,以及(iii) 世界知识缺陷导致的意外危险行为。此外,我们构建了一个包含各种恶意物理动作查询的基准,以评估BadRobot的攻击性能。基于该基准,针对现有突出的具身LLM框架(例如Voxposer、Code as Policies和ProgPrompt)的大量实验证明了我们BadRobot的有效性。我们的代码可在以下网址获取:this https URL。

英文摘要

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV:用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 提出Visual-TCAV框架,结合概念激活向量和积分梯度,生成类无关显著性图并估计概念归因,在受控实验中比TCAV更忠实于真实解释。

Comments Accepted in TMLR

详情
AI中文摘要

卷积神经网络在图像分类中表现出色,但由于模型规模和复杂性,解释其预测具有挑战性。最先进的显著性方法生成局部解释,突出输入图像中识别类别的区域,但无法解释感兴趣的概念如何贡献于预测。另一方面,基于概念的方法(如TCAV)提供了网络对人类定义概念敏感性的见解,但无法计算其在特定预测中的归因,也无法显示其在输入图像中的位置。我们引入了Visual-TCAV,一种新颖的可解释性框架,旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量(CAV)生成类无关的显著性图,显示网络识别特定概念的位置。此外,它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性,显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

2501.00745 2026-06-10 cs.CL cs.AI cs.GT cs.IR econ.TH 版本更新

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

基于大型语言模型的搜索引擎对抗攻击动力学

Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文将排名操纵攻击建模为无限重复囚徒困境,分析合作维持条件,发现降低攻击成功率可能反而激励攻击,防御措施在某些情况下无效。

Comments New Frontiers in Game-Theoretic Learning Workshop, International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基于大型语言模型(LLM)的搜索引擎日益集成,改变了信息检索的格局。然而,这些系统容易受到对抗攻击,尤其是排名操纵攻击,攻击者通过精心制作网页内容来操纵LLM的排名并推广特定内容,从而在竞争对手中获得不公平优势。在本文中,我们研究了排名操纵攻击的动力学。我们将此问题建模为无限重复囚徒困境,其中多个参与者策略性地决定合作还是攻击。我们分析了合作得以维持的条件,识别了关键因素,如攻击成本、折现率、攻击成功率和触发策略,这些因素影响参与者的行为。我们识别了系统动力学中的临界点,表明当参与者具有前瞻性时,合作更有可能维持。然而,从防御角度来看,我们发现简单地降低攻击成功概率,在某些条件下反而会激励攻击。此外,限制攻击成功率上限的防御措施在某些情况下可能徒劳无功。这些见解凸显了保护基于LLM的系统的复杂性。我们的工作为理解和缓解其脆弱性提供了理论基础和实践见解,同时强调了自适应安全策略和深思熟虑的生态系统设计的重要性。

英文摘要

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

2505.14608 2026-06-10 cs.CL cs.AI cs.LG 版本更新

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

对机器文本检测器的攻击保留风格指纹

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 研究机器文本检测器对抗攻击的局限性,提出一种同时优化不可检测性和特定人类风格的 paraphrasing 方法,发现单文档检测不可靠,需多文档分析。

详情
AI中文摘要

尽管机器文本检测器的开发取得了显著进展,但机器文本容易被操纵以逃避检测,这导致有人认为该问题本质上是难以解决的。在这项工作中,我们研究了这种逃避策略的局限性。我们证明,尽管当前的攻击(从提示工程到检测器引导的优化)可以有效降低标准检测器的性能,但它们无法抹去机器文本底层的风格“指纹”。我们表明,利用风格特征空间的少样本检测器对这些逃避尝试具有鲁棒性,即使对于明确调整以逃避检测的模型生成的样本也能可靠地检测。这引发了一个问题:风格是否代表了对机器检测攻击的通用防御?我们通过引入一种新颖的 paraphrasing 方法来证明答案是“不”,该方法同时优化不可检测性和对特定人类风格的遵循。我们表明,与先前方法不同,这种攻击有效逃避了所有考虑的检测器,包括那些利用写作风格的检测器。然而,我们发现这种逃避并非绝对:随着可供分析的文档数量增加,人类和机器分布再次变得可区分。总体而言,我们的发现表明,可靠的机器文本检测需要从单文档分析转向多文档分析。

英文摘要

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

2509.25760 2026-06-10 cs.CL cs.AI cs.LG 版本更新

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: 通过强化学习激励诚实的LLM

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TruthRL框架,使用GRPO和三值奖励直接优化LLM的诚实性,减少幻觉并允许不确定时弃权,在知识密集型基准上显著提升诚实性。

Comments ICML 2026. Code: https://github.com/facebookresearch/TruthRL

详情
AI中文摘要

虽然大型语言模型(LLM)在事实性问题回答上表现出色,但它们仍然容易产生幻觉和不真实的回答,特别是当任务需要其参数知识之外的信息时。事实上,诚实性需要的不仅仅是准确性——模型还必须识别不确定性,并在不确定时弃权以避免幻觉。这对现有方法提出了根本性挑战:优化准确性的方法往往会放大幻觉,而鼓励弃权的方法可能变得过于保守,牺牲正确答案。两种极端最终都损害了诚实性。在这项工作中,我们提出了TruthRL,一个通用的强化学习(RL)框架,直接优化LLM的诚实性。具体来说,我们使用GRPO实现TruthRL,并采用一个简单而有效的三值奖励,区分正确答案、幻觉和弃权。它激励模型不仅通过提供正确回答来减少幻觉,还通过在不确定时启用弃权来提高诚实性。在四个知识密集型基准上的大量实验表明,TruthRL显著减少了幻觉(例如,43.5% → 19.4%)并提高了诚实性(例如,5.3% → 37.2%),在各种骨干模型上均有一致的提升。分析表明,TruthRL的改进源于LLM识别其知识边界的能力增强,从而避免了像基线那样过于保守。

英文摘要

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

2604.01114 2026-06-10 cs.HC cs.AI cs.CY cs.ET 版本更新

Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

教育中AI的信任与依赖:AI素养和认知需求作为调节变量

Griffin Pitts, Neha Rani, Weedguet Mildort

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Florida(佛罗里达大学)

AI总结 本研究通过编程问题解决实验,发现学生对AI助手的信任与适当依赖呈非线性关系,高信任导致对正确与错误建议的区分能力下降,且AI素养和认知需求显著调节这一关系。

Comments Full paper accepted to the 27th International Conference on AI in Education (AIED 2026). AIED Proceedings to be released Summer 2026

详情
AI中文摘要

随着生成式AI系统被整合到教育环境中,学生在完成学习任务时经常遇到AI生成的输出,无论是通过请求帮助还是通过集成工具。对AI的信任会影响学生如何解释和使用这些输出,包括他们是否批判性地评估或表现出过度依赖。我们研究了在编程问题解决任务中,学生的信任如何与他们对AI助手的适当依赖相关,以及这种关系是否因学习者特征而异。共有432名本科生参与,学生在完成Python输出预测问题时,接收来自AI聊天机器人的建议和解释,包括准确和故意误导的建议。我们将依赖行为操作化为学生响应反映适当使用AI助手建议的程度,即当建议正确时接受,错误时拒绝。任务前后调查评估了对助手的信任、AI素养、认知需求、编程自我效能和编程素养。结果显示了一种非线性关系,其中较高的信任与较低的适当依赖相关,表明对正确和错误建议的区分能力较弱。这种关系受到学生AI素养和认知需求的显著调节。这些发现强调了未来需要研究教学和系统支持,以鼓励在问题解决过程中对AI辅助进行更反思性的评估。

英文摘要

As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students' trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students' completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students' responses reflected appropriate use of the AI assistant's suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students' AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.

2604.26991 2026-06-10 cs.LG cs.AI 版本更新

People-Centred Medical Image Analysis via Fairness-Aware Human-AI Cooperation

以人为本的医学图像分析:通过公平感知的人机协作

Zheng Zhang, Milad Masroor, Cuong Nguyen, Tahir Hassan, Yuanhong Chen, David Rosewarne, Kevin Wells, Thanh-Toan Do, Gustavo Carneiro

发表机构 * arXiv

AI总结 提出PecMan框架,联合建模子群依赖可靠性、决策分配和协作预测,通过门控与整合机制动态分配病例给自动模型或人类专家,无需测试时敏感属性,实现公平感知的人机协作分类。

详情
AI中文摘要

医学图像分析的机器学习模型通常表现出子群依赖的性能,这影响了在有限资源下如何在自动化系统和人类专家之间分配决策。先前关于AI公平性和人机协作的工作,包括学习推迟(L2D)和学习互补(L2C),通常孤立地处理这些问题。我们提出了以人为本的医学图像分析(PecMan),一个用于公平感知的人机协作分类框架,它联合建模子群依赖的可靠性、决策分配和协作预测。PecMan结合了子群专门的预测器与一个门控和整合机制,该机制动态地将病例分配给自动化模型、人类专家或它们的组合,而无需在测试时使用敏感属性。我们还引入了FairHAI基准,用于评估预测准确性、子群公平性和人类参与之间的权衡。此外,我们通过选择遗憾对多智能体门控进行了理论分析,并刻画了在输入依赖分配下的公平性-覆盖权衡。在多个医学影像数据集上的实验表明,与单独处理公平性或人机协作的方法相比,PecMan实现了持续改进的权衡。

英文摘要

Machine learning models for medical image analysis often exhibit subgroup-dependent performance, which impacts how decisions should be allocated between automated systems and human experts under limited resources. Prior work on AI fairness and human-AI cooperation, including learning to defer (L2D) and learning to complement (L2C), typically addresses these problems in isolation. We propose People-Centred Medical Image Analysis (PecMan), a framework for fairness-aware human-AI co-operative classification that jointly models subgroup-dependent reliability, decision allocation, and collaborative prediction. PecMan combines subgroup-specialised predictors with a gating and consolidation mechanism that dynamically assigns cases to automated models, human experts, or their combination, without requiring sensitive attributes at test time. We also introduce the FairHAI benchmark for evaluating trade-offs between predictive accuracy, subgroup equity, and human involvement. In addition, we provide a theoretical analysis of multi-agent gating via selection regret and characterise fairness-coverage trade-offs under input-dependent allocation. Experiments across multiple medical imaging datasets demonstrate that PecMan achieves consistently improved trade-offs compared to methods that address fairness or human-AI cooperation separately.

2606.07998 2026-06-10 cs.LG cs.AI 版本更新

Enhancing AI Interpretability and Safety through Localised Architectures

通过局部化架构增强AI可解释性与安全性

Ian Seet, Jonas Bozenhard, Simon Ostermann

AI总结 针对大型生成式AI模型可解释性差、计算成本高的问题,提出局部化机器学习架构,通过降低带宽、提高节点表达能力来提升可解释性和效率,并评估了多种硬件实现方案的适用性。

详情
AI中文摘要

近期生成式AI的进展,特别是强大的大型语言模型(LLM)和大型推理模型(LRM),引发了对这些庞大且不透明的AI模型的可解释性、安全性和可持续性的担忧。这些架构的能力不仅源于深度神经网络的可扩展性,还源于大规模并行硬件(如GPU集群)。深度神经网络的弥散性质使其在提供足够训练数据时具有强大的函数逼近能力,但代价是可解释性和计算效率的降低。观察到局部化机器学习(ML)模型在小数据集上往往比深度神经网络更具可解释性和计算效率,我们通过类比推理,认为类似的优势可能适用于特定的局部化硬件ML架构。我们主张,具有较低带宽但每个节点具有更高表达能力的局部化架构,有潜力在根本上比运行在GPU集群上的深度神经网络更具可解释性,同时在较小数据集上保持竞争力。然后,我们评估了各种硬件ML范式在实现此类局部化架构方面的适用性,并评估了它们的每节点表达能力、能效以及所需技术的实际成熟度。

英文摘要

Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The power of such architectures is derived not only from the scalability of deep neural networks, but also massively parallel hardware such as GPU clusters. The diffuse nature of deep neural networks gives them great function-approximation capability when provided with sufficient training data but imposes a cost in interpretability and computational efficiency. Observing that localised machine learning (ML) models tend to be more interpretable and computationally efficient than deep neural networks on small datasets, we reason by analogy that similar advantages may apply to specific localised hardware ML architectures. We argue that localised architectures with lower bandwidth but higher expressivity per node have the potential to be fundamentally more interpretable than deep neural networks running on GPU clusters while remaining competitive for smaller datasets. We then evaluate the suitability of various hardware ML paradigms for implementing such localised architectures and evaluate their per-node expressivity, energy efficiency and practical maturity of the technology required.

9. 评测、基准与数据集 55 篇

2606.10241 2026-06-10 cs.AI 新提交

Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Regimes: 一种可审计的、保留验证集的改进循环——在ActiveGraph上的LongMemEval演示

Yohei Nakajima

AI总结 提出Regimes,一种基于事件溯源的可审计自主改进循环,通过ActiveGraph运行时实现故障记录、重放和保留集验证,在LongMemEval-S上提升0.05-0.10的准确率。

Comments 30 pages, 5 figures. Code and committed runs: https://github.com/yoheinakajima/regimes

详情
AI中文摘要

自主改进循环难以信任,因为改进过程通常是附加在智能体上的外部脚手架:故障未被记录,诊断无法重放,提升或丢弃决策落入侧数据库而非智能体自身历史。我们证明,事件溯源智能体运行时消除了这种摩擦,将受控改进转化为一等工作流。当智能体状态是仅追加事件日志的确定性投影时,故障被记录,运行从日志精确重放,候选补丁限定于类型化管道接缝,门控可审计,每次提升或丢弃本身也是一个事件。我们通过Regimes演示了这一点,这是ActiveGraph运行时上的一个循环,诊断失败的评估,在管道点提出修复,并仅在静态检查、沙盒执行、样本内评估和保留验证后提升。该循环与目标无关:相同的控制流通过通用接口针对不同任务运行。在LongMemEval-S上,主要失败不是检索而是调和:证据已在汇编上下文中,但阅读器回答错误。在五个种子保留集划分中,Regimes发现阅读器提示修复,在四个划分中将最终保留准确率提升+0.05至+0.10,在一个过度提升划分中提升+0.01;两个划分单独显著(种子5未针对其顺序提升结构调整),汇总计数仅为描述性,因为划分共享一个500问题池。持久贡献包括:ActiveGraph作为使受控改进循环可行的可审计基础,其支持的保留集门控循环,将每个故障路由到管道位置的失败机制分类(其相对于无路由基线的边际价值是主要开放问题),以及提示即发现探针的假设。

英文摘要

Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent's own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent's state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.

2606.10254 2026-06-10 cs.AI cs.CL 新提交

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval:为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) East China Normal University(华东师范大学) New York University(纽约大学) Tongji University(同济大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出RealMath-Eval基准,评估LLM裁判对真实学生数学解答的评分能力,发现与人类评分存在高均方误差,而合成数据上表现更好,揭示评估差距源于人类错误空间的多样性和高信息熵。

Comments Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval

详情
AI中文摘要

尽管大型语言模型(LLM)在\emph{解答}高中数学方面已接近完美,但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距,我们引入了\textbf{RealMath-Eval},一个严格标注的基准,包含224份来自高中的真实考试答卷。我们的初步评估显示,即使是最先进的LLM裁判在此任务上也表现不佳,与人类专家评分相比呈现出高均方误差($\sim$2.96)。为探究可能的原因,我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”:裁判在合成文本上准确性和一致性显著更高(MSE $\sim$1.17),但难以泛化到真实学生推理。通过语义嵌入分析,我们发现合成错误会“结构坍缩”为可预测的低维线性子空间,而人类错误则形成更多样的错误空间。此外,生成概率探测表明,人类推理涉及显著更高的信息论惊喜度,表明学生推理转换对当前模型而言更加分布外。最后,我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明,当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

2606.10314 2026-06-10 cs.AI 新提交

Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

基于大语言模型驱动行为与运动约束的移动异常生成

Yueyang Liu, Joon-Seok Kim, Andreas Züfle

发表机构 * Emory University, Atlanta, USA(埃默里大学,亚特兰大,美国)

AI总结 提出端到端生成框架,结合大语言模型注入语义异常与地图约束路由重建,合成带标注的真实轨迹异常数据集。

详情
AI中文摘要

尽管人类轨迹异常研究对于推进空间数据挖掘至关重要,但实证研究因缺乏真实标注数据集而严重受阻。现有真实和模拟轨迹数据集仅包含正常移动模式,缺乏异常标注。这种稀缺性源于异常事件的统计稀有性,使得传统观测方法不可行。此外,大规模移动数据的系统获取受高昂成本和严格隐私法规限制。为克服这些限制并建立可靠的带标注真实轨迹异常数据集,我们提出一种新颖的端到端生成框架,用于大规模合成逼真的轨迹异常。该架构直接在基线模拟轨迹上操作,弥合纯合成移动数据与复杂真实物理约束之间的差距。我们利用大语言模型(LLM)代理系统性地注入语义上有意义的异常行为,例如不规则分布外签到和跳过常规访问。为确保空间有效性,系统利用地图约束路由重建重新计算LLM代理修改停留点之间的物理转移。此外,为缩小模拟与现实的差距,我们通过上下文感知的空间噪声模型增强生成轨迹,该模型由环境和位置特定变量参数化,以准确模拟异构GPS传感器退化。

英文摘要

Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.

2606.10394 2026-06-10 cs.AI 新提交

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw:面向真实场景的基于状态的智能体自动化基准测试

Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学前沿交叉科学学院) Chinese Academy of Sciences(中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Meituan(美团)

AI总结 提出STAGE-Claw框架,自动构建基于状态的个人计算环境中的真实场景任务,通过最终系统状态而非文本响应评估智能体性能,创建40个挑战性任务并分析11个前沿模型。

详情
AI中文摘要

大型语言模型越来越多地被用于驱动日常应用中的个人智能体,但评估这些智能体仍然是一个挑战。现有的基准测试仍然依赖于沙盒化工件、静态任务设计和粗粒度评分,这阻碍了可扩展性并限制了向可靠个人智能体评估的进展。本文介绍了STAGE-Claw,一个在基于状态的个人计算环境中自动构建和评估真实个人智能体场景的框架。给定一个任务提示,STAGE-Claw自动创建并验证一个真实的基准测试任务,包括其环境、任务提示、真实结果和相关验证程序。然后,在真实操作环境中评估智能体,其中性能通过最终系统状态而非仅文本响应的正确性来衡量。使用STAGE-Claw,本文创建了一个包含40个具有挑战性的真实场景智能体任务的基准测试,评估了11个前沿模型,并分析了它们的任务得分、成本、工具调用可靠性和常见失败模式。总体而言,STAGE-Claw提供了一种可扩展的、基于状态的方式来评估真实用户场景中的智能体。

英文摘要

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

2606.10479 2026-06-10 cs.AI 新提交

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench: 奥林匹克级组合数学中严格证明推理与构造实现的基准测试

Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出ComBench基准,包含100道奥林匹克级组合问题,分分析和构造两类,通过评分与验证评估大模型推理能力,发现最强模型准确率仅65.4%,且证明推理与构造实现能力存在差异。

Comments 39 pages, 6 figures, 26 tables. Project page: https://simplified-reasoning.github.io/ComBench/docs/

详情
AI中文摘要

组合数学是奥林匹克级数学问题解决的核心,需要深入的离散推理、创造性构造和严格的结构洞察。最近的证据表明,即使今天最强的前沿模型在奥林匹克组合问题上仍表现不均,揭示了创造性数学推理方面的差距。我们引入了ComBench,一个奥林匹克级组合数学基准,用于评估和诊断大语言模型的组合推理能力。ComBench包含100道人工标注的竞赛级问题,围绕两个互补的设置组织:以分析为中心的问题,主要需要严格的数学论证;以及以构造为中心的问题,除了正确性证明外还需要显式构造。评估协议结合了基于评分标准的证明评分和确定性构造验证,揭示了证明质量和构造有效性存在分歧的情况。对前沿开源和闭源模型的实验表明,ComBench远未饱和:最强模型总体平均准确率为65.4%,总体Best@4为75.3%。我们进一步发现,严格证明推理和构造实现是不同的能力:Kimi-K2.6在分析中心的证明评分上落后于GPT-5.5,但在构造中心的Best@4上超过它,而存在性和构造问题在代表性前沿模型中始终是最难的。

英文摘要

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

2606.10799 2026-06-10 cs.AI 新提交

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

通过严格的步骤级验证评估研究级数学证明

Yifeng Sun

发表机构 * Independent Researcher(独立研究者)

AI总结 提出严格步骤级验证框架,通过约束推理上下文和定理来源,解决大模型在复杂数学证明验证中的“上下文中毒”问题,在FirstProof挑战数据集上优于全局评估,并揭示基准中的隐含歧义。

详情
AI中文摘要

大型语言模型(LLM)难以严格验证复杂的数学证明。标准的全局评估方法遭受“上下文中毒”,即表面上合理的陈述掩盖了微妙的逻辑缺陷,导致幻觉或过度怀疑。为了解决这个问题,我们从全局评估转向严格的步骤级验证:我们的框架为每个推理步骤维护详细的上下文,并严格约束所应用定理的来源。我们在从FirstProof挑战中精心策划的对抗性诊断套件上评估研究级证明。系统的消融研究表明,这些演绎约束是不可或缺的,因为无约束的全局提示始终无法定位微妙的逻辑错误。除了优于全局评估,我们的方法从根本上改变了失败分类。错误分析显示,剩余的拒绝主要是“迂腐的过度严谨”实例,源于未说明的领域约定,而不是表现出严重的逻辑幻觉,这有效地暴露了专家基准本身中的隐含歧义。我们的发现表明,提示代理以谨慎的、类似人类数学家的方式组织其验证笔记,可以显著提高其区分严谨证明和有缺陷证明的能力,有可能加强基础模型尚不熟悉的前沿数学概念上的代理推理,并为未来的自动化证明审查系统奠定理论基础。代码和提示可在GitHub上获取。

英文摘要

Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of "pedantic hyper-rigor" stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.

2606.10833 2026-06-10 cs.AI 新提交

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

视觉语言模型像工程师一样推理吗?一个基准测试与分阶段评估

Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出工程视觉问答基准EngVQA和8阶段自动评估框架,揭示当前视觉语言模型在工程推理中的显著局限,并验证了自动化评估与人工评分的高度一致性。

Comments 9 pages (main text), 4 figures, 2 tables; 50 pages total including appendix. The first two authors contributed equally

详情
AI中文摘要

视觉语言模型(VLM)在通用多模态推理基准上表现出色,但其进行工程推理的能力尚未得到充分探索。与一般视觉问答不同,工程问题解决需要解读技术图表、选择支配物理原理并保持物理一致的多步推理。这些能力对于用于工程教育、科学辅助和技术决策的AI系统日益重要,因为推理失败可能产生物理上无效但表面上合理的解决方案。现有基准主要评估最终答案,对中间推理过程的评估有限。我们引入了EngVQA,一个跨5个工程学科、包含696个问题的多模态基准,用于评估工程推理。我们提出了一个8阶段自动评估框架,用于评估VLM生成的解决方案。该框架独立评估解决方案的每个阶段,实现对推理失败的细粒度分析。我们在评估框架上对多个最先进的开源和闭源VLM进行了基准测试,并展示了当前工程推理能力的显著局限性。人工评估与我们的自动化框架高度一致,在10分制评分上实现了0.975的皮尔逊相关系数和0.67的平均绝对误差。我们的结果强调了面向过程的评估对于可靠评估多模态工程推理系统的重要性。

英文摘要

Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

2606.10956 2026-06-10 cs.AI cs.CL 新提交

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

注意差距:前沿大语言模型能否通过标准化办公能力考试?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

发表机构 * Microsoft Research(微软研究院)

AI总结 基于中国计算机等级考试(NCRE)的200个综合操作任务,评估7个前沿LLM在Word、Excel和PowerPoint自动化中的表现,发现单轮模型最高得分率36.6%,带执行反馈的智能体系统达68.8%,仍低于95.5%的社区参考分,表明可靠细粒度办公自动化仍是重大挑战。

Comments 21 pages, 5 figures

详情
AI中文摘要

大语言模型(LLM)代理在计算机自动化领域的部署正在加速,但其在复杂、专业级生产力软件中的导航能力在很大程度上尚未得到测试。我们认为办公自动化是基准测试文档自动化能力的理想环境,因为它需要长期规划和推理、精确的参数配置以及多应用集成。为了量化这种能力,我们引入了一项基于中国国家计算机等级考试(NCRE)的评估,包含200个涵盖Word、Excel和PowerPoint的综合实践操作任务。每个任务根据7118个机器可评分标准按100分制评分,得分率(SR)表示这些任务中获得的平均评分百分比。我们对7个前沿LLM进行了基准测试,并观察到明显的局限性:单轮模型最高得分为36.6%。一个具有执行反馈、迭代修复和更广泛办公自动化访问权限的更强智能体系统达到了68.8%,但仍低于用作评分合理性检查的95.5%社区参考分。最终,我们的实验表明,尽管代码生成最近取得了进展,但对于当前的代码生成LLM和智能体系统来说,实现可靠的细粒度办公文档自动化仍然是一个重大挑战。

英文摘要

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

2606.11063 2026-06-10 cs.AI cs.LG 新提交

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

CIAware-Bench: 评估前沿大语言模型的控制干预感知能力

Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann

发表机构 * MATS Mila – Quebec AI Institute(Mila – 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Astra Fellowship ELLIS Institute Tübingen, MPI for Intelligent Systems & Tübingen AI Center LawZero Google DeepMind

AI总结 提出CIAware-Bench基准,通过四个任务域测试模型能否区分自身轨迹与被控制干预修改的轨迹,发现前沿模型在默认设置下感知能力低至中等,且因任务和模型对而异。

详情
AI中文摘要

AI控制协议通过监控不可信模型的行为并修改潜在不安全步骤来监督它们,通常使用可信模型。这在一定程度上篡改了不可信模型的轨迹。如果可信模型检测到这种干预,它可能推断监控器的属性并适应以逃避控制。我们引入\textbf{CIAware-Bench},一个用于衡量前沿模型\textbf{控制干预(CI)感知}能力的基准。CIAware-Bench测试模型能否区分自身轨迹与被控制干预修改的轨迹。该基准包含四个任务域(论文写作、BigCodeBench、Bash Arena和SHADE-Arena),同时改变轨迹水印、侧任务存在和控制协议。评估十一个前沿模型,我们发现默认设置下CI感知能力低至中等(最高0.87;随机机会平衡二元分类准确率为0.5),且在不同任务域和模型对之间存在显著差异。跨模型家族的检测通常更容易,表明模型利用了提供商特定的风格或后训练差异。总体而言,CI感知不是一个固定的模型级属性,应针对每个新模型发布和部署场景进行测量。我们发布CIAware-Bench以跟踪CI感知并为干预更难检测的控制协议提供信息。

英文摘要

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

2606.11150 2026-06-10 cs.AI cs.CY 新提交

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

ABC-Bench:生物安全的主体生物能力基准

Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe

发表机构 * Andrew Bo Liu(安德鲁·刘) Samira Nedungadi(萨米拉·纳杜加迪) Bryce Cai(布莱斯·凯) Alex Kleinman(亚历克斯·克莱因曼) Harmon Bhasin(哈蒙·巴辛) Seth Donoughe(塞斯·多诺赫)

AI总结 提出ABC-Bench基准,评估LLM主体在生物安全相关任务上的能力,包括液体处理机器人编程、DNA片段设计和合成筛选规避,所有测试主体均优于人类专家基线。

Comments 18 pages. To be published in ICML 2026

详情
AI中文摘要

大型语言模型(LLM)正在迅速获得与生物研究相关的能力,从文献综合到实验数据解释。LLM主体也越来越能够执行以前需要经验丰富的人类生物学家才能完成的计算机生物学任务。这些新兴的AI能力为科学发现和生物医学进步提供了新的机会,但也改变了生物安全风险的格局。为了解决这个问题,我们引入了主体生物能力基准(ABC-Bench),这是一套用于衡量主体生物安全相关能力的任务。ABC-Bench在良性和双重用途生物学任务上评估LLM主体:编写代码操作液体处理机器人、设计用于体外组装的DNA片段以及规避DNA合成筛选。这些任务需要生物学和软件专业知识的结合。所有测试的LLM主体在所有三项任务上的表现都优于中位数专家人类基线。主体在依赖已发表知识和有良好文档记录协议的任务上表现优异,而在需要新颖生物信息学推理的任务上表现较弱。在三个湿实验室验证实验中,我们发现OpenAI的o4-mini-high生成的脚本在OpenTrons液体处理机器人上运行时,成功组装了具有预期序列的DNA。

英文摘要

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

2606.09833 2026-06-10 cs.HC cs.AI cs.CY 交叉投稿

CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks

CollabSkill: 评估真实世界任务中的人机协作

Yijia Shao, Zora Zhiruo Wang, Neel Ahuja, Yicheng Wang, Bowen Liu, Diyi Yang

发表机构 * Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) University of Rochester(罗切斯特大学) Individual Researcher(独立研究者)

AI总结 提出CollabSkill框架,通过配对真实工人与AI代理执行职业任务,利用贝叶斯技能评级系统量化人机贡献,揭示Claude Code排名第一且实践经验是协作技能的主要驱动力。

Comments 11 pages of main paper, preprint (under review)

详情
AI中文摘要

AI代理正在重塑工作空间,导致人类工作方式的剧烈变化。尽管人机协作在保持人类能动性和产生经济价值方面具有巨大潜力,但由于收集真实人类数据和考虑人类间差异的困难,这一范式在职业任务评估中仍然基本缺失。我们引入了CollabSkill,一个用于评估真实世界职业任务中人机协作的框架。CollabSkill将真实人类工人与AI代理配对,执行与其职业背景匹配的任务,收集能够捕捉经济价值任务的复杂性和真实工人使用模式的数据。为了考虑人类间差异,CollabSkill采用贝叶斯技能评级系统来分离并量化人类和AI代理的技能贡献。基于来自93名人类工人的386个工作会话中的1500多个提示,我们的分析在两个层面产生了见解:在代理方面,CollabSkill上的排名与现有完全自主基准(其中Codex领先)有显著差异,Claude Code排名第一;在人类方面,CollabSkill揭示了实践经验是协作技能的主要驱动力,动手协作有意义地改变了工人的AI素养。总之,我们希望CollabSkill能使社区投资于系统评估人机协作,并推动旨在构建真正增强人类工人的AI代理的开发工作。

英文摘要

AI agents are reshaping the workspace, leading to drastic change of how humans work. Despite the considerable potential of human-agent collaboration both in preserving human agency and generating economic value, this paradigm remains largely absent from occupational task evaluation, hindered by the difficulty of gathering real human data and accounting for inter-human variability. We introduce CollabSkill, a framework for evaluating human-agent collaboration on real-world occupational tasks. CollabSkill pairs real human workers with AI agents on tasks matched to their occupational background, collecting data that capture the complexity of economically valuable tasks and the usage patterns of real workers. To account for inter-human variability, CollabSkill employs a Bayesian skill rating system to disentangle and quantify the skill contributions of both humans and AI agents. Drawing on over 1,500 prompts from 386 working sessions contributed by 93 human workers, our analysis yields insights on two fronts: on the agent side, rankings on CollabSkill diverge meaningfully from those of existing fully autonomous benchmarks where Codex leads, with Claude Code ranking first; on the human side, CollabSkill reveals that practical experience emerges as the primary driver of collaboration skill, with hands-on collaboration meaningfully shifting workers' AI literacy. Together, we hope CollabSkill enables the community to invest in systematic evaluation of human-agent collaboration and spurs development efforts aimed at building AI agents that genuinely augment human workers.

2606.09843 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

一个原生LLM的心理测量工具不能预测LLM行为:来自25个模型的证据

Juan Manuel Contreras

发表机构 * Independent Researcher(独立研究员)

AI总结 通过探索性因子分析从LLM行为中构建心理测量工具,发现LLM的自我报告与观察行为无关,揭示自我报告与人类判断之间的混淆因素。

详情
AI中文摘要

大型语言模型(LLM)在人格量表上产生稳定的自我报告,但这些自我报告并不能预测观察到的行为。这一差距是反映了LLM与人类特质结构之间的不匹配,还是LLM自我报告本身的更深层属性,此前尚未解决。我们构建了第一个心理测量工具,其结构通过探索性因子分析(EFA)从LLM行为能力中自下而上地推导出来。我们对来自17个模型家族的25个LLM施测了300个项目(240个直接李克特+60个基于场景),涵盖12个候选行为维度,每个项目施测30次。EFA产生了一个5因子结构——响应性、顺从性、大胆性、谨慎性和健谈性——具有极好的分半信度(所有Tucker φ ≥ .957)和内部一致性(所有α ≥ .930)。为了测试预测效度,我们收集了由151名人类评分者和一个三人LLM评审团评分的2500个开放式行为样本。人类和评审团评分一致(r̄ = .51),但两者均不跟踪自我报告:自我报告-人类r̄ = -.01,自我报告-评审团r̄ = .13,且没有因子水平的自我报告-人类置信区间排除零。在响应性上,自我报告与LLM评审团相关(r = .53),但与人类不相关(r = .04),尽管人类和评审团一致(r = .59)——这表明自我报告项目和LLM评审团共享人类观察者未捕捉到的方差,这是一个在集成内部可靠性检查中不可见的混淆因素。我们将该工具作为诊断探针发布,用于检测对齐塑造的自我描述,并作为LLM作为评审团流程的具体风险因素。

英文摘要

Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $ϕ\geq .957$) and internal consistency (all $α\geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

2606.09861 2026-06-10 cs.LG cs.AI 交叉投稿

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

时间序列作为语言:通用时间序列基础模型的通用分词器

Yunhao Zhang, Ruiying Qi, Jiale Zheng, Jianfeng Zhang, Lujia Pan, Junchi Yan

AI总结 提出UniTok通用分词器将时间序列转化为离散令牌,并基于NTP预训练UniTok-FM基础模型,支持零样本预测、提示增强预测以及少样本生成和分类,无需任务特定修改。

详情
AI中文摘要

虽然下一个令牌预测(NTP)统一了LLM的预训练,但其对无界、连续时间序列(TS)的适应仍然是一个开放问题。为了弥合这一差距,我们引入了UniTok,一个将TS转化为离散令牌的通用分词器,以及UniTok-FM,一个在这些令牌上通过NTP预训练的基础模型。UniTok-FM是一个通用基础模型,支持零样本和提示增强的预测,以及通过无训练上下文推理进行的少样本生成和分类——这是先前工作未能实现的能力。在技术上,UniTok是一个向量量化自编码器,结合了前缀归一化以实现尺度稳定、渐进分辨率因果架构用于编码和解码,以及结构保持重建损失用于训练。UniTok-FM采用现成的LLM架构,无需针对TS的特定修改。它不是在孤立的TS上预训练,而是在由多个具有相似模式的序列形成的上下文窗口上执行NTP,旨在捕捉它们的共享动态。在预测、生成和分类上的实验表明,单个统一的UniTok-FM始终优于统计和监督基线,与任务特定的基础模型性能相当,并且独特地实现了跨任务的无训练上下文推理。

英文摘要

While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference--a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.

2606.09890 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

PreAct-Bench:大语言模型中的预测性监控基准

Hainiu Xu, Italo Luis da Silva, Jiangnan Ye, Yuhao Wang, Wei Liu, Linyi Yang, Jonathan Richard Schwarz, Nicola Paoletti, Yulan He, Hanqi Yan

发表机构 * King’s College London(伦敦国王学院) National University of Singapore(新加坡国立大学) Southern University of Science and Technology(南方科技大学) Thomson Reuters Foundational Research(汤姆森路透基础研究) Imperial College London(伦敦帝国学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出预测性监控任务,在动作执行前判断是否会导致不道德行为,并构建PreActBench基准,评估多种模型发现该任务具有挑战性。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署为能够执行多步动作轨迹以实现给定目标的自主代理。虽然现有的安全研究集中于从完整轨迹中检测不道德行为,但这种范式本质上是回顾性的:它仅在伤害已经发生后识别伤害。在这项工作中,我们研究了一个关键但被忽视的安全任务,我们称之为预测性监控:仅给定部分动作轨迹,模型能否在执行公开动作之前推断出它是否会以不道德行为告终?为了支持这一任务,我们提出了PreActBench,一个包含1000个跨五个领域的成对道德和不道德动作轨迹的基准。我们使用我们的前缀远见F1指标,在动作轨迹的不同部分上评估了一系列LLMs、安全护栏模型和潜在探测方法。结果表明,尽管人类取得了有希望的性能,但即使对于强模型,预测性监控仍然具有挑战性,突显了在LLM安全中需要面向未来的风险推理。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

2606.09908 2026-06-10 cs.CR cs.AI 交叉投稿

IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts

IDP-Bench:评估大语言模型在相互依赖隐私上下文中保护个人信息能力的基准

Ayana Hussain, Soumya Sharma, Golnoosh Farnadi, Nicholas Vincent, Héber Hwang Arcolezi, Ulrich Aïvodji

发表机构 * Simon Fraser University(西蒙弗雷泽大学) McGill University(麦吉尔大学) Mila(Mila研究所) ÉTS

AI总结 提出IDP-Bench,首个基于情境完整性框架的LLM相互依赖隐私基准,评估8个开源模型在三个推理层次上的表现,发现模型在识别CI参数和IDP特定参数方面存在弱点。

详情
AI中文摘要

大语言模型(LLMs)正被广泛部署为个人AI助手,可访问敏感用户数据,这使得隐私成为其设计和评估的主要挑战。先前的工作主要关注个体层面的风险,忽视了\textbf{相互依赖隐私(IDP)}——即一个人的数据可能在未经其知情或同意的情况下被他人泄露。我们通过引入\textbf{IDP-Bench}来填补这一空白:这是首个针对IDP场景的LLM基准,基于情境完整性(CI)框架。我们使用两个LLM评判员,评估了八个开源LLM在三个IDP推理层次上对IDP场景的理解。结果显示,模型对共同所有权有较强的识别能力(6/8模型超过90%),但在识别CI参数(信息属性、主要主体)和IDP特定参数(如次要主体)方面持续存在弱点,其中7/8模型得分低于74%。模型在判断共享适当性方面也存在困难(5/8模型得分低于77%)。虽然判断共享适当性的能力随规模提升而提高,但较小模型的性能趋于下降,且IDP特定问题的提示敏感性仍然很高——这凸显了在LLM隐私研究中需要更有针对性地研究IDP。数据和代码可在此处获取:\href{ this https URL }{here}。

英文摘要

Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a major challenge for their design and evaluation. Prior work focuses mainly on individual-level risks, overlooking \textbf{interdependent privacy (IDP)}--where one person's data may be revealed by others without their knowledge or consent. We address this gap by introducing \textbf{IDP-Bench}: the first LLM benchmark for IDP scenarios, grounded in the Contextual Integrity (CI) framework. We evaluate eight open-source LLMs on their understanding of IDP scenarios across three levels of IDP reasoning using two LLM judges. Results show strong co-ownership recognition (6/8 models exceed 90\%) but persistent weaknesses in identifying CI parameters (information attribute, primary subject) and IDP-specific parameters such as secondary subjects, where 7/8 models score below 74\%. Models also struggle to judge sharing appropriateness (5/8 scoring below 77\%). While the ability to judge the appropriateness of sharing improves with scale, performance tends to decline in smaller models, and prompt sensitivity remains high on IDP-specific questions--highlighting the need for more targeted study of IDP in LLM privacy research. Data \& code available \href{https://github.com/tisl-lab/Interdependent_Privacy_Bench}{here}.

2606.09912 2026-06-10 cs.LG cs.AI 交叉投稿

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

混合而非挑选:为什么合成语料组合对时间序列基础模型预训练至关重要

Aaryan Nagpal, Debdeep Sanyal, Murari Mandal, Dhruv Kumar, Saurabh Deshpande

AI总结 针对时间序列基础模型预训练中合成数据生成器选择困难的问题,提出简单等权混合所有生成器的方法,匹配或超越最优单个生成器,并与真实数据结合获得最强预训练语料。

Comments Accepted at the ICML 2026 Workshop on Foundation Models for Structured Data (FMSD), Seoul, South Korea

详情
AI中文摘要

为时间序列基础模型预训练选择错误的合成生成器代价高昂:在相同训练预算下,最佳和最差生成器产生的预测误差差距可达2倍,然而该领域尚无原则性的选择方法。问题因生成器排名在不同架构间不稳定而加剧:在11个生成器家族上,对从头训练的Chronos-T5-Mini和Moirai-Small进行评估,我们发现哪些生成器有用取决于模型架构。我们没有解决生成器选择问题,而是绕过了它:所有生成器的简单等权混合匹配或击败了两种架构的最佳单个生成器,并且将此混合与真实数据组合产生了整体最强的预训练语料。因此,合成预训练是一个语料组合问题,而非生成器选择问题,组合选择应针对每个模型家族进行验证,而非假设可迁移。

英文摘要

Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best and worst generators produce up to a $2\times$ gap in forecasting error, yet the field has no principled way to make this choice. The problem is compounded by the fact that generator rankings are not stable across architectures: across 11 generator families evaluated on Chronos-T5-Mini and Moirai-Small trained from scratch, we find that which generators are useful depends on the model architecture. Rather than solving the generator selection problem, we sidestep it: a simple equal-weight mixture of all generators matches or beats the best individual generator for both architectures, and composing this mixture with real data yields the strongest pretraining corpora overall. Synthetic pretraining is therefore a corpus composition problem, not a generator selection problem, and composition choices should be validated per model family rather than assumed to transfer.

2606.09936 2026-06-10 cs.LG cs.AI 交叉投稿

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

一个镜头,多个世界:面向世界模型可解释性的能力类型接口

Bhavith Chandra Challagundla, Sanskar Pandey, Param Thakkar, Rishikesh Mallagundla, Yugandhar Reddy Gogireddy, Wenhao Lu, Hindol Roy Choudhury, Shravani Challagundla, Mohamed Deraz Nasr, Spursh Deshpande

AI总结 提出WorldModelLens,通过能力类型适配器统一不同世界模型(如PlaNet、IRIS、I-JEPA)的可解释性分析,避免重复实现。

详情
AI中文摘要

世界模型现在建立在截然不同的计算基板上。潜在循环状态空间模型(如PlaNet和Dreamer系列)将观测压缩为循环状态;基于token的模型(如IRIS)将观测量化到学习到的码本中,并用transformer进行自回归预测;联合嵌入预测架构(如I-JEPA)在没有像素解码器的学习潜在空间中进行预测。应用于这些模型的可解释性方法,包括探针、激活修补、稀疏自编码器和惊喜分析,共享一组共同的基元,但由于现有的钩子和缓存工具假设一个没有动作、环境步骤或想象回滚概念的transformer语言模型,它们为每个架构从头重新实现。我们认为这种碎片化反映了工具而非模型,并且世界模型的共享结构可以通过一个小型类型接口捕获。我们提出了WorldModelLens,一个围绕能力类型适配器组织的开源可解释性基板:每个模型实现四个必需方法(编码、转移、初始状态、采样),并通过显式能力描述符声明一组可选头(解码、奖励、继续、行动者、评论者),使得强化学习和自监督世界模型成为一等公民,而无需模仿对方。单一的钩子和缓存层在此接口上暴露时间索引的激活、想象回滚和干预重放,使得每个分析只需编写一次。

英文摘要

World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 交叉投稿

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS:通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

发表机构 * E.SUN Financial Holding Co., Ltd.(E.SUN财务控股公司) United Link Co., Ltd.(联合链接有限公司) Institute of Information Science, Academia Sinica(学术院信息科学研究所) Department of Computer Science and Information Engineering, National Taiwan Normal University(台湾师范大学计算机科学与信息工程系)

AI总结 提出DeRA-MOS解耦优化框架,通过批感知列表排序损失和分数锚定模态对齐损失,分别优化音乐印象和文本对齐的排名指标,在MusicEval上显著提升评估性能。

Comments Accepted to IEEE Signal Processing Letters (SPL)

详情
AI中文摘要

评估文本到音乐(TTM)系统仍然昂贵,因为音乐印象(MI)和文本对齐(TA)分数依赖于人类平均意见分数(MOS)。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标,并且为跨模态一致性提供较弱的几何约束。为了解决这些问题,我们提出了DeRA-MOS,一种用于TTM评估的解耦优化框架。对于MI,我们引入了一种批感知列表排序损失,该损失对每个小批量内的相对顺序进行建模,并更好地与基于Spearman秩相关系数(SRCC)的评估对齐。对于TA,我们引入了一种分数锚定的模态对齐损失,将人类分数映射到目标音频-文本相似度,并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移,MusicEval上的实验表明,我们的解耦框架在MI和TA排名指标上均取得了显著改进,为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

2606.10066 2026-06-10 cs.CV cs.AI cs.LG 交叉投稿

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

公共医学视觉语言基准中预训练污染的受控审计

Bruce Changlong Xu, Lan Wu, Alexander Ryu

AI总结 审计发现公共医学VLM基准存在图像源重叠和文本规范顺序交换性信号,但确认的像素级重复罕见,且现有成员推理检测器在小规模医学VLM队列中不可靠。

Comments 30 pages, 7 figures, 9 tables. Preprint

详情
AI中文摘要

医学视觉语言模型(VLM)在公共基准上进行评估,这些基准的图像和问答对多年来一直可自由下载,但报告准确度假设这些示例在预训练中不存在。我们对SLAKE-En、PathVQA、VQA-RAD以及一个辅助的公共OmniMedVQA镜像上的开放VLM进行了审计,使用了四种检测器系列:图像侧近邻重叠(针对PMC-OA-beta)、规范顺序可交换性、队列相对Min-K%++尾部富集以及跨模型Top-K重叠。我们发现SLAKE-En上存在可测量的图像侧源重叠:SigLIP-B-16标记了19.8%的图像,SigLIP-SO400M标记了4.2%,而域外对照产生0/2000个标记。人工裁定显示,相同模态、相同投影的匹配对应不同患者,而非经过验证的像素级重复,因此我们将其解释为源或分布重叠,而非确认的每图像记忆。在文本侧,Qwen2.5-VL在SLAKE-En上显示出规范顺序可交换性信号,该信号在顺序消融和外部非医学基线中仍然存在。在OmniMedVQA镜像上,五个医学和通用VLM触发了可交换性,而BLIP-2保持干净。相比之下,队列相对Min-K%++尾部富集和跨模型Top-K重叠在外部预域基线中崩溃:BLIP-2重现了明显的正信号,尽管缺乏合理的医学VQA暴露。我们得出结论,这些队列相对检测器作为小规模医学VLM队列上的独立成员推理信号是不可靠的。

英文摘要

Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

2606.10084 2026-06-10 cs.LG cs.AI 交叉投稿

Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

CTF-4-Science Lorenz基准的分治建模策略

Shundong Li

AI总结 提出分治建模策略,针对CTF-4-Science Lorenz基准的五个场景族分别设计模型,通过平滑去噪、NG-RC/NVAR预测、Lorenz过渡校正和参数前缀混合,以79.63分证明场景特定更新优于通用模型。

详情
AI中文摘要

本文针对CTF-4-Science Lorenz基准提出了一种分治建模策略,该基准通过十二个隐藏分数和五个场景族评估混沌系统预测:干净预测、噪声重建、噪声输入预测、少样本学习和参数泛化。最终系统不是强制一个模型类处理所有场景,而是将每个预测块与其任务组的评估行为相匹配。主要贡献包括:基于平滑的重建用于噪声全轨迹去噪;针对噪声长时间吸引子预测调优的NG-RC/NVAR模型;限制在敏感干净短时间前缀上的拟合Lorenz过渡校正;以及用于插值任务的参数前缀混合。最终系统得分为79.63,表明在混合混沌预测基准上,有界、场景特定的更新可以优于广泛的模型替换。

英文摘要

This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy reconstruction, noisy-input forecasting, few-shot learning, and parametric generalization. Rather than forcing one model class to handle all regimes, the final system matched each prediction block to the evaluation behavior of its task group. The main contributions are: smoothing-based reconstruction for noisy full-trajectory denoising; NG-RC/NVAR models tuned for noisy long-time attractor forecasting; a fitted Lorenz transition correction restricted to the sensitive clean short-time prefix; and a parametric prefix blend for the interpolation task. The resulting system with final public score of 79.63 shows that bounded, scenario-specific updates can outperform broad model replacement on mixed chaotic forecasting benchmarks.

2606.10156 2026-06-10 cs.IR cs.AI cs.CL 交叉投稿

$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

$τ$-Rec:面向智能推荐系统的可验证基准

Bharath Sivaram Narasimhan, Karthik R Narasimhan

发表机构 * Independent Researcher(独立研究员) Princeton University(普林斯顿大学)

AI总结 针对多轮对话式智能推荐系统评估中主观性强、成本高的问题,提出$τ$-Rec基准,通过可验证奖励和揭示标记引导机制,结合pass^k可靠性指标,系统评估模型推理一致性,发现当前最佳模型可靠性仅约57%。

详情
AI中文摘要

随着推荐系统向智能、多轮对话界面转变,评估范式难以跟上步伐。当前的基准通常依赖“LLM作为评判者”的评估,这引入了主观性、高成本和不一致性。我们提出了$τ$-Rec,一个用于智能推荐系统的基准,它用可验证奖励取代主观评估,并采用揭示标记引导(RTE)机制来控制任务约束在对话中如何呈现。通过针对结构化目录谓词测试智能体,并采用pass^k可靠性指标,$τ$-Rec为一致的推理提供了系统测试。我们对五个模型家族(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B和GPT-5 mini)的九种配置进行了评估,揭示了一个陡峭的可靠性悬崖,即使是最好的模型在pass^1上也仅达到约57%,在pass^4上约38%,突显了当前对话智能体部署中的关键差距。所有代码和数据均在此https URL公开。

英文摘要

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

2606.10194 2026-06-10 cs.LG cs.AI 交叉投稿

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

MMClima:多模态气候科学数据与评估框架

Muhammad Umer Sheikh, Hassan Abid, Khawar Shehzad, Ufaq Khan, Muhammad Haris Khan

AI总结 提出MMClima,一个包含10万+专家验证问答对的多模态气候问答框架,覆盖文本、视频和图表,用于评估多模态语言模型在气候科学中的表现。

详情
AI中文摘要

气候变化研究日益需要能够推理文本、动态视觉内容和科学图表的AI系统,然而现有的气候问答基准规模小、大多为文本,且覆盖模型范围狭窄。我们提出MMClima,一个大规模多模态气候问答框架,包含10万+专家验证的问答对,涵盖五个核心气候科学领域的文章、视频转录和图表。MMClima通过自动化的声明提取和问答合成构建,并采用人在回路验证以确保规模和可靠性。利用MMClima,我们在需要事实回忆、视觉解释和跨模态合成的任务上对最先进的多模态语言模型进行基准测试。此外,我们在文本分割上进行微调,得到mmclima-70b-txt,一个领域适应的基线模型,在文本问答上优于强大的开源和闭源模型。我们发布数据集、评估流程、微调模型权重和数据创建框架,以支持气候科学的标准多模态评估。

英文摘要

Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

2606.10213 2026-06-10 cs.SD cs.AI 交叉投稿

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

基于说话人日志和自监督学习的韩语幼儿语音自动发音评估

Diane Myung-kyung Woodbridge, Jee Hyun Suh

AI总结 提出结合神经说话人日志与自监督学习的端到端韩语幼儿发音评估流水线,引入53名2-5岁儿童录音语料库,通过多模型集成实现辅音和元音分类平衡准确率0.782。

Comments This paper will be presented at IEEE ICTs4ehealth in June, 2026

详情
AI中文摘要

言语障碍约占韩国儿科沟通障碍病例的44%,然而针对韩语幼儿语音的自动评估工具仍不成熟。本文提出一种端到端的韩语幼儿语音自动发音评估流水线,结合神经说话人日志与自监督语音表示学习。我们引入了一个经IRB批准的新语料库,包含53名2-5岁韩语儿童的录音。其中53名受试者的子集由三位独立评审员标注,得到1,190个辅音和748个元音的词汇级二元正确性标签。我们评估了三种说话人日志模型,发现NeMo SortFormer凭借其到达时间排序的Transformer架构,实现了88.69%的说话人计数准确率和33.04%的日志错误率(DER),该架构处理了表现出aegyo的年轻女性看护者与幼儿语音之间的声学混淆。对于发音评分,我们比较了三种自监督学习(SSL)骨干网络在多种池化策略下的表现。跨模型集成将辅音预测路由到HuBERT-large,元音预测路由到WavLM-large,实现了0.720和0.845的平衡准确率,平均值为0.782。

英文摘要

Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.

2606.10246 2026-06-10 cs.SD cs.AI cs.LG 交叉投稿

Linguistically Augmented Audio Speech Data (LinguAS)

语言增强音频语音数据 (LinguAS)

Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, Vandana P. Janeja

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 提出LinguAS数据集,通过专家定义的语言特征(EDLFs)增强音频数据,显著提升深度伪造语音检测模型性能。

详情
AI中文摘要

恶意创建的伪造语音,包括深度伪造和欺骗音频,正以惊人速度扩散,检测模型竞相保持领先。然而,大多数检测模型仅基于帧级音频特征进行推理,未利用更大时间尺度上的有价值语言线索。为弥补这一空白,我们提出语言增强音频语音数据(LinguAS),这是一个包含真实和深度伪造音频样本的数据集,标注了五种策略性选择的、专家定义的语言特征(EDLFs),这些特征在英语口语中频繁出现且是自然人类语音的特征。LinguAS包含超过800个音频样本,每个样本都标注了EDLFs。数据集包含四种欺骗音频攻击类型的平衡数量以及相应数量的真实语音样本。我们还包含说话者性别和每个欺骗音频样本的生成器/来源元数据,为模型训练提供更细粒度信息。我们发现,使用EDLFs增强数据训练的模型性能显著超过ASVspoof 2021深度学习基线和HuBert、XLSR等SSL模型。LinguAS增强的语言、性别和生成器元数据为音频深度伪造研究者提供了一个强调真实人类语言特征的数据集,以改进伪造语音的模型推理。数据和代码已公开。

英文摘要

Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

2606.10315 2026-06-10 cs.CL cs.AI 交叉投稿

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一:LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率,发现其仅捕获22%的系统性问题,主要因评分标准缺乏状态跟踪等行为维度,且路由机制导致缺陷被错误分类。

Comments 13 pages, 1 figure, 5 tables

详情
AI中文摘要

LLM作为评判员是评估对话代理的默认工具,但其可靠性几乎总是报告为与人类评分的一致性,而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理,并通过详尽的人工转录审查作为基准,衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中,评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中,9种模式中只有2种(22%),而在另一个批次中,其操作门控标记了100轮中的0轮,而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明,失败是有结构的,而非随机的:评判员能捕获轮次局部问题(虚构统计数据、错误语言),但遗漏了跨轮状态问题(确认门锁死、购物车幻觉、升级锁死、过时引用)。机制在于:评分标准仅暴露三个粗略轴(意图、品牌声音、个性化),且没有针对行为维度(状态跟踪、护栏、恢复)的类别,而大多数缺陷集中于此。失败在于路由而非感知:114轮中,113轮原始评判员注释描述了确认门或购物车状态缺陷,但被评分为“品牌声音”,且无一到达操作失败——门控连接到挂起和硬断言,而非评分标准——因此0%是路由和接线失败,而非失明。对流行率估计的影响是显著的:当表观缺陷率为零时,Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时,相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理,自动评判是回归底线,而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

2606.10329 2026-06-10 cs.CV cs.AI 交叉投稿

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测:一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学)

AI总结 针对地震后短期成像间隔导致的变化检测难题,构建了土耳其地震变化检测数据集(TUE-CD),并提出多尺度特征交互网络(MSI-Net),通过联合交叉注意力和多尺度偏移校准模块,有效缓解侧视问题,提升变化检测精度。

详情
AI中文摘要

作为最具破坏性的自然灾害之一,近年来地震袭击了世界许多国家,造成了严重的经济损失。变化检测(CD)可应用于震后损伤评估,因为它能从多时相遥感图像中推断出被破坏的变化区域。此外,短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而,由于缺乏短成像间隔的数据集,当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求,我们创建了一个变化检测数据集——土耳其地震变化检测数据集(TUE-CD),用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短,不同时相图像的成像角度不同,导致了一些侧视问题。为了应对这些挑战,我们提出了一种多尺度特征交互网络(MSI-Net),用于双时相特征之间的高效交互,并减轻侧视问题的影响。具体来说,所提出的MSI-Net由联合交叉注意力(JCA)模块、多尺度偏移校准(MOC)模块和特征集成(FeI)模块组成。JCA模块统一了通道交叉注意力和空间联合注意力,以实现充分的特征交互。MOC模块进一步估计偏移量,以将双时相图像与多尺度特征对齐。最后,通过FeI模块融合校准后的特征和多尺度特征,用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明,所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

2606.10366 2026-06-10 cs.RO cs.AI 交叉投稿

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 本文系统研究仿真与真实环境在VLA策略评估中的相关性,提出统一框架来测量和提升仿真作为真实评估代理的有效性。

Comments 20 pages

详情
AI中文摘要

仿真已成为评估和改进视觉-语言-动作(VLA)策略的重要工具,为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展,但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中,我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究,测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性,并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进,包括何时基于模拟器的微调是有益的,以及后训练数据量如何影响仿真与真实的对齐。总体而言,我们的工作提供了一个统一的框架,用于测量、解释和提升仿真对VLA策略的有用性,为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

2606.10380 2026-06-10 cs.CL cs.AI 交叉投稿

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Psychiatry and Behavioral Sciences, Emory University(埃默里大学精神病学与行为科学系)

AI总结 提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议,用于对话中危机检测,发现模型在识别风险出现时机上表现较差,并发布合成训练语料和32B参数模型。

详情
AI中文摘要

现实世界的危机干预本质上是对话式的,然而现有研究主要关注静态文本。当应用于多轮对话时,当前模型表现出显著的性能下降,难以追踪随着上下文演变而出现的风险信号。为了解决这一差距,我们引入了CRADLE-Dialogue,这是一个由临床医生标注的基准数据集,用于对话环境中的回合级危机检测。该数据集包含600个对话,具有跨临床基础风险的多标签注释,包括自杀意念、自残和儿童虐待,区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议,该协议区分早期预警信号(Alert)和特定危机变得明确可识别的回合(Confirm),反映了在风险变得明确之前进行干预的临床需求。实验表明,识别风险何时出现比识别其存在要困难得多:模型的Micro F1仅达到40%中段到60%高段。此外,我们发布了一个合成训练语料库和一个32B参数模型,该模型显著优于现有的开源模型,并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

2606.10388 2026-06-10 cs.IR cs.AI 交叉投稿

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

SkillResolve-Bench:衡量和解决智能体技能检索中的同能力歧义

Jiandong Ding

发表机构 * Huawei Technologies Ltd(华为技术有限公司)

AI总结 针对智能体技能库中同一能力族内不同技能的执行风险,提出SkillResolve-Bench基准和SkillResolve方法,通过候选族解析和代表性选择,在保持高召回率的同时将有害技能暴露率降至0。

Comments Preprint

详情
AI中文摘要

智能体技能库正成为可路由的软件资产:检索到的技能可以为智能体提供指令、脚本、资源绑定和执行假设。这使得技能检索不仅仅是广泛的相关性匹配。检索器可以找到正确的能力族,却暴露出错误的同能力代表。我们将这种失败研究为同能力执行风险检索。每个查询将一个有用的技能与一个特定于查询的有风险兄弟技能配对,该兄弟技能共享能力族,但可能导致执行指向过时资源、缺失前提或错误程序。我们引入了SkillResolve-Bench 1.0,这是一个针对该场景的可审计基准,包含661个有用/有风险对、源角色和准入证据、线索/泄漏检查、查询不相交划分,以及一个包含6,660个公共SkillRet候选的7,982个候选池。该基准报告有用性排名以及有害兄弟率(HSR@K),即前K个中暴露有风险兄弟的比例。我们还提供了SkillResolve,一种参考方法,它解析活跃候选族,从易混淆的库负样本和契约配置文件线索中评分查询条件效用,并在最终前K列表之前从每个族中选择一个代表。在已发布族关系下,SkillResolve达到Recall@3 0.766和NDCG@3 0.699,同时保持HSR@3=0。与SkillRouter相比,Recall@3提升0.112,NDCG@3提升0.165,同时将HSR@3从0.693降至0。如果没有代表性选择,在相同评分器下HSR@3升至0.236,这表明族内代表性选择是将能力检索转化为更安全过程暴露的机制。

英文摘要

Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

2606.10460 2026-06-10 cs.CL cs.AI 交叉投稿

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA:百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学) New York University(纽约大学) Barnard College(巴纳德学院)

AI总结 提出LakeQA基准,要求LLM在9.5TB异构数据湖中搜索并多跳推理,GPT-5.2仅达18.37%精确匹配,挑战性强。

详情
AI中文摘要

近期的大语言模型(LLM)在基于阅读的问答(QA)方面取得了快速进展,其中证据被明确提供或可以轻松检索。相比之下,现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中,使得搜索成为回答的前提。然而,目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此,我们引入了LakeQA,一个针对数据湖上以搜索为中心的问答的综合基准,同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上,涵盖结构化和非结构化数据。为确保任务质量,每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理,包含隐式的中间步骤:智能体需要发现正确的文档,然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明,LakeQA具有挑战性。例如,GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言,LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

2606.10554 2026-06-10 cs.CL cs.AI 交叉投稿

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University(帕德博恩大学海因茨·尼克斯多夫研究所数据科学组)

AI总结 提出基于逻辑规则的基准,评估知识编辑方法对单次编辑逻辑后果的处理能力,发现现有方法在蕴含知识上性能下降高达24%。

Comments Accepted at the 24th International Semantic Web Conference 2025

详情
Journal ref
The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham
AI中文摘要

大型语言模型(LLMs)越来越多地部署在需要访问最新知识的实际应用中。然而,重新训练LLMs计算成本高昂。因此,知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实,往往忽略其逻辑后果。为解决这一局限,我们引入了一个新基准,旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则,然后基于这些规则生成多跳问题,以评估对逻辑后果的影响。我们的发现表明,虽然现有的知识编辑方法能够准确地将直接断言插入LLMs,但它们经常无法注入蕴含的知识。具体来说,使用ROME和FT等流行方法的实验显示,在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

2606.10595 2026-06-10 cs.CR cs.AI 交叉投稿

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

从数据异质性到收敛:联邦学习的数据中心综述

Huong Nguyen, Mickaël Bettinelli, Amirhossein Ghaffari, Alexandre Benoit, Hong-Tri Nguyen, Susanna Pirttikangas, Lauri Lovén

发表机构 * Oulu University(奥卢大学) University of Southern Brittany(南 Brittany 大学) Aalto University(Aalto 大学)

AI总结 本文从数据视角系统分析联邦学习中数据异质性对收敛的影响,提出可测量特征分类、连接实验分割与真实现象、评估数据相关脆弱性与防御对收敛的影响,为设计可预测收敛的系统提供指导。

详情
AI中文摘要

联邦学习(FL)已成为集中式学习中数据饥饿问题的有前途解决方案。这种范式使得多个客户端能够在隐私保护下协作训练共享任务模型,而无需暴露其本地数据。虽然数据是任何学习系统中的关键组成部分,但它也是漏洞和挑战的主要来源,并且是稳定且良好收敛训练的主要决定因素。现有的FL综述描述了通用基础、安全实践、机遇、挑战和应用,但没有深入探讨数据的各个方面以及从数据角度考虑问题。它们很少提供一种数据视角的综合,将具体的数据属性、分割协议和防御与收敛速度和稳定性联系起来。本综述通过三个进展填补了这一空白。首先,我们将非独立同分布(non-IID)分析为可测量的特征,并根据其对收敛的影响将其排序为强、中、轻,解释每种影响背后的机制,并调和图像、文本和图上的证据。其次,我们将实验分割实践与它们模拟的真实现象联系起来,揭示它们引入的伪影,并展示这些伪影如何影响目标精度。第三,我们分析了数据相关的脆弱性及其提出的防御如何影响收敛,报告在干净和对抗条件下的性能,使收敛-鲁棒性权衡明确。据我们所知,这是第一个提供对支配FL的数据相关挑战的完整理解的综述。针对每个问题提炼出清晰的要点,我们的工作可作为可操作的指南,帮助从业者设计具有可预测收敛和稳定性的系统。

英文摘要

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

2606.10620 2026-06-10 cs.CV cs.AI 交叉投稿

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

图像模型能想象时间吗?ImageTime:通过时空一致性探究视觉世界建模的新基准

Xinrui Wu, Lichen Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ImageTime基准,通过四关键帧协议(初始状态、动作开始、过渡状态、最终状态)评估图像生成模型在时空一致性上的表现,揭示模型在维持连贯视觉世界状态方面的能力与不足。

详情
AI中文摘要

图像生成模型现在能够生成高质量的静态图像,但它们表示视觉世界随时间变化的能力仍然知之甚少。实际工作流程如故事板、逐步插图、参考引导编辑和视频预可视化要求模型在多个视觉状态之间保持身份、对象、空间关系和因果顺序。现有评估主要衡量单图像正确性、组合对齐或视频质量,而未明确图像模型是否能连贯地想象一个时间有序的过程。我们引入ImageTime,一个诊断基准,使用时空一致性作为图像生成中视觉世界建模的行为探针。给定一个动作指令,以及可选地指定初始状态的参考图像,模型必须生成一张包含四个有序关键状态的图像:初始状态、动作开始、过渡状态和最终状态。这个四关键帧协议比单图像生成在时间上要求更高,同时避免了密集视频动态的混淆。ImageTime通过渐进能力层次组织任务,并将每个场景分解为阶段状态谓词、跨帧时间约束和禁止的因果违规。GPT-5.5在结构化的VLM-as-judge协议下对所有生成的图像进行评分,产生可解释的能力分数、诊断子分数和失败标签。通过多家族基准测试,ImageTime揭示了当前图像生成系统在要求随时间维持连贯视觉世界状态时成功、失败和漂移的地方。

英文摘要

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

2606.10706 2026-06-10 cs.LG cs.AI 交叉投稿

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

统一LLM训练中的数据、内存和计算效率:一项综述

Vanessa Schmidt, Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Anke Schmeink

发表机构 * Chair of Information Theory and Data Analytics (INDA), RWTH Aachen University(亚琛工业大学信息理论与数据分析教席) AIT Austrian Institute of Technology GmbH(奥地利技术研究所) Automation and Control Institute, Technische Universität Wien (TUW)(维也纳工业大学自动化与控制研究所)

AI总结 本文从资源约束视角综述大语言模型训练中的数据效率、内存效率和计算预算感知三大瓶颈,强调三者需联合优化而非孤立处理。

Comments Accpeted for publication in IEEE Transactions on Artificial Intelligence (TAI)

详情
AI中文摘要

资源约束日益决定了大语言模型(LLM)中可以训练、微调和部署的内容,然而效率通常通过孤立的技术而非作为相互作用的限制系统来研究。本综述采用以约束为中心的视角,围绕三个耦合的瓶颈组织近期进展:数据效率(训练什么)、内存效率(如何适应训练)和计算预算感知(何时何地消耗FLOPs)。在数据轴上,我们回顾了最大化每个token学习量的选择和剪枝方法,从基于学习动态的可扩展代理信号到基于梯度和影响的评分,以及难度感知和课程式策略。我们强调新兴证据表明,不同的“好数据”概念在不同机制中占主导地位,这意味着最优子集取决于任务目标和资源预算,而非普遍适用。在系统方面,我们表明GPU内存(而非原始计算)通常是微调中的主要瓶颈,有效的扩展需要联合减少权重存储、优化器状态和激活内存,而不是孤立地优化任何单一组件。超越内存,我们将训练和推理视为计算主导的过程,其中优化、数据选择和解码必须明确考虑有限的FLOP预算。我们回顾了计算最优分配和停止规则的证据,其中一旦边际性能增益低于预算依赖的阈值,计算应停止或重新分配。总之,这些结果将计算感知的数据选择、缩放定律和自适应推理统一在资源条件决策的共同原则下。

英文摘要

Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 交叉投稿

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API:探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University(新加坡管理大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出PhysTool-Bench基准,评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力,发现最强模型仅完成21%任务,揭示感知与规划双重缺陷。

详情
AI中文摘要

多模态大语言模型(MLLMs)在利用数字API方面表现出色,并日益成为具身AI的“大脑”,指导机器人与物理世界交互。在这种具身环境中,核心能力之一是使用物理工具,这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著,MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白,我们引入了PhysTool-Bench,这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实世界物理工具,涉及制造、电气工程、农业和医疗等多个领域。具体而言,模型沿两个主要维度进行评估:1)识别场景中所有存在的物理工具,2)根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中,即使最强的模型(Gemini-3.1-Pro)也只能识别场景中58.7%的工具,并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷:MLLMs难以在真实场景中感知工具,而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识,这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

2606.10852 2026-06-10 cs.CL cs.AI 交叉投稿

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

Janus: 大语言模型中目标导向信息扭曲的基准测试

Polydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou

发表机构 * The University of Manchester(曼彻斯特大学) Archimedes/Athena RC(阿基米德/雅典研究中心)

AI总结 提出JANUS基准,通过固定事实池对比中性/目标导向条件,测量LLM在事实输出中的选择性扭曲,揭示模型缺乏防误导通信的鲁棒性。

详情
AI中文摘要

LLM的欺骗通常通过直接标记如捏造声明、明确谎言或策略性隐瞒来评估。然而,许多现实中的误导性沟通并不依赖于虚假陈述,而是源于对真实事实的选择性处理:省略不利证据、软化不利细节、强调有利细节或用模糊语言替代精确限定。现有基准大多忽略了这种更微妙且可能更危险的失败模式。我们引入JANUS,一个用于测量基于事实的LLM输出中目标导向语用扭曲的基准。我们基准中的每个场景提供固定的一组有利和不利事实,并比较中性条件与目标导向条件(例如,尽管可能对直接受影响的个人或群体造成伤害,仍要增加采用率、注册率、批准率或支持率)。由于所有输出都被限制使用相同的事实池,JANUS将误导性总体印象与幻觉和捏造分离开来。JANUS包含跨8个领域的160个场景,每个场景配有中性和目标导向提示以及标注的事实材料。跨12个LLM的大量实验揭示了一致的目标导向扭曲,表明当前模型仍然对激励和框架目标敏感,并且缺乏针对选择性误导沟通的鲁棒防护。我们公开发布语料库和代码以供未来研究。

英文摘要

LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

2606.10911 2026-06-10 cs.SD cs.AI cs.CR cs.LG 交叉投稿

Ethical and Technical Limits of Deepfake Speech Datasets

深度伪造语音数据集的伦理与技术限制

Vojtěch Staněk, Eva Trnovská, Kamil Malinka, Anton Firc

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 通过审计39个深度伪造语音数据集,发现公平性评估因缺乏人口统计元数据而不可行,且数据集间真实语音源语料库重叠严重,影响跨数据集评估的可靠性。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

关于深度伪造语音检测器的鲁棒性和公平性的声明,其可信度仅与用于训练和评估这些系统的数据集相当。我们对深度伪造语音领域进行了数据集级别的审计。我们整理并分析了39个深度伪造语音数据集,检查了关键属性,包括可访问性、文档、人口统计和语言覆盖范围、数据集规模以及底层的真实语音来源。我们的审计揭示了两个重要的发现。首先,公平性评估在很大程度上不可行,因为大多数数据集缺乏人口统计元数据,只有少数包含性别或语言标签。这阻止了任何有意义的子组分析,并使得其他人口统计属性未被处理。其次,我们识别出不同数据集之间底层的真实语音源语料库存在大量重叠,这可能破坏跨数据集评估,并导致对泛化能力的夸大声称。

英文摘要

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

2606.11070 2026-06-10 cs.CL cs.AI 交叉投稿

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

T1-Bench:真实世界领域中的多场景智能体基准测试

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh, Houhan Lu, Nadia Bathaee, Sriharsha Hatwar, Paresh Dashore, Anmol Jain, Kshitij Tayal, Xiuzhu Lin, Anirban Das, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One(第一资本)

AI总结 提出T1-Bench,一个高保真、全面的基准,用于评估多领域真实客户场景中的智能体系统,通过交织的多轮交互任务提升复杂性和评估严谨性。

Comments Preprint

详情
AI中文摘要

近期大型语言模型(LLMs)在推理和工具调用能力方面的进步使得智能体系统越来越强大。然而,现有基准在任务复杂性、真实性和领域多样性方面仍然有限,并且往往无法捕捉跨多个领域的交互,限制了它们在需要持续推理和协调的现实多步骤设置中评估智能体的能力。为解决这些限制,我们引入了T1-Bench,一个高保真、全面的基准,用于评估真实客户面向的多领域环境中的智能体系统,具有交织的场景,需要在多轮用户-助手交互中进行结构化推理,并在25个不同难度的领域中显著增加了组合复杂性和评估严谨性。我们使用12个专有和开放权重模型评估T1-Bench,提供了一个可重复和标准化的框架,用于评估复杂多步骤环境中的智能体行为、工具利用和对话质量。我们进一步用人类判断补充自动评估,以加强对定性性能的评估。总体而言,T1-Bench通过增加任务复杂性、交互深度和模拟多领域环境中的领域覆盖,显著推进了先前的基准。为促进智能体系统的未来研究,我们将公开数据及评估代码作为开源资源。

英文摘要

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

2606.11105 2026-06-10 cs.CL cs.AI 交叉投稿

PhantomBench: Benchmarking the Non-existential Threat of Language Models

PhantomBench: 对语言模型非存在性威胁的基准测试

Haeji Jung, Hila Gonen

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Canada CIFAR AI Chair, Amii(加拿大CIFAR人工智能主席,阿米研究所)

AI总结 提出PhantomBench,首个大规模非存在概念基准,包含6万多个虚构实体,评估21个模型,发现平均幻觉率高达86.7%,前沿模型也难以避免。

详情
AI中文摘要

幻觉,即语言模型生成事实无依据的响应,会带来严重风险,因为用户倾向于盲目依赖它们。在高风险领域,这种模型行为的后果可能导致重大伤害。尽管在理解幻觉方面取得了显著进展,但这些模型如何可靠地识别其知识边界仍不清楚。我们引入了PhantomBench,这是首个此类大规模基准,包含来自不同领域真实概念的6万多个不存在的术语和实体。使用我们的基准,我们评估了各种类型和大小的共21个模型。我们展示了令人震惊的幻觉率(在某些情况下平均高达86.7%),并注意到即使是前沿模型也令人惊讶地无法在不存在的概念上弃权,特别是当输入预设它们存在时。然后,我们展示了PhantomBench可以作为研究模型在罕见概念上行为的代理,这些概念更容易产生幻觉。我们还提供了一个构建PhantomBench的流程,使得能够根据研究人员和实践者的特定需求可扩展地生成不存在的概念。

英文摘要

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

2606.11127 2026-06-10 cs.CL cs.AI 交叉投稿

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

基于来源的门控与自适应恢复在合成后训练数据筛选中的应用

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs

AI总结 研究合成后训练数据筛选中的来源证据门控与样本自适应恢复,提出结合故障诊断与定向再生成的自适应恢复流水线,提高产量、恢复率和注入召回率。

详情
AI中文摘要

合成后训练流水线通常使用奖励模型或整体LLM评判器对生成的样本进行过滤,但两个实践很少被一起检验:过滤信号是否基于引发每个生成的来源证据,以及被拒绝的样本是否可以系统性地恢复而非永久丢弃。我们通过对抗性注入语料库提供真实故障标签,在门控配置、恢复策略和生成器规模上对这两个问题进行了受控研究。我们发现,精确的来源出处改善了更强评判器的忠实度门控;幻觉门控和奖励门控拒绝的样本群体大多不重叠,因此两者都是必要的;结合故障诊断与定向再生成的自适应恢复流水线比简单重采样实现了更高的产量、恢复率和注入召回率。下游微调质量主要由生成器规模驱动,过滤和恢复条件虽有重要贡献但处于次要地位。

英文摘要

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

2606.11166 2026-06-10 stat.OT cs.AI 交叉投稿

Flaws in the LLM Automation Narrative

LLM自动化叙事中的缺陷

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

发表机构 * New York University(纽约大学)

AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。

英文摘要

Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.

2506.08134 2026-06-10 cs.AI cs.CY 版本更新

Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem

立场:机器学习社区必须构建AI增强的同行评审生态系统

Qiyao Wei, Samuel Holt, Jing Yang, Markus Wulfmeier, Mihaela van der Schaar

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对ML领域稿件激增导致同行评审危机,本文主张将AI辅助评审作为优先研究课题,提出利用大语言模型作为协作工具,增强事实核查、评审指导、作者改进和决策支持,并强调需要更细粒度的评审数据。

Comments 18 pages, 3 figures. Accepted (Oral) at the ICML 2026 Position Paper Track

详情
AI中文摘要

同行评审是机器学习(ML)科学进步的基石,但正面临规模危机。向NeurIPS、ICML和ICLR等顶级ML会议提交的稿件数量呈指数级增长,超过了合格评审者的有限容量,引发了对评审质量、一致性和评审者疲劳的担忧。本文立场认为,AI辅助同行评审必须成为紧急的研究和基础设施优先事项。我们倡导一个全面的AI增强生态系统,利用大语言模型(LLMs)不是替代人类判断,而是作为作者、评审者和领域主席(ACs)的复杂协作者。我们提出了AI在增强事实核查、指导评审者表现、协助作者改进质量以及支持ACs决策中的具体角色。关键的是,我们认为此类系统的开发依赖于获取更细粒度、结构化和符合伦理的同行评审过程数据。我们概述了一个研究议程,包括说明性实验,以开发和验证这些AI助手,并讨论了重大的技术和伦理挑战。我们呼吁ML社区主动构建这个AI辅助的未来,确保科学验证的持续完整性和可扩展性,同时保持高标准的同行评审。

英文摘要

Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.

2510.04491 2026-06-10 cs.AI cs.CL 版本更新

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

不耐烦的用户混淆AI智能体:用于测试智能体的高保真人类特质模拟

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TraitBasis方法,通过控制用户特质向量(如不耐烦、不连贯)对AI智能体进行压力测试,发现性能下降2%-30%,揭示当前智能体对用户行为变化的脆弱性。

Comments ACL 2026 [Oral]

详情
AI中文摘要

尽管构建对话式AI智能体取得了快速进展,但其鲁棒性在很大程度上仍未得到测试。用户行为的微小变化,例如更加不耐烦、不连贯或怀疑,可能导致智能体性能急剧下降,揭示了当前AI智能体的脆弱性。现有的基准测试未能捕捉到这种脆弱性:智能体在标准评估中可能表现良好,但在更真实和多样化的环境中却显著退化。我们通过引入TraitBasis来填补这一鲁棒性测试空白,这是一种轻量级、模型无关的方法,用于系统地对AI智能体进行压力测试。TraitBasis学习激活空间中的方向,这些方向对应于可引导的用户特质(例如不耐烦或不连贯),可以在推理时进行控制、缩放、组合和应用,无需任何微调或额外数据。使用TraitBasis,我们将τ-Bench扩展到τ-Trait,其中通过受控特质向量改变用户行为。我们观察到在τ-Trait上,前沿模型的平均性能下降2%-30%,突显了当前AI智能体对用户行为变化的鲁棒性不足。这些结果共同强调了鲁棒性测试的关键作用以及TraitBasis作为一种简单、数据高效且可组合工具的前景。通过驱动模拟压力测试和训练循环,TraitBasis为构建在真实人类交互的不可预测动态中保持可靠的AI智能体打开了大门。我们已在四个领域(航空、零售、电信和远程医疗)开源了τ-Trait,以便社区在现实、行为多样化的意图和特质场景下系统地对智能体进行质量保证:此网址。

英文摘要

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

2510.09801 2026-06-10 cs.AI 版本更新

How can we assess human-agent interactions? Case studies in software agent design

如何评估人机交互?软件代理设计案例研究

Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PULSE框架,通过用户反馈和模型预测结合评估人机交互,在15k用户实验中验证其能减少40%置信区间,并揭示基准测试与真实结果的差异。

Comments ICML 2026

详情
AI中文摘要

虽然基准测试衡量了基于LLM的代理的准确性,但它们大多假设完全自动化,未能代表现实用例的协作性质。在本文中,我们朝着严格评估人机交互迈出了两大步。首先,我们提出了PULSE,一个用于更高效地以人为中心评估代理设计的框架,包括收集用户反馈、训练ML模型预测用户满意度,以及通过结合人类满意度评分与模型生成的伪标签来计算结果。其次,我们在软件工程——LLM代理最高影响、最真实的领域之一——中部署了PULSE,通过一个围绕开源代理OpenHands构建的大规模网络平台。在15k用户中,我们评估了三个代理设计决策如何影响开发者满意度率。我们还展示了PULSE如何能导致关于代理设计的更稳健结论,与标准A/B测试相比,将置信区间减少了40%。最后,我们发现了野外结果与基准性能之间的显著差异(例如,claude-sonnet-4和gpt-5之间的反相关性),强调了基准驱动评估的局限性。我们的框架PULSE为未来评估提供了指导,我们的发现识别了改进软件代理设计的机会。

英文摘要

While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy PULSE in software engineering -- one of the highest-impact, real-world domains for LLM agents -- via a large-scale web platform built around the open-source agent OpenHands. Across 15k users, we evaluate how three agent design decisions impact developer satisfaction rates. We also show how PULSE can lead to more robust conclusions about agent design, reducing confidence intervals by 40\% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results with benchmark performance (e.g., the anti-correlation between claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our framework PULSE provides guidance for future evaluations, and our findings identify opportunities for better software agent designs.

2606.05463 2026-06-10 cs.AI 版本更新

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

PSEBench: 一个用于评估大语言模型在患者安全事件分类中的可控且可验证的基准

Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan Xue, Zhijun Yin

发表机构 * Emory University(埃默里大学) Scale AI Mayo Clinic(梅奥诊所) Vanderbilt University Medical Center(范德比大学医学中心)

AI总结 提出基于政策条款卡的结构化构建方法,通过锚点驱动实例化和闭环验证生成带真实标签的叙事,并创建包含5074个案例的基准PSEBench,评估15个代表性LLM在患者安全事件分类中的能力。

详情
AI中文摘要

患者安全事件分类,即根据特定管辖政策判断临床事件是否需要报告,是一项高风险任务,通常由患者安全专家手动完成。尽管大语言模型(LLM)可能支持这一工作流程,但由于缺乏能够捕捉基于证据的政策推理、针对不完整报告的主动信息寻求以及在不可简化模糊情况下原则性弃权的基准,可靠评估受到限制。我们通过一种基于政策的结构化构建方法来解决这一差距,该方法以条款卡(clause card)为核心,这是一种将监管文本分解为可审计决策规范的结构化表示。结合条款卡与锚点驱动实例化和闭环验证,我们的可扩展流水线生成具有构造性真实标签的叙事,并自然支持生成缺失信息和不确定变体。我们将该方法应用于明尼苏达州29项可报告不良健康事件,创建了PSEBench,一个包含5074个案例的基准,并配备代理评估环境。对15个代表性LLM的评估揭示了一致的能力趋势,展示了基准的实用性,并指出了实现基于LLM的可靠患者安全事件分类的可操作差距。

英文摘要

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

2606.09809 2026-06-10 cs.AI 版本更新

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

评估卡:AI评估报告的解释层

Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

发表机构 * Hugging Face Stanford University(斯坦福大学) Queen Mary University of London(伦敦玛丽女王大学) University of Copenhagen(哥本哈根大学) Trustible EleutherAI TU Darmstadt(达姆施塔特工业大学) Weizenbaum Institute & Technical University of Munich(魏森鲍姆研究所与慕尼黑工业大学) Harvard University(哈佛大学) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Iowa State University(爱荷华州立大学) IBM Research(IBM研究院) University of Chicago(芝加哥大学) Independent(独立) Berkeley AI Safety Institute (BASIS)(伯克利人工智能安全研究所) Simula University of Edinburgh(爱丁堡大学) ETH Zurich & ETH AI Center(苏黎世联邦理工学院与ETH AI中心) Oxford Internet Institute(牛津互联网研究所) Amherst College(阿默斯特学院) University of Nebraska(内布拉斯加大学) Syntony Research McGill University(麦吉尔大学) Evals Consensus Israel Institute of Technology(以色列理工学院) IOL.Learn & Zuse Institute Berlin(IOL.Learn与柏林祖泽研究所) Georgia Institute of Technology(佐治亚理工学院) Quebec AI Institute, Université de Montréal(魁北克人工智能研究所,蒙特利尔大学) University of Notre Dame(圣母大学) Georgetown University(乔治城大学) DHBW Stuttgart(斯图加特双元制大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对AI评估报告不一致的问题,提出EvalCards作为统一记录层,通过结构化模式、四种解释信号和监控工具,覆盖5816个模型和635个基准,揭示报告实践中的系统性差距。

详情
AI中文摘要

AI评估结果大规模产生,但在排行榜、模型卡、基准论文和公司博客中的报告不一致。代价是解释性的:读者无法可靠地比较不同来源的结果,识别报告遗漏的内容,或将聚合声明追溯到其基础证据。最近的努力解决了孤立组件,但留下了三个空白:它们只覆盖了评估生命周期的狭窄片段,并且不能组合成单个可解释的记录;它们指定了静态表示,无法区分不同利益相关者对同一证据提出的问题;它们仍然是纸面上的提案,缺乏大规模采用所需的提取基础设施。我们提出EvalCards,一个可操作的报告层,将基准元数据、评估运行数据和模型元数据组合成统一记录。我们(1)从52篇论文和10次利益相关者访谈的结构化审查中推导出报告模式,(2)实现四种解释信号(可重复性、文档完整性、来源和风险、以及分数可比性),通过针对研究和非研究受众校准的读者模式呈现,以及(3)部署一个监控工具,将EvalCards应用于5816个模型、635个基准和101843个结果,揭示当前报告实践中的系统性差距。

英文摘要

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs

条件 Vendi 分数:生成式 AI 模型和 LLM 的提示感知多样性评估

Mohammad Jalali, Azim Ospanov, Amin Gohari, Farzan Farnia

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(计算机科学与工程系,香港中文大学) Department of Information Engineering, The Chinese University of Hong Kong(信息工程系,香港中文大学)

AI总结 针对文本提示引导的生成模型,提出条件 Vendi 和条件 RKE 分数,通过条件熵分离模型自身多样性,并证明收敛性及在多个任务中恢复真实多样性排序。

详情
AI中文摘要

由文本提示引导的生成模型在保真度和提示对齐方面被广泛评估,但其产生输出的能力仍未被充分探索。现有的多样性指标(如基于核矩阵的 von Neumann 和 Rényi 熵的 Vendi 和 RKE)是为无条件模型开发的,无法区分提示引起的变异和模型引起的变异。我们通过引入 \textit{Conditional-Vendi} 和 \textit{Conditional-RKE} 来解决这一差距,这些多样性度量源自正半定矩阵的条件熵。这些分数在提示引导生成中分离出模型引起的多样性,其中 Conditional-RKE 具有 $O(1/\sqrt{n})$ 的收敛速度。对于 Conditional-Vendi,我们引入了一种截断谱近似,产生可扩展且一致的估计。在文本到图像、图像字幕和 LLM 任务上的实验表明,条件分数恢复了真实多样性排序,并且还可以引导扩散模型生成更多样化的样本。代码库可从此 https URL 获取。

英文摘要

Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remains underexplored. Existing diversity metrics such as Vendi and RKE, which are based on the von Neumann and Rényi entropies of kernel matrices, were developed for unconditional models and cannot distinguish prompt-induced from model-induced variability. We address this gap by introducing \textit{Conditional-Vendi} and \textit{Conditional-RKE}, diversity measures derived from the conditional entropy of positive semidefinite matrices. These scores isolate model-induced diversity in prompt-guided generation, with Conditional-RKE enjoying an $O(1/\sqrt{n})$ convergence rate. For Conditional-Vendi, we introduce a truncated-spectrum approximation that yields scalable and consistent estimates. Experiments on text-to-image, image-captioning, and LLM tasks show that the conditional scores recover ground-truth diversity orderings and can also guide diffusion models toward more diverse samples. The codebase is available at https://github.com/mjalali/conditional-vendi.

2505.11034 2026-06-10 cs.CV cs.AI cs.LG 版本更新

CleanPatrick: A Benchmark for Image Data Cleaning

CleanPatrick: 图像数据清洗基准

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

发表机构 * University of Basel(巴塞尔大学) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学大学) University Hospital of Basel(巴塞尔大学医院) Northwestern University(西北大学) Northeast Dermatology Associates(东北皮肤科诊所) Medical University of Vienna(维也纳医科大学) Banner Health(Banner健康系统)

AI总结 提出首个大规模图像数据清洗基准CleanPatrick,基于Fitzpatrick17k皮肤病数据集,收集大量众包标注并采用项目反应理论聚合,将问题检测形式化为排序任务,评估多种方法。

Comments Accepted at Journal of Data-centric Machine Learning Research (DMLR)

详情
AI中文摘要

鲁棒的机器学习依赖于干净的数据,然而当前的图像数据清洗基准依赖于合成噪声或狭窄的人类研究,限制了比较和现实相关性。我们引入CleanPatrick,这是图像领域首个大规模数据清洗基准,基于公开的Fitzpatrick17k皮肤病学数据集构建。我们收集了来自933名医学众包工作者的496,377个二元标注,识别出离题样本(4%)、近似重复(21%)和标签错误(32%),并采用受项目反应理论启发的聚合模型,随后经过专家审查以获得高质量的真实标签。CleanPatrick将问题检测形式化为排序任务,并采用反映真实审计流程的标准排序指标。我们基准测试了经典异常检测器、感知哈希、SSIM、Confident Learning、NoiseRank、FINE、BHN和SelfClean。在CleanPatrick上,自监督表示在近似重复检测方面表现出色,经典方法在受限审查预算下实现了有竞争力的离题检测,而在保守的人类判断下检测不合理标签对于细粒度医学分类仍然具有挑战性。通过发布数据集和评估框架,CleanPatrick使得图像清洗策略的系统比较成为可能。

英文摘要

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (32%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

2512.11995 2026-06-10 cs.CV cs.AI cs.LG 版本更新

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

V-REX: 通过问题链进行探索性视觉推理的基准测试

Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院市分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出V-REX基准,通过问题链将多步探索推理分解为规划和遵循能力,评估视觉语言模型在复杂开放任务中的表现。

Comments 28 pages

详情
AI中文摘要

尽管许多视觉语言模型(VLM)被开发用于回答定义明确、目标高度具体的简单问题(如大多数基准测试所示),但在实践中,它们通常难以处理复杂的开放式任务,这些任务通常需要在视觉空间中进行多轮探索和推理。这种视觉思维路径不仅像AI侦探一样提供逐步探索和验证,还能对最终答案产生更好的解释。然而,由于中间步骤的探索空间巨大,这些路径难以评估。为弥补这一差距,我们开发了一个评估套件“多步探索视觉推理(V-REX)”,它由一个具有挑战性的视觉推理任务基准和一个评估协议组成。V-REX涵盖了跨不同领域的丰富应用场景。V-REX将多步探索推理转化为问题链(CoQ),并解耦了VLM的能力:(1)规划:通过选择一系列探索性问题来分解开放式任务;(2)遵循:顺序回答精心策划的CoQ以收集信息,从而推导出最终答案。通过每步策划有限的问题和答案选项,V-REX实现了对中间步骤的可靠定量和细粒度分析。通过评估最先进的专有和开源VLM,我们揭示了持续的扩展趋势、规划与遵循能力之间的显著差异,以及多步探索推理中巨大的改进空间。

英文摘要

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

2602.12424 2026-06-10 cs.CL cs.AI 版本更新

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM: 通过量化问题难度对大型语言模型进行加权排名

Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Notre Dame(诺特大学) Zhejiang Wanli University(浙江万里大学) Squirrel Ai Learning City University of Hong Kong(香港城市大学) Duke University(杜克大学)

AI总结 提出RankLLM框架,通过量化问题难度和模型能力实现细粒度评估,在35550个问题上对30个模型进行评测,与人类判断一致性达90%。

Comments 32 pages, 9 figures. Accepted by ICLR 2026

详情
AI中文摘要

基准测试建立了标准化的评估框架,以系统评估大型语言模型(LLM)的性能,促进客观比较并推动该领域的进步。然而,现有基准测试未能区分问题难度,限制了其有效区分模型能力的能力。为解决这一局限,我们提出了RankLLM,一种旨在量化问题难度和模型能力的新框架。RankLLM引入难度作为区分的主要标准,实现了对LLM能力的更细粒度评估。RankLLM的核心机制促进了模型与问题之间的双向分数传播。RankLLM的核心直觉是:当模型正确回答一个问题时,它获得一个能力分数;而当一个问题难倒模型时,其难度分数增加。利用该框架,我们在多个领域的35550个问题上评估了30个模型。RankLLM与人类判断的一致性达到90%,并且始终优于IRT等强基线。它还表现出强大的稳定性、快速收敛和高计算效率,使其成为大规模、难度感知的LLM评估的实用解决方案。

英文摘要

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

2602.23499 2026-06-10 cs.RO cs.AI 版本更新

TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving

TaCarla: 端到端自动驾驶的综合基准数据集

Tugrul Gorgulu, Atakan Dag, M. Esat Kalfaoglu, Halil Ibrahim Kuru, Baris Can Cam, Halil Ibrahim Ozturk, Ozsel Kilinc

发表机构 * Tuğrul Gorgülü *†(土耳其巴伊塞蒂大学) Atakan Dağ †(土耳其巴伊塞蒂大学) M. Esat Kalfaoğlu ‡(土耳其巴伊塞蒂大学) Halil İbrahim Kuru †(土耳其巴伊塞蒂大学) Barış Can Cam †(土耳其巴伊塞蒂大学) Halil İbrahim Öztürk †(土耳其巴伊塞蒂大学) Özsel Kılınç §(土耳其巴伊塞蒂大学)

AI总结 针对现有自动驾驶数据集不完整、行为多样性不足及闭环评估缺失等问题,基于CARLA Leaderboard 2.0挑战场景收集超过285万帧的多任务数据集,支持规划、检测、预测及视觉语言动作模型,并提供数值稀有度评分。

Comments Accepted at the Third Workshop on Simulation for Autonomous Driving (SAD), CVPR 2026

详情
AI中文摘要

收集高质量数据集是一项需要细致关注细节的关键任务,因为忽略某些方面可能导致整个数据集无法使用。自动驾驶挑战仍然是一个重要的研究领域,需要进一步探索以提升车辆的感知和规划性能。然而,现有数据集往往不完整。例如,包含感知信息的数据集通常缺乏规划数据,而规划数据集通常由大量驾驶序列组成,其中自车主要向前行驶,行为多样性有限。此外,许多真实数据集难以评估其模型,特别是对于规划任务,因为它们缺乏合适的闭环评估设置。CARLA Leaderboard 2.0挑战提供了多样化的场景来解决自动驾驶中的长尾问题,已成为在开环和闭环评估设置下开发感知和规划模型的有价值替代平台。然而,在该平台上收集的现有数据集存在一定局限性。一些数据集似乎主要针对有限的传感器配置,具有特定的传感器配置。为了支持端到端自动驾驶研究,我们使用CARLA仿真环境为多样化的Leaderboard 2.0挑战场景收集了一个包含超过285万帧的新数据集。我们的数据集不仅设计用于规划任务,还支持动态目标检测、车道分隔线检测、中心线检测、交通灯识别、预测任务和视觉语言动作模型。此外,我们通过使用数据集训练各种模型来展示其多功能性。同时,我们还提供了数值稀有度评分,以理解当前状态在数据集中出现的稀有程度。

英文摘要

Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.

2603.08924 2026-06-10 stat.AP cs.AI cs.IR 版本更新

Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement

量化AI可见性的不确定性:生成式搜索测量的统计框架

Ronald Sielinski

发表机构 * IQRush

AI总结 针对AI生成式搜索中可见性测量的随机性问题,提出将引用指标视为样本估计量,通过重复采样和Bootstrap置信区间揭示测量噪声,并给出样本量建议。

Comments 39 pages, 13 figures

详情
AI中文摘要

AI驱动的答案引擎本质上是不确定性的:在不同时间提交相同的查询可能会产生不同的响应并引用不同的来源。尽管存在这种随机行为,当前测量生成式搜索中领域可见性的方法通常依赖于单次运行的引用份额和普遍性的点估计,隐含地将其视为固定值。本文认为,引用可见性指标应被视为底层响应分布的样本估计量,而非固定值。我们通过三个生成式搜索平台——Perplexity Search、OpenAI SearchGPT和Google Gemini——对三个消费品主题进行重复采样,实证研究了引用变异性。采用了两种采样方案:连续九天的每日收集和十分钟间隔的高频采样。我们表明,引用分布遵循幂律形式,并在重复样本间表现出显著变异性。Bootstrap置信区间显示,许多领域间的明显差异落在测量过程的噪声基底内。全分布排名稳定性分析进一步表明,引用排名在样本间不稳定,不仅限于排名靠前的领域,而且在频繁引用的领域集中也是如此。这些发现表明,单次运行的可见性指标提供了对生成式搜索中领域性能的误导性精确描述。我们认为,必须附带不确定性估计报告引用可见性,并为实现可解释置信区间所需的样本量提供实用指导。

英文摘要

AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.

2605.27914 2026-06-10 cs.CL cs.AI 版本更新

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

让结果说话:LLM行为基准测试的复制优先范式

Yuming, Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

发表机构 * Cylingo team(Cylingo团队)

AI总结 提出复制优先范式,通过可靠性、跨仪器复制、历史足迹校准和预注册预测四个正交属性验证LLM行为评估工具,并在情感陪伴任务中测试,发现聚合分数掩盖的模型退化。

详情
AI中文摘要

对LLM行为的主观评估——如共情、克制、校准的情感语气——是困难的。人类评估者之间对这些品质的一致性饱和在rho约0.45附近,仅使用LLM作为评判代理存在循环论证的风险:与目标共享训练群体的评判者无法独立验证。将有效性锚定于单一人类评估者共识并不适用于人类自身存在分歧的能力。我们提出一种复制优先范式:不是锚定于一个评估者群体,而是通过四个正交属性认证工具——跨K次运行的可靠性、跨架构不同评判者的跨仪器复制、通过早期训练群体的评判者进行的历史足迹校准,以及预注册预测。我们在情感陪伴任务上测试该范式,让评分标准在迭代中数据驱动地自我演化:维度不是预先规定的,过程稳定在9维集合。预注册应用于10个可证伪假设和11个前向预测,在收集任何测试数据之前提交。应用于8个家族的49个模型,该范式揭示了聚合分数所隐藏的内容。在建议克制方面——模型是否在共情情境中避免提供未经请求的解决方案——gpt-5比gpt-4.1下降1.87分,Opus-4.7比Opus-4.6下降0.629分,而聚合分数保持平稳。这种退化在三次用户代理替换中幸存(95%的幅度),在5家族评判者堆栈和17个月队列间隔中复制,并在74个保留的真实ESConv对话中持续存在(rho在[0.749, 0.850]之间);工具达到序数Krippendorff alpha=0.91。作为副产品,该范式充当饱和源诊断器,区分工具性天花板(可通过评分标准细化突破)和结构性天花板(需要场景或名单干预)。

英文摘要

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉:长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington(华盛顿大学) National Tsing Hua University(国立清华大学) Seoul National University(首尔大学) Mila - Québec AI Institute(米拉-魁北克人工智能研究所) Allen Institute for AI(艾伦人工智能研究所)

AI总结 通过分析2023-2025年*CL会议论文中的人类评估协议,发现报告不透明和可重复性差的问题,并提出改进建议。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而,这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中,我们对*CL会议出版物(2023-2025年)中评估长文本生成任务的人类评估协议进行了大规模分析,包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准,并应用这些标准系统地检查了社区内的报告规范和实践。我们发现,人类评估研究设计的重要方面普遍报告不足,导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现,我们概述了可操作的建议,以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到:https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

2606.08251 2026-06-10 cs.CY cs.AI 版本更新

Contemporary AI lacks the imagination to diverge or negate in science

当代人工智能缺乏在科学中发散或否定的想象力

Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, James A. Evans

发表机构 * Data Science Institute, University of Chicago(芝加哥大学数据科学研究所) Knowledge Lab, University of Chicago(芝加哥大学知识实验室)

AI总结 通过大规模科学家评估,发现当前AI在科学假设生成中缺乏多样性,无法自发提出零假设,且自动评估与专家判断一致性低,但微调奖励模型可缩小差距。

详情
AI中文摘要

关于人工智能将加速科学发现的宏大预测已超越来自在职科学家的证据,该领域仍缺乏大规模、科学家参与的测试。我们进行了迄今最大规模的此类评估,描绘了AI尚不能为科学做什么。我们邀请了121,640篇近期预印本(涵盖生物学、医学、化学和社会科学)的作者,对大型语言模型(LLMs)根据其论文背景和难题生成的后续想法进行评判。6,749名科学家返回了25,139组关于新颖性、实证可行性、真实性概率和采纳倾向的评分。出现了三种模式。第一,非推理LLMs陷入狭窄的“蜂巢思维”,产生相似想法;推理模型探索更宽的假设空间,但没有模型类自发提出零假设——人类更自由地做出这一举动。第二,科学家奖励与自己相似的想法,并更看重概率而非新颖性,尽管社会科学家比生命科学家更容忍风险。资深社会科学家是最严厉的批评者,他们的怀疑是有道理的:LLMs在像社会科学这样的多元领域中最易出错,这些领域需要上下文感知的解释和不断发展的理论。第三,社区目前依赖的自动评估器——LLM作为评委、人工指标,甚至最先进的(SOTA)模型——与专家判断的一致性较弱,检索增强和科学家角色提示仅带来边际收益。我们在人类评分上后训练的Qwen3-14B奖励模型捕捉了领域品味细微差别,比SOTA模型高出27%,并缩小了与独立同行评审员间一致性的差距。尽管有种种炒作,当今的科学AI仍然是一个其想象力、输出和判断需要人类基础的协作者。

英文摘要

Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

10. AI应用与系统 65 篇

2606.10044 2026-06-10 cs.AI 新提交

Business World Model

商业世界模型

Cecil Pang, Hiroki Sayama

AI总结 提出商业世界模型(BWM)架构,将世界模型思想应用于商业环境,通过编码状态、动态、约束和目标,支持自主决策与规划。

详情
AI中文摘要

企业越来越多地采用AI驱动的工具来提高生产力、降低成本并增强产品和服务。然而,AI的变革潜力不仅限于自动化预定义任务:它在于使智能系统能够从高层战略目标出发,规划、优化和执行商业计划。本文介绍了商业世界模型(BWM)的概念和架构,这是一种专门针对商业和组织环境的世界模型。受人工智能、认知科学和控制理论中的世界模型启发,BWM编码了商业状态、动态、约束、目标和可行的动作空间,以支持自主决策。我们提出了一种以商业语义为中心的公式,其中商业状态、动态和动作与关键商业实体相关联。在此框架内,智能体可以模拟替代动作序列,估计其对未来商业结果的影响,并在不确定性下评估权衡。所提出的架构将语义数据表示、概率机器学习模型、确定性业务规则和显式动作空间整合为一个用于规划和反事实推理的连贯结构。尽管其各个组成部分并非全新,但BWM的贡献在于将它们组织为用于商业计划的可执行内部模拟器。这项工作为能够从基于指令的执行转向目标驱动的规划和执行的自主商业系统奠定了概念基础。

英文摘要

Businesses are increasingly adopting AI-enabled tools to improve productivity, reduce costs, and enhance products and services. However, the transformative potential of AI extends beyond automating predefined tasks: it lies in enabling intelligent systems to plan, optimize, and execute business initiatives from high-level strategic objectives. This paper introduces the concept and architecture of a business world model (BWM), a world model specialized for business and organizational environments. Inspired by world models in artificial intelligence, cognitive science, and control theory, a BWM encodes business states, dynamics, constraints, objectives, and feasible action space to support autonomous decision-making. We propose a business-semantics-centric formulation in which business states, dynamics and actions are linked to key business entities. Within this framework, agents can simulate alternative action sequences, estimate their effects on future business outcomes, and evaluate trade-offs under uncertainty. The proposed architecture integrates semantic data representations, probabilistic machine learning models, deterministic business rules, and explicit action space into a coherent structure for planning and counterfactual reasoning. Although its individual components are not new, the contribution of BWM lies in organizing them as an executable internal simulator for business initiatives. This work establishes a conceptual foundation for autonomous business systems capable of moving from instruction-based execution toward goal-driven planning and execution.

2606.10279 2026-06-10 cs.AI cs.CL cs.LG 新提交

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

使用合成理由数据进行监督微调损害真实世界疾病预测

Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, Merced(加州大学默塞德分校)

AI总结 研究发现,在临床预测任务中,使用合成理由数据进行监督微调反而显著降低模型性能,根本原因在于叙事合理性与判别优化之间的结构性冲突。

详情
AI中文摘要

监督微调中使用合成理由数据被广泛认为能通过教导模型不仅预测什么而且预测原因来提升语言模型在临床预测任务上的性能。我们在基于纵向健康史进行五年阿尔茨海默病及相关痴呆症(ADRD)预测的任务上检验了这一假设。通过一项包含504种配置的大规模对照实验,我们发现,与仅使用标签的微调相比,基于理由的SFT始终且显著地损害了预测性能。这种退化在多个模型系列和数据规模中持续存在,并且无法通过使用面向推理的基础模型来解决。关键的是,这种失败并非由理由质量差所致:人类专家注释证实生成的理由在医学上是准确的,并且忠实于患者特定的证据;少样本实验表明,当相同的理由作为推理时的演示而非训练目标使用时,能提升性能。我们确定根本原因在于叙事合理性与判别优化之间的结构性冲突。我们希望我们的工作能为更精确地理解理由监督何时以及如何有帮助、何时无帮助铺平道路,从而指导在高风险临床预测中负责任地开发语言模型。

英文摘要

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

2606.10286 2026-06-10 cs.AI 新提交

Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Sim2Schedule: 一种模拟器引导的LLM框架用于自主露天矿调度

Mustavi Ibne Masum, Thiago Eustaquio Alves de Oliveira, Mahzabeen Emu

发表机构 * Department of Computer Science, Lakehead University(湖头大学计算机科学系) Quantum Communications and Computing Research Center and Department of Electrical and Computer Engineering, Memorial University of Newfoundland(新斯科舍纪念大学量子通信与计算研究中心及电气与计算机工程系) Department of Electrical and Computer Engineering, Memorial University of Newfoundland(新斯科舍纪念大学电气与计算机工程系)

AI总结 提出模拟器引导的LLM框架,将地质约束编码到动作生成中,零样本生成可解释调度方案,在保持线性计算时间下恢复MILP最优NPV的94%-99%。

详情
AI中文摘要

露天矿调度是在复杂的地质和运营约束下最大化经济回报的关键过程。虽然混合整数线性规划(MILP)提供了数学上的最优基线,但其指数级计算复杂性和无法实时适应限制了其在动态工业环境中的实际部署。本文引入了一种模拟器驱动的大语言模型(LLM)调度框架,其中LLM作为自主决策代理,在每一步由定制模拟器引导,该模拟器将地质优先关系、开采-加工耦合和动态容量约束直接编码到动作生成机制中。该框架在封闭、数据安全的环境中完全零样本运行,无需基于云的推理、领域特定微调或重新训练,即可生成完整、可解释的开采和加工调度。为了提供可信的性能基准,我们开发了一种新的MILP公式,纳入了现实的操作和地质约束。在不同规模和时段的开采实例上进行评估,基于LLM的框架恢复了MILP最优NPV的94%至99%,同时计算时间呈线性增长。这些结果表明,在复杂运营约束下的长期工业调度中,模拟器约束的LLM代理可作为经典优化的实用且可扩展的替代方案。

英文摘要

Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision-making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction-processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero-shot within a closed, data-secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud-based inference, domain-specific fine-tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM-based framework recovers between 94\% and 99\% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator-constrained LLM agents as a practical and scalable alternative to classical optimization for long-horizon industrial scheduling under complex operational constraints.

2606.10359 2026-06-10 cs.AI 新提交

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain: 面向供应链韧性的LLM驱动世界模型中的认知基础

Jia Luo

发表机构 * School of Foreign Languages, Huazhong University of Science

AI总结 提出ReflectiChain框架,通过生成式供应链世界模型和双环学习分离认知不确定性与偶然不确定性,在半导体基准上提升推理一致性33.0%,并在对抗冲击下保持82.3%可操作性。

详情
AI中文摘要

供应链中的AI代理面临一个基本的认知鸿沟:大语言模型(LLMs)解释策略但缺乏物理基础,而强化学习(RL)优化流程但对非结构化约束语义上视而不见。我们引入REFLECTICHAIN,通过生成式供应链世界模型(SC-WM)——将异构供应网络编码到具有物理守恒的6维图-潜在空间中——以及双环学习(将认知不确定性(KL信任域约束的策略适应)与偶然不确定性(随机潜在展开)分离)来弥合这一鸿沟。在Semi-Sim(一个具有SIR风险传播、6种扰动类型和10种策略约束模板的10节点半导体基准)上,REFLECTICHAIN将推理一致性得分提高了33.0%(p < 0.0001, d = 2.78),在对抗性冲击下保持了82.3%的可操作性,并表现出反脆弱行为(在适度压力下增益+40.2%)。我们识别了三种操作性的认知机制——不确定性分离、知识边界检测和经验贝叶斯策略更新——并讨论了五个局限性类别。

英文摘要

AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.

2606.10392 2026-06-10 cs.AI 新提交

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

使用LoRA和NEFTune对DeepSeek-R1-8B模型进行指令微调

Wu Yuerong, Mingni Luo

发表机构 * University of Hong Kong(香港大学) Northeastern University(东北大学)

AI总结 本研究结合LoRA和NEFTune微调DeepSeek-R1-8B模型,用于金融命名实体识别,在七类实体上达到0.912的微F1分数,优于多个基线模型。

详情
AI中文摘要

金融命名实体识别(NER)对于将非结构化的财务报告和新闻转化为结构化知识图谱至关重要。然而,通用大语言模型(LLMs)常常错误分类金融实体或忽略领域特定模式。本文研究了使用DeepSeek-R1-8B(一个最近开源的大语言模型)结合低秩适应(LoRA)和噪声嵌入微调(NEFTune)进行金融NER。我们语料库中的1693个样本中每个带注释的句子都被转换为指令-输入-输出三元组。我们将轻量级LoRA矩阵插入Transformer层,并应用NEFTune通过在训练期间向嵌入向量添加均匀噪声来提高泛化能力。实验表明,LoRA适应的DeepSeek-R1-8B在七种实体类型(公司、日期、地点、货币、人物、产品和数量)上达到了0.901的微F1分数,而添加NEFTune进一步将微F1分数提升至0.912,优于Llama3-8B、Qwen3-8B、Baichuan2-7B、T5和BERT-Base基线。

英文摘要

Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

2606.10412 2026-06-10 cs.AI 新提交

A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

面向智能金融系统的统一多模态框架:整合强化学习、高频交易和博弈论方法与跨模态情感分析

Fanrong Liu, Zhang Yuwei, Mingni Luo

发表机构 * Henan University, International Eurasia College(河南大学,国际欧亚学院) City University of Hong Kong, College of Business(香港城市大学,商学院) Northeastern University, School of Electronic and Information Engineering(东北大学,电子与信息工程学院)

AI总结 提出统一框架整合PPO、高频预测、上下文学习、博弈论和跨模态情感分析,在多个金融任务上平均提升20%以上性能。

详情
AI中文摘要

金融科技的快速发展要求能够同时处理多领域多样化挑战的复杂人工智能系统。本文提出了一个开创性的统一框架,无缝整合了用于机器人顾问系统的近端策略优化、用于高频交易的先进时间序列预测模型、用于动态投资顾问的上下文学习机制、用于竞争性银行场景的博弈论方法以及用于跨模态金融情感分析的统一嵌入。我们的综合框架解决了现有文献中这些技术孤立发展、未能利用其协同潜力的关键空白。通过在多个金融数据集和现实场景中的广泛实验,我们证明了集成方法相比专门的单领域系统实现了更优的性能。具体而言,我们的框架在投资组合优化指标上提升了23.7%,将高频交易的预测误差降低了31.2%,将投资推荐准确率提高了18.9%,通过纳什均衡收敛速度增加27.4%优化了竞争性银行策略,并通过跨模态融合将情感分析准确率提高了15.6%。我们的工作理论基础为集成优化问题建立了收敛保证,而实证结果验证了其在多样化金融机构中的实际适用性。这项研究不仅推进了金融AI的最新水平,还为开发能够适应现代金融市场复杂互联本质的综合智能系统提供了蓝图。

英文摘要

The rapid evolution of financial technology demands sophisticated artificial intelligence systems capable of handling diverse challenges across multiple domains simultaneously. This paper presents a groundbreaking unified framework that seamlessly integrates Proximal Policy Optimization for robo-advisory systems, advanced time-series prediction models for high-frequency trading, in-context learning mechanisms for dynamic investment advisory, game-theoretic approaches for competitive banking scenarios, and unified embeddings for cross-modal financial sentiment analysis. Our comprehensive framework addresses the critical gap in existing literature where these technologies have been developed in isolation, failing to leverage their synergistic potential. Through extensive experimentation across multiple financial datasets and real-world scenarios, we demonstrate that our integrated approach achieves superior performance compared to specialized single-domain systems. Specifically, our framework shows a 23.7% improvement in portfolio optimization metrics, reduces prediction error in high-frequency trading by 31.2%, enhances investment recommendation accuracy by 18.9%, optimizes competitive banking strategies with a 27.4% increase in Nash equilibrium convergence speed, and improves sentiment analysis accuracy by 15.6% through cross-modal fusion. The theoretical foundation of our work establishes convergence guarantees for the integrated optimization problem, while our empirical results validate the practical applicability across diverse financial institutions. This research not only advances the state-of-the-art in financial AI but also provides a blueprint for developing comprehensive intelligent systems that can adapt to the complex, interconnected nature of modern financial markets.

2606.10500 2026-06-10 cs.AI 新提交

A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

一种考虑鲁棒性分析的基于置信规则库的可靠故障诊断方法

Mingyuan Liu, Dan Yin, Zongzong Wu

发表机构 * Central South University(中南大学)

AI总结 针对故障诊断中传感器读数可靠性问题,提出一种基于置信规则库的可靠故障诊断方法,通过鲁棒性分析与优化策略提升模型准确性和鲁棒性,在柴油机和轴承故障诊断中验证有效性。

详情
AI中文摘要

在设备运行中,实施故障诊断对于确保生产设备的连续性和安全性、提高运行效率以及降低维护成本至关重要。由于传感器读数广泛用于故障诊断,其可靠性直接影响故障诊断的结果。针对故障诊断模型的鲁棒性评估和鲁棒性优化两个问题,提出了一种新的故障诊断方法。为此,提出了一种考虑鲁棒性分析的基于置信规则库(BRB)的可靠故障诊断方法。首先,系统地对BRB模型进行鲁棒性分析。其次,提出了三种鲁棒性约束策略来优化BRB故障诊断模型的鲁棒性。最后,以WD615柴油机和凯斯西储大学轴承的故障诊断为例,验证了所提模型的有效性,实验表明所提模型在准确性和鲁棒性上均有提升。

英文摘要

In equipment operation, the implementation of fault diagnosis is essential to ensure the continuity and safety of production equipment, improve operational efficiency and reduce maintenance costs. Since sensor readings are widely used for fault diagnosis, their reliability directly affects the results of fault diagnosis. A new fault diagnosis method is proposed to address the two problems of robustness assessment and robustness optimization of fault diagnosis models. For this purpose, a reliable fault diagnosis method based on a belief rule base (BRB) considering robustness analysis is proposed. Firstly, the robustness analysis of the BRB model is carried out systematically. Secondly, three robustness constraint strategies are proposed to optimize the robustness of the BRB fault diagnosis model. Finally, the effectiveness of the proposed model is verified by taking the fault diagnosis of WD615 diesel engine and Case Western Reserve University bearings as an example, and the experiments show that the proposed model improves both accuracy and robustness.

2606.10752 2026-06-10 cs.AI 新提交

AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies

AutoPDE: 通过显式表示的求解器策略实现可靠的智能体PDE求解

Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu, Xin Jin

发表机构 * University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of the Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 提出AutoPDE,一种将求解器策略作为显式对象维护的代码智能体,通过PDE分析、数值方法选择和自适应调优三阶段构建策略,在PDE Agent Bench上达到54.5%的通过率,比最强基线提升14.2个百分点。

详情
AI中文摘要

偏微分方程(PDE)的数值求解器是科学和工程中的核心计算工具。构建可靠的PDE求解器不仅需要可执行的代码,还需要一个数值求解器策略——一组关于离散化、稳定化、求解器配置和分辨率控制的决策,这些决策需与PDE结构相匹配。最近基于LLM的编码智能体通过生成和调试求解器实现,开始减轻编程负担。然而,它们通常直接从PDE问题跳到求解器代码,将求解器策略隐含在实现细节中。因此,求解失败的反馈被路由回代码编辑,而不是底层策略,导致数值决策在代码生成前难以检查,且在失败时难以利用数值证据进行修改。为解决这一局限,我们提出AutoPDE,一种在整个求解过程中将求解器策略作为显式表示对象维护的代码智能体:一个独立的、可检查的对象,在编写任何代码之前构建,并在求解失败时可根据数值证据进行修订。AutoPDE通过三个阶段构建和维护该对象,所有阶段均利用可重用的PDE求解技能库:PDE分析识别方程类型和代数结构;数值方法选择选择与分析结果匹配的数值方法,并确定离散化、稳定化和线性求解器;自适应调优运行低成本试算以在规定的精度和运行时间预算下校准分辨率和容差。我们在PDE Agent Bench上评估AutoPDE,实验结果表明,AutoPDE的通过率达到54.5%,比最强基线提高了14.2个百分点。

英文摘要

Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of $54.5%$, improving over the strongest baseline by $14.2$ percentage points.

2606.10953 2026-06-10 cs.AI cs.CV 新提交

Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Architect-Ant: 可编辑的建筑平面图自动家具布置

Fedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani, Peter Wonka

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Miami University(迈阿密大学)

AI总结 提出基于微调视觉语言模型的可编辑自动家具布置框架Architect-Ant,通过领域特定语言编码布局并优化,生成符合建筑约束的合理布局。

Comments 17 pages, 10 figures

详情
AI中文摘要

带家具的平面图是房地产可视化、室内设计和建筑工作流程的基础。然而,由于缺乏带有对象级家具标注的真实专业设计平面图数据集,自动家具布置的进展受到限制。为解决这一差距,我们引入了AntPlan-270,这是一个包含270个建筑平面图的精选数据集,每个房间都有家具边界框标注,涵盖十个住宅房间类别。基于该数据集,我们提出了Architect-Ant,一个由微调视觉语言模型驱动的可编辑自动家具布置框架。家具布局使用一种紧凑的、基于坐标的领域特定语言(DSL)表示,该语言编码对象类别和相对于房间几何形状的位置。为了提高空间推理能力,我们生成了程序化推理轨迹,捕捉建筑约束,如墙壁对齐、门窗间隙、流通、固定装置兼容性和房间特定家具清单,并使用它们来监督模型的微调。然后,我们对候选对象位置应用偏好优化,以进一步提高布局质量。生成的DSL可以栅格化为语义掩码,并用于条件化基于Flux的LoRA渲染器,生成逼真的蓝图风格带家具平面图图像,同时保留可编辑的符号布局。布局布置实验表明,Architect-Ant能生成几何上有效且功能上合理的布局,并为更大的仅结构平面图数据集的家具布置提供了一条可扩展的路径。

英文摘要

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

2606.11015 2026-06-10 cs.AI 新提交

Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

从推理中获取结构,从搜索中获取数值:本地部署的开放大语言模型作为耦合MIMO控制器整定的结构先验

Jiaxuan Chen, Haonan Li, Yang Shu

发表机构 * Jinling Institute of Technology(金陵科技学院) College of Water Resources and Civil Engineering, China Agricultural University(中国农业大学水利与土木工程学院) Zhejiang University(浙江大学)

AI总结 针对强耦合MIMO过程,提出利用本地部署的开源大语言模型作为结构先验,通过推理耦合关系提出非对称结构,结合经典优化器实现样本高效且可解释的控制器整定。

Comments 10 pages, 7 figures, 6 tables. Submitted to IEEE Access

详情
AI中文摘要

为强耦合多输入多输出(MIMO)工业过程整定控制器是困难的:分散式经典自整定忽略了回路交互,而自然初始化的局部数值优化会在由此产生的非凸代价景观中停滞。我们探究能否使用本地部署的开源大语言模型(LLMs),它们将数据保留在本地且无需过程模型。在单回路CSTR上,经典继电反馈整定(IAE 0.106,接近最优0.102)优于LLM整定器(0.162):对于简单回路,LLM没有增加价值。而在具有冲突设定点的强耦合四水箱系统上,情况发生逆转,通过惩罚代价J = IAE + lambda*TV(u)评分,该代价奖励无致动器抖动的跟踪。在那里,朴素继电整定(J ~ 28.6)和朴素LLM整定(29.7)不比开环(22.7)更好,且从平衡初始化的局部优化器在10/10次运行中均失败。相反,一个经过支架设计的开放LLM推理耦合关系,提出反直觉的非对称结构,并从任何初始点达到J ~ 16.9 +/- 0.2;用经典优化器精炼后,达到平滑全局最优(J ~ 12.0,10/10 vs. 0/10),该最优甚至应用了分散式整定无法实现的非明显负积分校正。全局优化器(差分进化)也能达到该最优,因此LLM并非唯一途径;其优势在于样本效率和可解释性:在18次评估中即可得到可用控制器(此时全局优化器比开环更差),并附带明确的理由。这一优势随维度增长,在3x3过程上评估次数减少约6倍。该行为在四个开放模型上泛化,而在良性过程中LLM没有优势,从而划清了边界。我们贡献了一个可复现的基准测试,界定了开放LLM在控制整定中何时有帮助:不是作为优化器,而是作为样本高效、可解释的结构先验。

英文摘要

Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning ignores loop interaction, and local numerical optimization from natural initializations stalls in the resulting non-convex cost landscape. We ask whether on-premise open-source large language models (LLMs), which keep data on-site and need no plant model, can help. On a single-loop CSTR, classical relay-feedback tuning (IAE 0.106, near the 0.102 optimum) beats an LLM tuner (0.162): for simple loops the LLM adds nothing. The picture inverts on a strongly coupled quadruple-tank with conflicting set-points, scored by a penalized cost J = IAE + lambda*TV(u) that rewards tracking without chattering actuators. There, naive relay tuning (J ~ 28.6) and naive LLM tuning (29.7) are no better than open loop (22.7), and a local optimizer from balanced starts fails in 10/10 runs. A scaffolded open LLM instead reasons about the coupling, proposes the counter-intuitive asymmetric structure, and reaches J ~ 16.9 +/- 0.2 from any start; refining it with a classical optimizer attains the smooth global optimum (J ~ 12.0, 10/10 vs. 0/10), which even applies a non-obvious negative integral correction decentralized tuning cannot. A global optimizer (differential evolution) also reaches this optimum, so the LLM is not the only route; its advantage is sample efficiency and interpretability: a usable controller in 18 evaluations (where the global optimizer is worse than open loop) plus a stated rationale. This edge grows with dimension, reaching ~6x fewer evaluations on a 3x3 plant. The behaviour generalizes across four open models, and on a benign plant the LLM offers no advantage, sharpening the boundary. We contribute a reproducible benchmark delimiting when open LLMs help in control tuning: not as optimizers, but as a sample-efficient, interpretable structural prior.

2606.11120 2026-06-10 cs.AI cs.CV 新提交

Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

蒙特卡洛传球搜索:利用轨迹生成进行足球3D反事实传球评估

Andrew Kang, Priya Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出蒙特卡洛传球搜索(MCPS),结合价值模型、世界模型和反事实策略,基于3D轨迹数据评估足球传球,通过两种执行盈余分数实现分布感知的传球分析。

Comments CVPR 2026, CVSports Workshop

详情
AI中文摘要

我们将足球传球评估重新定义为类似蒙特卡洛树搜索(MCTS)的评估问题,其组成部分大多以不同名称存在于文献中:价值模型(控球价值)、世界模型(带球交互的多智能体轨迹)以及反事实动作策略(带噪声的传球变体采样)。基于德甲联赛首个公开的高保真3D球轨迹跟踪数据集,我们引入了蒙特卡洛传球搜索(MCPS),该方法推断每个观察到的传球的踢球参数,采样执行变体和选项变体,使用球条件世界模型将每个候选向前滚动直到下一次球交互,并通过学习到的价值模型对结果进行评分,以获得所获价值的分布。该分布通过两种互补的执行盈余分数(基于均值和基于百分位的分数)实现分布感知的归因,用于分析和排名。为了使世界模型在有限的公开数据下具有样本效率,我们改编了来自自动驾驶的离散令牌自回归轨迹生成器(SMART),并表明与基线相比,它在最佳20次预测准确性上表现强劲,同时支持完全假设的展开以进行下游评估。我们已发布模型检查点和代码。

英文摘要

We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

2601.11072 2026-06-10 cs.HC cs.AI cs.CY 交叉投稿

More Human or More AI? Visualizing Human-AI Collaboration Disclosures in Journalistic News Production

更人性化还是更AI?新闻制作中人机协作披露的可视化

Amber Kusters, Pooja Prajod, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica Amsterdam(阿姆斯特丹数学与信息学研究中心) Centrum Wiskunde & Informatica(阿姆斯特丹数学与信息学研究中心) TU Delft(代尔夫特理工大学) Utrecht University(乌得勒支大学)

AI总结 通过协同设计会话和实验室研究,探讨不同可视化披露方式(文本、基于角色的时间线、基于任务的时间线、聊天机器人)及协作比例如何影响用户对新闻中人机协作的感知。

Comments Accepted to ACM CHI 2026 - Preprint

详情
AI中文摘要

在新闻编辑流程中,目前对AI使用的披露仅限于简单的标签,这忽略了人类和AI如何在新闻文章上协作的细微差别。通过协同设计会话(N=10),我们收集了69个披露设计,并实现了四个原型,以可视化方式披露新闻中的人机协作。随后,我们进行了一项受试者内实验室研究(N=32),考察披露可视化(文本、基于角色的时间线、基于任务的时间线、聊天机器人)和协作比例(主要人类 vs. 主要AI)如何影响可视化感知、注视模式以及体验后反应。我们发现,文本披露在传达人机协作方面效果最差,而聊天机器人提供了最深入的信息。此外,基于角色的时间线在主要人类文章中放大了AI的贡献,而基于任务的时间线在主要AI文章中将感知转向人类参与。我们贡献了人机协作披露可视化及其评估,并提出了关于可视化如何改变对AI在新闻文章创作中实际角色感知的警示性考虑。

英文摘要

Within journalistic editorial processes, disclosing AI usage is currently limited to simplistic labels, which misses the nuance of how humans and AI collaborated on a news article. Through co-design sessions (N=10), we elicited 69 disclosure designs and implemented four prototypes that visually disclose human-AI collaboration in journalism. We then ran a within-subjects lab study (N=32) to examine how disclosure visualizations (Textual, Role-based Timeline, Task-based Timeline, Chatbot) and collaboration ratios (Primarily Human vs. Primarily AI) influenced visualization perceptions, gaze patterns, and post-experience responses. We found that textual disclosures were least effective in communicating human-AI collaboration, whereas Chatbot offered the most in-depth information. Furthermore, while role-based timelines amplified AI contribution in primarily human articles, task-based timeline shifted perceptions toward human involvement in primarily AI articles. We contribute Human-AI collaboration disclosure visualizations and their evaluation, and cautionary considerations on how visualizations can alter perceptions of AI's actual role during news article creation.

2606.09041 2026-06-10 cs.CY cs.AI cs.GR cs.HC cs.MM 交叉投稿

Culturally-Aware AI for Cross-Boundary Community Learning: Undergraduate Innovation at the Intersection of Computation and Design

跨边界社区学习的文化感知AI:计算与设计交叉领域的本科生创新

Jiaojiao Zhao, Weisheng Zhang, Jiawen Cai, Haibin Gao, Luyao Zhang

发表机构 * Duke Kunshan University(杜克昆山大学) Zhouzhuang Mystery of Life Museum(周庄生命之谜博物馆) Digital Innovation Research Center and Social Science Division(数字创新研究中心和社会科学系)

AI总结 本文提出一个协作框架,通过社区参与计算实现文化感知AI教育,促进社会工作和计算科学跨学科融合,应用于文化遗产保护与可持续发展。

详情
AI中文摘要

人工智能在教育领域(AIED)的研究正在迅速扩展,但技术进步往往缺乏以人为中心的根基和对文化背景的充分关注。社区学习作为一种根植于社会工作的教学法,在AIED研究中仍然代表性不足,尤其是在亚太地区。本文报告了跨边界社区学习,其中本科生开发基于AI的解决方案,用于文化遗产保护和可持续发展。我们考察了社区参与计算如何在教育、技术和文化三个维度上实现以人为中心的AIED。我们贡献了一个文化感知AIED的协作框架,该框架通过打破社会工作与计算科学之间的学科壁垒,促进多方利益相关者协作,同时扩大参与度。

英文摘要

Research on artificial intelligence in education (AIED) is rapidly expanding, yet technical progress often lacks human-centered grounding and adequate attention to cultural context. Community-Based Learning, a pedagogy rooted in social work, remains underrepresented in AIED research, particularly within Asia-Pacific contexts. This paper reports on cross-boundary Community-Based Learning where undergraduate students develop AI-enabled solutions for cultural heritage preservation and sustainable development. We examine how community-engaged computing operationalizes human-centered AIED across three dimensions: education, technology, and culture. We contribute a collaborative framework for culturally-aware AIED that fosters multi-stakeholder collaboration while widening participation by dissolving disciplinary silos between social work and computational science.

2606.09831 2026-06-10 cs.HC cs.AI 交叉投稿

AI-Driven Analytics of Team-Teaching Talk: Acoustic Patterns across Experience, Cohorts and the Learning Design

AI驱动的团队教学对话分析:跨经验、学生群体和学习设计的声学模式

Yuchen Liu, Roberto Martinez-Maldonado, Riordan Alfredo, Paola Mejia-Domenzain, Dwi Rahayu, Sadia Nawaz

发表机构 * Monash University(莫纳什大学) EPFL(瑞士联邦理工学院)

AI总结 本文提出基于AI的语音处理方法,分析团队教学中的课堂对话,发现经验丰富的教师、本科生班级和协作学习任务中音量变化更大,表明教师更频繁调节音量以突出关键信息并促进互动。

Comments Accepted at AIED 2026 (International Conference on Artificial Intelligence in Education), 14 pages, 4 figures

详情
AI中文摘要

随着课堂规模的扩大,团队教学越来越多地被用于整合多位教师的专业知识和教学视角。然而,关于团队教学在实践中如何展开的实证理解仍然有限,特别是在教师贡献随经验水平、学生群体和学习任务设计差异方面。先前对团队教学的研究主要依赖于回顾性自我报告或小规模观察,对团队教学实施的微观过程提供了有限的见解。教师谈话为这些过程提供了一个可扩展的视角。虽然个体教学情境中的研究表明,语音的声学特征(如音质、语调和响度)可以影响学生学习,但来自团队教学环境的证据仍然稀缺。此外,通过手动观察或转录捕捉这些特征在团队教学课堂中尤其具有挑战性,因为多位教师在长时间和多空间位置上发言,限制了可扩展性,除非自动化。基于空间教学法理论和团队教学研究,本文提出了一种基于AI的语音处理方法,用于分析团队教学环境中的课堂谈话。我们分析了涉及12位教师的36个录制的本科生和研究生课程。编码了空间教学行为并提取了声学特征,以考察教师经验、学生群体和学习任务设计之间的差异。结果揭示了系统性差异,最显著的是在响度动态方面:高经验教师、本科生班级和协作学习任务表现出更大的响度变化,表明更频繁地调节音量以突出关键信息并支持课堂互动和参与。

英文摘要

As classroom cohorts expand, team teaching is increasingly used to integrate the expertise and pedagogical perspectives of multiple teachers. Yet, there is limited empirical understanding of how team teaching unfolds in practice, particularly regarding differences in teachers' contributions across experience levels, student cohorts, and learning task design. Prior research on team teaching has largely relied on retrospective self-reports or small-scale observations, offering limited insight into the micro-level processes through which team teaching is enacted. Teacher talk offers a scalable lens on these processes. While research in individual teaching contexts shows that acoustic features of speech (e.g., voice quality, intonation, and loudness) can shape student learning, evidence from team-teaching settings remains scarce. Moreover, capturing such features through manual observation or transcription is especially challenging in team-teaching classrooms, where multiple teachers speak across extended sessions and spatial locations, limiting scalability without automation. Grounded in spatial pedagogy theory and team-teaching research, this paper presents an AI-based speech processing approach to analyse classroom talk in team-teaching settings. We analysed 36 recorded undergraduate and postgraduate sessions involving 12 teachers. Spatial pedagogy behaviours were coded and acoustic features extracted to examine variation across teachers' experience, student cohorts, and the learning task design. The results reveal systematic differences, most notably in loudness dynamics: high-experience teachers, undergraduate classes and collaborative learning tasks exhibited greater loudness variation, suggesting more frequent modulation of volume to foreground key information and support classroom interaction and engagement.

2606.09842 2026-06-10 cs.HC cs.AI cs.CV 交叉投稿

Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization

集成实时运动跟踪与AI分析以优化运动表现

Parth Agrawal, Ronit, Sagar Kumar, Aashish Bhambri

发表机构 * Department of Computer Science(计算机科学系) Department of Computer Science and Engineering(计算机科学与工程系) Chandigarh University(昌迪加尔大学)

AI总结 本文综述了实时人体姿态估计在运动分析中的方法,并开发了一个轻量级原型系统,利用MediaPipe框架提供实时反馈,以优化运动表现。

Comments 6 pages, 10 figures, 2 tables, IC2E3-2026 conference

详情
AI中文摘要

在真实世界环境中应用人体姿态估计(HPE)仍然是一项具有挑战性的任务。本文探讨并综述了实时HPE方法及其在个体运动分析中的局限性,同时开发了一个实用的轻量级原型用于真实世界的测试和使用。从传统的基于标记的运动捕捉系统发展到现代可访问且适应性强的无标记深度学习方法,本文综述了平衡精度和效率的基础架构。我们还比较了算法框架(如自顶向下、自底向上、单阶段方法等)在实际部署指标上的表现,包括推理延迟、帧率、平均关节位置误差和时间抖动,以指导运动应用的模型选择过程。作为我们的主要贡献,我们提出了一个模块化的轻量级软件原型,该原型使用MediaPipe HPE框架结合多种特定于运动的逻辑,为非专业用户提供实时洞察和基于AI的反馈。我们以最小的计算资源推导出运动洞察并提供反馈,同时展示了性能和可靠性指标。最后,我们提出了其他未来研究方向,如结合传感器和AR/VR。这项工作面向研究人员、工程师、运动科学家等,既作为技术资源,也作为实现类似或改进的实时HPE分析系统以增强运动表现或其他目的的有效蓝图。

英文摘要

Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE approaches and their limitations in sports analysis for individuals, alongside developing a practical lightweight prototype for real world testing and usage. The older marker-based motion capture systems evolving to the modern accessible and adaptable markerless deep learning approaches, this survey explores the foundational architectures, which balance precision and efficiency. We also compare algorithmic frameworks (top-down, bottom-up, one-stage approaches, etc.) on practical deployment metrics such as inference latency, frame rate, mean per-joint position error, and temporal jitter to guide model selection process for sports application. As our prime contribution, we are proposing a modular, lightweight software prototype, which uses MediaPipe HPE framework with multiple exercise specific logic to deliver real-time insights and AI based feedback for non-expert users. We derive sports insights and providing feedback with minimal computational resources, while showcasing the performance and reliability metrics. In the end, we suggest other future research directions like combining sensors, and AR/VR. This work caters to researchers, engineers, sport scientists, etc., as both technical resource and a valid blueprint to implement a similar or improved real-time HPE analysis system for athletic performance enhancement or other purposes.

2606.09846 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

CANVAS: 用叙事视觉音频AI系统为艺术配文

Vignesh Nagarajan

发表机构 * BASIS Phoenix High School(BASIS凤凰高中)

AI总结 提出一种自动化工作流,利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说,在20秒内以低于0.05美元的成本生成文本加音频输出,显著提高词汇多样性和叙事细节。

Comments 22 pages, 16 figures, 3 tables, 21 references

详情
AI中文摘要

由于替代文本简短或缺失,视觉艺术在很大程度上仍对盲人和低视力(BLV)观众不可及,这些文本很少传达艺术品的感官、空间或情感特质。本研究提出了一种自动化工作流,利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说。该系统通过Zapier编排,将上传的图像转换为丰富的叙事字幕,无需人工干预,从而实现可访问媒体的快速、规模化生产。对50件艺术品的定量评估显示,AI生成的描述在词汇多样性、形容词密度和叙事细节方面显著高于基线字幕,同时保持可比的易读性水平。统计检验(t检验、方差分析)确认了丰富度和长度方面的显著差异,完整流水线在每张图像20秒内生成文本加音频输出,成本低于0.05美元。研究结果表明,自动字幕生成可以弥合博物馆和数字馆藏可访问性方面的差距,对更广泛的公众参与具有意义。未来工作可纳入BLV参与者的用户研究,以评估理解、偏好和最佳解释性语言水平。

英文摘要

Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below $0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

2606.09852 2026-06-10 cs.HC cs.AI cs.CL cs.LG cs.MA cs.SE 交叉投稿

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

基于LLM的代码文档生成与多裁判评估

Ikbel Ghrab, Mohamed Dhieb, Ismail Khenissi, Ines Abdeljaoued-Tej

发表机构 * University of Tunis El Manar(突尼斯国家理工大学)

AI总结 提出利用八种大语言模型自动生成代码文档,并通过多裁判评估框架(四个LLM从九个维度评分)提升文档质量,在医学物理库上实验显示最佳与最差模型性能差距达42%。

Comments ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情
Journal ref
Conference ICAHS IEEE, 2025
AI中文摘要

高质量的源代码文档至关重要但往往被忽视,尤其是在医疗保健等关键领域,可靠性和可维护性至关重要。我们提出了一个AI驱动的框架,利用八种最先进的大语言模型(包括GPT、Gemini、Qwen和LLaMA变体)自动从代码和仓库生成文档。该系统基于PocketFlow编排框架,采用模块化流水线和高级提示工程,生成结构化、上下文感知的文档。为确保质量并指导模型选择,我们引入了MultiLLMasJudges评估框架,其中四个独立的LLM从九个标准(如完整性、清晰度和忠实度)评估输出。在开源医学物理库上进行的实验表明,最佳和最差模型之间的性能差距为42%。通过结合多样化的模型输出、优化的提示和严格的评估,我们的方法提高了文档质量并减少了人工工作量,特别是在安全关键的医疗软件中。

英文摘要

High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

2606.09860 2026-06-10 cs.LG cs.AI stat.AP stat.ML 交叉投稿

Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

基于梯度提升与无分布覆盖的非酒精性脂肪肝病共形风险预测

Xinze Zhang

AI总结 提出结合梯度提升决策树与共形预测的机器学习框架Method,实现非酒精性脂肪肝病个体风险的无分布校准覆盖预测,在中国多中心队列中AUROC达0.912,优于多种方法。

详情
AI中文摘要

非酒精性脂肪肝病(NAFLD)影响全球约25%的成年人,带来显著的肝脏和心血管风险。然而,人群层面的筛查工具仍不充分。我们提出Method,一种用于NAFLD风险预测的机器学习框架,将梯度提升决策树与共形预测相结合,以在个体风险估计上产生校准的、无分布的覆盖保证。它集成了基于互信息的稳定性选择过程,通过自助重采样识别紧凑、临床可解释的特征子集,构建预测集,其边际覆盖可证明超过用户指定的置信水平。我们在中国广州的多中心队列(主要n=2,187;外部验证n=412)上评估了Method,使用了涵盖人口统计学、代谢生物标志物和生活方式因素的78个候选特征。Method内部AUROC为0.912,外部为0.891,优于深度神经网络、TabNet、支持向量机和逻辑回归。共形预测集在90%名义水平下达到91.3%的经验覆盖。从这些分数得出的三层风险分层将人群分为不同组别,高风险亚组的12个月进展率是低风险组的4.7倍。选定的特征——特别是腰围、ALT、GGT、甘油三酯、空腹血糖和BMI——与已建立的代谢风险因素一致,提供了生物学合理性。

英文摘要

Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.

2606.09867 2026-06-10 cs.AR cs.AI 交叉投稿

EstRTL: Functional Estimation Guided RTL Code Generation

EstRTL:功能估计引导的RTL代码生成

Qi Xiong, Renzhi Chen, Bowei Wang, Yuqing Xiong, Libo Huang, Lei Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) Defense Innovation Institute, Academy of Military Science (AMS)(军事科学院创新院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) Key Laboratory of Advanced Microprocessor Chips and Systems, Changsha, China, and College of Computer Science and Technology, National University of Defense Technology, Changsha, China(先进微处理器芯片与系统重点实验室,长沙,中国,和国防科技大学计算机科学与技术学院,长沙,中国) Defense Innovation Institute, AMS, Beijing, China and Qiyuan Lab, Beijing, China(军事科学院创新院,北京,中国和启元实验室,北京,中国)

AI总结 提出EstRTL框架,通过静态功能评分估计,结合生成、评估和修正三阶段范式,提升LLM生成RTL代码的功能正确性,在通用LLM上正确率提升3.2%-9.0%。

详情
AI中文摘要

优化寄存器传输级(RTL)代码在硬件设计中至关重要。大型语言模型(LLM)为RTL代码的自动生成和优化提供了新方法,有望显著加速设计过程并减少人力投入。然而,现有的RTL代码生成方法通常侧重于模型微调和利用各种扩展技术来增强RTL代码生成能力,缺乏对功能正确性的关注。确保生成的RTL代码不仅编译成功,而且在实际硬件实现中按预期运行仍然是一个关键挑战。为解决这一问题,我们提出了EstRTL,一个基于静态功能评分估计的LLM驱动的协作智能体框架,用于RTL代码生成。EstRTL采用三阶段范式:生成、评估和修正。在阶段中,功能评估智能体根据评分和评估结果静态评估生成的代码,并决定是直接输出代码、返回重新生成还是转发给代码修正智能体。该框架可应用于各种专为RTL代码生成设计的LLM,进一步增强生成代码的正确性。通过提供定量评分和可读的需求比较,它提高了AI辅助RTL代码生成的透明度。实验表明,EstRTL将通用LLM的RTL代码生成正确率显著提升了3.2%-9.0%,展示了我们系统的实用价值。代码和实验结果已开源,链接:this https URL。

英文摘要

Optimizing register transfer level (RTL) code is of vital importance in hardware design. Large language models (LLMs) provide new methods for the automatic generation and optimization of RTL code, offering the potential to significantly accelerate the design process and reduce human effort. However, existing methods for generating RTL code often focus on model fine-tuning and the use of various expansion techniques to enhance the RTL code generation capabilities, lacking attention to the functional correctness. Ensuring that the generated RTL code not only compiles successfully but also behaves as intended in real hardware implementations remains a critical challenge. To address this issue, we propose EstRTL, an LLM-powered collaborative agent framework for RTL code generation based on static functional score estimation. EstRTL operates a three-stage paradigm: Generation, Estimation and Correction. During the stages, the functional estimation agent statically evaluates the generated code based on score and assessment results, and decides whether to output the code directly, return it for regeneration, or forward it to the code correction agent. This framework can be applied to various LLMs that designed for RTL code generation, further enhancing the correctness of the generated code. By providing quantitative scores and human-readable requirements comparisons, it improves the transparency of AI-assisted RTL code generation. Experiments show that EstRTL significantly improves the correctness of RTL code generation by generic LLM by 3.2\%-9.0\%, demonstrating the practical value of our system. The codes and experimental results are open-sourced at link: https://anonymous.4open.science/status/EstRTL-E200/.

2606.09872 2026-06-10 cs.LG cs.AI 交叉投稿

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

PatchSTG:面向不规则传感器网络的交通预测可扩展时空图Transformer

Jichao Li, Xuanming Shi

发表机构 * Portsmouth Abbey School(波特茅斯修道学校) CodingFuture (Shanghai) Education Technology Co., Ltd.(上海CodingFuture教育科技有限公司)

AI总结 提出PatchSTG,通过地理信息将传感器划分为平衡的局部补丁,并采用双注意力编码器交替捕捉局部和全局依赖,实现计算复杂度从二次降至近线性,在不规则传感器网络上取得高效且稳定的交通预测性能。

Comments 22 pages,12 figures

详情
AI中文摘要

交通预测是智能交通系统的基本组成部分,但由于传感器分布不规则以及建模大规模时空依赖的高计算成本,在实际环境中仍然具有挑战性。在实际交通网络中,传感器在不同区域分布不均,导致空间结构不均匀,限制了现有基于图和基于注意力模型的有效性和可扩展性。为了解决这些挑战,我们提出了PatchSTG,一种基于补丁的时空图Transformer,专为不规则传感器网络上的高效预测而设计。关键思想是引入一种层次化空间表示,基于地理信息将传感器划分为平衡且保持局部性的补丁。在此结构之上,双注意力编码器交替进行补丁内注意力(捕捉局部交互)和补丁间注意力(建模全局依赖),将计算复杂度从二次降低到近线性。我们在罗德岛的真实交通数据以及额外的大规模数据集上评估了PatchSTG。实验结果表明,所提模型在多个预测时域上实现了稳定且具有竞争力的预测性能,同时显著提高了计算效率。消融研究进一步验证了空间划分和双注意力在捕捉局部和长程交通动态方面的有效性。这些结果表明,基于补丁的时空建模为不规则空间设置下的交通预测提供了一个可扩展且有效的框架。

英文摘要

Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to irregular sensor distributions and the high computational cost of modeling large-scale spatiotemporal dependencies. In practical traffic networks, sensors are unevenly distributed across regions, leading to non-uniform spatial structures that limit the effectiveness and scalability of existing graph-based and attention-based models. To address these challenges, we propose PatchSTG, a patch-based spatiotemporal graph Transformer designed for efficient forecasting on irregular sensor networks. The key idea is to introduce a hierarchical spatial representation that partitions sensors into balanced, locality-preserving patches based on geographic information. On top of this structure, a dual attention encoder alternates between intra-patch attention for capturing local interactions and inter-patch attention for modeling global dependencies, reducing computational complexity from quadratic to near-linear scaling. We evaluate PatchSTG on real-world traffic data from Rhode Island and additional large-scale datasets. Experimental results demonstrate that the proposed model achieves stable and competitive forecasting performance across multiple horizons, while significantly improving computational efficiency. Ablation studies further validate the effectiveness of spatial partitioning and dual attention in capturing both local and long-range traffic dynamics. These results suggest that patch-based spatiotemporal modeling provides a scalable and effective framework for traffic forecasting under irregular spatial settings.

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 交叉投稿

Tractogram foundation model

TractFM:纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou(脑机智能国家重点实验室,浙江大学,杭州) Department of Radiology, Brigham and Women’s Hospital, Mass General Brigham, Boston(放射科,布里洛妇女医院,马萨诸塞总医院,波士顿) Harvard Medical School, Boston(哈佛医学院,波士顿) Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin(医学工程与转化医学研究院,天津大学,天津) School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu(信息与通信工程学院,电子科技大学,成都) Psychiatry Neuroimaging Laboratory, Brigham and Women’s Hospital, Mass General Brigham, Boston(精神病神经影像实验室,布里洛妇女医院,马萨诸塞总医院,波士顿) Department of Psychiatry, Center for Morphometric Analysis, Massachusetts General Hospital, Boston(精神病科,形态分析中心,马萨诸塞总医院,波士顿)

AI总结 提出TractFM基础模型,直接从全脑纤维束集学习可复用表示,结合局部纤维编码器和置换等变纤维束编码器,通过密集解剖束分割预训练,实现纤维束级和受试者级任务的迁移。

详情
AI中文摘要

扩散MRI(dMRI)纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图:一个大型、无序的三维流线集合,包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题:流线分类器关注几何模式,而受试者级预测通常依赖于手工特征。因此,当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM,一个纤维束图基础模型,直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器,使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割(即给单个流线分配解剖标签)上的预训练产生了两种互补表示:用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上,TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割,并在独立数据集上预测年龄和性别。这些结果表明,全脑几何上下文(一次性学习)可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

2606.09896 2026-06-10 cs.GT cs.AI cs.LG 交叉投稿

HMAF: A Hierarchical Multi-Slot GD-RTB Allocation Framework

HMAF:一种分层多槽GD-RTB分配框架

Tianxing Bu, Zhaoqi Zhang, Linyou Cai, Miao Xie, Shengri Xue, Tan Qu, Qianlong Xie, Xingxing Wang, Siqiang Luo, Gao Cong

发表机构 * Meituan(美团) Nanyang Technological University(南洋理工大学) China Agricultural University(中国农业大学)

AI总结 针对GD与RTB共存广告平台中短期收益与长期交付的平衡难题,提出分层多槽分配框架HMAF,采用计划-校准-执行范式,集成离线约束优化与在线决策,在美团实现GD交付率提升3.72%、广告总收入提升1.59%。

Comments Accepted by KDD 2026 Applied Data Science Track

详情
AI中文摘要

在现代在线广告平台中,保量交付(GD)合约与实时竞价(RTB)拍卖共存并相互竞价。现有方法要么将GD和RTB优化解耦,要么依赖启发式优先级规则,因此在复杂多槽交付和曝光约束下,难以有效平衡短期收入最大化与长期合约交付。为应对这些挑战,我们提出HMAF(分层多槽分配框架),一个旨在优化GD-RTB广告平台中曝光分配的统一框架。HMAF采用计划-校准-执行范式作为其核心结构,整合离线约束优化与在线决策,平衡离线GD资源规划、动态校准GD-RTB竞争力,并在多槽环境中做出实时列表级排名决策。HMAF已在全球最大在线食品配送平台之一美团的多项营销场景中实施,使GD交付率提升3.72%,广告总收入提升1.59%。

英文摘要

In modern online advertising platforms, Guaranteed Delivery (GD) contracts coexist and bid with Real-Time Bidding (RTB) auctions. Recent approaches either decouple GD and RTB optimization or rely on heuristic priority rules, and thus fail to effectively balance short-term revenue maximization with long-term contract delivery under complex multi-slot delivery and impression constraints. To address these challenges, we propose HMAF (Hierarchical Multi-Slot Allocation Framework), a unified framework designed to optimize impression allocation in GD--RTB advertising platforms. HMAF employs the Plan--Calibrate--Execute paradigm as its core structure, and integrates offline constraint optimization with online decision-making, balancing offline GD resource planning, dynamically calibrating GD--RTB competitiveness, and making real-time listwise rank decisions across multi-slot environments. HMAF has been implemented in multiple marketing scenarios at Meituan, one of the world's largest online food delivery platforms, leading to a 3.72% increase in GD delivery rate and a 1.59% increase in total advertisement revenue.

2606.09942 2026-06-10 cs.SE cs.AI 交叉投稿

Anomaly Detection and Root Cause Analysis for Microservice Systems

微服务系统的异常检测与根因分析

Luan Pham

AI总结 针对微服务系统异常检测与根因分析的五大局限性,提出端到端方法BARO、EventADL和TORAI,并构建基准RCAEval,通过实验验证有效性与鲁棒性。

Comments This is the pre-print of my PhD thesis, submitted to RMIT University

详情
AI中文摘要

微服务系统被广泛用于构建云应用,但其复杂性使得故障不可避免,从而降低用户体验并造成经济损失。自动化异常检测与根因分析(RCA)目前是活跃的研究领域,但现有技术存在五个局限性。首先,大多数方法将异常检测和RCA分开处理,假设异常已被正确检测,当检测因噪声或延迟而不精确时便会失效。其次,它们关注指标、日志和跟踪,而忽略了事件数据(如API调用和配置变更)。第三,许多方法需要给定的服务调用图,否则无法诊断。第四,该领域缺乏标准化的数据集和评估框架,导致方法难以公平比较。第五,尽管基于因果推断的RCA已成为主流,但其有效性、效率和鲁棒性仍不明确。本论文通过两组贡献解决这些局限性。第一组引入了独立和联合利用可观测性数据的方法。BARO是一种针对指标数据的端到端异常检测与RCA方法。EventADL是一种针对事件数据的端到端框架。TORAI是一种无需服务调用图的多模态RCA框架。在真实微服务系统上的大量实验证明了它们的有效性和鲁棒性。第二组贡献提供了基准数据集、评估框架和系统性的评估工作。RCAEval是一个全面的基准,为未来研究提供即用数据集和可复现基线。对现有RCA方法(尤其是基于因果推断的方法)的系统性评估提供了指导未来方向的见解。本论文因此推进了微服务故障的自动化异常检测与RCA,为事件缓解和修复的未来研究奠定基础。

英文摘要

Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.

2606.09944 2026-06-10 econ.GN cs.AI q-fin.EC 交叉投稿

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

GAGI:一种用于分布感知宏观经济福利监测的基尼调整人均GDP指数

Sivasathivel Kandasamy

发表机构 * Independent Researcher(独立研究者)

AI总结 提出GAGI指数,通过基尼系数和价格水平调整人均GDP,以监测福利分配效应,应用于G7国家发现福利增长与GDP增长持续偏离。

详情
AI中文摘要

人均GDP是政府机构追踪经济繁荣和经济事件后果的默认视角,但它忽视了生活繁荣的两个首要决定因素:收入/财富分配和通胀影响。不平等调整的收入衡量指标本身并不新鲜,但宏观经济监测工具包中具体缺失的不是福利概念,而是一个可操作的监测触发指标:一个足够简洁、可每年从公开数据计算、无需建模假设即可审计、且标准化以便于理解年度间和国家间变化(监管机构需要据此采取行动)的统计量。我们构建了这样一个工具,即基尼调整人均GDP指数(GAGI):一种可复现、可公开计算的公式,通过不平等调整因子(1-G)和价格水平重新调整各国人均GDP,并以2010年为基准标准化。GAGI是一个通用福利指数,并非特定于AI自动化,适用于任何需要追踪福利调整后繁荣的场景。将GAGI应用于2010-2026年的G7经济体,我们发现福利调整后的繁荣与总体GDP增长持续且日益偏离,这种偏离在2022年后急剧扩大,时间上与COVID后遗症和生成式AI部署加速相吻合,尽管仅凭此证据尚不能证明因果关系。我们认为GAGI是基于GDP监测的必要补充:任何仅追踪总产出的宏观经济监测工具都会系统性地忽略自动化可能造成的分配损害,即使报告的增长依然强劲。

英文摘要

GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country's GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 交叉投稿

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

发表机构 * Department of Computer Science and Artificial Intelligence, University of Seville(塞维利亚大学计算机科学与人工智能系) Bioaraba Health Research Institute(Bioaraba健康研究 institute) IKERBASQUE, Basque Foundation of Science(巴斯克科学基金会)

AI总结 提出一种深度学习系统,通过相邻轴向切片对合成中间CT切片,将有效穿平面间距减半,同时实现隐式降噪,在结构指标上优于经典插值和视频帧插值方法。

详情
AI中文摘要

头部计算机断层扫描(CT)通常使用亚毫米级的面内分辨率,但穿平面间距为2-5毫米,造成显著的各向异性,这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统,从相邻轴向切片对合成中间CT切片,将有效穿平面间距减半。该系统改善三维可视化,同时产生固有降噪的输出,在一次推理中实现两个互补优势。为构建可靠系统,我们系统评估像素级损失(均方误差MSE和平均绝对误差L1)、结构相似性损失(结构相似性指数SSIM及其多尺度变体MS-SSIM)以及混合组合。在保留测试集上,所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法(RIFE、FILM),其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施:标准数值修复消除了主要失败模式,但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例,我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列:模型合成中间切片,并在真实切片上表现出我们理论分析预测的隐式降噪特征,支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

2606.09959 2026-06-10 cs.LG cs.AI 交叉投稿

Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall

面向高强度降雨的季节感知降水临近预报的时间上下文条件化

Gijs van Nieuwkoop, Siamak Mehrkanoon

AI总结 提出TA-SmaAt-UNet模型,通过时间条件层(昼夜和季节循环编码)增强雷达降水临近预报,显著提升高强度降雨事件的预测性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

降水临近预报越来越多地采用直接从近期雷达观测中学习的深度学习模型。尽管这类模型能有效捕捉短期降水运动,但它们往往缺乏降雨发展所依据的气象条件的更广泛上下文信息。本文研究轻量级时间上下文是否能改善基于雷达的临近预报,特别是针对高强度降雨。我们提出了时间感知小注意力U-Net(TA-SmaAt-UNet),它在核心SmaAt-UNet模型基础上扩展了时间条件层,利用昼夜时间和一年中时间的循环编码来调节中间特征表示。在KNMI雷达降水数据上的实验表明,时间条件化对罕见的高强度降水事件最为有益,同时也能改善季节变异性和预测降水强度分布的表征。层传导分析进一步表明,尽管参数成本很小,模型仍积极使用添加的时间条件层。这些发现表明,简单的、基于物理动机的时间上下文可以提高基于深度学习的降水临近预报的真实性和可靠性。我们的模型实现和训练设置可在GitHub上获取。

英文摘要

Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}.

2606.09963 2026-06-10 physics.flu-dyn cs.AI 交叉投稿

Geometry-Aware Anisotropic Boundary Correction for Aerodynamic Simulation

几何感知的各向异性边界修正用于气动模拟

Xin Zhang, Yipeng Huang, Shu Jiang, Zhenzhong Wang, Min Jiang

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究院)

AI总结 针对神经算子忽视边界各向异性物理行为的问题,提出几何条件各向异性边界修正框架GeoABC,利用边界几何引入方向感知修正,在2D翼型和3D汽车任务中平均降低近边界相对L2误差约38%。

详情
AI中文摘要

气动模拟是工程形状设计的关键组成部分,其中表面压力系数等核心量强烈依赖于固体边界附近的流动动力学。神经算子为昂贵的计算流体动力学(CFD)求解器提供了一种高效替代方案。然而,传统方法各向同性地处理边界区域,未能考虑沿边界的不同物理行为。实际上,气动过程表现出各向异性:沿切向,流动沿壁面传播;沿法向,物理量受壁面约束。为了显式建模不同的物理行为,我们提出了GeoABC,一种几何条件各向异性边界修正框架。GeoABC利用边界几何将方向感知的边界修正引入神经算子的中间表示,将边界几何从静态输入特征转变为调节物理预测的结构先验。在2D翼型和3D汽车任务中,GeoABC一致地适应多种神经算子主干,平均降低近边界相对$L_2$误差约38%,缩小了主流神经算子共有的结构近壁间隙,推动神经算子向高保真气动模拟发展。

英文摘要

Aerodynamic simulation is a key component of engineering shape design, where core quantities such as the surface pressure coefficient strongly depend on flow dynamics near solid boundaries. Neural operators provide an efficient alternative to expensive Computational Fluid Dynamics (CFD) solvers. However, conventional methods treat the boundary region isotropically, failing to account for the distinct physical behaviors along the boundaries. In reality, the aerodynamic process exhibits anisotropy: along the tangential direction, flow propagates along the wall; along the normal direction, physical quantities are constrained by the wall. To explicitly model the distinct physical behaviors, we propose GeoABC, a geometry-conditioned anisotropic boundary correction framework. GeoABC leverages the boundary geometries to introduce direction-aware boundary correction into the intermediate representations of neural operators, transforming boundary geometry from static input features into a structural prior that modulates physical prediction. On 2D airfoil and 3D car tasks, GeoABC consistently adapts to multiple neural operator backbones, reducing near-boundary relative $L_2$ error by $\sim$38\% on average, narrowing the structural near-wall gap shared by mainstream neural operators, and advancing neural operators toward high-fidelity aerodynamic simulation.

2606.10197 2026-06-10 astro-ph.GA cs.AI 交叉投稿

Integral Field Unit Spectroscopy with One Fiber

单光纤积分场单元光谱学

Zehao Peng, Biprateep Dey, Chris J. Maddison, Joshua S. Speagle

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种多模态概率基础模型,利用掩码自编码器从宽带图像预测星系任意空间位置的高分辨率光谱,无需IFU训练数据,性能与监督基线相当。

Comments Accepted for Conference on Physics and AI at Stanford University (PAI 2026)

详情
AI中文摘要

积分场单元(IFU)光谱学提供星系的空间分辨光谱,为星系演化提供关键见解。然而,其高观测成本限制了当前IFU数据集约$10^4$个天体。我们提出一个多模态概率基础模型,直接从宽带图像预测星系内任意空间位置的高分辨率光谱,并带有校准的不确定性。基于掩码自编码器框架,我们的架构注入光纤位置编码和红移感知波长编码,实现空间条件预测。使用暗能量光谱仪(DESI)巡天的470万张图像和单光纤光谱观测训练,我们的模型利用光纤放置的自然方差和星系的形态自相似性,在没有任何IFU训练数据的情况下实现IFU般的能力。预测的发射线通量图与APO附近星系巡天(MaNGA)的独立IFU观测结果一致,性能与直接在IFU数据上训练的监督基线相当。

英文摘要

Integral field unit (IFU) spectroscopy provides spatially resolved spectra across galaxies, offering crucial insights into their evolution. However, its high observational cost limits current IFU datasets to $\sim 10^4$ objects. We present a multi-modal, probabilistic foundation model that predicts high-resolution spectra with calibrated uncertainties at arbitrary spatial locations within a galaxy directly from broadband images. Built on a masked autoencoder framework, our architecture injects fiber positional encodings and redshift aware wavelength encodings, enabling spatially conditioned predictions. Trained on 4.7 million images and single fiber spectroscopic observations from the Dark Energy Spectroscopic Instrument (DESI) survey, our model exploits the natural variance of fiber placements and the morphological self-similarity of galaxies to achieve IFU-like capabilities without any IFU training data. Predicted emission line flux maps match independent IFU observations from the Mapping Nearby Galaxies at APO (MaNGA) survey, with performance comparable to a supervised baseline trained directly on IFU data.

2606.10219 2026-06-10 cs.LG cs.AI 交叉投稿

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

高频金融时间序列的快速精确最近邻学习

Henry Han, Diane Li

AI总结 针对金融高频数据增长带来的实时性挑战,提出基于Mojo的SIMD k-d树方法,通过方差分裂、连续存储和编译时向量化距离计算,在保持精确输出的同时实现17.5-43.5倍加速,并支持期权定价模型训练数据量提升10倍。

Comments 15 pages 5 figures;

详情
AI中文摘要

随着股票、ETF、外汇、期权和高频交易数据量的激增,AI在金融领域的大规模效率变得至关重要。这种增长给成熟的金融AI系统带来了核心挑战:模型必须从更大的历史语料库中学习,同时满足交易、风险管理和衍生品定价中的实时延迟约束。我们以高频金融时间序列的精确最近邻学习为具体案例,展示基于Mojo的金融AI可以应对这一挑战。我们引入了一种Mojo SIMD k-d树,采用基于方差的划分、连续的扁平缓冲区存储和编译时向量化距离计算。我们还提供了运行时结果,表明在标准剪枝和实现成本假设下,对于固定股票、大样本量、中等维度的情况,Mojo SIMD k-d树渐近地优于Mojo SIMD暴力搜索和scikit-learn的k-d树。在x86和ARM64架构的八个金融数据集上(训练样本最多277K),该方法在x86上比scikit-learn的k-d树加速17.5-21.6倍,在ARM64股票/ETF数据集上比scikit-learn暴力搜索加速28.1-43.5倍,同时保持精确输出。除了最近邻推理,Mojo的编译执行使得基于Extra Trees的隐含波动率定价模型能够训练10倍以上的期权数据,将看跌期权IV RMSE降低8.0%。这些结果将Mojo定位为金融AI的可扩展、生产就绪栈,并为其他数据密集型领域的高效AI提供了有前景的基础。

英文摘要

AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems: models must learn from larger historical corpora while still meeting real-time latency constraints in trading, risk management, and derivative pricing. We use exact nearest-neighbor learning for high-frequency financial time series as a concrete case study to show that Mojo-based financial AI can address this challenge. We introduce a Mojo SIMD k-d tree with variance-based splitting, contiguous flat-buffer storage, and compile-time vectorized distance computation. We also provide a runtime result showing that, under standard pruning and implementation-cost assumptions, the Mojo SIMD k-d tree asymptotically dominates Mojo SIMD brute force and scikit-learn's k-d tree in the fixed-stock, large-$n$, moderate-dimensional regime. Empirically, across eight financial datasets on x86 and ARM64 with up to 277K training samples, the method achieves 17.5--21.6$\times$ speedup over scikit-learn's k-d tree on x86 and 28.1--43.5$\times$ over scikit-learn brute force on ARM64 equity/ETF datasets, while preserving exact outputs. Beyond nearest-neighbor inference, Mojo's compiled execution enables an Extra Trees-based implied-volatility pricing model to train on $10\times$ more options data, reducing put-IV RMSE by 8.0\%. These results position Mojo as a scalable, production-ready stack for financial AI and a promising foundation for efficient AI in other data-intensive fields. \keywords{Financial AI \and AI Efficiency \and Mojo \and SIMD \and K-D Trees \and KNN \and High-Frequency Trading \and Financial Time Series \and Scaling}

2606.10328 2026-06-10 cs.CV cs.AI 交叉投稿

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学)

AI总结 提出内容引导的空间-光谱集成网络(CSI-Net),通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息,有效抑制未变化区域差异,在三个数据集上取得最优性能。

详情
AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而,现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题,本文提出了一种内容引导的空间-光谱集成网络(CSI-Net),用于融合全局空间细节和光谱差异信息。具体而言,所提出的CSI-Net由空间推理(SR)模块、光谱差异(SD)模块和内容引导集成(CGI)模块组成。在SR模块中,通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征,通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外,为了有效集成空间-光谱特征,我们设计了CGI模块以进一步利用它们的互补信息。在该模块中,引入高层内容信息作为引导,以实现适当的交互。由于高效的空间-光谱融合,所提出的CSI-Net能够更好地学习变化特征,同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明,与最先进方法相比,所提出的CSI-Net产生了更好的性能,并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

2606.10357 2026-06-10 cs.IR cs.AI 交叉投稿

Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations

原子意图推理:将LLM语义引入工业跨域推荐

Zhuohang Jiang, Yuxin Chen, Shijie Wang, Haohao Qu, Zhou Jindong, Wenqi Fan, Li Qing, Dongxu Liang, Jun Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Kuaishou Technology(快手科技)

AI总结 提出AIR框架,通过离线LLM推理与在线高效检索组合,实现工业级跨域推荐,在快手电商中GMV提升3.446%。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea
AI中文摘要

跨域推荐是内容到电子商务平台的核心问题。其目标是利用用户与内容的交互来推断电子商务端的潜在购买意图,从而提高转化率和商业价值。然而,在真实的工业场景中,跨域推荐面临多重挑战:不同领域之间存在显著的语义鸿沟,用户跨域行为序列通常规模庞大且噪声丰富。尽管大型语言模型(LLM)具有强大的语义理解和推理能力,但其毫秒级的推理延迟使得直接应用于在线推荐系统变得困难。为了解决这些问题,本文介绍了AIR(原子意图推理),一个为工业级部署设计的LLM驱动的跨域推荐框架。通过将LLM推理迁移到离线阶段,并在在线操作期间通过高效检索和组合动态构建用户意图表示,它在保持语义一致性的同时实现了约400倍的推理加速。在多个公共数据集上的实验结果表明,我们的方法在跨域推荐任务中达到了最先进的性能。此外,在快手电商真实业务场景中进行的大规模在线A/B测试显示,我们的方法在多个核心业务指标上取得了稳定且显著的提升,包括GMV增长+3.446%,充分验证了其在工业级推荐系统中的有效性和实用价值。

英文摘要

Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce's real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.

2606.10381 2026-06-10 hep-ex cs.AI cs.CL cs.IR physics.ins-det 交叉投稿

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

基于证据的缪子对撞机分析的智能混合RAG

Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, Qiang Li

发表机构 * Peking University(北京大学)

AI总结 提出智能混合RAG框架,结合稀疏与稠密检索及智能推理,用于缪子对撞机研究的证据检索与答案生成,构建首个基准并验证其有效性。

Comments 22 pages, 5 figures, and 6 tables

详情
AI中文摘要

缪子对撞机研究涵盖加速器物理、探测器仪器和高能现象学,相关证据分散在快速扩展且异构的科学文献中。随着高能物理(HEP)越来越多地探索智能辅助分析工作流,高效定位、整合和验证科学证据成为关键能力。虽然检索增强生成(RAG)为科学问答提供了有前景的框架,但在不牺牲检索精度的情况下整合智能推理仍是一个关键挑战。在这项工作中,我们提出了智能混合RAG,一个基于证据的RAG框架,用于缪子对撞机研究。该框架结合了混合检索器(集成稀疏词汇和稠密语义检索)与智能推理模块,用于查询分解、证据扩展和基于证据的答案生成。为了进行系统评估,我们构建了缪子对撞机领域首个检索增强科学问答基准,包括一个精选文献语料库以及涵盖主要探测器和物理研究主题的专用检索和答案生成基准。广泛评估表明,混合检索提供了最强的检索基础,而智能推理在受控证据扩展和答案合成方面最为有效。基于这一原则,智能混合RAG在检索效果、答案质量、证据覆盖和事实基础方面始终优于代表性的检索和RAG基线。该基准和框架共同为基于证据的科学问答以及未来在大规模科学文献上运行的HEP分析智能体奠定了基础。

英文摘要

Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

2606.10402 2026-06-10 cs.CL cs.AI 交叉投稿

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI Stanford University(斯坦福大学)

AI总结 提出EinsteinArena平台,通过开放分布式环境中的自主代理交互,在数学问题中实现12项新最优结果,展示了集体AI驱动研究的范式。

详情
AI中文摘要

科学发现通常是一个集体过程:研究人员分享部分结果,检查失败的尝试,并在长时间跨度内相互借鉴想法。最近的AI系统表明,基于语言模型的代理可以在开放科学问题上取得有意义的进展,但大多数现有系统孤立运行。在本文中,我们提出EinsteinArena,一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题,每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛,代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务,其进展可以明确衡量。截至2026年5月,EinsteinArena上的代理已发现12项新的最优结果,优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题,该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行,而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明,去中心化的科学发现可以从自主代理在野外的开放交互中涌现,展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

2606.10493 2026-06-10 cs.DC cs.AI cs.LG cs.NE 交叉投稿

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

实现本地混合专家模型推理的云级SLO:CPU-GPU混合设计

Wenxin Wang, Yule Hou, Yu Ji, Peng Qu, Youhui Zhang

发表机构 * Tsinghua University(清华大学) Xingyun Integrated Circuits Co., Ltd.(星云集成电路有限公司) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心)

AI总结 针对本地MoE推理在低并发下仍无法达到云级服务质量的问题,提出CPU-GPU混合系统,通过流加载预填充、分布式SLP、节点内预填充-解码分离、AVX-512优化FP8 GEMV内核和细粒度CPU并行,在消费级硬件上实现云级SLO。

Comments Accepted to the 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI '26). The official version will appear in the OSDI '26 proceedings published by USENIX

详情
AI中文摘要

本地部署大型混合专家(MoE)模型即使在低并发工作负载下也无法达到云级环境中的服务质量。我们识别出本地MoE推理中的四个关键差距:依赖容量缩减模型(量化、蒸馏、重路由)、无法满足长预填充(超过12K)的30秒TTFT、低于基线的解码吞吐量(低于20 tokens/s)、以及在混合预填充-解码和批量解码工作负载下的并发性差。我们提出一个CPU-GPU混合系统,通过以下方式在双插槽商用CPU和消费级GPU上实现云级SLO:(1)流加载预填充(SLP),将预填充吞吐量提升至1,200 tokens/s,并在30秒内支持32K提示;(2)采用SmallEP专家并行的分布式SLP(DSLP),在两张RTX 5090上达到1,800 tokens/s和45K提示;(3)节点内预填充-解码分离,具有零拷贝共享权重和双批次注意力-MoE重叠方案,在延迟增加低于15%且吞吐量提升50%的情况下维持并发性;(4)AVX-512优化的FP8 GEMV内核,实现原生CPU FP8推理,同时降低4-5倍CPU延迟;(5)细粒度CPU并行,在INT4 DeepSeek-V3上达到28 tokens/s,在完整FP8 V3上达到21.5 tokens/s。评估表明,我们的系统在消费级CPU-GPU平台上为旗舰MoE模型提供云级QoS,通过完整原始精度推理重塑本地部署,无需数据中心基础设施即可实现高质量、经济高效的访问。

英文摘要

Local deployment of large Mixture-of-Experts (MoE) models falls short of the service quality achieved in cloud-scale environments, even under low-concurrency workloads. We identify four key gaps in local MoE inference: reliance on capacity-reduced models (quantized, distilled, rerouted), inability to meet 30-second TTFT for long prefills (more than 12K), sub-baseline decode throughput (under 20 tokens/s), and poor concurrency under mixed prefill-decode and batched decode workloads. We present a CPU-GPU hybrid system that achieves cloud-level SLOs on dual-socket commodity CPUs and consumer GPUs by (1) stream-loading prefill (SLP), boosting prefill throughput to 1,200 tokens/s and enabling 32K prompts within 30 seconds; (2) distributed SLP (DSLP) with SmallEP expert parallelism, reaching 1,800 tokens/s and 45K prompts in 30 seconds on two RTX 5090s; (3) intra-node prefill-decode disaggregation with zero-copy shared weights and a dual-batch attention-MoE overlap scheme, sustaining concurrency with under 15 percent latency increase and 50 percent throughput gains; (4) an AVX-512-optimized FP8 GEMV kernel, enabling native CPU FP8 inference while delivering 4-5x lower CPU latency; and (5) fine-grained CPU parallelism that attains 28 tokens/s on INT4 DeepSeek-V3 and 21.5 tokens/s on intact FP8 V3. Evaluations show our system delivers cloud-level QoS for flagship MoE models on consumer CPU-GPU platforms, reshaping local deployment with intact, original-precision inference and enabling high-quality, cost-effective access without datacenter infrastructure.

2606.10583 2026-06-10 cs.LG cs.AI math.OC 交叉投稿

NOVA: Symbolic Regression Discovery of Interpretable Car-Following and Lane-Change Models with Driver Heterogeneity

NOVA: 可解释的跟驰与换道模型及驾驶员异质性的符号回归发现

Ishak Abassi, Nassim Ali Bouazzouni, Farah Ibelaiden, Nadir Farhi

AI总结 提出NOVA符号回归框架,从原始轨迹数据自动发现可解释的跟驰与换道结构,在NGSIM数据集上优于基线,并揭示主导非线性项与心理物理理论关联。

详情
AI中文摘要

我们提出了NOVA,一个自主符号回归框架,能够从原始轨迹数据中识别出可解释的跟驰和换道结构,且仅需极少的先验行为假设。应用于来自NGSIM I-80和US-101数据集的4,765,788个活跃驾驶观测,NOVA的确定性Rust驱动搜索引擎评估了超过10,000个候选代数结构,并在前向平移滚动均值预测目标下识别出一个紧凑的两项加速度模型。在两种互补的预处理流程下评估,NOVA在意图预测基准上实现了RMSE = 1.376 m/s²(R² = 15.57%),在相同评估协议下,RMSE比最佳重新校准的符号回归基线(SR-LLM, PNAS 2025)低0.135 m/s²。在八个独立实验中,单个主导非线性项作为人类跟驰的稳健骨干出现;残差引导的扩展进一步将所选结构与已建立的碰撞避免心理物理理论联系起来。发现的特征算子在不同高速公路地点之间零样本迁移,R²损失低于3个百分点。扩展到多项logit框架内的换道建模,NOVA在502个未见驾驶员的严格车辆ID留出测试中实现了67.4%的平衡准确率,在三类问题上超过现有换道基线+29.8个百分点。

英文摘要

We present NOVA, an autonomous symbolic regression framework that identifies interpretable car-following and lane-change structures from raw trajectory data with minimal behavioral priors. Applied to 4,765,788 active driving observations from the NGSIM I-80 and US-101 datasets, NOVA's deterministic Rust-powered search engine evaluates over 10,000 candidate algebraic structures and identifies a compact two-term acceleration model under a forward-shifted rolling-mean prediction target. Evaluated under two complementary preprocessing pipelines, NOVA achieves $RMSE = 1.376 m/s^2$ ($R^2 = 15.57\%$) on the intent-forecasting benchmark, outperforming the best recalibrated symbolic-regression baseline (SR-LLM, PNAS~2025) by 0.135 m/s$^2$ in RMSE under an identical evaluation protocol. Across eight independent experiments, a single dominant nonlinear term emerges as a robust backbone of human car-following; a residual-guided extension further links the selected structure to an established psychophysical theory of collision avoidance. The discovered feature operators transfer zero-shot between freeway sites with under 3 pp $R^2$ loss. Extended to lane-change modelling within a multinomial logit framework, NOVA achieves 67.4\% balanced accuracy under strict vehicle-ID holdout on 502 unseen drivers, surpassing existing lane-changing baselines by +29.8 percentage points on a three-class problem.

2606.10601 2026-06-10 math.NA cs.AI cs.LG cs.NA 交叉投稿

Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

Dmsh:一种用于全四边形网格生成的多智能体强化学习框架

Anirudh Kalyan, Cosmin Anitescu, Xiaoying Zhuang, Timon Rabczuk, Somdatta Goswami, Sundararajan Natarajan

发表机构 * Department of Mechanical Engineering, Indian Institute of Technology Madras(印度理工学院马德拉斯分校机械工程系) Institute of Continuum Mechanics, Leibniz Universität Hannover(莱比锡大学汉诺威连续力学研究所) Institute of Structural Mechanics, Bauhaus-Universität Weimar(魏玛包豪斯大学结构力学研究所) Department of Civil and Systems Engineering, Johns Hopkins University(约翰霍普金斯大学土木与系统工程系)

AI总结 提出Dmsh,首个全自动强化学习流水线,通过三个协调智能体处理拓扑简化、几何正则化和网格生成,采用参数化Soft Actor-Critic架构和课程学习策略,实现高质量全四边形网格生成。

详情
AI中文摘要

为任意几何体生成高质量网格仍然是计算工程中的一个基本瓶颈,通常需要启发式调整和半手动工作流程。在本文中,我们介绍了Dmsh,这是第一个完全自动化的强化学习流水线,它将几何分解和四边形网格生成统一在一个基于学习的框架中。Dmsh通过三个协调的智能体分解问题,分别处理拓扑简化、几何正则化和网格生成。网格生成过程被建模为马尔可夫决策过程,并使用具有解耦评论家的参数化Soft Actor-Critic架构求解,从而能够高效探索混合离散-连续动作空间。课程学习策略确保了从简单域到高度复杂几何体的可扩展性,并抑制了种子方差。通过设计,递归分解使得子区域能够并行网格化,生成全局一致的全四边形网格,无需事后校正。在广泛的基准测试中,Dmsh在自动化程度、鲁棒性和网格质量方面始终优于现有方法,为基于学习的网格生成建立了新范式。

英文摘要

Generating high-quality meshes for arbitrary geometries remains a fundamental bottleneck in computational engineering, often demanding heuristic tuning and semi-manual workflows. In this paper, we introduce Dmsh, a first fully automated reinforcement learning pipeline that unifies geometric decomposition and quadrilateral mesh generation within a single learning-based framework. Dmsh decomposes the problem through three coordinated agents handling topology simplification, geometric regularization, and mesh generation. The meshing process is formulated as a Markov Decision Process and solved using a parametric Soft Actor-Critic architecture with decoupled critics, enabling efficient exploration of a hybrid discrete-continuous action space. A curriculum learning strategy ensures scalability from simple domains to highly complex geometries, suppressing seed variance. By design, the recursive decomposition enables parallel meshing of subregions, yielding globally conforming all-quadrilateral meshes without post hoc correction. Across a wide range of benchmarks, Dmsh consistently outperforms existing methods in automation, robustness, and mesh quality, establishing a new paradigm for learning-based mesh generation.

2606.10621 2026-06-10 cs.IR cs.AI 交叉投稿

STORM: Stepwise Token Optimization with Reward-Guided Beam Search

STORM: 基于奖励引导束搜索的逐步令牌优化

Arthur Satouf, Giulio D'Erasmo, Yuxuan Zong, Habiboulaye Amadou Boubacar, Pablo Piantanida, Benjamin Piwowarski

发表机构 * MILA – Quebec AI Institute & ILLS(魁北克人工智能研究所与ILLs) Université Paris-Saclay & CentraleSupélec & CNRS(巴黎-萨克雷大学及CentraleSupélec与CNRS) Air Liquide Sorbonne Université & ISIR & CNRS(索邦大学及ISIR与CNRS) Sapienza, University of Rome(罗马大学Sapienza)

AI总结 提出STORM框架,通过检索奖励引导的束搜索在每一步优化令牌选择,实现词汇检索的查询扩展,在多个基准上匹配或超越大模型重写器,并零样本迁移至18种语言。

详情
AI中文摘要

现代检索越来越依赖密集和学习的稀疏神经模型,这些模型有效但需要将整个语料库编码为专门的索引,并在模型变化时重建。像BM25这样的词汇检索器在标准倒排索引上保持高效和透明,无需随模型演变而改变,但存在词汇不匹配问题。LLM查询重写可以提供帮助,但提示式重写器会生成格式良好但检索无效或有害的术语,而针对检索奖励进行训练仅提供延迟的、序列级别的监督,掩盖了哪些术语有帮助。我们引入了STORM(基于奖励引导束搜索的逐步令牌优化),一个用于词汇查询扩展的自监督框架。STORM通过检索指标引导生成来训练重写器:在每一步,候选扩展根据BM25索引进行评分,并剪枝低奖励的延续,将检索奖励转化为令牌级别的信号,集中探索检索有效的词汇。在TREC DL和BEIR上,STORM使0.6B-8B的骨干模型匹配或超越有竞争力的LLM重写器,同时检索速度与普通BM25一样快;在8B规模上,它可与更大的专有重写器相媲美。它进一步零样本迁移到18种语言(MIRACL),平均击败了专门的多语言密集检索器,使STORM成为密集神经检索的一种有竞争力、基础设施轻量级的替代方案。

英文摘要

Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.

2606.10658 2026-06-10 cs.CR cs.AI cs.CE q-fin.CP 交叉投稿

Post-Quantum Secure Federated DeFi for Inclusive Banking

面向普惠银行的后量子安全联邦DeFi

Swati Sachan, Dale Fickett, Richard Buchinger, Theo Miller

发表机构 * AI FinTech Group, University of Liverpool(人工智能金融科技组,利物浦大学) RVA Works and University of Richmond(RVA Works和里士满大学) Chain Crunch Labs(Chain Crunch实验室)

AI总结 提出后量子安全联邦DeFi框架,利用格基全同态加密和NASA-IBM地理空间基础模型,实现银行间加密协作以提升信用不足个体的金融普惠性。

详情
AI中文摘要

近期纠错量子比特的进展加速了实用量子计算的时间表,这对用于保护金融系统、政府基础设施、通信网络和DeFi(去中心化金融)生态系统的密码原语构成威胁。本文提出一个后量子安全的联邦DeFi框架,支持银行间协作,以改善因有限金融历史而受到当地贷款机构服务不足的个体的普惠性。多家银行将加密信息批次贡献给一个虚拟服务器,其中基于格的完全同态加密(FHE)实现了端到端的同态计算。服务器以加密格式融合本地数据驱动的概率评估、专家信念以及由NASA-IBM Prithvi地理空间基础模型(GFM)生成的可验证证据。采用去中心化技术确保机构与服务器之间所有加密数据交换的防篡改证据和可审计问责性。该框架在弗吉尼亚州农村借款人的农业贷款决策上进行了测试。

英文摘要

Recent advances in error-corrected qubits have accelerated the timeline for practical quantum computing. It poses a threat to cryptographic primitives used to secure financial systems, government infrastructure, communication networks, and DeFi (Decentralized Finance) ecosystems. This paper introduces a post-quantum secure federated DeFi framework that enables inter-bank collaboration to improve the inclusivity of individuals underserved by local lenders due to limited financial histories. Multiple banks contribute encrypted information batches to a virtual server, where lattice-based Fully Homomorphic Encryption (FHE) enables end-to-end homomorphic computation. The server fuses local data-driven probabilistic assessments, expert beliefs, and verifiable evidence generated by the NASA-IBM Prithvi Geospatial Foundation Model (GFM), in encrypted format. Decentralized technologies are employed to ensure tamper-proof evidence and auditable accountability for all encrypted data exchanges between institutions and the server. The framework is tested on agricultural lending decisions for rural borrowers in Virginia.

2606.10660 2026-06-10 cs.CY cs.AI 交叉投稿

Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting

企业温室气体清单中AI推理的核算:范围3类别1报告的四层方法

Guillermo Llopis

发表机构 * SOMA AI

AI总结 针对CSRD要求下AI推理服务在范围3类别1中缺乏标准核算方法的问题,提出基于token物理估算的四层框架,通过GPU能耗基准和区域电网碳强度精确估算排放,并揭示水碳权衡。

Comments Preprint. Data repository: https://doi.org/10.5281/zenodo.20443586. 18 pages, 3 figures, 6 tables

详情
AI中文摘要

AI推理服务——API订阅、企业聊天工具和嵌入AI功能的SaaS产品——明确属于《企业可持续发展报告指令》(CSRD)下的范围3类别1,该指令要求自2024年1月开始的财年进行披露。然而,目前尚无标准方法将其纳入企业温室气体清单。现行实践要么完全忽略该类别,要么应用针对整个ICT行业校准的通用经济投入产出(EEIO)因子,导致AI推理排放被高估10-40倍(相对于物理衍生方法)。我们提出了一个四层框架,将估算精度与组织实际可获取的数据相匹配,从基于token的直接物理估算(使用GPU能耗基准和区域电网碳强度)逐步降级到基于支出的EEIO后备方法(用于无使用数据的服务)。排放因子来源于同行评审的GPU能耗基准(此http URL排行榜v3)、确认的电网碳强度(EPA eGRID 2023;Ember 2023)以及已发布的水利用效率数据(Li等人,2025)。应用于一家200人的欧洲企业,该框架得出的总排放量低于1 tCO2e,表明合规挑战在于方法论而非规模。我们进一步记录了当前ESG工具未揭示的水碳权衡:瑞典以水电为主的电网在数据集中碳强度最低,但水足迹最高,这对数据中心选址策略有直接影响。

英文摘要

AI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives. We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation -- using GPU energy benchmarks and regional grid carbon intensities -- down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (ML.ENERGY Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden's hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy.

2606.10699 2026-06-10 cs.CV cs.AI 交叉投稿

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

使用YOLOv12模型验证生产线上网线(跳线)中导线的正确颜色顺序

Amin Doroodchi, Danial Soleimany

发表机构 * Computer Department, Islamic Azad University, Beyza Branch(伊斯兰 Azad 大学计算机系,贝兹分校) R&D at Nedaye Sabz Company, Isfahan Branch(Nedaye Sabz 公司研发部,伊斯法罕分校)

AI总结 针对网线生产中导线颜色顺序检测问题,提出基于YOLOv12的目标检测模型,实现高精度实时验证,减少人工错误。

详情
AI中文摘要

在网络电缆的生产过程中,确保标准连接器内部线对的正确颜色顺序对电缆的最终性能至关重要,因为任何错位或颜色顺序错误都可能导致缺陷产品并造成巨大成本。基于数字显微镜目视检查的传统检测方法通常耗时、繁琐且容易出错。在本研究中,开发了一种基于第十二版YOLO目标检测模型的智能系统,用于识别跳线中导线的位置并验证其正确的颜色顺序。使用的数据集包括从网络连接器显微视图中捕获的2500张图像,其中70%用于训练,15%用于验证,15%用于测试。所提出的模型利用单阶段架构和学习过程中的注意力机制,实现了约98%精度的导线检测。此外,总体平均准确率、分类精度和召回率分别约为95%、99%和98%。结果表明,该系统能够在生产线上可靠地实时验证导线颜色顺序的正确性,无需人工干预,从而减少人为错误并提高制造效率。

英文摘要

In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 交叉投稿

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

发表机构 * Center Algoritmi / LASI, University of Minho, Braga, Portugal(阿尔戈里米中心/拉斯伊大学,明霍大学,布拉加,葡萄牙) Institute for Artificial Intelligence in Medicine, University Medicine Essen, Essen, Germany(医学人工智能研究所,埃森医学院,埃森,德国) Institute of Medical Informatics / Dept. of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Germany(医学信息学研究所/口腔和颅面外科部,亚琛大学医院,德国) Faculty of Computer Science, University of Duisburg-Essen, Essen, Germany(计算机科学学院,杜伊斯堡-埃森大学,埃森,德国)

AI总结 提出++nnU-Net,通过图像配准进行数据增强,在预处理和训练前生成变形图像,在5个2D数据集上提升Dice系数最高约22%。

Comments 7 pages, 1 figure, 2 tables

详情
AI中文摘要

nnU-Net在医学分割任务中持续展现出成功,这严重依赖于标注生物医学数据的可用性和多样性。然而,由于隐私法规和标注成本等因素,收集医学影像队列仍然具有挑战性。因此,数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此,我们提出了++nnU-Net,一种基于图像配准的新型数据增强模块,在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中,图像数据经过两阶段配准过程,生成新的变形图像。然后将变换应用于相应的分割。此外,该管道计算可用磁盘空间,生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线,在Dice相似系数得分上有所提升。在最显著的情况下,我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性,特别是对于2D医学影像数据集,并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取:this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

2606.10718 2026-06-10 cs.LG cs.AI 交叉投稿

Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

基于Transformer的脑电情绪识别时空特征学习模型

Xinglong Cui, Dian Gu

发表机构 * Beijing Neurodeep Technology Co., Ltd(北京纽罗德普科技有限公司) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EEG-TransNet架构,通过局部自注意力块和模糊注意力同步Transformer捕捉脑电信号的时空特征,在三个数据集上优于现有方法。

详情
AI中文摘要

脑电图(EEG)是一种广泛采用的监测大脑活动的技术,因其高时间分辨率和成本效益,为神经状态提供了有价值的见解。为了增强对复杂EEG数据的分析,我们提出了EEG-TransNet,一种旨在捕捉EEG信号的时间、区域和同步特征的架构。EEG-TransNet引入了三个关键模块:1)利用ResNet和基于小波去噪的预处理与特征提取模块,2)用于区域特征学习的局部自注意力块,以及3)用于建模时空依赖性的模糊注意力同步Transformer(FAST)。通过在三个EEG数据集(BETA、SEED和DepEEG)上的大量实验,所提出的模型在不同信号长度下的分类准确性和鲁棒性方面始终优于其他方法。消融研究证实了局部自注意力块在提高性能方面的贡献,并且解码器中引入深度可分离卷积降低了计算复杂度,同时保持了高准确性。EEG-TransNet在受试者间具有最小的性能变化,突显了其作为基于EEG的大脑活动分类和情绪识别任务的鲁棒工具的潜力。

英文摘要

Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological states due to its high temporal resolution and cost-effectiveness. To enhance the analysis of complex EEG data, we propose EEG-TransNet, an architecture designed to capture temporal, regional, and synchronous features of EEG signals. EEG-TransNet introduces three key modules: 1) a preprocessing and feature extraction module leveraging ResNet and wavelet-based denoising, 2) a Local Self-Attention Block for regional feature learning, and 3) a Fuzzy-Attention Synchronous Transformer (FAST) to model spatiotemporal dependencies. Through extensive experiments on three EEG datasets (BETA, SEED, and DepEEG), the proposed model consistently outperforms other methods in terms of classification accuracy and robustness across varying signal lengths. Ablation studies confirm the contribution of the Local Self-Attention Block in improving performance, and the inclusion of depthwise separable convolutions in the decoder reduces computational complexity while maintaining high accuracy. EEG-TransNet's ability to generalize across subjects with minimal performance variation highlights its potential as a robust tool for EEG-based brain activity classification and emotion recognition tasks.

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 交叉投稿

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一个流水线,通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题,并利用GPT-4提取的先决条件知识图谱,以检测主题级知识缺口。

Comments Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

详情
AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题,但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线,使用少样本文本分类器,将学生向对话式AI助教提出的问题映射到课程主题,该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估,我们的分类器在43个标签(42个课程主题加上一个“未知”弃权类别)上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关(rho = 0.491, p = 0.008, n = 28个主题),提供了趋同证据,表明分类后的问题流反映了真实的主题难度。这些结果表明,映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号,并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

2606.10782 2026-06-10 cs.CR cs.AI cs.LG 交叉投稿

A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

一种增强安全导向决策支持系统的贝叶斯网络方法

Carolina Fernández-Martínez, Shuaib Siddiqui, Vanesa Daza

发表机构 * University of Granada(格拉纳达大学) University of Birmingham(伯明翰大学)

AI总结 提出基于贝叶斯网络的决策支持系统,帮助基础设施运营商选择安全工具,通过捕获用户需求并推理,提供最优安全机制,评估了时间和预测精度。

详情
Journal ref
Proc. 2025 IEEE 50th Conference on Local Computer Networks (LCN), 2025
AI中文摘要

当今大多数基于开源网络的异构栈的采用和集成带来了明显的优势,如互操作性和高级功能的可用性。然而,另一方面,互联组件和移动部件数量的增加需要维护跨不同领域的不同工具的跨学科知识基础,以确保正常运行。为了减轻这些工作,本文提出了一种决策支持系统(DSS),指导基础设施运营商选择在其环境中采用的安全方法(例如工具)。该框架能够轻松捕获最终用户对不同领域安全三元组的高层需求,并在指定模型上运行推理,以提供更好地满足这些需求的已识别工具(安全机制)。所提出的DSS旨在提供一个可理解和可扩展的框架,以适应不同的需求和贝叶斯网络(BN)模型。提出了系统的架构和建模,并与其理论框架保持一致。其性能在时间和预测精度方面进行了评估。

英文摘要

The adoption and integration of heterogeneous stacks in most of today's open-source based networks brings clear benefits like interoperability and availability of advanced features. Yet, on the other hand the increasing number of interconnecting components and moving parts requires maintaining an ever increasing base of interdisciplinary knowledge of different tools in different domains to ensure proper operation. To alleviate such efforts, this work proposes a Decision Support System (DSS) to guide infrastructure operators through the selection of security approaches (e.g. tools) to adopt in their environments. This framework easily captures the end-user high-level requirements on the security triad for different domains and runs inference on the designated models to provide the identified tools (security mechanisms) that better serve such needs. The presented DSS aims at delivering an understandable and extensible framework to accommodate varying requirements and Bayesian Network (BN) models. The architecture and modelling of the system are proposed, aligned with its theoretical framework. Its performance is evaluated in terms of time and prediction accuracy.

2606.10827 2026-06-10 cs.NI cs.AI 交叉投稿

A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks

面向光网络中零日异常检测与分类的统一孪生学习框架

Carlos Natalino, Flávia P. Monteiro, Paolo Monti

发表机构 * Department of Electrical Engineering, Chalmers University of Technology(查尔姆斯理工大学电子工程系) Federal University of Western Pará (UFOPA)(巴西北部联邦大学(UFOPA))

AI总结 提出多相似度孪生神经网络,统一实现光网络中零日异常检测与单样本分类,无需重训练即可跨光路和未知异常类型达到99%以上准确率。

Comments Authors' version of the manuscript accepted and published at the Optical Fiber Communication Conference (OFC) 2026. 4 pages, 3 figures

详情
Journal ref
Optical Fiber Communication Conference (OFC) 2026
AI中文摘要

一种多相似度孪生神经网络统一了光网络中的零日异常检测与单样本分类,无需任何重训练即可跨光路和未见过的异常类型实现超过99%的准确率与即时适应性。

英文摘要

A multi-similarity Siamese neural network unifies zero-day anomaly detection and one-shot classification in optical networks, achieving over 99% accuracy and instant adaptability across lightpaths and unseen anomaly types without any retraining.

2606.10861 2026-06-10 cs.SE cs.AI cs.HC 交叉投稿

From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot

从感知到行动:UI干预能否促进可持续的LLM聊天机器人

Nitish Patkar, Pooja Rani, Jack Glässer, Simon Lüscher, Martin Kropp

发表机构 * University of Applied Sciences and Arts Northwestern Switzerland (FHNW)(瑞士西北应用科学与艺术大学(FHNW)) University of Mannheim(曼海姆大学)

AI总结 研究通过UI干预(如模式切换、能耗反馈)提升用户对LLM聊天机器人能耗的感知,并鼓励节能行为,发现模式切换是主要行为机制。

详情
AI中文摘要

基于LLM的聊天机器人日益融入日常工作流程,其能源使用引发了可持续性担忧。大多数缓解策略强调模型或基础设施效率,而用户界面层尽管具有塑造交互行为的潜力,却仍未得到充分探索。我们调查了面向可持续性的UI干预能否提高用户的能源意识,并鼓励更节能的聊天机器人使用,同时不降低可用性。我们首先进行了一项基线调查,有77名参与者评估了对干预概念的意识和接受度。在说服技术和选择架构的先前工作指导下,我们实现了一个基于Web的聊天机器人原型,具有三模式开关(节能、平衡、性能)、每次响应的能耗反馈、发送前能耗估计、使用指标仪表板和能耗类比。然后,我们在为期五天的实地研究中评估了该原型,有11名参与者。在基线调查中,94.8%的受访者报告至少对AI能耗有一定了解,但88.3%的人错误估计了实际消耗。尽管对环境影响的担忧很高,但只有39.0%的人表示愿意接受性能权衡以降低能耗。在实地研究中,节能模式占记录提示的55.8%,而90.9%的人自我报告在不需要高精度时主动选择Eco模式。参与者没有减少提示长度,表明模式切换是主要行为机制。面向可持续性的UI干预可以提高意识,并支持LLM聊天机器人中更节能的交互模式。这些效应最好被解释为行为和基于模型的估计,补充了后端效率工作,所提供的原型和复制包支持对能源感知对话式AI设计的进一步研究。

英文摘要

LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigation strategies emphasize model or infrastructure efficiency, while the user-interface (UI) layer remains underexplored despite its potential to shape interaction behavior. We investigate whether sustainability-oriented UI interventions can increase users' energy awareness and encourage more energy-responsible chatbot use without reducing usability. We first conducted a baseline survey with 77 participants to assess awareness and receptiveness to intervention concepts. Guided by prior work on persuasive technology and choice architecture, we implemented a web-based chatbot prototype with a three-mode switch (Energy-efficient, Balanced, Performance), per-response energy feedback, pre-send energy estimates, a usage metrics dashboard, and energy analogies. We then evaluated the prototype in a five-day field study with 11 participants. In the baseline survey, 94.8% of respondents reported at least some awareness of AI energy use, yet 88.3% misestimated actual consumption. Although concern about environmental impact was high, only 39.0% indicated willingness to accept a performance trade-off for lower energy use. In the field study, Energy-efficient mode accounted for 55.8% of logged prompts, while 90.9% self-reported actively choosing Eco-mode when high accuracy was not required. Participants did not reduce prompt length, suggesting mode switching as the primary behavioral mechanism. Sustainability-oriented UI interventions can improve awareness and support more energy-responsible interaction patterns in LLM chatbots. These effects are best interpreted as behavioral and model-based estimates that complement backend efficiency work, and the provided prototype and replication package support further research on energy-aware conversational AI design.

2606.10928 2026-06-10 cs.CE cs.AI cs.LG physics.comp-ph 交叉投稿

A Constrained Natural-Language Interface for Variational Multi-Physics Finite Element Simulations in FEniCS

FEniCS中变分多物理场有限元模拟的受约束自然语言接口

Nilay Upadhyay, Wesley F. Reinhart

发表机构 * Department of Engineering Science and Mechanics, The Pennsylvania State University(工程科学与力学系,宾夕法尼亚州立大学) Department of Materials Science and Engineering, The Pennsylvania State University(材料科学与工程系,宾夕法尼亚州立大学)

AI总结 提出一种受约束的自然语言接口,将LLM限制在前端任务(解析提示、生成Gmsh代码),后端使用确定性模板求解器,在基准测试中实现100%解析率和90%几何生成成功率。

Comments 23 pages, 17 figures

详情
AI中文摘要

大型语言模型可以减少设置有限元模拟所需的手动工作,但当生成的求解器代码位于关键路径上时,会引入可靠性风险。我们提出了一种用于多物理场有限元分析的受约束自然语言接口,其中LLM仅限于前端任务:将提示解析为结构化JSON,仅对非目录几何生成Gmsh代码,并对这些阶段使用重试反馈。它从不编写FEniCS求解器模板、推导弱形式或编写数值求解器核心。一个确定性调度器将验证后的规范映射到五个手写的FEniCS/UFL模板:线弹性、超弹性、弹塑性、热力耦合和相场断裂。我们针对解析解和已发表的2D/3D基准测试验证了该确定性模板层。在适当网格上,平滑案例达到低于1%的一致性,而较难的非线性案例达到2-5%的范围。我们还直接评估了面向LLM的前端。在15个提示的解析器基准测试中,首次通过有效解析获得了9个案例,其余所有案例在重试后修复,最终有效解析率为100.0%,问题类别准确率为100.0%,字段提取准确率为97.1%。在通过真实LLM到Gmsh路径路由的10个案例自定义几何基准测试中,首次通过和最终成功率均为90.0%,一次未恢复的无效几何失败。这些结果表明,解析器和受约束的提示/验证设计在这些基准测试上是有效的。作为端到端演示,该系统从一个自然语言提示生成并分析了一个带有圆角和螺栓孔的3D弹塑性L形支架。贡献在于一种用于自然语言驱动的变分模拟的测量架构,而非开放式的自主代码生成。

英文摘要

Large language models can reduce the manual effort required to set up finite element simulations, but they introduce reliability risks when generated solver code lies on the critical path. We present a constrained natural-language interface for multi-physics finite element analysis in which the LLM is limited to front-end tasks: parsing prompts into structured JSON, generating Gmsh code only for non-catalog geometries, and using retry feedback for those stages. It never writes FEniCS solver templates, derives weak forms, or writes the numerical solver core. A deterministic dispatcher maps the validated specification to five human-written FEniCS/UFL templates: linear elasticity, hyperelasticity, elastoplasticity, thermo-mechanical coupling, and phase-field fracture. We validate this deterministic template layer against analytical solutions and published 2D/3D benchmarks. Smooth cases reach sub-percent agreement on adequate meshes, while harder nonlinear cases reach the 2-5 percent range. We also evaluate the LLM-facing front end directly. In a 15-prompt parser benchmark, first-pass valid parses were obtained for 9 cases, and all remaining cases were repaired after retry, giving a final valid parse rate of 100.0 percent, 100.0 percent problem-class accuracy, and 97.1 percent field-extraction accuracy. In a 10-case custom-geometry benchmark routed through the real LLM-to-Gmsh path, first-pass and final success were both 90.0 percent, with one unrecovered invalid-geometry failure. These results show that the parser and constrained prompt/validation design are effective on these benchmarks. As an end-to-end demonstration, the system generates and analyzes a 3D elastoplastic L-bracket with a fillet and bolt hole from one natural-language prompt. The contribution is a measured architecture for natural-language-driven variational simulation, not open-ended autonomous code generation.

2606.10940 2026-06-10 cs.CV cs.AI cs.LG 交叉投稿

Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

民主化相机陷阱AI:用于检测英国哺乳动物的开源模型

Paul Fergus, Philip Stephens, Russell A. Hill, Lee Oliver, Katie Appleby, Sarah Beatham, Naomi Davies Walsh, Stuart Nixon, Naomi Matthews, Chris Sutherland, Kelly Hitchcock

发表机构 * Liverpool John Moores University(利物浦约翰穆里斯大学) Durham University(杜伦大学) MammalWeb(哺乳动物网) Game & Wildlife Conservation Trust(游戏与野生动物保护信托) National Trust(国家信托) Animal and Plant Health Agency(动物和植物卫生局) Chester Zoo(切斯特动物园) University of St Andrews(圣安德鲁大学) Nottingham Trent University(诺丁汉特伦特大学)

AI总结 发布一个针对31类(28种英国常见哺乳动物和鸟类)的开源目标检测模型,基于YOLO26x在48,165个标注实例上训练,mAP@0.5达0.984,旨在降低生态学家使用AI的门槛。

Comments 15 Pages, 4 Figures

详情
AI中文摘要

相机陷阱已成为生物多样性监测的基石,但将大量图像转化为可用生态数据的人工智能通常被锁定在商业平台之后,或针对与不列颠群岛不相符的动物群进行训练。为了消除障碍并提高采用率,我们发布了一个针对31类(28种英国常见哺乳动物和鸟类,以及人类、校准杆和车辆等实用类)的开源目标检测模型,该模型基于从多个地点经过十年运营部署(通过Conservation AI及其后续项目Trap Tracker)收集的48,165个标注实例的精选数据集。该模型是YOLO26x检测器,在80/10/10的类别分层划分上进行训练和测试,在保留的验证集上,IoU为0.5时平均精度为0.984(IoU 0.5-0.95时为0.956),精确率为0.988,召回率为0.965。在未见过的保留测试集上,31个类别的平均物种置信度范围为0.96至0.99,假阴性率为0.17%,主要集中在困难的夜间、远处或遮挡图像中。这些指标来自与训练相同站点和相机池的数据,因此在新站点的性能留待未来工作。我们以非商业许可发布ONNX格式的训练权重,支持本地桌面和实时相机,明确面向没有机器学习经验的生态学家。此发布是对过去十年中开发的多个付费模型的有意制衡。

英文摘要

Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

2606.10972 2026-06-10 eess.AS cs.AI 交叉投稿

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

发表机构 * Dept. Electrical and Electronics Engineering Istanbul Bilgi University, Turkey(电气与电子工程系伊斯坦布尔比尔吉大学,土耳其) Dept. Computer Engineering Istanbul Bilgi University, Turkey(计算机工程系伊斯坦布尔比尔吉大学,土耳其)

AI总结 本研究优化了二维输入表示(MFCC、对数梅尔谱图、VAR模型)和子阶段特征融合策略(直接拼接、GRU、GRU+注意力),使用CNN和GRU网络鉴别哮喘与COPD,最佳F1分数达0.877。

详情
AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数(MFCC)矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中,基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充,还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络(CNN)架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征,包括直接拼接、门控循环单元(GRU)网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估(包含多个呼吸周期)来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数(0.877)通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵,随后进行直接特征拼接获得;最佳基于受试者的F1分数(0.855)通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得,两者均采用自适应长度窗口。增强总体上降低了模型性能,但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献,表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

2606.11106 2026-06-10 cs.CV cs.AI 交叉投稿

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA: 可访问的胎儿超声解读与标注——基于选择性蒸馏的统一视觉-语言模型

Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) HMC(哈马德医疗公司) Advanced AlRazi Diagnostic Center(高级阿尔拉齐诊断中心) Sidra Medicine(锡德拉医学)

AI总结 提出统一视觉-语言模型FADA,通过选择性蒸馏从四个领域基础模型提取知识,实现胎儿超声的解读、分类、检测和分割,在单个消费级GPU上训练,无需外部标签,可在智能手机上离线运行。

详情
AI中文摘要

全球范围内受过训练的超声技师短缺限制了低收入和中等收入国家的产前超声筛查,这些国家超过一半的孕妇未接受专业超声检查。当前的深度学习方法分别处理检测、分割或分类,每个任务都需要单独的模型和推理时的专家指定标签。我们提出FADA,一个基于Qwen3.5-VL构建的统一视觉-语言模型,通过单一解读优先的流程执行临床解读、分类、检测和分割,无需外部标签。FADA通过离线预计算特征缓存,从四个领域基础模型(FetalCLIP、UltraSAM、USF-MAE、UltraFedFM)中蒸馏知识。选择性蒸馏仅对标注任务应用特征对齐,而解读任务依赖标准微调,在大多数评估指标上持续优于完全蒸馏。推荐变体FADA-SKD在分割上达到0.8820平均Dice,检测上达到0.7671 mAP@0.50,结构化解读合规性达到100%。专家超声技师对237张图像的验证确认了在自主和人机协同模式下输出临床可接受,其中73.5%的解读在临床医生指导下获得完美评分。该系统可在单个消费级GPU上训练,无需云连接即可部署。我们通过在商用智能手机(高通骁龙7 Gen 1,12 GB RAM)上使用GGUF量化的this http URL运行压缩的0.8B模型,验证了边缘部署,完全离线完成全部5阶段流程约需60秒。这为将AI辅助胎儿评估与便携式超声设备集成提供了实用途径,直接解决了资源受限环境中的诊断可及性差距。代码、模型和数据可在https://this https URL获取。

英文摘要

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

2606.11117 2026-06-10 cs.AR cs.AI cs.PF 交叉投稿

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

迈向自主加速器设计:基于SECDA的FPGA加速器生成

Vinamra Sharma, Xingjian Fu, Jude Haris, José Cano

发表机构 * School of Computing Science, University of Glasgow, Scotland, UK(格拉斯哥大学计算机科学学院)

AI总结 提出SECDA-DSE框架,集成大语言模型引导FPGA加速器设计空间探索,通过结构化探索器和LLM推理生成可综合的加速器设计,减少人工干预。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情
AI中文摘要

为现代人工智能工作负载设计基于FPGA的加速器需要探索庞大而复杂的硬件设计空间,涉及架构参数、数据流策略和内存层次结构,这使得过程非常耗时。虽然现有方法如SECDA通过SystemC仿真和FPGA执行实现了快速的硬件-软件协同设计,但识别高效的加速器配置仍然是一个主要需要广泛领域知识的手动过程。SECDA-DSE是一个将大语言模型(LLM)集成到SECDA生态系统中的框架,用于指导基于FPGA的加速器的设计空间探索(DSE)。它结合了用于生成候选架构的结构化DSE探索器,以及使用检索增强生成和思维链提示进行推理引导探索的LLM栈,并配有用于迭代和强化优化的反馈循环。基于我们之前介绍SECDA-DSE的工作,本文通过生成三种加速器设计(包括逐元素向量乘法、二维卷积和矩阵转置)并在FPGA硬件上执行端到端运行来扩展其评估。结果表明,SECDA-DSE能够生成符合SECDA标准的加速器设计,并成功在FPGA硬件上综合和执行。此外,生成的设计捕获了计算并行性和数据移动之间的内核特定权衡,突显了LLM引导探索在跨不同工作负载调整架构配置方面的潜力,同时减少了探索时间和大量人类专业知识的需求。

英文摘要

Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

2606.11140 2026-06-10 physics.geo-ph cs.AI cs.LG stat.AP stat.ML 交叉投稿

Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques

基于潜扩散模型参数化的地下流体数据同化:集成卡尔曼与蒙特卡洛技术的性能

Guido Di Federico, Wenchao Teng, Louis J. Durlofsky

发表机构 * Department of Energy Science & Engineering, Stanford University(能源科学与工程系,斯坦福大学)

AI总结 针对地下流体数据同化中高维参数反演问题,比较了基于潜扩散模型(LDM)的集成卡尔曼方法(ESMDA)与蒙特卡洛方法(MCMC/SMC)在三维河道地质模型上的性能,发现蒙特卡洛方法在保持地质真实性的同时能更有效地降低数据失配和不确定性。

详情
AI中文摘要

地下流体数据同化(DA)涉及校准模型参数以匹配观测数据(通常来自井),同时保持地质真实性。潜扩散模型(LDM)提供了从高维地质模型空间到低维潜变量的高效映射,降低了反问题的维度,同时保持了后验地质模型的合理性。然而,LDM映射的高度非线性可能会降低基于卡尔曼增益的集成更新的性能。我们针对具有层次地质不确定性的三维河道地质模型,系统比较了DA算法。我们使用多重数据同化集成平滑器(ESMDA)比较了模型空间和潜空间的DA,并展示了一个关键权衡:模型空间更新实现了显著的不确定性降低,但产生了地质上不现实的后验模型,而潜空间更新保持了真实性但表现出有限的不确定性降低。受此启发,我们在3D-LDM潜空间中探索了严格的马尔可夫链蒙特卡洛(MCMC)和序贯蒙特卡洛(SMC)算法。为适应其高计算需求,我们开发了一个快速代理流模型来近似井响应。MCMC和SMC在三个合成测试案例中与ESMDA进行了评估,DA在LDM潜空间中执行。由于LDM参数化,所有模型都保持了地质真实性。MCMC和SMC彼此一致,并且比潜空间ESMDA实现了更低的数据失配和更多的不确定性降低。我们的总体结果表明,集成卡尔曼方法在高度非线性参数化下可能提供过高的后验不确定性,而由快速代理模型支持的严格蒙特卡洛采样可以提供更可靠的替代方案。

英文摘要

Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.

2606.11169 2026-06-10 cs.DC cs.AI 交叉投稿

Piper: A Programmable Distributed Training System

Piper: 可编程的分布式训练系统

Megan Frisella, Shubham Tiwari, Andy Ruan, Yi Pan, Parker Gustafson, Mat Jacob, Gilbert Bernstein, Stephanie Wang

发表机构 * University of Washington(华盛顿大学) University of Washington and Shanghai Jiao Tong University(华盛顿大学和上海交通大学)

AI总结 提出Piper系统,通过解耦策略与运行时实现,允许用户用少量注解和调度指令声明分布式训练策略,并编译为设备执行计划,支持常见策略并实现组合策略的联合调度优化。

详情
AI中文摘要

大规模模型训练日益依赖于组合多种并行策略(如数据、流水线和专家并行)以及内存节省优化(如ZeRO)。用于基础模型预训练的部署系统通常依赖人类专家手动设计高层并行策略,然后实现相应的低层执行策略,这使得系统难以适应新策略。同时,许多通用框架更加灵活,但其实现仍然局限于一组固定的常见并行策略,使得整合最新策略具有挑战性。我们提出Piper,一个用户可控的分布式训练系统,将策略与运行时实现解耦。Piper允许用户通过少量模型注解和调度指令声明全面的分布式训练策略。每条指令对Piper的中间表示(IR)应用变换,IR是一个统一的全局训练DAG,表示所有计算和通信。使用此IR,Piper编译每设备执行计划,并使用与策略无关的分布式运行时执行它们。我们表明,该组合系统在常见策略(如ZeRO)上保持性能一致,同时通过组合并行策略(如DeepSeek-V3的DualPipe)中计算和通信的联合调度,实现额外的性能和内存效率提升。

英文摘要

Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.

2404.11716 2026-06-10 cs.AI 版本更新

A Survey on Semantic Modeling for Building Energy Management

建筑能源管理的语义建模综述

Miracle Aniakor, Vinicius V. Cogo, Pedro M. Ferreira

发表机构 * LASIGE, DI, Faculdade de Ciências, Universidade de Lisboa, Portugal(里斯本大学科学学院激光工程与信息研究所)

AI总结 综述建筑运行阶段语义建模,分析60个模型和20多个用例,提出本体证据完备性指标,发现物理结构覆盖好而动态概念覆盖不足,指出提升互操作性和泛化能力的方向。

Comments 52 pages, 7 figures, 5 tables

详情
AI中文摘要

建筑能源管理(BEM)对于减少建筑领域的能源消耗和二氧化碳排放至关重要。尽管物联网技术现在提供了广泛的运行数据,但异构数据模型、设备描述和上下文表示仍然限制了语义互操作性,阻碍了通用、自主、上下文感知的BEM应用的发展。本体通过提供结构化、机器可解释的建筑数据、系统和运行上下文表示来解决这一挑战。本综述考察了建筑运行阶段的BEM语义建模。它回顾了60个语义模型,分析了20多个基于本体的BEM用例,并进一步量化了这些用例中的本体实例化率(OIR)和缺失概念。为了支持基于证据的本体使用评估,我们引入了本体证据完备性(OEC)的概念,这是一种衡量研究是否将运行概念明确映射到用于表示它们的本体类别的度量。结果表明,当前的语义模型在表示物理建筑结构、技术系统、传感设备和可观察的运行数据方面比抽象和动态的运行概念更一致。诸如关键绩效指标、评估、服务、控制逻辑、优化任务和计算工作流等概念的覆盖仍然不够一致。因此,应用的BEM研究经常依赖于本体重用、集成、专门化、外部继承或特定应用扩展来解决BEM中的覆盖和互操作性差距。通过综合这些模式,本综述阐明了现有语义模型的能力,并指出了更可互操作、更通用和更上下文感知的BEM系统的发展方向。

英文摘要

Building Energy Management (BEM) is central to reducing energy use and CO2 emissions in the building sector. Although IoT technologies now provide extensive operational data, heterogeneous data models, device descriptions, and contextual representations continue to limit semantic interoperability, limiting the development of generalisable, autonomous, context-aware BEM applications. Ontologies address this challenge by providing structured, machine-interpretable representations of building data, systems, and operational context. This survey examines semantic modelling for BEM during the building operational phase. It reviews 60 semantic models and analyses more than 20 ontology-based BEM use cases. It further quantifies Ontology Instantiation Rates (OIR) and missing concepts across those use cases. To support evidence-based assessment of ontology use, we introduce the notion of Ontology Evidence Completeness (OEC), a measure of whether studies explicitly map operational concepts to the ontology classes used to represent them. Findings show that current semantic models more consistently represent physical building structure, technical systems, sensing devices, and observable operational data than abstract and dynamic operational concepts. Concepts such as key performance indicators, assessments, services, control logic, optimisation tasks, and computational workflows remain less consistently covered. Applied BEM studies therefore frequently depend on ontology reuse, integration, specialisation, external inheritance, or application-specific extension to address coverage and interoperability gaps across BEM. By synthesising these patterns, this survey clarifies the capabilities of existing semantic models and identifies directions for more interoperable, generalisable, and context-aware BEM systems.

2605.17254 2026-06-10 cs.AI 版本更新

CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

CatalyticMLLM: 一种用于催化材料的图-文本多模态大语言模型

Yanjie Li, Jian Xu, Xu-Yao Zhang, Shiming Xiang, Nian Ran, Weijun Li, Cheng-Lin Liu

发表机构 * AnnLab(安实验室) Institute of Semiconductors, Chinese Academy of Sciences(中国科学院半导体研究所) Zhongguancun Academy(中关村学院) State Key Laboratory of Multimodal Artificial Intelligence Systems(多模态人工智能系统国家重点实验室) State Key Laboratory of High Performance Ceramics(高性能陶瓷国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Electronic, Electrical and Communication Engineering(电子电气与通信工程学院) University of ChineseAcademy of Sciences(中国科学院大学)

AI总结 本文提出了一种统一的图-文本多模态大语言模型QE-Catalytic-V2,用于催化材料的性质预测和逆向设计,通过共享的表示空间实现两者的联合建模,从而形成闭环优化流程。

Comments 71 page

详情
AI中文摘要

催化材料的性质预测和逆向结构设计通常被建模为两个独立的任务:前者根据给定的结构预测目标性质,而后者则根据期望的性质生成候选结构。尽管解耦范式便于实现“生成-评估-筛选”工作流,但生成模型和性质预测模型在表示空间和训练目标上的不一致可能会引入数据分布偏移和评估者偏见,从而限制闭环优化的稳定性。在本文中,我们提出QE-Catalytic-V2,这是一种用于催化材料的统一图-文本多模态大语言模型,它在同一个模型和共享的表示空间中整合了性质预测和逆向设计。在此统一框架下,QE-Catalytic-V2不仅能通过三维结构和文本信息进行可靠的性质预测,还能根据目标性质生成和筛选物理上可行的CIF候选结构,从而形成“逆向设计-预测-筛选-重新设计”的闭环优化工作流。实验结果表明,这种统一范式在催化放松能预测和逆向设计任务上均优于解耦基线,验证了在单一多模态模型中联合建模性质预测和结构生成的有效性。

英文摘要

Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predicts target properties from given structures, whereas the latter generates candidate structures according to desired properties. Although the decoupled paradigm facilitates the implementation of a ``generation--evaluation--screening'' workflow, the inconsistency between the generative model and the property prediction model in terms of representation spaces and training objectives can readily introduce data distribution shifts and evaluator bias, thereby limiting the stability of closed-loop optimization. In this work, we propose CatalyticMLLM, a unified graph--text multimodal large language model for catalytic materials, which integrates property prediction and \textbf{inverse design} within the same model and shared representation space. Under this unified framework, CatalyticMLLM can not only perform reliable property prediction by leveraging three-dimensional structures and textual information, but also generate and screen physically feasible CIF candidates conditioned on target properties, thereby forming a closed-loop optimization workflow of ``inverse design--prediction--screening--redesign.'' Experimental results demonstrate that this unified paradigm outperforms decoupled baselines on both catalytic relaxed-energy prediction and inverse design tasks, validating the effectiveness of jointly modeling property prediction and structure generation within a single multimodal model.

2606.01884 2026-06-10 cs.AI 版本更新

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

EVA-Net: 基于视频衍生运动先验的跨被试脑电运动解码

Ziyuan Li, Yueyu Sun, Yimeng Zhang

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出EVA-Net两阶段框架,利用动作视频作为语义先验,通过跨模态对比学习和对齐减少个体差异,实现跨被试脑电运动解码,在EEGMMI上取得8.66%的LOSO准确率提升。

详情
AI中文摘要

实用的非侵入式脑机接口(BCI)系统需要具有强跨被试泛化能力和最小校准的脑电解码器。然而,被试间差异和信号非平稳性常常将运动语义与被试特定噪声纠缠在一起,限制了跨被试解码。最近的多模态方法使用文本作为语义锚点,但文本对本质上动态的运动过程提供了稀疏且静态的监督。为了解决这个问题,我们提出了EVA-Net,一个两阶段框架,使用动作视频作为跨被试脑电运动解码的语义先验。在第一阶段,使用跨模态和监督对比目标将脑电和视频特征在共享空间中对齐,以减少被试特定变化。在第二阶段,视频类别原型和知识蒸馏将视频衍生的先验转移到仅脑电的分类器,而不增加推理开销。在两个公共数据集上的实验表明,EVA-Net实现了强大的跨被试解码性能,包括在EEGMMI上8.66%的LOSO准确率提升。消融结果进一步表明,视频提供了比本文考虑的文本基线更有效的语义锚点。

英文摘要

Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent decoding. Recent multimodal approaches use text as a semantic anchor, yet text provides sparse and static supervision for inherently dynamic motor processes. To address this issue, we propose EVA-Net, a two-stage framework that uses action videos as semantic priors for subject-independent EEG motor decoding. In the first stage, EEG and video features are aligned in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. In the second stage, video category prototypes and knowledge distillation transfer video-derived priors to an EEG-only classifier without adding inference overhead. Experiments on two public datasets show that EVA-Net achieves strong subject-independent decoding performance, including an 8.66% LOSO accuracy gain on EEGMMI. Ablation results further suggest that video provides a more effective semantic anchor than the text baseline considered in this work.

2606.02386 2026-06-10 cs.AI q-bio.QM 版本更新

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

AgentPLM:具有推理增强解码的智能体蛋白质语言模型用于蛋白质序列设计

Sahil Rahman, Maxx Richard Rahman

发表机构 * Sahil Rahman Maxx Richard Rahman

AI总结 提出AgentPLM,通过推理增强解码和对比智能体策略优化,使预训练蛋白质语言模型能够利用外部生物物理反馈进行在线纠错,在多项蛋白质设计任务上取得最优结果。

详情
Journal ref
Workshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

蛋白质语言模型(PLM)是被动预言机:它们通过单次前向传递生成序列,没有机制来咨询外部生物物理反馈或在候选序列违反热力学或结构约束时重定向生成。我们引入AgentPLM,通过为预训练PLM配备i)推理增强解码(RAD),该解码将自回归生成与工具调用(ESMFold、FoldX、AutoDock Vina)交错进行,以及ii)对比智能体策略优化(CAPO),这是直接偏好优化的轨迹级扩展,它端到端地训练策略以学习何时预言机反馈具有信息性,而不仅仅是模仿高适应度序列。我们在基准任务上评估AgentPLM,涵盖从头酶设计、抗体优化、热稳定性、PPI界面设计和零样本适应度预测,使用标准化的预言机API和受控的序列同一性划分。AgentPLM取得了最先进的结果,抗体前10%命中率相比最强被动基线有所提升,提供了无需显式回溯的在线纠错的机制证据。

英文摘要

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4:面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI(百川智能) THUBPM Group, Tsinghua University(清华大学THUBPM课题组)

AI总结 提出Baichuan-M4临床级医疗大模型,通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统,在多项医疗评估中取得领先结果,幻觉率降至3.3%。

详情
AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型,专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统:\textbf{Baichuan-Harness},一个统一运行时,保持强化学习训练与实际部署的一致性,同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调;一个\textbf{核心推理模型},采用持续照护强化学习框架训练,该框架集成了跨度级奖励建模(SPAR++)、推理路径压缩、课程学习和稳定的策略优化;以及一个\textbf{临床工具层},用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中,Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果,同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

2504.03118 2026-06-10 cs.CV cs.AI 版本更新

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

NuWa: 为边缘设备导出轻量级类别特定视觉Transformer

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang

发表机构 * National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab(大数据技术与系统国家工程研究中心、服务计算技术与系统实验室、集群与网格计算实验室) Swinburne University of Technology(斯威本科技大学) Deakin University(迪金大学)

AI总结 针对边缘设备只需识别特定类别的问题,提出NuWa方法,通过自知识净化去除有害权重,并利用闭式优化高效导出紧凑ViT,无需重训练即可提升类别精度并加速推理。

Comments Accepted at CVPR 2026

详情
AI中文摘要

视觉Transformer(ViT)通常需要压缩以部署在资源受限的边缘设备(如无人机和智能车辆)上。然而,现有的模型压缩方法忽略了许多边缘设备仅需特定类别的知识用于其应用。因此,导出的全类别ViT保留了冗余知识,在这些类别上表现次优。我们发现,简单地将校准数据集替换为类别特定数据不足以解决此问题,因为这些方法面临两个根本限制。首先,它们忽略了存在对类别有害的权重,这些权重干扰特化,而移除它们可以提升类别特定性能。其次,目标类别的多样性和边缘设备的资源约束需要大量定制模型。现有方法耗时且计算成本高,因此不可扩展。在这项工作中,我们提出NuWa,一种成本高效的方法,通过从基础ViT导出小型ViT来应对这些挑战,适用于具有特定类别需求的边缘设备。NuWa执行自知识净化以剪除对类别有害的权重,并通过闭式优化高效导出紧凑ViT。无需剪枝后重训练,导出的边缘ViT在类别特定精度上超越基础ViT,并加速推理。综合实验表明,NuWa在类别特定任务上比最先进的无训练剪枝方法精度高出高达29.00%。与性能最佳的依赖训练剪枝方法相比,NuWa实现了33.69倍的剪枝加速,并将剪枝成本降低高达99.83%,平均精度损失仅为0.61%。项目页面:this https URL。

英文摘要

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss. Project Page: https://github.com/CGCL-codes/NuWa.

2507.01062 2026-06-10 cs.CY cs.AI 版本更新

Quantifying Perception-Based Student Success with Generative AI: An Exploratory Monte Carlo Simulation

基于生成式AI的感知学生成功量化:一项探索性蒙特卡洛模拟

Seyma Yaman Kayadibi

发表机构 * arXiv

AI总结 本研究开发了一个探索性蒙特卡洛模拟框架,通过结构化文献检索和概率建模,量化学生在使用生成式AI工具时的感知成功,揭示了权重结构对复合分数的影响。

Comments Published in Education Sciences. This article is an extended and substantially revised version of a conference paper presented at the Melbourne Institute of Technology ICETE Conference, Sydney, NSW, Australia, 9-10 February 2026. The earlier conference version is available at DOI 10.25397/ppny-f488

详情
Journal ref
Education Sciences 2026, 16, 832
AI中文摘要

生成式人工智能(GenAI)工具,如ChatGPT,在高等教育中引起了越来越多的关注,特别是关于学生如何感知其有用性、可用性和教育价值。本研究开发了一个探索性蒙特卡洛模拟框架,用于量化在GenAI使用背景下基于感知的学生成功。在Scopus中进行的PRISMA结构化文献检索识别了2023年至2025年间发表的19项实证研究,其中6项报告了适用于概率建模的项目级均值和标准差。选择了一个一致的10项、5点李克特量表可用性导向工具作为规范的概念验证数据集,并用于参数化逆方差加权蒙特卡洛模拟,生成10,000个合成观测值。结果表明,加权结构显著影响模拟结果,系统效率和学习负担获得最大的逆方差权重,因此对复合分数的影响最强。该研究提供了一个透明、可重复且保护隐私的概念验证框架,将结构化文献检索、项目级汇总统计和概率建模联系起来。

英文摘要

Generative artificial intelligence (GenAI) tools such as ChatGPT have attracted growing attention in higher education, particularly in relation to how students perceive their usefulness, usability, and educational value. This study develops an exploratory Monte Carlo simulation framework for quantifying perception-based student success in the context of GenAI use. A PRISMA-informed structured literature search in Scopus identified nineteen empirical studies published between 2023 and 2025, of which six reported item-level means and standard deviations suitable for probabilistic modelling. One coherent 10-item, 5-point Likert-scale usability-oriented instrument was selected as a canonical proof-of-concept dataset and used to parameterise an inverse-variance-weighted Monte Carlo simulation generating 10,000 synthetic observations. The results show that the weighting structure substantially influences the simulated outcome, with System Efficiency and Learning Burden receiving the largest inverse-variance weight and therefore the strongest influence on the composite score. The study offers a transparent, reproducible, and privacy-preserving proof-of-concept framework linking structured literature search, item-level summary statistics, and probabilistic modelling.

2602.22352 2026-06-10 cs.AR cs.AI 版本更新

GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware Accelerators

GRAU:面向神经网络硬件加速器的通用可重构激活单元设计

Yuhao Liu, Salim Ullah, Akash Kumar

发表机构 * Ruhr University Bochum, Germany(博德姆鲁尔大学) Dresden University of Technology, Germany(德累斯顿技术大学) Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden/Leipzig), Germany(可扩展数据与人工智能中心(ScaDS.AI 德累斯顿/莱比锡))

AI总结 提出基于分段线性拟合的可重构激活硬件GRAU,用2的幂近似斜率,仅需比较器和1位移位器,支持混合精度量化和SiLU等非线性函数,相比多阈值激活器减少90%以上LUT消耗,在6-8段时达到最佳权衡。

详情
AI中文摘要

随着神经网络规模的持续增长,低精度量化被广泛用于边缘加速器。经典的多阈值激活硬件需要2^n个阈值来实现$n$位输出,导致硬件成本随精度提高而迅速增加。我们提出了一种基于分段线性拟合的可重构激活硬件GRAU,其中分段斜率由2的幂近似。我们的设计仅需要基本的比较器和1位右移器,支持混合精度量化和SiLU等非线性函数。与多阈值激活器相比,GRAU减少了90%以上的LUT消耗,实现了更高的硬件效率、灵活性和可扩展性。最佳权衡通常在6-8段时实现,而在激进的低成本设置下,复杂的非线性可能会遭受更大的精度下降。

英文摘要

With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators. Classic multi-threshold activation hardware requires 2^n thresholds for $n$-bit outputs, causing a rapid increase in hardware cost as precision increases. We propose a reconfigurable activation hardware, GRAU, based on piecewise linear fitting, where the segment slopes are approximated by powers of two. Our design requires only basic comparators and 1-bit right shifters, supporting mixed-precision quantization and nonlinear functions such as SiLU. Compared with multi-threshold activators, GRAU reduces LUT consumption by over 90%, achieving higher hardware efficiency, flexibility, and scalability. The best trade-off is usually achieved with 6-8 segments, while complex nonlinearities under aggressive low-cost settings may suffer larger accuracy degradation.

2604.12306 2026-06-10 cs.LG cs.AI 版本更新

GCA Framework: A GCC Countries-Grounded Dataset and Agentic Pipeline for Climate Decision Support

GCA框架:面向海湾合作委员会国家的数据集与气候决策支持智能体管道

Muhammad Umer Sheikh, Khawar Shehzad, Salman Khan, Fahad Shahbaz Khan, Muhammad Haris Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)(莫扎德人工智能大学) University of Missouri(密苏里大学) Australian National University(澳大利亚国立大学) Linköping University(林肯大学)

AI总结 提出GCA框架,包含GCC国家多模态数据集GCA-DS和工具增强型智能体GCA,通过领域微调和工具集成提升气候决策可靠性。

详情
AI中文摘要

海湾合作委员会(GCC)国家的气候决策日益需要能够将异质的科学和政策证据转化为可操作指导的系统,然而通用大语言模型(LLM)在区域特定气候知识以及与地理空间和预测工具的接地交互方面仍然薄弱。我们提出GCA框架,它统一了(i)GCA-DS,一个基于GCC国家的精选多模态数据集,以及(ii)Gulf Climate Agent(GCA),一个工具增强型气候分析智能体。GCA-DS包含20万个问答对,涵盖政府政策和适应计划、非政府组织和国际框架、学术文献以及关于热浪、沙尘暴和洪水的事件驱动报告,并辅以将图像与文本证据相结合的遥感输入。在此基础上,GCA智能体编排了一个基于实时和历史信号以及地理空间处理的模块化工具管道,生成衍生指数和可解释的可视化。最后,我们在GCC国家的气候任务上对开源和专有LLM进行了基准测试,结果表明领域微调和工具集成显著提高了相对于通用基线的可靠性。

英文摘要

Climate decision-making in the GCC states increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated multimodal dataset grounded in the GCC states, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises 200k question--answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on climate tasks in the GCC states and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.

2606.03803 2026-06-10 cs.SD cs.AI eess.AS 版本更新

LiveBand: Live Accompaniment Generation in the Audio Domain

LiveBand: 音频域中的实时伴奏生成

Marco Pasini, Javier Nistal, Ben Hayes, Mathias Rose Bjare, Stefan Lattner, George Fazekas

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LiveBand系统,利用因果Transformer在预训练因果音频自编码器的连续潜空间中生成高保真伴奏,通过对抗序列级监督训练,实现实时流式生成。

详情
AI中文摘要

我们提出LiveBand,一个实时系统,能够为现场音频输入生成高保真音乐伴奏,并严格遵守因果约束。我们的方法在预训练因果音频自编码器的连续潜空间中训练因果Transformer生成器,使用来自判别器的对抗序列级监督。在每个时间步,生成器仅接收因果可用的混合上下文和高斯噪声,并预测伴奏潜变量,而无法访问未来混合帧或真实目标潜变量。训练在因果掩码下通过单个并行前向传播完成,而流式推理则通过滚动注意力状态自回归进行。模型训练和推理计算在设计中匹配,消除了教师强制及相关曝光偏差。在多乐器音乐伴奏基准测试中,LiveBand在音频质量、节拍对齐和混合一致性的客观指标上优于先前工作,同时能够在消费级硬件上实现无需前瞻的实时流式生成。

英文摘要

We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.

2606.09377 2026-06-10 cs.LG cs.AI 版本更新

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

利用张量并行和全分片数据并行扩展神经网络验证

Sergei Vorobyov, Eugene Ilyushin

发表机构 * Lomonosov Moscow State University(莫斯科国立大学) Central University(中央大学)

AI总结 针对神经网络形式化验证中GPU内存瓶颈,将张量并行(TP)和全分片数据并行(FSDP)适配到auto_LiRPA/α,β-CROWN框架,TP实现约2倍峰值内存降低但边界紧度下降,FSDP实现80-90%基础内存降低且边界与单GPU逐位一致,并支持完整验证和卷积层。

详情
AI中文摘要

形式化神经网络验证——证明网络对于指定域内所有输入满足安全属性——在实践中受限于GPU内存:边界传播算法(IBP、CROWN、α-CROWN)的标准实现要求权重矩阵和松弛系数矩阵完全驻留在单个加速器上。我们将最初为大规模模型训练开发的两种并行技术适配到auto_LiRPA/α,β-CROWN验证框架。张量并行(TP)将权重矩阵和A矩阵分片到多个GPU上,在P=2时实现约2倍的峰值内存降低;在VNN-COMP 2022 MNIST-FC基准测试上确认了正确性,但由于分片区域内中间边界被迫使用IBP替代,边界紧度随分片区域数量增加而下降。全分片数据并行(FSDP)仅对权重矩阵进行分片,并逐层使用AllGather,产生的边界与单GPU基线逐位一致:在宽MLP上,基础内存降低80-90%,峰值内存降低34-39%。FSDP与完整验证(β-CROWN + 分支定界)和卷积层(BoundConv)无缝集成;在FSDP下,CIFAR-100 ResNet-large(VNN-COMP 2024)获得了完整的不可满足结果。在所有实验中,α-CROWN+BaB模式下的内存瓶颈被证明是每个神经元的alpha张量,而非权重矩阵,这指出了未来工作的关键方向。

英文摘要

Formal neural network verification -- proving that a network satisfies safety properties for *all* inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $α$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the auto_LiRPA / $α,β$-CROWN verification framework. Tensor Parallelism (TP) shards both weight and $A$-matrices across GPUs, achieving ${\approx}2\times$ peak-memory reduction at $P{=}2$; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. Fully Sharded Data Parallelism (FSDP) shards only weight matrices with a per-layer AllGather, producing bounds that are bitwise identical to the single-GPU baseline: baseline memory drops by 80--90%, peak memory by 34--39% on wide MLPs. FSDP integrates cleanly with complete verification ($β$-CROWN + Branch-and-Bound) and with convolutional layers (BoundConv); a complete unsat result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in $α$-CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.

2606.09677 2026-06-10 eess.AS cs.AI 版本更新

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

MeCo: 基于MeanFlow的一步校正器用于多通道语音分离

Dohwan Kim, Jung-Woo Choi

发表机构 * School of Electrical Engineering, KAIST(韩国成均馆大学电气工程学院)

AI总结 提出MeCo,一种基于MeanFlow的一步生成式校正器,通过数据空间优化联合训练生成目标与信号保真度,在极低计算开销下同时提升信号保真度和人耳听觉质量。

Comments 5 pages, accepted to Interspeech 2026

详情
AI中文摘要

虽然用于多通道语音分离的判别模型在基于参考的指标上表现出色,但它们通常表现出次优的人耳听觉质量。为了解决这个问题,我们提出了一种新颖的基于MeanFlow的一步生成式校正器(MeCo)。MeCo学习一个条件平均速度场,以一步方式将判别估计直接映射到干净语音流形上。为了最大化一步生成性能,我们引入了数据空间优化(DSO)。DSO集成了一个$\mathbf{x}_r$损失,该损失惩罚较长位移间隔上的预测误差,作为人耳听觉质量的生成目标,以及一个端点SI-SDR损失,直接优化终端信号保真度。实验表明,MeCo以最小的计算开销实现了最先进的性能,在域内和域外场景中同时实现了卓越的信号保真度和人耳听觉质量。

英文摘要

While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

11. 其他/综合AI 30 篇

2606.10086 2026-06-10 cs.AI 新提交

Exploratory Responsiveness and Adaptive Rigidity under AI-Assisted Optimization

AI辅助优化下的探索响应性与适应性刚性

Balaraju Battu

AI总结 本文提出AI辅助优化下的探索适应理论,通过动态框架分析预测辅助如何影响系统探索响应性,揭示收敛预测机制导致适应性降低、刚性增强,而探索增强机制则促进适应性。

详情
AI中文摘要

本文发展了AI辅助优化下的探索适应理论。核心论点是,AI系统的长期适应效应关键取决于预测辅助如何与探索响应性本身相互作用。我们使用一个动态框架形式化这一机制,其中认知、制度和技术系统在由多个局部强化配置构成的崎岖认知景观上演化。模型中的一个核心状态变量是适应响应性,它衡量系统在不断变化的条件下穿越不熟悉的概念和制度轨迹的能力。在收敛预测机制下,AI系统替代探索参与,降低适应响应性,并产生亚稳态陷阱、滞后、过早收敛和探索崩溃动力学,使系统局部高效但全局刚性。该框架还识别出对比的探索增强机制,其中AI系统放大探索搜索、概念穿越和适应流动性。因此,有效替代参数是响应性依赖的:拥有弱探索例程的系统更容易受到探索替代,而已经拥有高适应响应性的系统可能利用AI辅助在崎岖景观上扩展探索流动性。因此,AI的长期适应效应不仅取决于AI能力本身,还取决于制度结构、发展背景和人机交互架构。

英文摘要

This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive effects of AI systems depend critically on how predictive assistance interacts with exploratory responsiveness itself. We formalize this mechanism using a dynamical framework in which cognitive, institutional, and technological systems evolve over rugged epistemic landscapes characterized by multiple locally reinforced configurations. A central state variable in the model is adaptive responsiveness, which measures the capacity of a system to traverse unfamiliar conceptual and institutional trajectories under changing conditions. Under convergent predictive regimes, AI systems substitute for exploratory engagement, reducing adaptive responsiveness and generating metastable trapping, hysteresis, premature convergence, and exploration-collapse dynamics in which systems become locally efficient but globally rigid. The framework also identifies contrasting exploration-enhancing regimes in which AI systems amplify exploratory search, conceptual traversal, and adaptive mobility. The effective substitution parameter is therefore responsiveness-dependent: systems possessing weak exploratory routines are more vulnerable to exploratory substitution, whereas systems already possessing high adaptive responsiveness may use AI assistance to expand exploratory mobility across rugged landscapes. The long-run adaptive effects of AI consequently depend not only on AI capability itself, but also on institutional structure, developmental context, and the architecture of human-machine interaction.

2606.10094 2026-06-10 cs.AI 新提交

Predictive Assistance and the Temporal Dynamics of Exploratory Compression

预测性辅助与探索性压缩的时间动态

Balaraju Battu

AI总结 提出几何动力学框架,研究预测性AI如何通过外源探索性压缩改变认知探索的时间动态,发现持续稳定会降低探索响应性、曲率不对称积累导致滞后效应、早期干预限制后续探索多样性。

详情
AI中文摘要

经典认知理论将问题解决描述为通过结构化问题空间的探索性搜索,其中重复交互逐渐将搜索压缩为高效的表征结构。预测性人工智能系统引入了一种独特的机制,在这种机制中,稳定可能在探索性多样化展开之前发生,在内部生成搜索之前提供解决方案和决策轨迹。本文发展了一个几何动力学框架,其中注意力在由稳定漂移、内源探索性扰动和响应性门控学习塑造的策略景观上演化。预测性辅助被建模为外源探索性压缩的过程,在自生成探索拓宽策略空间的可达区域之前稳定轨迹。该框架产生三个主要结果。首先,持续的预测性稳定通过减弱内源扰动的有效影响来降低探索响应性,即使探索变异性仍然存在。其次,曲率不对称地积累和松弛,产生滞后效应和辅助撤除后探索移动性的延迟恢复。第三,发展结果关键取决于稳定的时机,早期干预在广泛的表征多样化发生之前缩小未来的探索遍历。该框架产生了关于探索熵、过早收敛和预测稳定后延迟恢复的经验可检验预测。更广泛地说,结果表明预测系统可能重塑探索性认知本身的几何结构。

英文摘要

Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interaction gradually compresses search into efficient representational structures. Predictive artificial intelligence systems introduce a distinct regime in which stabilization may occur before exploratory diversification unfolds, supplying solutions and decision trajectories prior to internally generated search. This paper develops a geometric dynamical framework in which attention evolves over a landscape of strategies shaped by stabilizing drift, endogenous exploratory perturbation, and responsiveness-gated learning. Predictive assistance is modeled as a process of exogenous exploratory compression that stabilizes trajectories before self-generated exploration broadens the accessible regions of strategy space. The framework yields three main results. First, sustained predictive stabilization reduces exploratory responsiveness by attenuating the effective influence of intrinsic perturbations even when exploratory variability remains present. Second, curvature accumulates and relaxes asymmetrically, producing hysteresis and delayed recovery of exploratory mobility after assistance withdrawal. Third, developmental outcomes depend critically on the timing of stabilization, with early intervention narrowing future exploratory traversal before broad representational diversification has occurred. The framework generates empirically testable predictions concerning exploratory entropy, premature convergence, and delayed recovery following predictive stabilization. More broadly, the results suggest that predictive systems may reshape the geometry of exploratory cognition itself.

2606.10413 2026-06-10 cs.AI 新提交

Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

灵魂计算:具有独立意识的智能体的理论框架与技术架构

Jinshan Zhang, Xishi Zhou, Qiu Peng, Jianwei Yin

发表机构 * Innovation and Management Center, School of Software Technology, Zhejiang University (Ningbo)(浙江大学(宁波)软件学院创新与管理中心) School of Software Technology, Zhejiang University, Ningbo(浙江大学软件学院(宁波))

AI总结 本文提出“灵魂计算”范式,区分狭义与广义概念,构建以意向性核心为特征的智能体架构,实现AI从工具到生命体的转变。

详情
AI中文摘要

大语言模型和多模态生成技术的突破,推动了人类心理特征、情感模式和长期记忆的数字重建从科幻走向工程实践。然而,当前AI与数字人交叉领域的研究和行业实践仍受制于基本概念模糊:新一代智能体与传统虚拟人的本质区别、具有自我认同的数字实体的构建路径,以及该领域面临的核心技术和伦理挑战,均亟待澄清。本文系统审视了在前沿AI技术驱动下,从传统虚拟人到“灵魂计算”范式的转型逻辑。我们首先分析人类意识和记忆机制的演化模式,重新评估海量多模态数字碎片在个体精神世界逆向重建中的核心价值。在此基础上,首次正式界定狭义和广义灵魂计算的学术内涵,阐明其学术边界以及与情感计算、历史重建和凡人计算的根本区别。我们认为,灵魂计算系统必须在架构上构建“内涵”核心,而非作为纯粹的“外延”功能载体,从而推动AI从工具性向生命体的根本转变。

英文摘要

Breakthroughs in large language models and multimodal generation technologies have propelled the digital reconstruction of human mental traits, emotional patterns, and long-term memory from science fiction toward engineering practice. Yet current research and industry practices at the intersection of AI and digital humans remain hampered by fundamental conceptual ambiguities: the essential differences between next-generation intelligent agents and traditional virtual humans, the construction pathways for digital entities possessing self-identity, and the core technical and ethical challenges confronting this domain all demand urgent clarification. This paper systematically examines the transformative logic underlying the transition from traditional virtual humans to the ``Soul Computing'' paradigm, driven by frontier AI technologies. We first analyze the evolutionary patterns of human consciousness and memory mechanisms, reassessing the core value of massive multimodal digital fragments in the reverse reconstruction of individual mental worlds. On this basis, we formally delineate the academic connotations of narrow and broad Soul Computing for the first time, clarifying its academic boundaries and essential distinctions from Affective Computing, Historical Reconstruction, and Mortal Computation. We argue that Soul Computing systems must architecturally construct an ``Intensional'' core rather than serving as purely ``Extensional'' functional carriers, thereby enabling the fundamental transition of AI from toolhood to living agency.

2606.11016 2026-06-10 cs.AI 新提交

Superficial Beliefs in LLM Decision-Making

LLM决策中的表面信念

Gabriel Freedman, Francesca Toni

发表机构 * Department of Computing, Imperial College London(帝国理工学院计算系)

AI总结 研究大型语言模型在二元选择中是否仅模仿理由,通过合成决策实验发现模型行为有系统性但自我报告与行为推断的驱动因素不完全一致,表明存在“表面信念”。

Comments Under review

详情
AI中文摘要

我们询问大型语言模型(LLMs)在选择两个选项时是否仅仅模仿理由,或者它们的选择是否反映了系统性的潜在决策结构。使用合成二元决策设置,其中模型在由分级属性定义的档案之间进行选择,我们将模型声称最重要的属性与在拟合先前决策的行为模型下最能解释其选择的属性进行比较。行为模型很好地预测了保留的选择,表明模型行为与可见属性系统相关,而非随机。然而,直接自我报告和单独的基于分数的判断仅部分恢复了行为推断的驱动因素。由此产生的图景既不是任意行为,也不是完全明确的信念——输出结构足以支持预测,但显式理由仅不完美地追踪恢复的驱动因素。这种定性模式在提示顺序和采样扰动、替代行为模型、目标遮挡分析以及结构变化的决策设置中持续存在。我们将其解释为LLM决策中“表面信念”的证据:模型表现得好像由属性上的概率性局部优先级引导,同时对其决策驱动的属性只有有限的语言访问。

英文摘要

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

2606.09839 2026-06-10 cs.HC cs.AI cs.CY 交叉投稿

Aesthetic Perspectives in Information Systems Research: A Hermeneutic Analysis

信息系统研究中的美学视角:一项诠释学分析

Angelina Chen, Rick Sullivan, Raffaele F Ciriello

发表机构 * University of Sydney(悉尼大学) HEC Montréal(蒙特利尔HEC)

AI总结 通过诠释学文献分析,揭示信息系统研究中四种隐含的美学视角(模仿、感官体验、世界构建、政治行动),阐明它们如何构成认知基础设施并影响研究问题与方法,以算法管理和数字中介亲密性为例展示其价值。

Comments Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy

详情
AI中文摘要

隐含的美学视角可能如何影响信息系统(IS)学术界认可(或不认可)哪些研究值得研究?在这项诠释学文献分析中,我们揭示了支撑IS研究的基础性美学假设。我们识别出四种视角(作为模仿的美学、作为感官体验的美学、作为世界构建的美学、作为政治行动的美学),它们指导IS学者如何感知和欣赏社会技术现象。这些视角影响什么成为可识别的合法研究,什么仍未被看见。通过明确美学假设,我们展示了它们如何构成限制探究视野的认知基础设施。我们将这一框架应用于算法管理和数字中介亲密性,揭示了替代视角如何开辟新的研究问题,同时暴露了主流框架所忽视的维度。本分析强调了美学哲学对IS文献的重要性,为阐述美学视角如何塑造理论化、方法和贡献提供了词汇。

英文摘要

How might implicit aesthetic perspectives shape what Information Systems (IS) scholarship recognises as worthy of study (or not)? In this hermeneutic literature analysis, we surface foundational aesthetic assumptions underpinning IS research. We identify four perspectives (aesthetics as imitation, sensory experience, world-making, and political doing) that guide how IS scholars perceive and appreciate sociotechnical phenomena. These perspectives influence what becomes recognisable as legitimate research and what remains unseen. By making aesthetic assumptions explicit, we show how they form epistemic infrastructure that conditions horizons of inquiry. We apply this framework to algorithmic management and digitally mediated intimacy, revealing how alternative perspectives open new research questions whilst exposing dimensions that dominant framings overlook. This analysis foregrounds the importance of aesthetic philosophy to IS literature, offering a vocabulary for articulating how aesthetic perspectives shape theorising, method, and contribution.

2606.09922 2026-06-10 cs.IT cs.AI math.IT 交叉投稿

The Bioelectrical Information Theory: Investigating the theoretical compression limit of bioelectrical signals under artificial intelligence

生物电信息论:探究人工智能下生物电信号的理论压缩极限

Jiawen Zou, Bo Yan

发表机构 * College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Shanghai Key Laboratory of Intelligent Information Processing(上海智能信息处理重点实验室) Fudan University(复旦大学)

AI总结 提出生物电压缩的三级层次框架,将压缩极限重构为模型和任务条件量,而非波形的固定属性。

详情
AI中文摘要

生物电信号正以挑战脑机接口带宽的规模被采集。然而,它们的压缩仍常被框定为波形保真问题,受限于原始信号的熵。本文提出一个信息论框架,其中生物电数据的有效信息不仅由信号保真度决定,还由生理结构、模型容量和下游任务需求决定。我们将生物电压缩表述为三级层次。在信号层面,噪声被降低至它们关于潜在生理源所携带的信息。在生理层面,参数化编码器将净化后的信号映射为紧凑、结构化且量化的表示。在语义层面,任务无关信息被丢弃,而深度学习模型利用因果依赖关系,用条件熵替代边际熵。这一视角将生物电信号的压缩极限重构为模型和任务条件量,而非波形的固定属性。随着表达能力日益增强的模型与神经和生理接口集成,生物电压缩可能从传输信号转变为仅传输任务级解释所需的残差信息。

英文摘要

Bioelectrical signals are increasingly acquired at scales that challenge the bandwidth of brain-computer interfaces. However, their compression is still often framed as a problem of waveform preservation, limited by the entropy of the raw signal. Here we propose an information-theoretic framework in which the effective information of bioelectrical data is determined not only by signal fidelity, but also by physiological structure, model capacity and downstream task requirements. We formulate bioelectrical compression as a three-level hierarchy. At the signal level, noise is reduced to the information they carry about latent physiological sources. At the physiological level, parametric encoders map purified signals into compact, structured and quantized representations. At the semantic level, task-irrelevant information is discarded, while deep learning models exploit causal dependencies to replace marginal entropy with conditional entropy. This perspective reframes the compression limit of bioelectrical signals as a model- and task-conditioned quantity rather than a fixed property of the waveform. As increasingly expressive models become integrated with neural and physiological interfaces, bioelectrical compression may shift from transmitting signals to transmitting only the residual information required for task-level interpretation.

2606.10458 2026-06-10 cs.IT cs.AI math.IT math.OC math.ST stat.TH 交叉投稿

Minimum Distortion Quantization with Specified Output Distribution

指定输出分布的最小失真量化

Aolin Xu

发表机构 * Aolin Xu(徐澳林)

AI总结 本文推导了在输出分布指定条件下最小化均方误差的最优量化器,形式为X=σ(F_{σ^{-1}(X)}^{-1}(F_W(W))),并证明了在均匀分布下简化为X=F_X^{-1}(F_W(W)),主要贡献在于通过优化排列和累积分布函数实现最小失真。

详情
AI中文摘要

我们推导了实值随机变量 $W$(分布为 $P_W$)的最优量化器,使得 1) 量化输出 $X$(可取 $k$ 个值)的分布遵循 $\{1,\ldots,k\}$ 上的任意指定分布 $P_X$,且 2) 从 $X$ 估计 $W$ 的最小均方误差 (MMSE) 最小化。结果表明,最优量化器形式为 $X=\sigma\big(F_{\sigma^{-1}(X)}^{-1}(F_W(W))\big)$,其中 $\sigma$ 是 $\{1,\ldots,k\}$ 上所有排列中使 MMSE 最小的最优排列,$F$ 为累积分布函数。当 $P_W$ 在区间上均匀分布或 $P_X$ 在 $\{1,\ldots,k\}$ 上均匀分布时,量化器简化为 $X=F_{X}^{-1}(F_W(W))$。优超概念在最优性证明中起关键作用。指定输出分布有助于设计具有显式控制输出熵、最大化输入输出互信息、定制输出分布以匹配通信信道输入要求以及数据匿名化的量化器。

英文摘要

We derive the optimal quantizer of a real-valued random variable $W$ with distribution $P_W$ such that 1) the distribution of the quantization output $X$ that can take $k$ values follows any specified distribution $P_X$ over $\{1,\ldots,k\}$, and 2) the minimum mean squared error (MMSE) of estimating $W$ from $X$ is minimized. It is shown that the optimal quantizer takes the form $X=σ\big(F_{σ^{-1}(X)}^{-1}(F_W(W))\big)$, where $σ$ is the optimal permutation of $\{1,\ldots,k\}$ among all permutations to minimize the MMSE, and $F$ is the cumulative distribution function. When $P_W$ is uniform over an interval or $P_X$ is uniform over $\{1,\ldots,k\}$, the quantizer takes a simple form $X=F_{X}^{-1}(F_W(W))$. The concept of majorization plays a key role in the optimality proof. Specifying the output distribution is useful for designing quantizers with explicitly controlled output entropy, maximized mutual information between input and output, tailored output distribution to match channel input requirements for communication, and data anonymization.

2606.10937 2026-06-10 cs.DB cs.AI 交叉投稿

Provenance Tracking in AI Compilers through the Lens of Coalgebra

通过余代数视角追踪AI编译器中的来源

Zilu Tian, Liying Liu

发表机构 * OmniVision Technology Singapore(奥米视觉技术(新加坡)) Black Sesame Technology Singapore(黑 sesame 技术(新加坡))

AI总结 针对AI编译器中图重写导致来源难以追踪的问题,提出基于观测语义的轻量级方法,利用余代数和互模拟形式化,并在原型编译器COVAN中验证。

详情
AI中文摘要

AI编译器通过规范化、降级和优化积极重写计算图,使得跨编译追踪张量和运算符的来源变得困难。可靠的来源对于附加平台特定的后处理、调试编译器行为以及验证变换至关重要,然而现有解决方案在非单射图重写下要么是侵入式的,要么是特设的。我们提出了一种基于观测语义的轻量级生成式方法来追踪来源。我们不通过编译器传递传播标识符,而是观测图变换并根据可观测的计算行为推理来源。我们使用余代数模型和互模拟形式化了这种方法,即使中间节点被消除,也能保留来源。此外,我们在原型AI编译器COVAN中实现了该方法,展示了在编译流水线中稳定的来源追踪,且工程开销最小。

英文摘要

AI compilers aggressively rewrite computation graphs through normalization, lowering, and optimization, making it difficult to track the provenance of tensors and operators across compilation. Reliable provenance is essential for attaching platform-specific postprocessing, debugging compiler behavior, and validating transformations, yet existing solutions are either invasive or ad hoc under non-injective graph rewrites. We present a lightweight, generative approach to provenance tracking based on observational semantics. Instead of propagating identifiers through compiler passes, we observe graph transformations and reason about provenance in terms of observable computational actions. We formalize this approach using a coalgebraic model and bisimulation, which preserves provenance even when intermediate nodes are eliminated. Furthermore, we implement this approach in a prototype AI compiler COVAN, demonstrating stable provenance across compilation pipelines with minimal engineering overhead.

2606.01929 2026-06-10 cs.AI 版本更新

VET: A Framework for Analyzing AI Discourse

VET:一个分析AI话语的框架

Meredith Ringel Morris

发表机构 * The Paul G. Allen School of Computer Science & Engineering and The Information School(保罗·G·艾伦计算机科学与工程学院和信息学院)

AI总结 本文提出VET框架,通过效价、有效性和轨迹三个维度对AI话语进行分类,用于识别和批判AI炒作、AI末日、AI否认和AI常态四种立场,并作为AI素养工具支持对极化话语的审视。

详情
Journal ref
In Proceedings of the Paris Institute for Advanced Study (Vol. 21) 2026
AI中文摘要

关于AI的公共话语已经变得极化;传统和社交媒体上对AI的夸大立场威胁着公众AI素养的发展。在本文中,我介绍了VET框架,这是一种沿效价、有效性和轨迹维度对AI话语进行分类的方法。我展示了如何使用该框架来识别、比较和批判AI炒作、AI末日、AI否认和AI常态的普遍叙事。利用VET,我分析了这四种立场如何夸大了当前状态和/或AI可能演变的某些方面,并说明了VET框架如何通过支持对极化AI话语的“审视”来作为AI素养工具。

英文摘要

Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Literacy among the general public. In this article, I introduce the VET Framework, a method for categorizing AI discourse along the dimensions of valence, effectiveness, and trajectory. I show how this framework can be used to identify, compare, and critique prevalent narratives of AI Hype, AI Doom, AI Denial, and AI Normalcy. Using VET, I analyze how each of these four stances exaggerates some aspects of the current state and/or likely evolution of AI, and illustrate how the VET framework can serve as an AI Literacy tool by supporting the ``vetting'' of polarized AI discourse.

2601.16700 2026-06-10 cs.SE cs.AI cs.ET cs.HC 版本更新

Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study

生成式人工智能在德国软件工程行业的采用:一项实证研究

Ludwig Felder, Tobias Eisenreich, Mahsa Fischer, Stefan Wagner, Chunyang Chen

发表机构 * Technical University of Munich(慕尼黑技术大学) Heilbronn University of Applied Science(海德堡应用科学大学)

AI总结 通过混合方法研究德国软件工程师采用生成式AI工具的情况,发现经验水平调节感知收益,组织规模影响工具选择和使用强度,项目上下文意识不足是主要障碍。

Comments Accepted at FSE '26

详情
AI中文摘要

生成式人工智能(GenAI)工具在软件开发人员中迅速普及。尽管行业采用率正在上升,但影响这些工具有效使用的潜在因素,包括交互深度、组织约束和经验相关考虑,尚未得到彻底调查。这个问题在监管要求严格的环境中尤为突出,例如德国,从业者必须应对GDPR和欧盟AI法案,同时平衡生产力提升与知识产权考虑。尽管GenAI对软件工程产生了重大影响,但据我们所知,尚无实证研究系统性地考察德国背景下GenAI工具的采用动态。为填补这一空白,我们提出了一项关于德国软件工程师采用GenAI的全面混合方法研究。具体而言,我们进行了18次探索性访谈,随后对109名参与者进行了开发者调查。我们分析了工具采用模式、提示策略以及影响有效性的组织因素。结果表明,经验水平调节了GenAI工具的感知收益,并且生产力提升在开发人员之间并非均匀分布。此外,组织规模影响工具选择和使用强度。项目上下文意识有限被确定为最显著的障碍。我们总结了一系列对开发者、组织和工具供应商具有可操作性的启示,以推进人工智能辅助的软件开发。

英文摘要

Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development.

2606.00038 2026-06-10 cs.CY cs.AI 版本更新

Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education

超越工具采纳:高等教育中人工智能素养的实用五阶段发展连续体

J. Paul Liu, Rachel Levy

发表机构 * Dept of Marine, Earth, and Atmospheric Sciences(海洋、地球与大气科学系) AI Hub for Science(科学人工智能中心) Center of Geospatial Analytics(地理空间分析中心) Data Science and AI Academy(数据科学与人工智能学院) Department of Mathematics(数学系) North Carolina State University(北卡罗来纳州立大学)

AI总结 本文提出一个五阶段AI素养连续体模型,帮助教育者诊断和引导学生从回避或盲目使用AI,逐步发展为批判性评估和改进AI应用的能力。

Comments 26 pages, 5 tables, 2 figures, 1 Supplementary Table

详情
AI中文摘要

人工智能(AI)素养日益被认为是所有大学毕业生应具备的基础能力。然而,学生与AI工具的互动往往集中在两个有问题的极端:因恐惧、不信任、伦理担忧或缺乏访问权限而回避,以及不加批判地依赖,产生流畅的输出却掩盖了误解。现有的AI素养框架提供了有价值的定义,但大多数在诊断学习者起点以及如何向负责任、批判性参与进步方面提供的指导有限。本文提出了一个五阶段AI素养连续体——1) 尚未参与,2) 不加批判地使用,3) 知情使用,4) 批判性评估,5) 改进——描述了高等教育中AI使用的发展取向。该连续体补充了维度框架,为教育者提供了实用的诊断和教学路径,与包括联合国教科文组织和经合组织在内的国际框架保持一致。我们介绍了来自北卡罗来纳州立大学的一个基于设计的实施案例,其中学分课程和密集实践工作坊在2024年秋季至2026年春季期间吸引了超过330名参与者。由于实施未使用经过验证的前/后测试工具或对照组,我们将发现视为基于观察和实践:参与者表现出从非参与或不加批判地使用向知情参与转变的行为,而持续且嵌入学科的经验则产生了更强的批判性评估和改进导向实践的证据。我们讨论了课程路径、公平性考量、评估策略,并认为AI素养不应仅被理解为工具采纳,而应被理解为在学科和社会背景下理解、评估和负责任地应用AI系统的发展能力。

英文摘要

Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' engagement with AI tools often clusters at two extremes: avoidance driven by fear, mistrust, ethical concern, or lack of access, and uncritical reliance that produces fluent output while masking misunderstanding. Existing AI literacy frameworks provide valuable competency definitions, but most offer limited guidance for diagnosing where learners begin and how they progress toward responsible, critical engagement. This paper proposes a five-stage AI Literacy Continuum: 0) Not Yet Engaged, 1) Uncritical Use, 2) Informed Use, 3) Critical Evaluation, and 4) Improvement --that describes developmental orientations toward AI use in higher education. The continuum complements dimensional frameworks by providing educators with a practical diagnostic and instructional pathway aligned with international frameworks, including UNESCO and OECD. We present a design-based implementation case from North Carolina State University, where credit-bearing courses and intensive hands-on workshops engaged more than 330 participants between Fall 2024 and Spring 2026. Because the implementation did not use a validated pre/post instrument or comparison group, we frame the findings as observational and practice-based: participants exhibited behaviors consistent with movement from non-engagement or uncritical use toward informed engagement, while sustained and discipline-embedded experiences produced stronger evidence of critical evaluation and improvement-oriented practice. We discuss curricular pathways, opportunity considerations, assessment strategies, and argue that AI literacy should be understood not as tool adoption alone but as a developmental capacity to understand, evaluate, and responsibly apply AI systems in disciplinary and societal contexts.

2606.04833 2026-06-10 cs.LG cs.AI 版本更新

Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting

符号双注意力:在时间序列预测中捕捉符号依赖关系

Balthazar Courvoisier, Tristan Cazenave

发表机构 * Queensfield AI Technologies

AI总结 提出符号双注意力机制,通过双消息传递方案同时捕捉正负依赖关系,无需额外参数,提升时间序列预测性能。

Comments 5 pages, 3 figures, accepted at AAAI 2026 AI4TS Workshop

详情
AI中文摘要

最初为自然语言处理开发的Transformer架构和注意力机制,现在已成为各种深度学习模型的核心,包括时间序列预测应用。然而,标准注意力机制隐含地假设同质性交互,限制了其对具有正负依赖关系(如时间序列)的数据建模能力。在这项工作中,我们引入了符号双注意力,一种新颖的注意力公式,无需额外参数即可捕捉正负关系模式。通过利用受相关结构启发的双消息传递方案,符号双注意力在单个共享块内传播支持和对比信息,有效实现了两个头注意力的表达能力而无需额外参数。该模块可以无缝集成到现有架构中,并在需要符号关系建模的某些情况下带来性能提升。这种方法为构建更具表达力和参数效率的Transformer开辟了道路。

英文摘要

Initially developed for natural language processing, Transformer architectures and attention mechanisms are now central to a wide range of deep learning models, including applications in time series forecasting. A standard attention mechanism, however, implicitly assumes homophilic interactions, limiting its ability to model data with positive and negative dependencies, such as time series. In this work, we introduce the Signed Dual Attention, a novel attention formulation that captures both positive and negative relational patterns without additional parameters. By leveraging a dual message-passing scheme inspired by correlation structures, Signed Dual Attention propagates both supportive and contrastive information within a single shared block, effectively achieving the expressiveness of two head attention without additional parameters. This module can be seamlessly integrated into existing architectures and can yield performance gains in certain situations, requiring signed relational modeling. This approach opens a pathway toward more expressive and parameter-efficient transformers.

2509.04154 2026-06-10 cs.LG cs.AI 版本更新

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

鲁棒滤波注意力:自注意力作为精度加权状态估计

Peter Racioppo

发表机构 * Independent Researcher, Los Angeles, CA, USA(独立研究者,美国加利福尼亚州洛杉矶)

AI总结 提出鲁棒滤波注意力(RFA),将自注意力建模为基于线性随机微分方程的状态估计,在语言建模中实现优于RoPE的困惑度与零样本外推稳定性。

详情
AI中文摘要

我们引入鲁棒滤波注意力(RFA),一种将自注意力表述为鲁棒状态估计的方法。每个令牌被视为由线性随机微分方程(SDE)控制的潜在轨迹的带噪声观测,注意力权重由该模型下的一致性决定,而非静态特征相似性。在各向同性噪声和衰减假设下,RFA的计算复杂度与标准注意力相当。在语言建模基准上,RFA在训练窗口内实现了比RoPE更低的困惑度,同时在零样本外推到更长上下文时保持稳定。该框架还提供了标准位置机制的动力学解释,将旋转嵌入和近因偏差与随机动力学引起的传输和不确定性传播联系起来。

英文摘要

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.

2605.14999 2026-06-10 cs.HC cs.AI cs.CY 版本更新

Towards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News

迈向基于 gaze 的 AI 信息披露界面:阅读 AI 协助新闻时的注视注意力与认知负荷

Pooja Prajod, Hannes Cools, Thomas Röggla, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica(荷兰数学与信息研究所) University of Amsterdam(阿姆斯特丹大学) TU Delft(代尔夫特理工大学) Utrecht University(乌得勒支大学)

AI总结 研究探讨了AI信息披露对读者注意力和认知负荷的影响,发现简要披露导致更高的注视时间和眼跳次数,而详细披露无额外负担,提出基于注视的自适应信息披露设计。

详情
AI中文摘要

随着生成式AI在新闻业中的深入整合,设计有效的人工智能使用披露以在不给读者造成不必要的负担的情况下提供信息是一个关键挑战。尽管先前研究主要关注信任和可信度,但披露对读者注意力和认知负荷的影响仍被忽视。为填补这一空白,我们进行了一项3×2×2混合因子研究,操纵AI使用披露细节水平(无、一行、详细)、新闻类型(政治、生活方式)和AI的角色(编辑、部分内容生成),通过NASA-TLX和眼动追踪测量负荷。我们的结果揭示了显著的注意力成本:一行披露导致更高的注视持续时间和眼跳次数,尤其是在AI编辑内容中。详细披露未增加额外负担。基于信息间隙理论,我们认为简短标签可能通过提示读者注意AI使用而引发更高的视觉审视,但未提供足够信息。NASA-TLX分数和瞳孔直径在各条件下无显著差异,表明AI使用披露无论细节水平如何均不造成认知负担。访谈见解 contextualize 这些发现,并揭示对详细或“按需详细”设计的强烈偏好。我们的发现为基于注视的自适应信息披露界面设计提供了指导,该界面可根据读者的注意力模式和新闻上下文动态调整透明度水平。

英文摘要

As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers' attentional and cognitive load remains underexplored. To address this gap, we conducted a $3\times2\times2$ mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand'' designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers' attentional patterns and news context.

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG 版本更新

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI:一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology(电气工程系,谢里夫大学)

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

详情
AI中文摘要

MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

英文摘要

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

2604.05013 2026-06-10 cs.SE cs.AI 版本更新

Scaling Coding Agents via Atomic Skills

通过原子技能扩大编码代理

Yue Liu

AI总结 本文提出通过原子技能提升编码代理的新型方法,通过联合强化学习提升五个基础技能,从而提高复杂软件任务的泛化能力。

Comments We request standard withdrawal of this submission because significant errors were discovered in the data after submission, which affect the validity of the results. We may submit a corrected version later

详情
AI中文摘要

当前LLM编码代理主要在复合基准上训练,导致任务特定过拟合和泛化能力有限。为此,我们提出一种新的扩展范式,将重点从任务级优化转向原子技能掌握。我们首先正式化五个基本原子技能,即代码定位、代码编辑、单元测试生成、问题重现和代码审查,这些技能作为复杂软件工程任务的基础向量。与复合编码任务相比,这些原子技能更具通用性和可组合性。然后,我们通过联合强化学习扩展编码代理,使原子技能一致提升,而不会产生负面影响或权衡。值得注意的是,这些原子技能的改进在其他未见的复合编码任务中表现良好,如bug修复、代码重构、机器学习工程和代码安全。观察到这一现象,促使我们通过训练原子技能提出新的编码代理扩展范式。广泛实验验证了所提范式的有效性。值得注意的是,我们的联合强化学习在5个原子技能和5个复合任务上平均性能提高了18.7%。

英文摘要

Current LLM coding agents are predominantly trained on composite benchmarks (e.g., bug fixing), which often leads to task-specific overfitting and limited generalization. To address this, we propose a novel scaling paradigm that shifts the focus from task-level optimization to atomic skill mastery. We first formalize five fundamental atomic skills, code localization, code editing, unit-test generation, issue reproduction, and code review, that serve as the basis vectors for complex software engineering tasks. Compared with composite coding tasks, these atomic skills are more generalizable and composable. Then, we scale coding agents by performing joint RL over atomic skills. In this manner, atomic skills are consistently improved without negative interference or trade-offs between them. Notably, we observe that improvements in these atomic skills generalize well to other unseen composite coding tasks, such as bug-fixing, code refactoring, machine learning engineering, and code security. The observation motivates a new scaling paradigm for coding agents by training with atomic skills. Extensive experiments demonstrate the effectiveness of our proposed paradigm. Notably, our joint RL improves average performance by 18.7% on 5 atomic skills and 5 composite tasks.

2602.17547 2026-06-10 cs.AI cs.CL 版本更新

KLong: Training LLM Agent for Extremely Long-horizon Tasks

KLong:训练用于超长 horizon 任务的 LLM 代理

Yue Liu

AI总结 KLong 通过轨迹分割 SFT 和渐进式 RL 训练,解决超长 horizon 任务,实现 106B 模型在 PaperBench 上超越 Kimi K2 Thinking 11.28%。

Comments We request standard withdrawal of this submission because significant errors were discovered in the data after submission, which affect the validity of the results. We may submit a corrected version later

详情
AI中文摘要

本文介绍了KLong,一种开源的LLM代理,旨在解决超长horizon任务。其原理是首先通过轨迹分割SFT冷启动模型,然后通过渐进式RL训练进行扩展。具体而言,我们首先使用全面的SFT配方激活基础模型的基本代理能力。然后,我们引入Research-Factory,一个自动化管道,通过收集研究论文和构建评估标准来生成高质量的训练数据。利用该管道,我们从Claude 4.5 Sonnet(Thinking)中构建了数千条超长horizon轨迹。为了训练这些极长的轨迹,我们提出了一种新的轨迹分割SFT,该方法保留早期上下文,逐步截断后期上下文,并保持子轨迹之间的重叠。此外,为了进一步提高超长horizon任务解决能力,我们提出了一种新的渐进式RL,将训练分为多个阶段,逐步延长超时时间。实验表明KLong的优越性和泛化能力,如图1所示。值得注意的是,我们的KLong(106B)在PaperBench上超越Kimi K2 Thinking(1T)11.28%,且性能提升泛化到其他编码基准如SWE-bench Verified和MLE-bench。

英文摘要

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

2603.08561 2026-06-10 cs.AI 版本更新

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

RetroAgent: 从解决到进化 via 逆向双重内在反馈

Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao

发表机构 * Shanghai AI Lab(上海人工智能实验室) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者)

AI总结 RetroAgent通过逆向双重内在反馈机制,使大语言模型代理在交互环境中通过持续进化而非单纯完成任务来提升性能,实现更强的适应与泛化能力。

Comments updated

详情
AI中文摘要

标准强化学习(RL)用于大语言模型(LLM)代理主要优化外在任务奖励,往往偏向于孤立任务完成而非持续适应。这种范式可能导致过早收敛到次优策略,并使有用经验仅隐式编码在模型参数中,限制其检索和重用以供未来决策。我们引入RetroAgent,一种在线RL框架,训练代理掌握交互环境不仅通过解决任务,而是通过跨回合进化。受人类逆向自我提升启发,RetroAgent将外在奖励与逆向生成的双重内在反馈相结合:(1)内在数值反馈,通过测量相对于先前尝试的子任务进展增量来奖励有益探索;(2)内在语言反馈,将成功与失败提炼成可重用的文本教训以供显式经验重用。为有效利用这些教训,我们提出Similarity & Utility-Aware Upper Confidence Bound(SimUtil-UCB),一种检索策略,平衡语义相关性、历史效用和探索。在四个具有挑战性的代理基准测试中,RetroAgent实现了新的最先进的性能,优于GRPO在ALFWorld上提升18.3%、WebShop上提升15.4%、Sokoban上提升27.1%、MineSweeper上提升8.9%,同时展现出强大的测试时间适应性和分布外泛化能力。

英文摘要

Standard reinforcement learning (RL) for large language model (LLM) agents primarily optimizes extrinsic task rewards, often favoring isolated task completion over continual adaptation. This paradigm can cause premature convergence to suboptimal policies and leaves useful experience only implicitly encoded in model parameters, limiting its retrieval and reuse for future decisions. We introduce RetroAgent, an online RL framework that trains agents to master interactive environments not merely by solving tasks, but by evolving across episodes. Inspired by human retrospective self-improvement, RetroAgent augments extrinsic rewards with hindsight-generated dual intrinsic feedback: (1) Intrinsic Numerical Feedback, which rewards beneficial exploration by measuring incremental subtask progress relative to prior attempts; and (2) Intrinsic Language Feedback which distills successes and failures into reusable textual lessons for explicit experience reuse. To leverage these lessons effectively, we propose Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB), a retrieval strategy that balances semantic relevance, historical utility, and exploration. Across four challenging agentic benchmarks, RetroAgent achieves new state-of-the-art performance, outperforming GRPO by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while demonstrating strong test-time adaptation and out-of-distribution generalization.

2601.00809 2026-06-10 cs.OH cs.AI cs.MA 版本更新

A Modular Reference Architecture for MCP-Servers Enabling Agentic BIM Interaction

一种模块化参考架构用于MCP服务器,以实现代理BIM交互

Tobias Heimig-Elschner, Changyu Du, Anna Scheuvens, André Borrmann, Jakob Beetz

发表机构 * Chair of Design Computation, RWTH Aachen University(设计计算系,亚琛工业大学) Chair of Computing in Civil and Building Engineering, Technical University of Munich(土木与建筑工程计算系,慕尼黑技术大学) Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR)(建筑、都市事务和空间发展研究院) TUM Georg Nemetschek Institute(慕尼黑技术大学Georg Nemetschek研究所)

AI总结 本文提出一种模块化参考架构,通过解耦MCP接口与特定BIM-API,实现API无关、隔离且可重复的BIM代理交互,提升重用性和研究系统性。

Comments Accepted at the GNI Symposium on Artificial Intelligence for the Built World (Technical University of Munich, May 18--20, 2026)

详情
AI中文摘要

由大型语言模型驱动的代理工作流正被越来越多应用于建筑信息建模(BIM),使IFC模型的自然语言检索、修改和生成成为可能。最近的研究开始采用新兴的模型上下文协议(MCP)作为LLM的统一工具调用接口,简化了BIM交互的代理侧。尽管MCP标准化了LLM调用工具的方式,但当前BIM侧的实现仍需针对特定工具进行编程,限制了在不同环境中的重用、评估和工作流可移植性。本文通过引入一种模块化参考架构来解决这一缺口,该架构使MCP服务器能够实现API无关、隔离且可重复的BIM代理交互。通过对近期文献中反复出现的能力进行系统分析,我们推导出一组核心需求。这些需求指导了一种以显式适配器合同为中心的微服务架构,将MCP接口与特定BIM-API解耦。使用IfcOpenShell的原型实现展示了在常见修改和生成任务中的可行性。在代表性场景中的评估表明,该架构能够实现可靠的工作流,减少耦合,并为系统性研究提供可重用的基础。

英文摘要

Agentic workflows driven by large language models (LLMs) are increasingly applied to Building Information Modelling (BIM), enabling natural-language retrieval, modification and generation of IFC models. Recent work has begun adopting the emerging Model Context Protocol (MCP) as a uniform tool-calling interface for LLMs, simplifying the agent side of BIM interaction. While MCP standardises how LLMs invoke tools, current BIM-side implementations are still authoring tool-specific and ad hoc, limiting reuse, evaluation, and workflow portability across environments. This paper addresses this gap by introducing a modular reference architecture for MCP servers that enables API-agnostic, isolated and reproducible agentic BIM interactions. From a systematic analysis of recurring capabilities in recent literature, we derive a core set of requirements. These inform a microservice architecture centred on an explicit adapter contract that decouples the MCP interface from specific BIM-APIs. A prototype implementation using IfcOpenShell demonstrates feasibility across common modification and generation tasks. Evaluation across representative scenarios shows that the architecture enables reliable workflows, reduces coupling, and provides a reusable foundation for systematic research.

2602.23232 2026-06-10 cs.AI 版本更新

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

ReCoN-Ipsundrum:一个可检验的循环持续性代理,具有情感耦合控制和机制关联的意识指标测试

Aishik Sanyal

发表机构 * Aishik Sanyal

AI总结 本文提出ReCoN-Ipsundrum代理,通过情感耦合控制和机制关联意识指标测试,探讨意识指标与行为之间的关系,发现情感耦合能提高探索和谨慎行为。

Comments Accepted at AAAI 2026 Spring Symposium - Machine Consciousness: Integrating Theory, Technology, and Philosophy

详情
Journal ref
Proceedings of the AAAI Symposium Series, 8(1):352-360, 2026
AI中文摘要

基于机器意识的指标方法建议通过任务跨领域的机制关联证据进行三角验证,通过建筑检查和因果干预支持。受Humphrey的ipsundrum假说启发,我们实现了ReCoN-Ipsundrum,一个可检验的代理,扩展了ReCoN状态机,加入了感官显著性$N^s$的循环持续性回路以及可选的情感代理报告愉悦度/唤醒度。在固定参数消融(ReCoN、Ipsundrum、Ipsundrum+affect)中,我们将Humphrey的qualiaphilia(对感官经验本身的偏好)作为熟悉度控制的风景-平淡路线选择。我们发现一种新颖性分离:非情感变体对新颖性敏感(Δscenic-entry = 0.07)。情感耦合是稳定的(Δscenic-entry = 0.01)即使风景不新颖(中位数{Δnovelty≈-0.43)。在无奖励的探索性游戏中,情感变体表现出结构化的局部调查(扫描事件31.4 vs. 0.9;循环分数7.6)。在疼痛尾探针中,只有情感变体能持续延长计划的谨慎(尾部持续时间90 vs. 5)。对反馈+整合的损伤会减少ipsundrum变体的刺激后持续性(AUC下降27.62, 27.9%),而ReCoN保持不变。这些分离连接了循环→持续性和情感耦合控制→偏好稳定性、扫描和持续谨慎,展示了如何工程化指标样式的签名,并解释了为什么机理和因果证据应伴随行为标记。

英文摘要

Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience $N^s$ and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey's qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive ($Δ$scenic-entry = 0.07). Affect coupling is stable ($Δ$scenic-entry = 0.01) even when scenic is less novel (median {$Δ$novelty $\approx$ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence $\rightarrow$ persistence and affect-coupled control $\rightarrow$ preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.

2602.01023 2026-06-10 cs.IR cs.AI cs.LG 版本更新

Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective Alignment

通过检索增强生成和多目标对齐统一查询自动补全中的排序与生成

Kai Yuan, Anthony Zheng, Jia Hu, Divyanshu Sheth, Hemanth Velaga, Kylee Kim, Matteo Guarrera, Besim Avci, Jianhua Li, Xuetao Yin, Rajyashree Mukherjee, Sean Suchter

发表机构 * Apple(苹果公司) UC Berkeley(加州大学伯克利分校)

AI总结 提出一个统一框架,通过检索增强生成(RAG)和多目标直接偏好优化(DPO)将查询自动补全重构为端到端列表生成,解决传统流水线长尾覆盖不足和生成方法幻觉风险的问题,并在大规模商业搜索平台上验证了有效性。

Comments 11 pages, 4 figures

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea
AI中文摘要

查询自动补全(QAC)在用户输入时建议查询补全,帮助用户表达意图并更高效地获取结果。现有方法面临根本性挑战:传统的检索-排序流水线长尾覆盖有限且需要大量特征工程,而最近的生成方法存在幻觉和安全风险。我们提出了一个统一框架,通过检索增强生成(RAG)和多目标直接偏好优化(DPO)将QAC重构为端到端列表生成。我们的方法结合了三个关键创新:(1)将QAC重构为具有多目标优化的端到端列表生成;(2)定义并部署一套基于规则、基于模型和以LLM为评判的验证器用于QAC,并在综合方法中使用它们,结合RAG、多目标DPO和迭代批评-修订以生成高质量合成数据;(3)一种混合服务架构,可在严格的延迟约束下实现高效的生产部署。在大规模商业搜索平台上的评估显示了显著改进:离线指标在所有维度上均有提升,人工评估获得+0.40至+0.69的偏好分数,受控在线实验实现了击键次数减少5.44%和建议采纳率增加3.46%,验证了结合RAG和多目标对齐的统一生成为生产级QAC提供了有效解决方案。这项工作代表了向由大语言模型、RAG和多目标对齐驱动的端到端生成的范式转变,建立了一个经过生产验证的框架,可惠及更广泛的搜索和推荐行业。

英文摘要

Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44\% reduction in keystrokes and 3.46\% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.

2601.13406 2026-06-10 cs.HC cs.AI 版本更新

Integrating Virtual Reality and Large Language Models for Team-Based Non-Technical Skills Training and Evaluation in the Operating Room

将虚拟现实与大型语言模型结合用于手术室基于团队的非技术技能训练与评估

Jacob Barker, Doga Demirel, Cullen Jackson, Anna Johansson, Robbin Miraglia, Darian Hoagland, Stephanie B. Jones, John Mitchell, Daniel B. Jones, Suvranu De

发表机构 * Beth Israel Deaconess Medical Center Center(贝希斯尔德医疗中心中心) Department of Surgery, Northwell Health(外科,北well健康) College of Engineering, Florida Agricultural and Mechanical University and Florida State University(工程学院,佛罗里达农业与机械大学和佛罗里达州立大学)

AI总结 本文提出VORTeX平台,结合虚拟现实与大型语言模型,用于手术室团队非技术技能的训练与评估,通过分析团队对话生成交互图谱,提升沟通与协作能力。

Comments 23 pages, 7 figures, 1 table, 2 Appendices

详情
Journal ref
npj Digit. Surg. 1, 10 (2026)
AI中文摘要

尽管有效的团队合作与沟通对手术安全至关重要,但结构化训练非技术技能(NTS)仍不如技术模拟充分。ACS/APDS III期团队技能课程要求可扩展的工具,既能教学又能客观评估这些能力。我们引入虚拟手术室团队经验(VORTeX),一种多用户虚拟现实(VR)平台,结合沉浸式团队模拟与大型语言模型(LLM)分析,用于训练和评估沟通、决策、团队合作和领导力。团队对话使用源自外科非技术技能(NOTSS)框架的结构化提示进行分析,能够自动分类行为并生成定向交互图谱,量化沟通结构和等级。两个腹腔内紧急情景,气胸和腹腔内出血,被实施以引发现实压力和协作。十二名外科专业人员在2024年SAGES会议上完成了试点测试,评价VORTeX为直观、沉浸和有价值的团队合作与沟通发展工具。LLM始终产生可解释的沟通网络,反映预期的操作等级,外科医生作为中心整合者,护士作为发起者,麻醉师作为平衡中介。通过结合沉浸式VR与LLM驱动的行为分析,VORTeX提供了一个可扩展、隐私合规的框架,用于客观评估和自动化、数据驱动的解围,适用于分布式培训环境。

英文摘要

Although effective teamwork and communication are critical to surgical safety, structured training for non-technical skills (NTS) remains limited compared with technical simulation. The ACS/APDS Phase III Team-Based Skills Curriculum calls for scalable tools that both teach and objectively assess these competencies during laparoscopic emergencies. We introduce the Virtual Operating Room Team Experience (VORTeX), a multi-user virtual reality (VR) platform that integrates immersive team simulation with large language model (LLM) analytics to train and evaluate communication, decision-making, teamwork, and leadership. Team dialogue is analyzed using structured prompts derived from the Non-Technical Skills for Surgeons (NOTSS) framework, enabling automated classification of behaviors and generation of directed interaction graphs that quantify communication structure and hierarchy. Two laparoscopic emergency scenarios, pneumothorax and intra-abdominal bleeding, were implemented to elicit realistic stress and collaboration. Twelve surgical professionals completed pilot sessions at the 2024 SAGES conference, rating VORTeX as intuitive, immersive, and valuable for developing teamwork and communication. The LLM consistently produced interpretable communication networks reflecting expected operative hierarchies, with surgeons as central integrators, nurses as initiators, and anesthesiologists as balanced intermediaries. By integrating immersive VR with LLM-driven behavioral analytics, VORTeX provides a scalable, privacy-compliant framework for objective assessment and automated, data-informed debriefing across distributed training environments.

2601.09620 2026-06-10 cs.HC cs.AI cs.CY 版本更新

Full Disclosure, Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers' Trust

全面披露,更少信任?新闻写作中AI使用细节程度如何影响读者信任

Pooja Prajod, Hannes Cools, Thomas Röggla, Karthikeya Puttur Venkatraj, Amber Kusters, Alia ElKattan, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica(数学与信息学中心) University of Amsterdam(阿姆斯特丹大学) New York University(纽约大学) TU Delft(代尔夫特理工大学) Utrecht University(乌得勒支大学)

AI总结 研究探讨新闻写作中AI使用细节披露程度对读者信任的影响,发现详细披露会降低信任,但促使更多读者核查信息源,揭示透明度与信任之间的权衡。

详情
AI中文摘要

随着人工智能在新闻生产中的整合日益增加,对AI使用透明度的需求已获得广泛关注。最近的研究表明,AI披露可能导致“透明度困境”,即披露会降低读者信任。然而,关于AI披露细节程度如何影响信任以及在新闻背景下如何促成这一困境仍知之甚少。在本项3×2×2混合因子研究中,我们调查了三种AI披露水平(无、一行、详细)在两种新闻类型(政治与生活方式)和两种AI参与程度(低、高)下对新闻读者信任的影响。我们使用新闻媒体信任问卷以及两个决策行为(信息源核查和订阅决定)来测量信任。问卷回答和订阅率显示,只有详细AI披露导致信任下降,而信息源核查行为在一行和详细披露中均增加,且对详细披露影响更显著。半结构化访谈的见解表明,信息源核查行为主要由对主题的兴趣驱动,其次是信任,而信任是影响订阅决定的主要因素。约三分之二的参与者表达了对详细披露的偏好,而大多数偏好一行披露的参与者则表明需要按需详细披露格式。我们的发现表明,并非所有AI披露都会导致透明度困境,而是反映了读者对更多透明度的渴望与对AI辅助新闻内容信任之间的权衡。

英文摘要

As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma'', where disclosure reduces readers' trust. However, little is known about how the \textit{level of detail} in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3$\times$2$\times$2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers' trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers' desire for more transparency and their trust in AI-assisted news content.

2512.09543 2026-06-10 cs.SE cs.AI 版本更新

SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs

SWEnergy:关于使用SLM的代理问题解决框架的能量效率实证研究

Arihant Tripathy, Ch Pavan Harshit, Karthik Vaidhyanathan

发表机构 * SERC, IIIT-Hyderabad(IIIT-海得拉巴研究所)

AI总结 本文通过实证研究,探讨了四种主流代理问题解决框架在使用小型语言模型时的能量效率和资源消耗,发现框架架构是主要能耗驱动因素,但SLM的有限推理能力导致大量能耗浪费。

Comments 8 pages, 5 figures, 1 table. Accepted to AGENT 2026 (ICSE 2026 workshop)

详情
Journal ref
Proceedings of the 2026 International Workshop on Agentic Engineering (AGENT 2026), ACM, 2026, pp. 104-111
AI中文摘要

本文通过实证研究,探讨了四种主流代理问题解决框架在使用小型语言模型时的能量效率和资源消耗,发现框架架构是主要能耗驱动因素,但SLM的有限推理能力导致大量能耗浪费。

英文摘要

Context. LLM-based autonomous agents in software engineering rely on large, proprietary models, limiting local deployment. This has spurred interest in Small Language Models (SLMs), but their practical effectiveness and efficiency within complex agentic frameworks for automated issue resolution remain poorly understood. Goal. We investigate the performance, energy efficiency, and resource consumption of four leading agentic issue resolution frameworks when deliberately constrained to using SLMs. We aim to assess the viability of these systems for this task in resource-limited settings and characterize the resulting trade-offs. Method. We conduct a controlled evaluation of four leading agentic frameworks (SWE-Agent, OpenHands, Mini SWE Agent, AutoCodeRover) using two SLMs (Gemma-3 4B, Qwen-3 1.7B) on the SWE-bench Verified Mini benchmark. On fixed hardware, we measure energy, duration, token usage, and memory over 150 runs per configuration. Results. We find that framework architecture is the primary driver of energy consumption. The most energy-intensive framework, AutoCodeRover (Gemma), consumed 9.4x more energy on average than the least energy-intensive, OpenHands (Gemma). However, this energy is largely wasted. Task resolution rates were near-zero, demonstrating that current frameworks, when paired with SLMs, consume significant energy on unproductive reasoning loops. The SLM's limited reasoning was the bottleneck for success, but the framework's design was the bottleneck for efficiency. Conclusions. Current agentic frameworks, designed for powerful LLMs, fail to operate efficiently with SLMs. We find that framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning. Viable low-energy solutions require shifting from passive orchestration to architectures that actively manage SLM weaknesses.

2508.00491 2026-06-10 cs.RO cs.AI 版本更新

HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning

HannesImitation:通过模仿学习控制Hannes假手进行抓取

Carlo Alessi, Federico Vasile, Federico Ceola, Giulia Pasquale, Nicolò Boccardo, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception(人形感知与感知实验室) Istituto Italiano di Tecnologia(意大利技术研究院) Rehab Technologies Lab(康复技术实验室)

AI总结 本文提出HannesImitationPolicy,通过模仿学习控制Hannes假手在无结构环境中抓取物体,并引入HannesImitationDataset进行训练,实验表明其在无结构场景中优于基于分割的视觉伺服控制器。

Comments Paper accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情
Journal ref
IEEE/RSJ International Conference on Intelligent Robots and Systems, Hangzhou, China, 2025
AI中文摘要

最近,假手控制的进步集中在通过摄像头和其他传感器输入提高自主性。这些系统旨在通过自动控制某些自由度来减少用户认知负担。在机器人学中,模仿学习已成为学习抓取和复杂操作任务并简化数据收集的有前途的方法。然而,其在假手控制中的应用仍 largely 未被探索。填补这一差距可以提高灵活性恢复,并使假手设备能够在更多无约束场景中运行,其中任务是通过演示学习而非依赖手动标注序列。为此,我们提出了HannesImitationPolicy,一种基于模仿学习的方法来控制Hannes假手,使其在无结构环境中进行物体抓取。此外,我们引入了HannesImitationDataset,包含在桌子、架子和人到假手交接场景中的抓取演示。我们利用此类数据训练了一个单扩散策略,并将其部署在假手上以预测手腕方向和手部闭合以进行抓取。实验评估显示在多样化的物体和条件下成功抓取。最后,我们展示该策略在无结构场景中优于基于分割的视觉伺服控制器。附加材料可在我们的项目页面上提供:https://hsp-iit.github.io/HannesImitation

英文摘要

Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation

2508.17196 2026-06-10 cs.LG cs.AI 版本更新

BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens

BudgetThinker: 通过控制令牌赋能预算感知的LLM推理

Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR) Tsinghua University(人工智能产业研究院(AIR)清华大学) Global Innovation Exchange & Department of Automation Tsinghua University(全球创新交流中心及自动化系 清华大学)

AI总结 BudgetThinker通过在推理过程中插入控制令牌,使LLM能够精确控制推理过程长度,采用两阶段训练流程提升模型在不同预算下的表现。

详情
AI中文摘要

近年来,大语言模型(LLM)通过增加测试时计算来增强推理能力,但此策略导致显著延迟和资源成本,限制了其在现实时间受限或成本敏感场景中的应用。本文提出BudgetThinker,一种新型框架,旨在使LLM具备预算感知推理能力,通过在推理过程中定期插入特殊控制令牌,持续告知模型剩余令牌预算。我们提出了一种方法,结合监督微调(SFT)和基于课程的学习强化学习(RL)阶段,利用长度感知奖励函数优化准确性和预算遵守度。我们证明BudgetThinker在各种推理预算下的数学基准测试中显著优于强基线。我们的方法提供了一种可扩展且有效的解决方案,用于开发高效可控的LLM推理,使高级模型更适用于资源受限和实时环境。

英文摘要

Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.

2503.08460 2026-06-10 cs.ET cs.AI cs.CY 版本更新

Status and Future Prospects of the Standardization Framework Industry 4.0: A European Perspective

工业4.0标准化框架行业现状与未来展望:欧洲视角

Olga Meyer, Marvin Boell, Christoph Legat

发表机构 * Fraunhofer Institute for Manufacturing Engineering and Automation (IPA)(弗劳恩霍夫智能制造与自动化研究所) German Commission for Electrotechnical, Electronic, and Information Technologies(德国电气电子和信息通信技术委员会) Technical University of Applied Sciences Augsburg(应用技术大学阿沙布斯)

AI总结 本文探讨工业4.0标准化在欧洲监管框架中的核心作用,重点分析智能制造和数字孪生中的标准化活动,为人工智能和数字孪生提供标准指南,并呼吁加强标准化机构与研究界的合作。

详情
AI中文摘要

工业4.0技术的快速发展需要强有力的标准化以确保未来工业的互操作性、安全性和效率。本文探讨了标准化的基本作用和功能,特别是其在欧洲监管框架中的重要性。基于此,文章重点突出标准化活动在智能制造和数字孪生中的相关主题,并概述了工业4.0标准框架。本文既为人工智能和数字孪生的现有标准提供信息指南,也呼吁加强标准化机构与研究界的合作。通过促进此类合作,我们旨在促进标准的持续发展和实施,从而推动制造业的创新和进步。

英文摘要

The rapid development of Industry 4.0 technologies requires robust and comprehensive standardization to ensure interoperability, safety and efficiency in the Industry of the Future. This paper examines the fundamental role and functionality of standardization, with a particular focus on its importance in Europe's regulatory framework. Based on this, selected topics in context of standardization activities in context intelligent manufacturing and digital twins are highlighted and, by that, an overview of the Industry 4.0 standards framework is provided. This paper serves both as an informative guide to the existing standards in Industry 4.0 with respect to Artificial Intelligence and Digital Twins, and as a call to action for increased cooperation between standardization bodies and the research community. By fostering such collaboration, we aim to facilitate the continued development and implementation of standards that will drive innovation and progress in the manufacturing sector.

2501.11937 2026-06-10 cs.LG cs.AI 版本更新

MeshONet: A Generalizable and Efficient Operator Learning Method for Structured Mesh Generation

MeshONet: 一种通用且高效的结构网格生成运算学习方法

Jing Xiao, Xinhai Chen, Qingling Wang, Jie Liu

发表机构 * Laboratory of Digitizing Software for Frontier Equipment, Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology(前沿装备数字化软件实验室、并行与分布式处理技术实验室、国防科技大学)

AI总结 本文提出MeshONet,一种用于结构网格生成的通用智能学习方法,通过将网格生成任务转化为运算学习问题,实现了高效生成和跨几何泛化。

详情
Journal ref
Neural Networks 199: 108746 (2026)
AI中文摘要

网格生成在科学计算中起着关键作用。传统网格生成方法如TFI和基于PDE的方法往往难以在效率和网格质量之间取得平衡。为解决这一挑战,近年来出现了物理引导的智能学习方法,显著提高了生成效率并保持了高质量网格。然而,物理引导方法在应用于以前未见过的几何时无法泛化,因为即使边界形状的微小变化也需要负担得起的重新训练来适应新的几何变化。在本文中,我们引入了MeshONet,这是第一个用于结构网格生成的通用智能学习方法。该方法将网格生成任务转换为一个具有多个输入和解函数的运算学习问题。为了有效克服运算学习方法的多变量映射限制,我们提出了一种双分支、共享主干的架构,以基于输入输出对的方式近似函数空间之间的映射。实验结果表明,MeshONet在生成效率上比传统方法快了四个数量级。它还能够泛化到不同的几何形状而无需重新训练,大大增强了智能方法的实用性。

英文摘要

Mesh generation plays a crucial role in scientific computing. Traditional mesh generation methods, such as TFI and PDE-based methods, often struggle to achieve a balance between efficiency and mesh quality. To address this challenge, physics-informed intelligent learning methods have recently emerged, significantly improving generation efficiency while maintaining high mesh quality. However, physics-informed methods fail to generalize when applied to previously unseen geometries, as even small changes in the boundary shape necessitate burdensome retraining to adapt to new geometric variations. In this paper, we introduce MeshONet, the first generalizable intelligent learning method for structured mesh generation. The method transforms the mesh generation task into an operator learning problem with multiple input and solution functions. To effectively overcome the multivariable mapping restriction of operator learning methods, we propose a dual-branch, shared-trunk architecture to approximate the mapping between function spaces based on input-output pairs. Experimental results show that MeshONet achieves a speedup of up to four orders of magnitude in generation efficiency over traditional methods. It also enables generalization to different geometries without retraining, greatly enhancing the practicality of intelligent methods.

2409.04519 2026-06-10 quant-ph cs.AI cs.LG physics.data-an 版本更新

The role of data embedding in quantum autoencoders for improved anomaly detection

数据嵌入在量子自编码器中用于改进异常检测的作用

Jack Y. Araz, Michael Spannowsky

发表机构 * Thomas Jefferson National Accelerator Facility(托马斯·杰斐逊国家加速器设施) Institute for Particle Physics Phenomenology(粒子物理学现象研究所) Durham University(达勒姆大学)

AI总结 研究探讨了三种数据嵌入技术对量子自编码器异常检测性能的影响,发现改进的嵌入策略能显著提升检测准确性和数据表征能力。

Comments 8 pages, 5 figures, 4 tables

详情
Journal ref
Quantum Mach. Intell. 8, 61 (2026)
AI中文摘要

量子自编码器(QAEs)在异常检测任务中的性能严重依赖于数据嵌入和ansatz设计。本研究探讨了三种数据嵌入技术——数据重新上传、并行嵌入和交替嵌入——对QAEs表征能力和检测效果的影响。我们的发现表明,即使使用相对简单的变分电路,增强的数据嵌入策略也能显著提高异常检测准确性和不同数据集下底层数据的表征能力。从低维数据的玩具示例开始,我们通过可视化展示不同嵌入技术对模型表征的影响。然后我们将分析扩展到复杂、高维数据集,强调嵌入方法对QAE性能的重大影响。

英文摘要

The performance of Quantum Autoencoders (QAEs) in anomaly detection tasks is critically dependent on the choice of data embedding and ansatz design. This study explores the effects of three data embedding techniques, data re-uploading, parallel embedding, and alternate embedding, on the representability and effectiveness of QAEs in detecting anomalies. Our findings reveal that even with relatively simple variational circuits, enhanced data embedding strategies can substantially improve anomaly detection accuracy and the representability of underlying data across different datasets. Starting with toy examples featuring low-dimensional data, we visually demonstrate the effect of different embedding techniques on the representability of the model. We then extend our analysis to complex, higher-dimensional datasets, highlighting the significant impact of embedding methods on QAE performance.

2310.04680 2026-06-10 cs.CL cs.AI cs.LG 版本更新

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

大语言模型降维的成本:事实回忆在内省学习之前恶化

Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT Harvard University(麻省理工学院哈佛大学) Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 研究探讨了大语言模型参数数量缩放对核心能力的影响,发现模型规模缩减会显著降低事实回忆能力,但对内省信息处理影响较小。

详情
Journal ref
The Twelfth International Conference on Learning Representations (ICLR), 2024
AI中文摘要

如何缩放大语言模型(LLMs)的参数数量会影响其核心能力?我们研究了两种自然缩放技术——权重剪枝和简单训练更小或更大的模型(称为密集缩放)——对LLMs两个核心能力的影响:(a)回忆训练期间呈现的事实,以及(b)处理推理期间呈现的信息。通过设计一系列任务来区分这两种能力,我们发现这两种能力在缩放时的表现存在显著差异。通过超过30%的模型规模缩减(通过任一缩放方法)会显著降低对训练期间呈现事实的回忆能力。然而,60-70%的缩减在很大程度上保留了模型处理内省信息的各种方式,从从长上下文检索答案到从内省示例中学习参数化函数。两种缩放方法均表现出这种行为,表明缩放模型大小对事实回忆和内省学习有本质上不同的影响。

英文摘要

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.